Convert PDF Files

Export has special configuration options that allow greater control over the conversion of PDF files. These options can improve the fidelity and accuracy of the XML output.

Convert PDF Files to a Logical Reading Order

The PDF format is primarily designed for presentation and printing of brochures, magazines, forms, reports, and other materials with complex visual designs. Most PDF files do not contain the logical structure of the original document—the correct reading order, for example, and the presence and meaning of significant elements such as headers, footers, columns, tables, and so on.

KeyView can convert a PDF file either by using the file's internal unstructured paragraph flow, or by applying a structure to the paragraphs to reproduce the logical reading order of the visual page. Logical reading order enables KeyView to produce PDF files that contain languages that read from right-to-left (such as Hebrew and Arabic) in the correct reading direction.

NOTE: The algorithm used to reproduce the reading order of a PDF page is based on common page layouts. The paragraph flow generated for PDFs with unique or complex page designs might not emulate the original reading order exactly.

For example, page design elements such as drop caps, callouts that cross column boundaries, and significant changes in font size might disrupt the logical flow of the output text.

Logical Reading Order and Paragraph Direction

By default, KeyView produces an unstructured text stream for PDF files. This means that PDF paragraphs are extracted in the order in which they are stored in the file, not the order in which they appear on the visual page. For example, a three-column article could be output with the headers and the title at the end of the output file, and the second column extracted before the first column. Although this output does not represent a logical reading order, it accurately reflects the internal structure of the PDF.

You can configure KeyView to produce a structured text stream that flows in a specified direction. This means that PDF paragraphs are extracted in the order (logical reading order) and direction (left-to-right or right-to-left) in which they appear on the page.

The following paragraph direction options are available.

Paragraph Direction Option

Description

Left-to-right

Paragraphs flow logically and read from left to right. You should specify this option when most of your documents are in a language that uses a left-to-right reading order, such as English or German.

Right-to-left

Paragraphs flow logically and read from right to left. You should specify this option when most of your documents are in a language that uses a right-to-left reading order, such as Hebrew or Arabic.

Dynamic

Paragraphs flow logically. The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used.

Conversions might be slower when logical reading order is enabled. For optimal speed, use an unstructured paragraph flow.

The paragraph direction options control the direction of paragraphs on a page; they do not control the text direction in a paragraph. For example, let us say that a PDF file contains English paragraphs in three columns that read from left to right, but 80% of the second paragraph contains Hebrew characters. If the left-to-right logical reading order is enabled, the paragraphs are ordered logically in the output—title paragraph, then paragraph 1, 2, 3, and so on—and flow from the top left of the first column to the bottom right of the third column. However, the text direction of the second paragraph is determined independently of the page by the PDF reader, and is output from right to left.

NOTE: Extraction of metadata is not affected by the paragraph direction setting. The characters and words in metadata fields are extracted in the correct reading direction regardless of whether logical reading order is enabled.

Enable Logical Reading Order

You can enable logical reading order by using either the API or the formats_e.ini file. Setting the direction in the API overrides the setting in the formats_e.ini file.

Use the Java API

To enable PDF logical reading order in the Java API

  1. Use the setPDFLogicalOrder(int orderFlag) method of the XmlExport object, and set the orderFlag argument to one of the following flags.

    Flag

    Description

    PDF_LOGICAL_ORDER_LTR

    Logical reading order and left-to-right paragraph direction

    PDF_LOGICAL_ORDER_RTL

    Logical reading order and right-to-left paragraph direction

    PDF_LOGICAL_ORDER_AUTO

    Logical reading order. The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. This option is used when a paragraph direction is not specified.

    PDF_LOGICAL_ORDER_RAW

    Unstructured paragraph flow. This is the default behavior. Set this flag if logical reading order is enabled, and you want to return to an unstructured paragraph flow.

For example,

objXMLExport.setPDFLogicalOrder(Export.PDF_LOGICAL_ORDER_RTL);

Use the formats_e.ini File

The formats_e.ini file is in the directory install\OS\bin, where install is the path name of the Export installation directory and OS is the name of the operating system.

To enable logical reading order by using the formats_e.ini file

  1. Change the PDF reader entry in the [Formats] section of the formats_e.ini file as follows:

    [Formats]
    230=lpdf
  2. Optionally, add the following section to the end of the formats_e.ini file:

    [pdf_flags]
    pdf_direction=paragraph_direction

    where paragraph_direction is one of the following:

    Flag

    Description

    LPDF_LTR

    Left-to-right paragraph direction

    LPDF_RTL

    Right-to-left paragraph direction

    LPDF_AUTO

    The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used.

    LPDF_RAW

    Unstructured paragraph flow. This is the default behavior. If logical reading order is enabled, and you want to return to an unstructured paragraph flow, set this flag.

Control Hyphenation

There are two types of hyphens in a PDF document:

  • A soft hyphen is added to a word by a word processor to divide the word across two lines. This is a discretionary hyphen and is used to ensure proper text flow in justified text.

  • A hard hyphen is intentionally added to a word regardless of the word's position in the text flow. It is required by the rules of grammar or word usage. For example, compound words, such as "three-week vacation" and "self-confident" contain hard hyphens.

By default, KeyView maintains the source document's soft hyphens in the output XML to more accurately represent the source document's layout. However, if you are using Export to generate text output for an indexing engine or are not concerned with maintaining the document's layout, OpenText recommends that you remove soft hyphens from the XML output. To remove soft hyphens, you must enable the soft hyphen flag.

NOTE: If the soft hyphen flag is enabled, every hyphen at the end of a line is considered a soft hyphen and removed from the XML output. If a hard hyphen appears at the end of a line, it is also removed. This might result in an intentionally hyphenated word being extracted without a hyphen.

To remove soft hyphens from the XML output

  1. Create an instance of the ConfigOption class. Set the OptionType argument to CFG_DELSOFTHYPHEN and the OptionValue argument to 1.
  2. Call the setConfigOption method and pass the ConfigOption object.
  3. Call a convert method. See the Javadoc in the directory install\javaapi\javadoc, where install is the path name of the Export installation directory.

Extract Custom Metadata from PDF Files

To extract custom metadata from your PDF files, add the custom metadata names to the pdfsr.ini file provided, and copy the modified file to the \bin directory. You can then extract metadata as you normally would.

The pdfsr.ini is in the directory samples\pdfini, and has the following structure:

<META>
<TOTAL>total_item_number</TOTAL>,
/metadata_tag_name datatype,
</META>

Parameter

Description

total item number

The total number of metadata tags that are listed.

metadata_tag_name

The metadata tag name used in the PDF files.

datatype

The data type of the metadata field. The possible types are:

  • KV_String
  • KV_Int4
  • KV_DateTime
  • KV_ClipBoard
  • KV_Bool
  • KV_Unicode
  • KV_IEEE8
  • KV_Other

For example:

<META>
<TOTAL> 4 </TOTAL>
/part_number     INT4
/volume          INT4
/purchase_date   DATETIME
/customer        STRING
</META>