Filter PDF Files to a Logical Reading Order

The order of the text inside a PDF file has no relation to the layout of the text on the page or screen. By default, KeyView extracts paragraphs in the order in which they are stored in the file, not the order in which they appear on the page. For example, a three-column article could be output with the headers and title at the end of the output file, and the second column extracted before the first column. Although this output does not represent a logical reading order, it accurately reflects the internal structure of the PDF.

You can configure KeyView to filter text in the order that it appears on the page (logical reading order).

NOTE: Filtering might be slower when logical reading order is enabled. The PDF file format does not provide the logical reading order, KeyView must calculate the order based on the position of elements on the page.

NOTE: The algorithm used to reproduce the reading order of a PDF page is based on common page layouts. The paragraph flow generated for PDFs with unique or complex page designs might not emulate the original reading order exactly. For example, page design elements such as drop caps, callouts that cross column boundaries, and significant changes in font size might disrupt the logical flow of the output text.

The following paragraph direction options are available:

Paragraph Direction Description
Left-to-right Paragraphs flow from left to right. You should specify this option when your documents are in a language that uses a left-to-right reading order, such as English or German.
Right-to-left Paragraphs flow from right to left. You should specify this option when your documents are in a language that uses a right-to-left reading order, such as Hebrew or Arabic.
Auto The PDF reader automatically determines the paragraph direction for each page.

The paragraph direction options control the direction of paragraphs on a page; they do not control the text direction in a paragraph. For example, a PDF file might contain English paragraphs in three columns that read from left to right, but 80% of the second paragraph might contain Hebrew characters. If the left-to-right logical reading order is enabled, the paragraphs are ordered logically in the output—title paragraph, then paragraph 1, 2, 3, and so on—and flow from the top left of the first column to the bottom right of the third column. However, the text direction of the second paragraph is determined independently of the page by KeyView, and is output from right to left.

NOTE: Extraction of metadata is not affected by the paragraph direction setting. The characters and words in metadata fields are extracted in the correct reading direction regardless of whether logical reading order is enabled.

Enable Logical Ordering through the API

To enable logical ordering for PDF files

  • In the C++ API, invoke pdf_logical_reading() on a Configuration object with any value from the enumerated list LogicalPDFDirection in Keyview_Enumerations.hpp. See The Configuration Class for more information.

Use the formats.ini File

To enable logical reading order by using the formats.ini file

  1. Change the PDF reader entry in the [Formats] section of the formats.ini file as follows:

    [Formats]
    230=lpdf
  2. Optionally, add the following section to the end of the formats.ini file:

    [pdf_flags]
    pdf_direction=paragraph_direction

    where paragraph_direction is one of the following:

    Flag Description
    LPDF_LTR Left-to-right paragraph direction.
    LPDF_RTL Right-to-left paragraph direction.
    LPDF_AUTO The PDF reader determines the paragraph direction for each PDF page.
    LPDF_RAW Unstructured paragraph flow. This is the default behavior. If logical reading order is enabled, and you want to return to an unstructured paragraph flow, set this flag.