Convert PDF Files

Export has special configuration options that allow greater control over the conversion of PDF files. These options can improve the fidelity and accuracy of the HTML output.

Use the pdf2sr Reader

The pdf2sr reader generates a high fidelity raster image of each page in a PDF file. The output HTML will also contain the text, with a zero opacity value, so that you can search for text when viewing the output in a web browser.

The pdf2sr reader has the following features:

  • supports standard and custom metadata (non-XMP)
  • supports basic text extraction
  • supports password protected PDFs
  • supports selecting specific pages to convert (see Convert a Subset of Pages)

The pdf2sr reader has the following limitations:

  • does not support logical order
  • does not support bidi PDFs
  • does not extract subfiles
  • does not extract bookmarks from PDFs
  • does not give estimations on percent embedded fonts match with display glyphs
  • does not support XMP metadata
  • does not support headers or footers
  • supports annotations only in the raster output, not as searchable text
  • does not support content access stream
  • does not support tagged content (PDFs)
  • cannot be used when extracting files to a stream using fpOpenSubFile().
  • cannot reconstruct missing information from Arabic text in converted PDFs (when you use Microsoft Print to PDF to convert Word documents that contain Arabic text in Calibri font to PDF, the resulting file is often incomplete because information that is required to interpret the text content is missing. The pdfsr reader can reconstruct the missing information, but pdf2sr does not do this).

To use the pdf2sr reader

  1. Open the formats_e.ini file with a text editor.
  2. In the [Formats] section, set the following:

    230=pdf2

Convert PDF Files to a Logical Reading Order

The PDF format is primarily designed for presentation and printing of brochures, magazines, forms, reports, and other materials with complex visual designs. Most PDF files do not contain the logical structure of the original document—the correct reading order, for example, and the presence and meaning of significant elements such as headers, footers, columns, tables, and so on.

KeyView can convert a PDF file either by using the file's internal unstructured paragraph flow, or by applying a structure to the paragraphs to reproduce the logical reading order of the visual page. Logical reading order enables KeyView to produce PDF files that contain languages that read from right-to-left (such as Hebrew and Arabic) in the correct reading direction.

NOTE: The algorithm used to reproduce the reading order of a PDF page is based on common page layouts. The paragraph flow generated for PDFs with unique or complex page designs might not emulate the original reading order exactly.

For example, page design elements such as drop caps, callouts that cross column boundaries, and significant changes in font size might disrupt the logical flow of the output text.

Logical Reading Order and Paragraph Direction

By default, KeyView produces an unstructured text stream for PDF files. This means that PDF paragraphs are extracted in the order in which they are stored in the file, not the order in which they appear on the visual page. For example, a three-column article could be output with the headers and the title at the end of the output file, and the second column extracted before the first column. Although this output does not represent a logical reading order, it accurately reflects the internal structure of the PDF.

You can configure KeyView to produce a structured text stream that flows in a specified direction. This means that PDF paragraphs are extracted in the order (logical reading order) and direction (left-to-right or right-to-left) in which they appear on the page.

The following paragraph direction options are available.

Paragraph Direction Option

Description

Left-to-right

Paragraphs flow logically and read from left to right. You should specify this option when most of your documents are in a language that uses a left-to-right reading order, such as English or German.

Right-to-left

Paragraphs flow logically and read from right to left. You should specify this option when most of your documents are in a language that uses a right-to-left reading order, such as Hebrew or Arabic.

Dynamic

Paragraphs flow logically. The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used.

Conversions might be slower when logical reading order is enabled. For optimal speed, use an unstructured paragraph flow.

The paragraph direction options control the direction of paragraphs on a page; they do not control the text direction in a paragraph. For example, let us say that a PDF file contains English paragraphs in three columns that read from left to right, but 80% of the second paragraph contains Hebrew characters. If the left-to-right logical reading order is enabled, the paragraphs are ordered logically in the output—title paragraph, then paragraph 1, 2, 3, and so on—and flow from the top left of the first column to the bottom right of the third column. However, the text direction of the second paragraph is determined independently of the page by the PDF reader, and is output from right to left.

NOTE: Extraction of metadata is not affected by the paragraph direction setting. The characters and words in metadata fields are extracted in the correct reading direction regardless of whether logical reading order is enabled.

Enable Logical Reading Order

You can enable logical reading order by using either the API or the formats_e.ini file. Setting the direction in the API overrides the setting in the formats_e.ini file.

Use the C API

To enable PDF logical reading order in the C API

  1. Call the fpInit() function.

  2. Call the KVHTMLConfig() function with the following arguments (see KVHTMLConfig()):

    Argument

    Parameter

    nType

    KVCFG_LOGICALPDF

    nValue

    Set to one of the following flags which are defined in kvtypes.h. (see LPDF_DIRECTION):

    • LPDF_LTR—Logical reading order and left-to-right paragraph direction.
    • LPDF_RTL—Logical reading order and right-to-left paragraph direction.
    • LPDF_AUTO—Logical reading order. The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used.
    • LPDF_RAW—Unstructured paragraph flow. This is the default behavior. If logical reading order is enabled, and you want to return to an unstructured paragraph flow, set this flag.

    pData

    NULL

    For example:

    (*fpHTMLConfig)(pKVHTML, KVCFG_LOGICALPDF, LPDF_RTL, NULL);

    The cnv2html sample program demonstrates this function. See cnv2html.

  3. Call the fpConvertStream() or KVHTMLConvertFile() function. See fpConvertStream() or KVHTMLConvertFile().

Use the formats_e.ini File

The formats_e.ini file is in the directory install\OS\bin, where install is the path name of the Export installation directory and OS is the name of the operating system.

To enable logical reading order by using the formats_e.ini file

  1. Change the PDF reader entry in the [Formats] section of the formats_e.ini file as follows:

    [Formats]
    230=lpdf
  2. Optionally, add the following section to the end of the formats_e.ini file:

    [pdf_flags]
    pdf_direction=paragraph_direction

    where paragraph_direction is one of the following:

    Flag

    Description

    LPDF_LTR

    Left-to-right paragraph direction

    LPDF_RTL

    Right-to-left paragraph direction

    LPDF_AUTO

    The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used.

    LPDF_RAW

    Unstructured paragraph flow. This is the default behavior. If logical reading order is enabled, and you want to return to an unstructured paragraph flow, set this flag.

Generate a Table of Contents from PDF Bookmarks

When you convert PDF files to HTML by using the basic reader (pdfsr), the table of contents is generated from "bookmarks" within the PDF file. The hyperlinked table of contents can appear either at the beginning of the HTML file or in a separate frame.

OpenText recommends that you configure the conversion so that the table of contents appears in a separate frame (the template pdfframe.ini demonstrates how to do this, see Set Conversion Options). Export uses absolute positioning when converting a PDF file, that is, the text appears in the exact position as in the original document. Table of contents entries do not contain absolute positioning information. Therefore, if the main document and the table of contents are generated in the same output file, the table of contents entries might overlap the body text in the document.

NOTE: When PDF bookmarks are converted to a table of contents in HTML, the generated links do not lead to the exact location of the destination marker, but jump to the page on which the destination marker exists. This is similar to the behavior of the Adobe Acrobat Reader.

Disable Bookmark Conversion

By default, Export converts PDF bookmarks to a table of contents in the HTML output. However, you can configure Export not to generate a table of contents based on the PDF bookmarks.

To prevent conversion of PDF bookmarks

  1. Call the fpInit() function.

  2. Call the KVHTMLConfig() function with the following arguments (see KVHTMLConfig()):

    Argument

    Parameter

    nType

    KVCFG_SUPPRESSTOCPRINTIMAGE

    nValue

    TRUE (non-zero)

    pData

    NULL

    For example:

    (*fpHTMLConfig)(pKVHTML, KVCFG_SUPPRESSTOCPRINTIMAGE, TRUE, NULL);

    The sample program cnv2html has KVCFG_SUPPRESSTOCPRINTIMAGE enabled. When you use this program to convert a PDF file with bookmarks, the HTML output does not include a table of contents.

  3. Call the fpConvertStream() or KVHTMLConvertFile() function. See fpConvertStream() or KVHTMLConvertFile().

Convert Invisible Text

PDF documents sometimes contain invisible text. You can search this text in Adobe PDF Reader, but you cannot view it in a web browser.

Toggle Invisible Text

You can add a JavaScript button to the upper right corner of the exported page, which you can click to toggle between invisible and regular text. When you turn on invisible text, the invisible text is displayed and the regular content is hidden; when you turn off invisible text, the invisible text is hidden.

Invisible text is hidden by default. The toggle button only appears if invisible text is detected in the PDF document.

To add an invisible text toggle button

  1. Call the fpInit() function.

  2. Call the KVHTMLConfig() function with the following arguments (see KVHTMLConfig()):

    Argument

    Parameter

    nType

    KVCFG_SETPDFINVISTEXTTOGGLE

    nValue

    0 (not used)

    pData

    szButtonName

    For example:

    (*fpHTMLConfig)(pKVHTML, KVCFG_SETPDFINVISTEXTTOGGLE, 0, szButtonName);

    The cnv2html and htmlini sample programs demonstrate this function. See cnv2html and htmlini.

  3. Call the fpConvertStream() or KVHTMLConvertFile() function. See fpConvertStream() or KVHTMLConvertFile().

    NOTE: If no invisible text is detected in the PDF document, no toggle button appears in the HTML output even if you set KVCFG_SETPDFINVISTEXTTOGGLE.

Specify Opacity of Invisible Text

Invisible text often occurs in PDF documents when the PDF software processes rasterized images through optical character recognition and then inserts the text in the PDF. You might want to display both the invisible text as well as the rasterized image. To do so, you can set the invisible text opacity as determined by an integer from 0 to 100, where 0 hides the invisible text and 100 displays it fully.

Invisible text opacity is set to 0 by default.

To set invisible text opacity

  1. Call the fpInit() function.

  2. Call the KVHTMLConfig() function with the following arguments (see KVHTMLConfig()):

    Argument

    Parameter

    nType

    KVCFG_SETPDFINVISTEXTOPACITY

    nValue

    iInvisOpacity

    pData

    NULL

    For example:

    (*fpHTMLConfig)(pKVHTML, KVCFG_SETPDFINVISTEXTOPACITY, iInvisOpacity, NULL);

    The htmlini sample program demonstrates this function. See htmlini.

  3. Call the fpConvertStream() or KVHTMLConvertFile() function. See fpConvertStream() or KVHTMLConvertFile().

Convert Rotated Text

By default, rotated text is displayed in its original position, at the original font size, and at 0 degrees rotation in the HTML output. The text is not rotated in the HTML output because text rotation is not supported by HTML.

Because the text is the original size, but might be displayed in a smaller space (at 0 degrees), the text might overlap adjacent text in the HTML output. To avoid this problem, you can specify that the rotated text be removed from its original position and displayed at the bottom of the HTML page on which it appears.

To specify that rotated text be displayed at the bottom of the HTML page

  1. Call the fpInit() function.

  2. Call the KVHTMLConfig() function with the following arguments (see KVHTMLConfig()):

    Argument

    Parameter

    nType

    KVCFG_SETTEXTROTATE

    nValue

    TRUE (non-zero)

    pData

    NULL

    For example:

    (*fpHTMLConfig)(pKVHTML, KVCFG_SETTEXTROTATE, TRUE, NULL);

    The sample program cnv2html demonstrates how to use this function. See htmlini.

  3. Call the fpConvertStream() or KVHTMLConvertFile() function. See fpConvertStream() or KVHTMLConvertFile().

NOTE: When this feature is enabled, white space is added to the bottom of every HTML page to accommodate any rotated text.

Control Hyphenation

There are two types of hyphens in a PDF document:

  • A soft hyphen is added to a word by a word processor to divide the word across two lines. This is a discretionary hyphen and is used to ensure proper text flow in justified text.

  • A hard hyphen is intentionally added to a word regardless of the word's position in the text flow. It is required by the rules of grammar or word usage. For example, compound words, such as "three-week vacation" and "self-confident" contain hard hyphens.

By default, KeyView maintains the source document's soft hyphens in the output HTML to more accurately represent the source document's layout. However, if you are using Export to generate text output for an indexing engine or are not concerned with maintaining the document's layout, OpenText recommends that you remove soft hyphens from the HTML output. To remove soft hyphens, you must enable the soft hyphen flag.

NOTE: If the soft hyphen flag is enabled, every hyphen at the end of a line is considered a soft hyphen and removed from the HTML output. If a hard hyphen appears at the end of a line, it is also removed. This might result in an intentionally hyphenated word being extracted without a hyphen.

To remove soft hyphens from the HTML output

  1. Call the fpInit() function.

  2. Call the KVHTMLConfig() function with the following arguments (see KVHTMLConfig()):

    Argument

    Parameter

    nType

    KVCFG_DELSOFTHYPHEN

    nValue

    TRUE (non-zero)

    pData

    NULL

    For example:

    (*fpHTMLConfig)(pKVHTML, KVCFG_DELSOFTHYPHEN, TRUE, NULL);
  3. Call the fpConvertStream() or KVHTMLConvertFile() function. See fpConvertStream() or KVHTMLConvertFile().

Extract Custom Metadata from PDF Files

To extract custom metadata from your PDF files, add the custom metadata names to the pdfsr.ini file provided, and copy the modified file to the \bin directory. You can then extract metadata as you normally would.

The pdfsr.ini is in the directory samples\pdfini, and has the following structure:

<META>
<TOTAL>total_item_number</TOTAL>,
/metadata_tag_name datatype,
</META>

Parameter

Description

total item number

The total number of metadata tags that are listed.

metadata_tag_name

The metadata tag name used in the PDF files.

datatype

The data type of the metadata field. Data types are defined in KVSumInfoType. See KVSumInfoType.

For example:

<META>
<TOTAL> 4 </TOTAL>
/part_number     INT4
/volume          INT4
/purchase_date   DATETIME
/customer        STRING
</META>

Convert a Subset of Pages

By default, KeyView converts all pages from a PDF file. When processing PDF files using the pdf2sr reader, you can specify a subset of pages to convert, to quickly get several pages from a large PDF file without having to process all the other pages.

To convert a subset of pages, set the page selection as described under fpSelectPDFPages(), and then call fpConvertStream() or KVHTMLConvertFile().

After you retrieve a subset of pages, you can get the rest of the pages in the PDF without processing the original subset again, by inverting the selection. To invert the selection, you set isExclusion to TRUE in the KVPageSelection structure. Then you call fpSelectPDFPages(), followed by fpConvertStream() or KVHTMLConvertFile() again.