KVXMLConfig()

This function is called directly and provides a way to configure options prior to the document conversion. Currently, the function is used for the following configurations:

  • Generate output without images

    Generate output with verbose markup and without images. To generate output with minimal markup (ID and style paragraph attributes) and without images, set the bIndexOnly member of the KVXMLOptions structure. See KVXMLOptions.

  • Enable PDF position information

    Include position information in the markup generated for a PDF document.

  • Configure PDF bookmarks

    Specify whether bookmarks in a PDF file are converted to simple XLinks in the XML output.

  • Configure Word bookmarks

    Disable the conversion of Microsoft Word bookmarks to zone elements.

  • Designate temporary directory

    Specify a directory in which temporary files created during XML conversion processes are stored.

  • Configure XML conversion

    Specify the elements and attributes extracted from an XML document based on the files document type.

  • Enable PDF logical reading order

    Convert paragraphs in PDF files in the order in which they appear on the page and with left-to-right or right-to-left paragraph direction. See Convert PDF Files to a Logical Reading Order.

  • Configure PDF soft hyphens

    Specify whether soft hyphens are removed from the XML output. See Control Hyphenation.

  • Enable Revision Marks

    Convert text and graphics that were deleted from a document with revision tracking enabled and include revision tracking information in the XML output. Convert Revision Tracking Information.

  • Protected file password

    Specify the password to use to open a password-protected file for export.

  • Specify output character set for summary information

    Specify the output character set for the document's metadata, when using fpGetSummaryInfo().

  • Include position and invisible text tokens (with bounding boxes) in the output

    Add top, left, height, width, and rotation attributes to <p> elements.

  • Enable or disable Optical Character Recognition (OCR)

    KeyView can perform Optical Character Recognition (OCR) on raster image files - see KVCFG_OCR in the table below.

Syntax

KVErrorCode pascal KVXMLConfig( 
    void    *pContext,
    int      nType,
    int      nValue,
    void    *p );

Arguments

pContext

A pointer to a KeyView Export session that you initialized by calling fpInit().

nType

The configuration flag. This is a symbolic constant defined in kvtypes.h. The available options are described in Configuration Flags.

nValue

The integer value defined for the flags above.

This is TRUE or FALSE for all flags except KVCFG_LOGICALPDF, KVCFG_SETMETADATACHARSET, KVCFG_SETTEMPDIRECTORY, and KVCFG_SETXMLCONFIGINFO.

For KVCFG_LOGICALPDF, this is one of the paragraph direction options defined in the LPDF_DIRECTION enumerated type in kvtypes.h. See LPDF_DIRECTION.

For KVCFG_SETTEMPDIRECTORY and KVCFG_SETXMLCONFIGINFO, this is not set.

  • For KVCFG_SETMETADATACHARSET, nValue is a character set enumerated in KVCharSet in kvcharset.h. See Convert Character Sets.

  • p

    The data for the configuration flag.

    This is NULL for all flags except KVCFG_SETTEMPDIRECTORY and KVCFG_SETXMLCONFIGINFO.

    For KVCFG_SETTEMPDIRECTORY, this is path to the directory where temporary files are stored.

    For KVCFG_SETXMLCONFIGINFO, this is a pointer to the KVXConfigInfo structure. See KVXConfigInfo.

    For KVCFG_SETPASSWORD, this is the source file password.

    Configuration Flags

    The following flags are available for the nType argument in KVXMLConfig(). These flags are defined in kvtypes.h.

    Flag

    Description

    KVCFG_SUPPRESSIMAGES

    If you set KVCFG_SUPPRESSIMAGES, the XML output includes verbose markup, but no images. If you do not set this option, embedded images in a document are regenerated as separate files and stored in the output directory. To generate output with minimal markup (ID and style paragraph attributes) and without images, set the bIndexOnly member of the KVXMLOptions structure to TRUE. KVXMLOptions.

    KVCFG_ENABLEPOSITIONINFO

    If you set KVCFG_ENABLEPOSITIONINFO, a position element is included in the markup for PDF documents. The position element defines the absolute position of the text relative to the bottom left corner of the page, and includes additional information such as font and color.

    KVCFG_SETMETADATACHARSET This option enables you to specify the output character set for metadata when using fpGetSummaryInfo(). nValue is a character set enumerated in KVCharSet in kvcharset.h. See Convert Character Sets. This function should be called before fpGetSummaryInfo().

    KVCFG_SUPPRESSTOCPRINTIMAGE

    If you set KVCFG_SUPPRESSTOCPRINTIMAGE, bookmarks in a PDF file are not converted to simple XLinks in the XML output. By default, PDF bookmarks are converted to source and destination anchors. For example,

    <a xmlns:xlink="http://www.w3.org/TR/xlink" xlink:href="#bmk1">Highlight File Format</a>
    <a xmlns:xlink="http://www.w3.org/TR/xlink" name="bmk1"><img src="pdf14640.jpg"/>
    

    KVCFG_DISABLEZONE

    If you set KVCFG_DISABLEZONE, the conversion of Microsoft Word bookmarks to zone elements (<zone name ="xxx">) in the output XML is disabled.

    A bookmark in Microsoft Word documents is a name given to a selected area of the document. The bookmark might enclose words, paragraphs, tables, table cells, lists, list items, or the entire document. In XML Export, bookmarks are converted to zone elements (<Zone name="xxx">) by using the KeyView KVT_ZONE token.

    Depending on how bookmarks are defined in the original document, the creation of zone elements might result in malformed XML. In this case, you can disable zone creation to avoid these validity errors. Zone element creation is enabled by default.

    KVCFG_SETTEMPDIRECTORY

    The KVCFG_SETTEMPDIRECTORY flag enables you to specify the directory in which temporary files created during conversion processes are stored. By default, the system temporary directory is used.

    To define a directory for temporary files generated during an out-of-process conversion, set the tempfilepath parameter in the formats_e.ini file. See Convert Files Out of Process.

    On Windows, p must be in the local Windows code page.

    To set KVCFG_SETTEMPDIRECTORY when converting out-of-process, call this function before you call KVXMLStartOOPSession().

    KVCFG_SETXMLCONFIGINFO

    The KVCFG_SETXMLCONFIGINFO flag enables you to define which elements and attributes are extracted from XML documents with a specified format ID or root element. You can use this to override the default settings for the supported XML formats (see Convert XML Files), or to define settings for custom XML document types.

    The settings are defined in the KVXConfigInfo structure (see KVXConfigInfo). To set custom settings for more than one document type, call the KVXMLConfig() function once for each type.

    You can also modify element extraction settings by using the kvxconfig.ini file. See Configure Element Extraction for XML Documents.

    KVCFG_LOGICALPDF

    The KVCFG_LOGICALPDF flag converts paragraphs in a PDF file in the order in which they appear on the page (logical reading order). The nValue argument specifies the paragraph direction. See Convert PDF Files to a Logical Reading Order.

    KVCFG_DELSOFTHYPHEN

    If you set KVCFG_DELSOFTHYPHEN, soft hyphens in the source document are removed, and the hyphenated words are joined in the XML output. By default, soft hyphens are maintained. See Control Hyphenation.

    OpenText recommends that you remove soft hyphens if you use Export to generate text output for an indexing engine or are not concerned with maintaining the document's layout. See fpConvertStream() or KVXMLConvertFile() for more information on running Export in index mode.

    KVCFG_INCLREVISIONMARK

    If you set this flag to TRUE, text and graphics that were deleted from a document with a revision tracking feature enabled are converted, and revision tracking information is included in the XML output.

    To reset the flag and exclude deleted content and revision tracking information from the XML output, set the flag to FALSE. See Convert Revision Tracking Information. The default is FALSE.

    KVCFG_WP_NOCOMMENTS

    Set KVCFG_WP_NOCOMMENTS to TRUE not to export text from comments and annotations.

    You can also toggle comment output by modifying the formats_e.ini file. See Show Hidden Data.

    KVCFG_WP_SHOWHIDDENTEXT

    Set KVCFG_WP_SHOWHIDDENTEXT to TRUE to export hidden text from Microsoft Word documents.

    KVCFG_WP_SHOWDATEFIELDCODE

    Set KVCFG_WP_SHOWDATEFIELDCODE to TRUE to export date field codes from Microsoft Word documents.

    KVCFG_WP_SHOWFILENAMEFIELDCODE

    Set KVCFG_WP_SHOWFILENAMEFIELDCODE to TRUE to export the file name field code from Microsoft Word documents.

    KVCFG_SS_SHOWHIDDENINFOR

    Set KVCFG_SS_SHOWHIDDENINFOR to TRUE to export hidden information from Microsoft Excel files.

    KVCFG_SS_SHOWCOMMENTS

    Set KVCFG_SS_SHOWCOMMENTS to TRUE to export comments from Microsoft Excel files.

    KVCFG_SS_SHOWFORMULA

    Set KVCFG_SS_SHOWFORMULA to TRUE to export formulas from Microsoft Excel files.

    KVCFG_PG_HIDEHIDDENSLIDE

    Set KVCFG_PG_HIDEHIDDENSLIDE to TRUE not to export hidden slides from Microsoft PowerPoint files.

    KVCFG_PG_HIDECOMMENT

    Set KVCFG_PG_HIDECOMMENT to TRUE not to export comments from Microsoft PowerPoint files. Comments are exported by default from PowerPoint 97 to 2000 files.

    KVCFG_PG_SHOWCOMMENTSSLIDE

    Set KVCFG_PG_SHOWCOMMENTSSLIDE to TRUE to export comments slides from Microsoft PowerPoint 2003 and 2007 files.

    KVCFG_PG_SHOWSLIDNOTES

    Set KVCFG_PG_SHOWSLIDNOTES to TRUE to export slide notes from Microsoft PowerPoint files.

    You can also toggle slide note output by modifying the formats_e.ini file. See Show Hidden Data.

    KVCFG_SETPASSWORD

    This flag enables you to define a password used to open a password-protected file for export. See Export Password Protected Files.

    nValue is TRUE.

    p is the source file password, which can have a maximum length of 255 characters (the final byte is null).

    KVCFG_POSITIONINFOOUTPUTTYPE This flag enables you to extend the existing <p> tags to include bounding box information.
    KVCFG_OCR

    Specifies whether to perform Optical Character Recognition (OCR) on raster image files, to extract machine-printed text from the image. The output from XML Export includes the original image, exported to the format you specify in KVXMLOptions, and any text extracted by OCR inside <ocr> tags. If OCR detects that some of the text forms a table, it will be included in the output as a <table>.

    OCR is available only on certain platforms (see Optical Character Recognition in the platform differences section). OCR processes only standalone raster images and not subfiles, such as images embedded in a Word document.

    If your license includes OCR, it is enabled by default. To disable OCR, set this flag to FALSE.

    Returns

    The return value is one of the error codes defined in KVErrorCode in kverrorcodes.h.

    Discussion

    • You must call this function after the call to fpInit() and before the call to fpConvertStream() or KVXMLConvertFile().

    • This function runs in-process or out of process. See Convert Files Out of Process.

    • When converting out-of-process, you must call this function after the call to KVXMLStartOOPSession() and before the call to KVXMLEndOOPSession(). The exception is when setting KVCFG_SETTEMPDIRECTORY - in which case, call this function before the call to KVXMLStartOOPSession().

    Examples

    • To generate verbose markup, but no images:

      (*fpXMLConfig)(pKVXML, KVCFG_SUPPRESSIMAGES, TRUE, NULL);
    • To produce summary information in UTF8:

      (*fpXMLConfig)(pKVXML, KVCFG_SETMETADATACHARSET, KVCS_UTF8, NULL);
    • To specify bookmarks in a PDF file are not converted to XLinks in the XML output:

      (*fpXMLConfig)(pKVXML, KVCFG_SUPPRESSTOCPRINTIMAGE, TRUE, NULL);
    • To disable the conversion of zone elements:

      (*fpXMLConfig)(pKVXML, KVCFG_DISABLEZONE, TRUE, NULL);
    • To set a directory for temporary files:

      char     tmpDir[250];
      strcpy (tmpDir, "c:\\temp\\xmlexport");
      (*fpXMLConfig)(pKVXML, KVCFG_SETTEMPDIRECTORY, 0, tmpDir);
    • To specify custom extraction settings for conversion of an XML file:

      KVXConfigInfo  xinfo;  /* populate xinfo */
      (*fpXMLConfig)(pKVXML, KVCFG_SETXMLCONFIGINFO, 0, &xinfo);
    • To specify PDF files are converted to a logical reading order, and the paragraph direction for the PDF output is left to right:

      (*fpXMLConfig)(pKVXML, KVCFG_LOGICALPDF, LPDF_LTR, NULL);
    • To specify PDF files are converted to a logical reading order, and the paragraph direction for the PDF output is right to left:

      (*fpXMLConfig)(pKVXML, KVCFG_LOGICALPDF, LPDF_RTL, NULL);
    • To specify PDF files are converted to a logical reading order, and the paragraph direction for the PDF output is determined on the fly for each page:

      (*fpXMLConfig)(pKVXML, KVCFG_LOGICALPDF, LPDF_AUTO, NULL);
    • To specify soft hyphens are removed from the XML output:

      (*fpXMLConfig)(pKVXML, KVCFG_DELSOFTHYPHEN, TRUE, NULL);
    • To convert text and graphics that are identified by revison marks:

      (*fpXMLConfig)(pKVXML, KVCFG_INCLREVISIOMARK, TRUE, NULL);
    • To toggle hidden data output from Microsoft Word documents, use one of the KVCFG_WP flags:

      (*fpXMLConfig)(pKVXML, KVCFG_WP_NOCOMMENTS, TRUE, NULL);
    • To toggle hidden data output from Microsoft Excel documents, use one of the KVCFG_SS flags:

      (*fpXMLConfig)(pKVXML, KVCFG_SS_SHOWHIDDENINFOR, TRUE, NULL);
    • To toggle hidden data output from Microsoft PowerPoint documents, use one of the KVCFG_PG flags:

      (*fpXMLConfig)(pKVXML, KVCFG_PG_HIDEHIDDENSLIDE, TRUE, NULL);
    • To specify a password to open a password-protected file for export:

      (*fpXMLConfig)(pKVXML, KVCFG_SETPASSWORD, TRUE, password);

      where password is a null-terminated string of 255 or fewer characters.

    • To include a position element in the markup for PDF documents:

      (*fpXMLConfig)(pKVXML, KVCFG_ENABLEPOSITIONINFO, TRUE, NULL);

      Using the PDF position element significantly changes the generated markup. For example, without the option, the XML output from a section of a PDF document looks like this:

      <?xml version="1.0" encoding="utf-8" ?> 
        <!DOCTYPE VerityXMLExport (View Source for full doctype...)> 
      - <VerityXMLExport>
      - <WP>
      - <p id="p1" font-size="33pt">
        <img src="ecpe.pdf38760.jpg" height="140px" width="292px" /> 
        Economic Fiscal Update 
        <font size="18pt" color="#777777">Theand</font> 
        <font size="14pt" color="#ffffff">October 30, 2002</font> 
        <font size="29pt" color="#a4a4a4">Overview</font> 
        </p>

      With the option enabled, the same section of the PDF document looks like this:

      <?xml version="1.0" encoding="utf-8" ?>
        <!DOCTYPE VerityXMLExport (View Source for full doctype...)> 
      - <VerityXMLExport>
      - <WP>
        <Position style="position:absolute;top:534px;left:254px;font-family:'Times New Roman';font-size:33pt;white-space:nowrap;" /> 
        <Position style="position:absolute;top:393px;left:254px;white-space:nowrap;" /> 
        <img src="ecpe.pdf36000.jpg" height="140px" width="292px" /> 
        <Position style="position:absolute;top:308px;left:256px;font-family:'Times New Roman';font-size:33pt;white-space:nowrap;" /> 
        Economic 
        <Position style="position:absolute;top:346px;left:256px;font-family:'Times New Roman';font-size:33pt;white-space:nowrap;" /> 
        Fiscal Update 
        <Position style="position:absolute;top:298px;left:281px;font-family:'Times New Roman';font-size:18pt;color:#777777;background-color:#ffffff;white-space:nowrap;" /> 
        The 
        <Position style="position:absolute;top:336px;left:299px;font-family:'Times New Roman';font-size:18pt;color:#777777;background-color:#ffffff;white-space:nowrap;" /> 
        and 
        <Position style="position:absolute;top:543px;left:397px;font-family:'Times New Roman';font-size:14pt;color:#ffffff;background-color:#000000;white-space:nowrap;" /> 
        October 30, 2004 
        <Position style="position:absolute;top:627px;left:382px;font-family:'Times New Roman';font-size:29pt;color:#a4a4a4;background-color:#ffffff;white-space:nowrap;" /> 
        Overview 
    • To include position information in attributes of <p> tags:

      (*fpXMLConfig)(pKVXML, KVCFG_ENABLEPOSITIONINFO, TRUE, NULL);
      (*fpXMLConfig)(pKVXML, KVCFG_POSITIONINFOOUTPUTTYPE, KVPIOT_ATTRIBUTES, NULL);

      In this mode, each piece of content output by the reader with a position is put in its own <p> element. Line break (<br/>) tags are not included in the output.

      The <p> tags have position information, when this information is available from the reader. These are included in new attributes of the <p> tag: top, left, height, width, and rotation.

      The top, left, width, and height attributes are all expressed in pixels. The top and left attributes give the coordinates of the top left corner of the content (an image, text box, and so on) relative to the top left corner of the page. The width and height attributes are the width and height of the content.

      Rotation is expressed in degrees, and gives the clockwise rotation of the content about the top left corner. If the rotation attribute is not present, the rotation is assumed to be zero.

      NOTE: Not all readers output all these attributes for all pieces of content. Only pdf2sr outputs width, height and rotation information for text. pdf2sr does not put height and width attributes on <p> tags that enclose images; rather, the <img> tags themselves have the height and width. For example:

      <p id="p1" font-size="12pt" top="0px" left="0px"><img src="103453.pdf00.png" height="1261px" width="892px"/></p>
      <p id="p2" font-family="MyriadPro-It" font-size="16pt" top="59px" left="129px" height="21px" width="447px"><i>Aufforderung zur Einreichung von Vorschlägen 2005:
      </i></p>