Configure Element Extraction for XML Documents

When filtering XML files, you can specify which elements and attributes to extract according to the file's format ID or root element. This option is useful when you want to extract only relevant text elements, such as abstracts from reports, or a list of authors from an anthology.

A root element is an element that contains all other elements. In the following XML sample, book is the root element:

<?xml version="1.0" encoding="UTF-8"?>
<book>
    <title>XML Introduction</title>
    <product id="33-657" status="draft">XML Tutorial</product>
    <chapter>Introduction to XML
        <para>What is HTML</para>
        <para>What is XML</para>
    </chapter>
    <chapter>XML Syntax
        <para>Elements must have a closing tag</para>
        <para>Elements must be properly nested</para>
    </chapter>
</book>

For example, you could specify that when filtering files with the root element book, the element title is extracted as metadata, and only product elements with a status attribute value of draft are extracted. When you extract an element, the child elements within the element are also extracted. For example, if you extract the element chapter from the previous sample, the child element para is also extracted.

Modify Element Extraction Settings

You can use the C API to modify the settings for the standard XML document types or add configuration settings for your own XML document types.

To modify settings

  1. Call the fpInit() function.
  2. Define the KVXConfigInfo structure.
  3. Call the fpSetConfig() function with the following arguments:

    Argument Parameter
    nType KVFLT_SETXMLCONFIGINFO
    nValue 0
    pData the address of the KVXConfigInfo structure

    For example:

    KVXConfigInfo  xinfo;
    /* populate xinfo */
    
    (*fpSetConfig)(pKVFilter, KVFLT_SETXMLCONFIGINFO, 0, &xinfo);
  4. Repeat step 2 and step 3 until the settings for all the XML document types that you want to customize are defined.
  5. Call the fpFilter() or fpFilterToFile() function.

Explore XML Extraction Settings with the Sample Program

The filter and filtertest sample programs read XML extraction settings from a configuration file that you specify with the -x argument. This option lets you try XML extraction settings without programming.

An example configuration file, kvxconfig.ini, is provided with the Filter SDK. The file is in the directory install\OS\bin, where install is the installation directory and OS is the name of the operating system.

The file contains the default element extraction settings for some XML formats. Sections from [config0] to [config99] show the default settings defined internally in KeyView. For example, the section [config3] shows the default extraction settings for the format MS_Visio_XML_Fmt. You can optionally modify these, but in most cases you do not need to modify the settings for these formats.

To define custom extraction settings for a generic XML document, add a new section. For example, if you have an XML file you can define custom settings so that KeyView extracts specific information. The sample program expects custom sections to be named [configN], where N is an integer starting at 100 and increasing by 1 for each additional file type, for example [config100], [config101], [config102], and so on.

To define custom settings for processing an XML file with the root element book, you could add the following:

[config100]
eKVFormat=
szRoot=book
szInMetaElement=
szExMetaElement=
szInContentElement=*
szExContentElement=para
szInAttribute=

This is a simple example that extracts text from all elements, except para elements.

The following table describes the configuration options in kvxconfig.ini. These are based on the structure KVXConfigInfo.

Configuration Option Description
eKVFormat

The format ID as detected by the KeyView detection module. This determines the file type to which these extraction settings apply. See Obtain Format Information for more information on format ID values.

If you are adding configuration settings for a custom XML document type, you must set eKVFormat to Unknown_Fmt.

szRoot

The file's root element. If eKVFormat is set to Unknown_Fmt, the root element is used to determine the file type to which these settings apply. Otherwise, pszRoot is ignored.

szInMetaElement

The elements extracted from the file as metadata. All other elements are extracted as text.

szExMetaElement

The child elements in the included metadata elements that are not extracted from the file as metadata. For example, the default extraction settings for the Visio XML format extract the DocumentProperties element as metadata. This element includes child elements such as Title, Subject, Author, Description, and so on. However, the child element PreviewPicture is defined in szExMetaElement because it is binary data and should not be extracted.

You cannot exclude any metadata elements from the output for StarOffice files. All metadata is extracted regardless of this setting.

szInContentElement

The elements extracted from the file as content text.

szExContentElement

The child elements in the included content elements that are not extracted from the file as content text.

szInAttribute

The attribute values extracted from the file. If you do not define attributes here, attribute values are not extracted.

Syntax for Specifying Elements

When you specify XML elements in your configuration options, use the following guidelines:

  • You can specify multiple elements in a comma-separated list, up to a maximum of 20 elements. You must not specify multiple root elements.

  • You can use an asterisk (*) to match all elements (including child elements) or attributes.

  • To further qualify an element, you can specify that the element must exist in a certain namespace, must contain a specific attribute, or both. To define the namespace and attribute of an element, enter the following:

    ns_prefix:elemname@attribname=attribvalue

    NOTE: You must enclose attribute values that contain spaces in quotation marks.

    For example, the entry bg:language@id=xml extracts a language element in the namespace bg that contains the attribute name id with the value of "xml". This entry extracts the following element from an XML file:

    <bg:language id="xml">XML is a simple, flexible text format derived from SGML</bg:language>

    but does not extract:

    <bg:language id="sgml">SGML is a system for defining markup languages.</bg:language>

    or

    <adv:language id="xml">The namespace should be a Uniform Resource Identifier (URI).</adv:language>

NOTE: If an element matches as both metadata and content, KeyView outputs it as metadata only.