Configure Element Extraction for XML Documents

When filtering XML files, you can specify which elements and attributes to extract according to the file's format ID or root element. This option is useful when you want to extract only relevant text elements, such as abstracts from reports, or a list of authors from an anthology.

A root element is an element that contains all other elements. In the following XML sample, book is the root element:

<?xml version="1.0" encoding="UTF-8"?>
<book>
    <title>XML Introduction</title>
    <product id="33-657" status="draft">XML Tutorial</product>
    <chapter>Introduction to XML
        <para>What is HTML</para>
        <para>What is XML</para>
    </chapter>
    <chapter>XML Syntax
        <para>Elements must have a closing tag</para>
        <para>Elements must be properly nested</para>
    </chapter>
</book>

For example, you could specify that when filtering files with the root element book, the element title is extracted as metadata, and only product elements with a status attribute value of draft are extracted. When you extract an element, the child elements within the element are also extracted. For example, if you extract the element chapter from the previous sample, the child element para is also extracted.

Modify Element Extraction Settings

In the C++ API, you can modify configuration settings for XML documents by using the xml_config function in the Configuration class.

Syntax for Specifying Elements

When you specify XML elements in your configuration options, use the following guidelines:

  • You can specify multiple elements in a comma-separated list, up to a maximum of 20 elements. You must not specify multiple root elements.

  • You can use an asterisk (*) to match all elements (including child elements) or attributes.

  • To further qualify an element, you can specify that the element must exist in a certain namespace, must contain a specific attribute, or both. To define the namespace and attribute of an element, enter the following:

    ns_prefix:elemname@attribname=attribvalue

    NOTE: You must enclose attribute values that contain spaces in quotation marks.

    For example, the entry bg:language@id=xml extracts a language element in the namespace bg that contains the attribute name id with the value of "xml". This entry extracts the following element from an XML file:

    <bg:language id="xml">XML is a simple, flexible text format derived from SGML</bg:language>

    but does not extract:

    <bg:language id="sgml">SGML is a system for defining markup languages.</bg:language>

    or

    <adv:language id="xml">The namespace should be a Uniform Resource Identifier (URI).</adv:language>

NOTE: If an element matches as both metadata and content, File Content Extraction outputs it as metadata only.