Configure Element Extraction for XML Documents
When filtering XML files, you can specify which elements and attributes to extract according to the file's format ID or root element. This option is useful when you want to extract only relevant text elements, such as abstracts from reports, or a list of authors from an anthology.
A root element is an element that contains all other elements. In the following XML sample, book
is the root element:
<?xml version="1.0" encoding="UTF-8"?> <book> <title>XML Introduction</title> <product id="33-657" status="draft">XML Tutorial</product> <chapter>Introduction to XML <para>What is HTML</para> <para>What is XML</para> </chapter> <chapter>XML Syntax <para>Elements must have a closing tag</para> <para>Elements must be properly nested</para> </chapter> </book>
For example, you could specify that when filtering files with the root element book
, the element title
is extracted as metadata, and only product
elements with a status
attribute value of draft
are extracted. When you extract an element, the child elements within the element are also extracted. For example, if you extract the element chapter
from the previous sample, the child element para
is also extracted.
Modify Element Extraction Settings
You can modify configuration settings for XML documents through the API.
Use the Java API
You can use the Java API to modify the settings for the standard XML document types or add configuration settings for your own XML document types.
To modify settings
-
Declare an array of
XMLConfigSet
objects. -
Create an instance of
ConfigOption
with the following arguments:-
Set the
OptionType
toCFG_SETXMLCONFIGINFO
. -
Set the
OptionValue
to 0. -
Set
OptionData
to the array object.
-
-
Call the
setConfigOption
method, and pass in theConfigOption
instance. -
Call a filter method. For example:
XMLConfigSet[] XMLInfo; ConfigOption config=new ConfigOption(Filter.CFG_SETXMLCONFIGINFO, 0, XMLInfo); objFilter.setConfigOption(config);
Use an Initialization File
You can use the initialization file to modify the settings for the standard XML document types or add configuration settings for your own XML document types.
To modify settings
- Modify the
kvxconfig.ini
file. -
Use the initialization file when processing the XML file. See Configure Element Extraction for XML Documents.
The Java sample program
FilterTest
demonstrates how to use the initialization file in the filtering process. See Sample Programs.
Explore XML Extraction Settings with the Sample Program
The FilterTest sample program reads XML extraction settings from a configuration file. This option lets you try XML extraction settings without programming.
An example configuration file, kvxconfig.ini
file is provided with the Filter SDK. The file is in the directory install\OS\bin
, where install
is the path name of the Filter installation directory and OS
is the name of the operating system.
This configuration file contains default element extraction settings for supported XML formats. For example, the following entry defines extraction settings for the Microsoft Visio 2003 XML format:
[config3] eKVFormat=MS_Visio_XML_Fmt szRoot= szInMetaElement=DocumentProperties szExMetaElement=PreviewPicture szInContentElement=Text szExContentElement= szInAttribute=
The following table describes the configuration options in kvxconfig.ini
.
Configuration Option | Description |
---|---|
eKVFormat
|
The format ID as detected by file format detection. This determines the file type to which these extraction settings apply. See Obtain Format Information for more information on format ID values. If you are adding configuration settings for a custom XML document type, you must set |
szRoot
|
The file's root element. If |
szInMetaElement
|
The elements extracted from the file as metadata. All other elements are extracted as text. |
szExMetaElement
|
The child elements in the included metadata elements that are not extracted from the file as metadata. For example, the default extraction settings for the Visio XML format extract the You cannot exclude any metadata elements from the output for StarOffice files. All metadata is extracted regardless of this setting. |
szInContentElement
|
The elements extracted from the file as content text. |
szExContentElement
|
The child elements in the included content elements that are not extracted from the file as content text. |
szInAttribute
|
The attribute values extracted from the file. If you do not define attributes here, attribute values are not extracted. |
Syntax for Specifying Elements
When you specify XML elements in your configuration options, use the following guidelines:
-
You can specify multiple elements in a comma-separated list, up to a maximum of 20 elements. You must not specify multiple root elements.
-
You can use an asterisk (*) to match all elements (including child elements) or attributes.
-
To further qualify an element, you can specify that the element must exist in a certain namespace, must contain a specific attribute, or both. To define the namespace and attribute of an element, enter the following:
ns_prefix:elemname@attribname=attribvalue
NOTE: You must enclose attribute values that contain spaces in quotation marks.
For example, the entry
bg:language@id=xml
extracts alanguage
element in the namespacebg
that contains the attribute nameid
with the value of"xml"
. This entry extracts the following element from an XML file:<bg:language id="xml">XML is a simple, flexible text format derived from SGML</bg:language>
but does not extract:
<bg:language id="sgml">SGML is a system for defining markup languages.</bg:language>
or
<adv:language id="xml">The namespace should be a Uniform Resource Identifier (URI).</adv:language>
NOTE: If an element matches as both metadata and content, File Content Extraction outputs it as metadata only.
Add Configuration Settings for Custom XML Document Types
You can define element extraction settings for custom XML document types by adding the settings to the kvxconfig.ini
file. For example, for files that contain the root element opentextxml
, you can add the following section to the end of the initialization file:
[config101] eKVFormat= szRoot=opentextxml szInMetaElement=dc:title,dc:meta@title,dc:meta@name=title szExMetaElement= szInContentElement=opentext:division@name=keyview,opentext:division@name=idol,p@style="Heading 1" szExContentElement= szInAttribute=opentext:division@name
The custom extraction settings must be preceded by a section heading named [configN]
, where N
is an integer starting at 100 and increasing by 1 for each additional file type, as in [config100]
, [config101]
, [config102]
, and so on. The default extraction settings for the supported XML formats are numbered config0
to config99
. Currently only 0
to 6
are used.
Since a custom XML document type is not recognized by file format detection, the format ID is not defined. The file type is identified by the file's root element only.
If a custom XML document type is not defined in the kvxconfig.ini
file or by the setConfigOption
method, then the default extraction settings for a generic XML document are used.