Parse XML into Documents
CFS attempts to parse any XML file that it receives according to rules that are specified in the [XMLParsing]
section of its configuration file. The parameters in the [XMLParsing]
section specify:
- How to divide the XML into documents.
- How to populate each document's
DREREFERENCE
field. - How to populate each document's
DRECONTENT
field.
To configure settings for parsing XML
- Open the CFS configuration file.
-
In the
[XMLParsing]
section, set the following parameters:DocumentRootPaths
A comma-separated list of paths to nodes that contain a single document. Specify the paths relative to the root of the XML. Use a forward slash (/) to represent levels in the XML hierarchy. Any elements contained within the specified node are added to the document as metadata. IncludeRootPath
A Boolean value (default false
) that specifies whether to include the node specified byDocumentRootPaths
in the document. You might set this parameter toTRUE
if the root node has attributes that you need to include in the document.ReferencePaths
A comma-separated list of possible paths to a node that contains the document reference. Specify the paths relative to the node identified by DocumentRootPaths
. Use a forward slash (/) to represent levels in the XML hierarchy. The XML for each document must contain exactly one node that matches the specified path(s).ContentPaths
A comma-separated list of possible paths to a node that contains the document content. Specify the paths relative to the node identified by DocumentRootPaths
. Use a forward slash (/) to represent levels in the XML hierarchy. If multiple content nodes are identified for a single document, a document is produced with multiple sections. - Save and close the configuration file.
Example
Consider the following XML:
<xml> <documents> <document> <metadata> <name>This is the name of the document</name> <created>28/02/15 11:01:17</created> <modified>28/02/15 15:23:00</modified> </metadata> <content>Here is some content</content> </document> <document> <metadata> <name>This is another document</name> <created>01/03/15 12:21:13</created> <modified>02/03/15 13:23:03</modified> </metadata> <different_content>Here is some content</different_content> </document> </documents> </xml>
To ingest this XML file, you might use the following configuration:
[XMLParsing] DocumentRootPaths=documents/document ReferencePaths=metadata/name ContentPaths=content,different_content
To ingest the XML, send the ingest action to CFS:
http://localhost:7000/action=ingest&adds=%3Cadds%3E%3Cadd%3E%3Csource%20 filename%3D%22xmlfile.xml%22%20 lifetime%3D%22permanent%22%20%2F%3E %3C%2Fadd%3E%3C%2Fadds%3E
This would produce the following documents:
#DREREFERENCE This is the name of the document #DREFIELD UUID="bfa1a8aac0b772d1ee467d830fa179bc" #DREFIELD DocTrackingId="3cd0e5cf3160163adf7445d013ef10b1" #DREFIELD ImportVersion="1207655" #DREFIELD KeyviewVersion="10220" #DREFIELD metadata/created="28/02/15 11:01:17" #DREFIELD metadata/modified="28/02/15 15:23:00" #DRECONTENT Here is some content #DREENDDOC #DREREFERENCE This is another document #DREFIELD UUID="aadf6628fccd0c6b885a79e2e39f4357" #DREFIELD DocTrackingId="66a63287d85b500159c5b5fb099b99a5" #DREFIELD ImportVersion="1207655" #DREFIELD KeyviewVersion="10220" #DREFIELD metadata/created="01/03/15 12:21:13" #DREFIELD metadata/modified="02/03/15 13:23:03" #DRECONTENT Here is some content #DREENDDOC