Parse XML into Documents

CFS attempts to parse any XML file that it receives according to rules that are specified in the [XMLParsing] section of its configuration file. The parameters in the [XMLParsing] section specify:

  • How to divide the XML into documents.
  • How to populate each document's DREREFERENCE field.
  • How to populate each document's DRECONTENT field.

To configure settings for parsing XML

  1. Open the CFS configuration file.
  2. In the [XMLParsing] section, set the following parameters:

    DocumentRootPaths A comma-separated list of paths to nodes that contain a single document. Specify the paths relative to the root of the XML. Use a forward slash (/) to represent levels in the XML hierarchy. Any elements contained within the specified node are added to the document as metadata.
    IncludeRootPath A Boolean value (default false) that specifies whether to include the node specified by DocumentRootPaths in the document. You might set this parameter to TRUE if the root node has attributes that you need to include in the document.
    ReferencePaths A comma-separated list of possible paths to a node that contains the document reference. Specify the paths relative to the node identified by DocumentRootPaths. Use a forward slash (/) to represent levels in the XML hierarchy. The XML for each document must contain exactly one node that matches the specified path(s).
    ContentPaths A comma-separated list of possible paths to a node that contains the document content. Specify the paths relative to the node identified by DocumentRootPaths. Use a forward slash (/) to represent levels in the XML hierarchy. If multiple content nodes are identified for a single document, a document is produced with multiple sections.
  3. Save and close the configuration file.

Example

Consider the following XML:

<xml> 
   <documents> 
      <document> 
         <metadata> 
           <name>This is the name of the document</name> 
           <created>28/02/15 11:01:17</created> 
           <modified>28/02/15 15:23:00</modified> 
         </metadata> 
         <content>Here is some content</content> 
       </document>
       <document>
         <metadata> 
           <name>This is another document</name> 
           <created>01/03/15 12:21:13</created> 
           <modified>02/03/15 13:23:03</modified> 
         </metadata> 
         <different_content>Here is some content</different_content> 
       </document>
     </documents> 
 </xml> 

To ingest this XML file, you might use the following configuration:

[XMLParsing]
DocumentRootPaths=documents/document
ReferencePaths=metadata/name
ContentPaths=content,different_content

To ingest the XML, send the ingest action to CFS:

http://localhost:7000/action=ingest&adds=%3Cadds%3E%3Cadd%3E%3Csource%20
                                         filename%3D%22xmlfile.xml%22%20
                                         lifetime%3D%22permanent%22%20%2F%3E
                                         %3C%2Fadd%3E%3C%2Fadds%3E

This would produce the following documents:

#DREREFERENCE This is the name of the document
#DREFIELD UUID="bfa1a8aac0b772d1ee467d830fa179bc"
#DREFIELD DocTrackingId="3cd0e5cf3160163adf7445d013ef10b1"
#DREFIELD ImportVersion="1207655"
#DREFIELD KeyviewVersion="10220"
#DREFIELD metadata/created="28/02/15 11:01:17"
#DREFIELD metadata/modified="28/02/15 15:23:00"
#DRECONTENT
Here is some content
#DREENDDOC

#DREREFERENCE This is another document
#DREFIELD UUID="aadf6628fccd0c6b885a79e2e39f4357"
#DREFIELD DocTrackingId="66a63287d85b500159c5b5fb099b99a5"
#DREFIELD ImportVersion="1207655"
#DREFIELD KeyviewVersion="10220"
#DREFIELD metadata/created="01/03/15 12:21:13"
#DREFIELD metadata/modified="02/03/15 13:23:03"
#DRECONTENT
Here is some content
#DREENDDOC