Many systems export data in XML format. You can use HPE CFS to ingest the XML.
The XML must be encoded in UTF-8.
CFS attempts to parse any XML file that it receives according to rules that are specified in the [XMLParsing]
section of its configuration file. The parameters in the [XMLParsing]
section specify:
DREREFERENCE
field.DRECONTENT
field.XML files that are successfully parsed are not sent to KeyView, but can still be manipulated by import tasks. Only when HPE CFS is unable to parse an XML file is it sent to KeyView.
To configure settings for ingesting XML
In the [XMLParsing]
section, set the following parameters:
DocumentRootPaths
|
A comma-separated list of paths to nodes that contain a single document. Specify the paths relative to the root of the XML. Use a forward slash (/) to represent levels in the XML hierarchy. Any elements contained within the specified node are added to the document as metadata. |
IncludeRootPath
|
A Boolean value (default false ) that specifies whether to include the node specified by DocumentRootPaths in the document. You might set this parameter to TRUE if the root node has attributes that you need to include in the document. |
ReferencePaths
|
A comma-separated list of possible paths to a node that contains the document reference. Specify the paths relative to the node identified by DocumentRootPaths . Use a forward slash (/) to represent levels in the XML hierarchy. The XML for each document must contain exactly one node that matches the specified path(s). |
ContentPaths
|
A comma-separated list of possible paths to a node that contains the document content. Specify the paths relative to the node identified by DocumentRootPaths . Use a forward slash (/) to represent levels in the XML hierarchy. If multiple content nodes are identified for a single document, a document is produced with multiple sections. |
Consider the following XML:
<xml> <documents> <document> <metadata> <name>This is the name of the document</name> <created>28/02/15 11:01:17</created> <modified>28/02/15 15:23:00</modified> </metadata> <content>Here is some content</content> </document> <document> <metadata> <name>This is another document</name> <created>01/03/15 12:21:13</created> <modified>02/03/15 13:23:03</modified> </metadata> <different_content>Here is some content</different_content> </document> </documents> </xml>
To ingest this XML file, you might use the following configuration:
[XMLParsing] DocumentRootPaths=documents/document ReferencePaths=metadata/name ContentPaths=content,different_content
To ingest the XML, send the ingest action to CFS:
http://localhost:7000/action=ingest&adds=%3Cadds%3E%3Cadd%3E%3Csource%20 filename%3D%22xmlfile.xml%22%20 lifetime%3D%22permanent%22%20%2F%3E %3C%2Fadd%3E%3C%2Fadds%3E
This would produce the following documents:
#DREREFERENCE This is the name of the document #DREFIELD UUID="bfa1a8aac0b772d1ee467d830fa179bc" #DREFIELD DocTrackingId="3cd0e5cf3160163adf7445d013ef10b1" #DREFIELD ImportVersion="1207655" #DREFIELD KeyviewVersion="10220" #DREFIELD metadata/created="28/02/15 11:01:17" #DREFIELD metadata/modified="28/02/15 15:23:00" #DRECONTENT Here is some content #DREENDDOC #DREREFERENCE This is another document #DREFIELD UUID="aadf6628fccd0c6b885a79e2e39f4357" #DREFIELD DocTrackingId="66a63287d85b500159c5b5fb099b99a5" #DREFIELD ImportVersion="1207655" #DREFIELD KeyviewVersion="10220" #DREFIELD metadata/created="01/03/15 12:21:13" #DREFIELD metadata/modified="02/03/15 13:23:03" #DRECONTENT Here is some content #DREENDDOC
|