Ingest XML

Many systems export data in XML format. This section describes how to ingest XML into IDOL using NiFi Ingest.

The steps in this section assume that:

  • You have one or more files each of which contains XML that should be parsed into one or more IDOL documents.
  • The XML is not necessarily in IDOL document format.

To ingest XML

  1. Add a GetFileSystem processor to your data flow to retrieve the XML file(s).

    • Configure the location of your XML files by setting the property "Directory Paths".
    • Set the property "Parse XML" to FALSE.
    • If you are running a NiFi cluster, set the dynamic property adv:FlowFileEmbedFiles to TRUE. For more information about this property, see Advanced Connector Properties.
  2. Add an ExecuteDocumentLua processor to the data flow.
  3. Connect the "success" relationship of the GetFileSystem processor to the ExecuteDocumentLua processor.
  4. Configure the ExecuteDocumentLua processor.

    1. Right-click the processor and click Configure.

      The Configure Processor dialog box opens.

    2. Click the Properties tab.
    3. Set the property Lua script function arguments to LuaFlowFileDocument, LuaProcessorSession.
    4. Click ADVANCED.

      The advanced configuration page opens.

    5. In the Lua Samples area, click Reading and writing a FlowFile document > Parse XML from the content file(name), and return new documents.
    6. Copy the example script into the Lua code area.

      The script uses the parse_document_xml function to parse the input file. If the incoming FlowFile contains a filename, this is passed directly to the function. If the incoming FlowFile contains an embedded file, the data is read and passed to the parse_document_xml function as a string.

    7. At the beginning of the Lua script, modify the values in the xmlParams table so that they are suitable for your XML. For example, the document_root_paths option is a list of paths to elements that represent the root of a document in the input XML. For more information about these options, refer to the documentation for the parse_document_xml function.
    8. Click SAVE and then close the advanced configuration page.
  5. Connect the "returned" relationship of the ExecuteDocumentLua processor to your ingestion pipeline. The resulting documents are output to the "returned" relationship because they are explicitly returned from the handler function in the Lua script.

    TIP: The original FlowFiles that were routed to the ExecuteDocumentLua processor are routed to the "success" relationship.

    To avoid indexing documents representing the original XML files, you could auto-terminate this relationship. However, if you are using a document registry service to ensure that documents are indexed in the correct sequence, route the "success" relationship to an UnregisterDocument processor. For more information about the document registry service, see Index Documents in the Correct Sequence.

  6. Start the GetFileSystem and ExecuteDocumentLua processors.