Re-Ingest Documents that Match a Query

After making changes to your NiFi Ingest dataflow, you might want to re-ingest the documents in your IDOL index. For example, if you add a processing step to perform a new type of media analysis you might want to re-ingest image files.

One approach is to clear the index and the state information stored by your connectors (see Connector Datastores) so that all of the items in your data repositories are ingested again. However, re-ingesting all of your documents can be time consuming. Instead, you can send a query to your IDOL Content component and re-ingest documents that match the query.

NOTE: To do this, the connectors you are using must support the synchronize from identifiers feature.

To re-ingest documents that match a query

  1. Open the NiFi user interface.
  2. Add a QueryIDOL processor to the data flow. This processor queries your IDOL Content component and returns documents that match the query.

    1. Drag the processor icon from the components toolbar to the canvas.

      The Add Processor dialog box opens.

    2. In the Source list, click idol.nifi.

    3. Select the QueryIDOL processor and click ADD.

      The processor is added to the canvas.

    4. Right-click the QueryIDOL processor and click Configure.

      The Configure Processor dialog box opens.

    5. Click the Properties tab and set the following properties:

      IDOL Host The host name or IP address of your IDOL Content component.
      IDOL ACI Port The ACI port of your IDOL Content component.
      Text
      Field Text
      Database Match

      These properties set the value of the Text, FieldText, and DatabaseMatch parameters in the Query action. Specify a query that returns the documents that you want to re-ingest. For example, to re-ingest PDF files you might set the Field Text property to:

      MATCH{230}:DOCUMENT_KEYVIEW_TYPE_NUMBER
    6. Click the Scheduling tab and in the Run Schedule box, type 1 hour.
    7. Click APPLY.
  3. Add a RouteOnAttribute processor to the data flow. This is necessary if your query returns documents that were originally retrieved by different connectors. (Each document must be routed back to the correct connector. The following steps demonstrate how to do this for a File System Connector. To route documents to multiple connectors you would repeat steps 3F to 3H to create additional output relationships).

    1. Drag the processor icon from the components toolbar to the canvas.

      The Add Processor dialog box opens.

    2. In the Source list, click org.apache.nifi.

    3. Select the RouteOnAttribute processor and click ADD.

      The processor is added to the canvas.

    4. Right-click the RouteOnAttribute processor and click Configure.

      The Configure Processor dialog box opens.

    5. Click the Properties tab.

    6. Click to add a new dynamic property. This creates a new output relationship.

      The Add Property dialog box opens.

    7. In the box, type a name for the output relationship and click OK. For example, to create an output relationship to route documents to a File System Connector, type FileSystem.
    8. Set the value of the property. For example, to select documents that were originally retrieved by a NiFi processor named "MyFileSystemConnector":

      ${idol.doc.source:getDelimitedField(2,':'):equals('MyFileSystemConnector')}

      This is NiFi expression language that reads the second part of the idol.doc.source FlowFile attribute and checks to see whether the value equals MyFileSystemConnector. For more information about this attribute, see Introduction to FlowFiles and Documents.

    9. Click APPLY.
  4. Create a connection between the QueryIDOL processor (using the "success" relationship) and the RouteOnAttribute processor.
  5. Create a connection between the RouteOnAttribute processor (using the "FileSystem" relationship that you added) and the File System Connector that originally retrieved the files. The connector should retrieve the files and route them into your ingestion pipeline.
  6. Start the QueryIDOL and RouteOnAttribute processors.

    The QueryIDOL processor sends the query to the IDOL Content component. You should be able to see the query in the Content component request log, which is available through action=GRL. Any result documents that were originally retrieved by the File System Connector are routed to the connector for re-ingestion.