Synchronize with a Repository
The primary purpose of a connector is to retrieve data from a repository so that it can be processed and indexed into an IDOL Content component. The connector retrieves new documents, updated documents, and a list of documents that have been deleted so that the IDOL index is kept up-to-date with the data source. IDOL NiFi Ingest processors that synchronize with a repository have names that begin with "Get", such as GetFileSystem, GetWeb, or GetSharePoint.
Some IDOL NiFi Ingest connectors also support a feature called synchronize from identifiers. If a connector supports this feature, you can route documents that were not processed successfully back to the connector, so that the connector retrieves each of the failed documents again.
The behavior of the "Get" processor changes depending on whether you configure an incoming connection.
- If the processor has no input, it synchronizes with the repository each time it is scheduled to run. (For information about how to configure the schedule, see Configure Synchronize Schedules).
- If the processor has an incoming connection it runs only when there is a FlowFile in the input queue. The connector reads the identifier (a unique value that is assigned by the connector) of the incoming document and attempts to retrieve that document from the repository again. If the source data is retrieved successfully the connector discards the FlowFile that it received and outputs a new FlowFile to the success relationship. If the source data is not retrieved successfully the connector outputs the FlowFile it received to the failed identifiers relationship.
The following procedure demonstrates how to perform both types of synchronize task using a single processor.
To perform synchronize and synchronize from identifiers with a single processor
-
Add and configure the "Get" processor for the relevant repository.
TIP: The same processor will perform both synchronize and synchronize from identifiers. So, if you want to perform a synchronize task every 24 hours but you want to process failed documents every 30 minutes, schedule the processor to run every 30 minutes. (Use the highest common factor of both of your chosen schedules).
-
Route failed documents back to the "Get" processor.
At this point the processor only performs the "synchronize from identifiers" task, and will only synchronize documents that are added to its input queue.
-
Add a
GenerateFlowFile
processor to the dataflow, and configure it to create a FlowFile that will start a synchronize task.-
Add a processor, by dragging the processor icon
from the components toolbar to the canvas.
The Add Processor dialog box opens.
-
In the Source list, click all groups.
-
Search for and select the GenerateFlowFile processor and click ADD.
The processor is added to the canvas.
-
Create a connection between the GenerateFlowFile processor and the "Get" processor. Hover the mouse over the GenerateFlowFile processor until you see the connection icon -
- and then drag the icon to the "Get" processor.
The Create Connection dialog box opens.
-
In the For Relationships area, select the success check box and click ADD.
The connection appears on the canvas. In the following image the GenerateFlowFile processor has been named "StartSynchronize".
-
Right-click the GenerateFlowFile processor and click Configure.
The Configure Processor dialog box opens.
- Click the SCHEDULING tab.
- Configure how often to synchronize with the repository. For example, in the Run Schedule box, type
24 hours
. - Click the Properties tab.
-
Click Add
.
The Add Property dialog box opens.
-
Type
idol.get.action
and click OK.Another box opens so that you can specify the value.
- Type
synchronize
and click OK. - Click APPLY to close the Configure Processor dialog box.
-
- You can now start both the connector processor and the GenerateFlowFile processor.