DeduplicateDocuments
A processor to identify FlowFiles that represent duplicate documents.
The processor identifies duplicates by matching hashes stored in a FlowFile (or calculated from a FlowFile) against previously-seen hashes that are stored in a database table. The processor routes the duplicates to a separate relationship so that they can be processed separately from other FlowFiles.
TIP: To calculate hashes for document content, metadata, and files, OpenText recommends that you use a HashDocument processor upstream of this processor.
Properties
Name | Description |
---|---|
IDOL License Service | An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server. |
Database Service | An optional DatabaseServiceImpl used by the processor to store the hash values that it has seen, allowing it to identify duplicate documents. If you do not specify a database service, the processor stores the data in files on disk. |
Hash Table Name | The name of the database table in which to store hash data. The processor automatically creates this table. |
Configuration JSON file | The path to a file in which to store the processor configuration in JSON format. To configure the processor, click ADVANCED and follow the on-screen instructions. |
Relationships
Name | Description |
---|---|
duplicate | FlowFiles that are identified as duplicates. |
success | FlowFiles that were processed but were not identified as duplicates. |
failure | FlowFiles with an invalid or unknown format. |
Advanced Configuration
The default configuration expects to find a hash value in the IDOL document metadata, in the field(s) part/hashes/MD5
. This is the default location in which the HashDocument processor writes an MD5 hash for a file part. For example:
<xmlmetadata> <part id="1"> <file.name>image.tif</file.name> <mimetype>application/octet-stream</mimetype> <hashes> <MD5>cfa17dde23fa9e1c5ac2332e9f93acc4</MD5> </hashes> </part> ... </xmlmetadata>
The default configuration performs the following actions:
Conditions | Action | Explanation |
---|---|---|
The processor sees a hash value that it has not seen before, and the FlowFile attribute idol.reference.action is add . |
Add a row to the database table containing:
|
Storing the information in the database is necessary to identify future duplicates. |
The processor sees a hash value that it has seen before, but the IDOL document reference also matches. | The FlowFile is not considered a duplicate. | Files can be synchronized multiple times - a matching reference indicates the FlowFile does not represent a duplicate. |
The processor sees a hash value that it has seen before, and the IDOL document reference does not match. | The FlowFile is considered as a duplicate and the processor runs the action "set match fields". This creates a field in the document metadata named DUPLICATE_MATCH , and populates it with the stored reference, source, and identifier values from the database. It also routes the FlowFile to the "duplicate" relationship. |
The "duplicate" FlowFile is tagged with the details of the "original" file and routed to the "duplicate" relationship so that it can be processed separately. |
You can customize the configuration by changing the settings in the advanced configuration UI.
"Reference Actions" refer to the value of the FlowFile attribute idol.reference.action
- see Introduction to FlowFiles and Documents. If a section of the configuration does not specify a reference action, it applies to any reference action.
"Document Actions" are tasks that the processor can perform:
Document Action | Description | FlowFile Destination |
---|---|---|
add hash | Add a row to the database containing the details of the FlowFile being processed. This is how the database is populated, so you must use this action somewhere in your configuration. | success |
delete hash | Delete a row from the database where the hash in the database matches the FlowFile being processed. | success |
delete hash by [field] | Delete a row from the database where the IDOL document reference, source (the value of the idol.doc.source FlowFile attribute), or IDOL document identifier matches the FlowFile being processed. |
success |
update hash details | Update the stored values for reference, source, and identifier in the database. | success |
not duplicate | Do nothing. A FlowFile is not routed to the "duplicate" relationship, unless routed there by another part of the configuration. | success |
set match fields | Create a field in the IDOL document metadata named DUPLICATE_MATCH , and populate it with the stored reference, source, and identifier values from the database. This also routes the FlowFile to the "duplicate" relationship. |
duplicate |