DeduplicateDocuments

A processor to identify FlowFiles that represent duplicate documents.

The processor identifies duplicates by matching hashes stored in a FlowFile (or calculated from a FlowFile) against previously-seen hashes that are stored in a database table. The processor routes the duplicates to a separate relationship so that they can be processed separately from other FlowFiles.

TIP: To calculate hashes for document content, metadata, and files, OpenText recommends that you use a HashDocument processor upstream of this processor.

Properties

Name Description
IDOL License Service An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.
Database Service An optional DatabaseServiceImpl used by the processor to store the hash values that it has seen, allowing it to identify duplicate documents. If you do not specify a database service, the processor stores the data in files on disk.
Hash Table Name The name of the database table in which to store hash data. The processor automatically creates this table.
Configuration JSON file The path to a file in which to store the processor configuration in JSON format. To configure the processor, click ADVANCED and follow the on-screen instructions.

Relationships

Name Description
duplicate FlowFiles that are identified as duplicates.
success FlowFiles that were processed but were not identified as duplicates.
failure FlowFiles with an invalid or unknown format.

Advanced Configuration

The default configuration expects to find a hash value in the IDOL document metadata, in the field(s) part/hashes/MD5. This is the default location in which the HashDocument processor writes an MD5 hash for a file part. For example:

<xmlmetadata>
  <part id="1">
    <file.name>image.tif</file.name>
    <mimetype>application/octet-stream</mimetype>
    <hashes>
      <MD5>cfa17dde23fa9e1c5ac2332e9f93acc4</MD5>
    </hashes>
  </part>
  ...
</xmlmetadata>

The default configuration performs the following actions:

Conditions Action Explanation
The processor sees a hash value that it has not seen before, and the FlowFile attribute idol.reference.action is add.

Add a row to the database table containing:

  • the hash value
  • the IDOL document reference
  • the FlowFile source (the value of the idol.doc.source FlowFile attribute, which is populated automatically by IDOL connectors)
  • the IDOL document identifier
  • a timestamp
Storing the information in the database is necessary to identify future duplicates.
The processor sees a hash value that it has seen before, but the IDOL document reference also matches. The FlowFile is not considered a duplicate. Files can be synchronized multiple times - a matching reference indicates the FlowFile does not represent a duplicate.
The processor sees a hash value that it has seen before, and the IDOL document reference does not match. The FlowFile is considered as a duplicate and the processor runs the action "set match fields". This creates a field in the document metadata named DUPLICATE_MATCH, and populates it with the stored reference, source, and identifier values from the database. It also routes the FlowFile to the "duplicate" relationship. The "duplicate" FlowFile is tagged with the details of the "original" file and routed to the "duplicate" relationship so that it can be processed separately.

You can customize the configuration by changing the settings in the advanced configuration UI.

"Reference Actions" refer to the value of the FlowFile attribute idol.reference.action - see Introduction to FlowFiles and Documents. If a section of the configuration does not specify a reference action, it applies to any reference action.

"Document Actions" are tasks that the processor can perform:

Document Action Description FlowFile Destination
add hash Add a row to the database containing the details of the FlowFile being processed. This is how the database is populated, so you must use this action somewhere in your configuration. success
delete hash Delete a row from the database where the hash in the database matches the FlowFile being processed. success
delete hash by [field] Delete a row from the database where the IDOL document reference, source (the value of the idol.doc.source FlowFile attribute), or IDOL document identifier matches the FlowFile being processed. success
update hash details Update the stored values for reference, source, and identifier in the database. success
not duplicate Do nothing. A FlowFile is not routed to the "duplicate" relationship, unless routed there by another part of the configuration. success
set match fields Create a field in the IDOL document metadata named DUPLICATE_MATCH, and populate it with the stored reference, source, and identifier values from the database. This also routes the FlowFile to the "duplicate" relationship. duplicate