Prevent Duplicate Documents
You can configure the IDOL Content component to implement deduplication when indexing documents. This process prevents storage of the same document or document content. If Content determines that the document to index matches an existing document, it replaces the existing document with the new document.
The IDOL Content component uses deduplication options to determine whether documents match. See Deduplication Options—KillDuplicates.
You can enable deduplication in one of three ways:
-
Enable deduplication for all indexing jobs by using the KillDuplicates configuration parameter in the
[Server]
section of the IDOL Content component configuration file. See Enable Deduplication for all Index Jobs.You can use the KillDuplicatesChecksumField configuration parameter with deduplication to prevent the IDOL Content component from unnecessarily updating existing documents. See Use KillDuplicatesChecksumField to Prevent Unnecessary Indexing.
You can also use the KillDuplicatesPreserveFields configuration parameter with deduplication to copy the specified IDX fields from an existing document to a newer version.
-
Enable deduplication for individual indexing jobs by using the KillDuplicates action parameter in the DREADD and DREADDDATA actions. See Enable Deduplication for Individual Index Jobs.
Use the KeepExisting action parameter with deduplication to discard the incoming document instead of replacing the existing document, This option reduces the indexing load. See Use KeepExisting to Minimize the Index Load.
-
Enable deduplication when indexing with Connector Framework Server (CFS) by setting the
KillDuplicates
configuration parameter for the connector. See Enable Deduplication for Connector Index Jobs.
Some other IDOL Content component parameters affect the behavior of the deduplication settings. See Deduplication Constraints.
You can deduplicate after indexing by using the DREDUPLICATE index action. See Locate Duplicate Documents.