Deduplication Constraints
There are some constraints on deduplication when using other IDOL parameters.
Use the Combine Operation
The IDOL Content component cannot use the same ReferenceType
field for deduplication as it uses for the Combine
action parameter. The Combine
operation occurs at query time and clashes with deduplication. If you intend to deduplicate when indexing and use the Combine
action parameter, you must set up separate ReferenceType
fields for these processes.
Use Deduplication with DIH Reference-Based Indexing
You can enable the DIH for reference-based indexing. Refer to the DIH Administration Guide.
If you index documents into IDOL with the DIH enabled for reference-based indexing, it might prevent deduplication of documents with different references. In this case, use only one of the following deduplication options:
-
KillDuplicates=REFERENCE
-
KillDuplicates=NONE
Use Deduplication with DIH Field-Based Indexing
You can use field-based indexing in the DIH to ensure correct deduplication in a distributed system. For more information on configuring the DIH for field-based indexing, refer to the DIH Administration Guide.
If you set KeepExisting
to False
, or use KillDuplicatesDB
options, it might prevent correct deduplication. To deduplicate correctly, you can distribute data by the DeDupeHash
field (MD5 hash) of the documents. In this way, DIH sends all duplicates to the same child server. Setting KillDuplicates
to DeDupeHash
during the indexing action then ensures accurate deduplication.
To use a field for deduplication, you must configure it as a ReferenceType
field. You do not need to configure it as ReferenceType
in the DIH configuration file.
Deduplication of content occurs for all reference fields specified in a single PropertyFieldCSVs
list in the IDOL Content component configuration file. To use only the DeDupeHash
field to deduplicate, and not also the DREREFERENCE
, you must set these reference fields in separate field processing sections in the IDOL Content component configuration file.