To enable deduplication for all indexing jobs—in other words, to set deduplication by default for the DREADD
and DREADDATA
actions—use the KillDuplicates
configuration parameter in the [Server]
section of the configuration file. Note that you must enable deduplication before you start indexing documents into HPE IDOL Server.
You can use the KillDuplicatesChecksumField
parameter to configure HPE IDOL to reverse normal deduplication and retain the existing document instead of the incoming document, based on the value of a specified field in the incoming document.
You can use the KillDuplicatesPreserveFields
parameter to configure one or more IDX fields that HPE IDOL Server copies to a newer version of a duplicate document.
Open the HPE IDOL Server configuration file in a text editor.
In the [Server]
section, set the KillDuplicates
parameter to REFERENCE
, REFERENCEMATCHN
, the names of the ReferenceType
fields to use to determine which documents are duplicates, or a combination of ReferenceType
field and a field that contains a document version number. For more information about these options, see Deduplication Options—KillDuplicates, or refer to the IDOL Server Reference.
You can identify fields that contain document references by setting up an appropriate field process. When you index a document that has the same value in the same ReferenceType
field as an existing document in HPE IDOL Server, HPE IDOL Server detects the duplicate. It deletes the existing document and replaces it with the new one.
Save and close the configuration file. Restart HPE IDOL Server for your changes to take effect. You can now index documents into HPE IDOL Server.
You identify fields as ReferenceType
fields through field processes. If you list multiple fields in the same PropertyFieldCSVs
parameter where you list the FieldName
for deduplication, HPE IDOL Server uses all the fields to eliminate duplicate documents. For example:
[SetReferenceFields] Property=Reference PropertyFieldCSVs=*/DREREFERENCE,*/URL
In this example, HPE IDOL Server uses both the DREREFERENCE
field and URL
field to eliminate duplicate copies if you set KillDuplicates
to DREREFERENCE
.
If you want to define multiple ReferenceType
fields but do not want to use them all for duplicate elimination, set up multiple field processes. For example:
[SetReferenceFields] Property=Reference PropertyFieldCSVs=*/DREREFERENCE [SetMoreReferenceFields] Property=Reference PropertyFieldCSVs=*/URL
In this example, HPE IDOL Server uses only the DREREFERENCE
field to eliminate duplicate copies if you set KillDuplicates
to DREREFERENCE
. It does not use the URL
field.
By default, when HPE IDOL Server detects that a new document is a duplicate of an existing one, it replaces the existing document with the new one.
For either of these two KillDuplicates
options, you can also use the KillDuplicatesChecksumField
configuration parameter to specify a checksum field. HPE IDOL Server then checks the value of this field in both documents. If the value is the same, HPE IDOL Server keeps the existing document rather than replacing it with the new document.
This process prevents unnecessary updates. For example, when refetching a Web site, use KillDuplicatesChecksumField
to configure HPE IDOL to update the index for this site only if the site has changed.
The KillDuplicatesChecksumField
must be a ReferenceType
field.
If there is a field that you want to keep in all versions of a document, regardless of whether it is later deleted or changed, you can use the KillDuplicatesPreserveFields
configuration parameter.
To preserve fields, set KillDuplicatesPreserveFields
to a comma-separated list of fields that you want to save.
When HPE IDOL Server receives a duplicate document, it copies this field from the existing version of the document to the newer version when it performs KillDuplicates
.
If there is more than one copy of the document in the HPE IDOL Server index when a new version arrives, HPE IDOL Server copies the preserve field from the existing duplicate with the highest document ID.
|