Customize Field Standardization
Field standardization modifies documents so that they have a consistent structure and consistent field names. You can use field standardization so that documents indexed into IDOL through different connectors use the same fields to store the same type of information. Field standardization only modifies fields that are specified in a dictionary, which is defined in XML format. A standard dictionary, named dictionary.xml
, is supplied in the CFS installation folder.
In most cases you should not need to modify the standard dictionary, but you can modify it to suit your requirements or create dictionaries for different purposes. By modifying the dictionary, you can configure CFS to apply rules that modify documents before they are ingested. For example, you can move fields, delete fields, or change the format of field values.
The following examples demonstrate how to perform some operations with field standardization.
The following rule renames the field Author
to DOCUMENT_METADATA_AUTHOR_STRING
. This rule applies to all components that run field standardization and applies to all documents.
<FieldStandardization> <Field name="Author"> <Move name="DOCUMENT_METADATA_AUTHOR_STRING"/> </Field> </FieldStandardization>
The following rule demonstrates how to use the Delete
operation. This rule instructs CFS to remove the field KeyviewVersion
from all documents. The Product
element ensures that this rule is run only by CFS.
<FieldStandardization> <Product key="ConnectorFrameWork"> <Field name="KeyviewVersion"> <Delete/> </Field> </Product> </FieldStandardization>
There are several ways to select fields to process using the Field
element.
Field element attribute |
Description | Example |
---|---|---|
name
|
Select a field where the field name matches a fixed value. |
Select the field <Field name="MyField"> ... </Field> Select the field <Field name="MyField"> <Field name="Subfield"> ... </Field> </Field> |
path
|
Select a field where its path matches a fixed value. |
Select the field <Field path="MyField/Subfield"> ... </Field> |
nameRegex
|
Select all fields at the current depth where the field name matches a regular expression. |
In this case the field name must begin with the word <Field nameRegex="File.*"> ... </Field> |
pathRegex
|
Select all fields where the path of the field matches a regular expression. This operation can be inefficient because every metadata field must be checked. If possible, select the fields to process another way. |
This example selects all subfields of <Field pathRegex="MyField/[^/]*"> ... </Field> This approach would be more efficient: <Field name="MyField"> <Field nameRegex=".*"> ... </Field> |
You can also limit the fields that are processed based on their value, by using one of the following:
Field element attribute | Description | Example |
---|---|---|
matches
|
Process a field if its value matches a fixed value. |
Process a field named <Field name="MyField" matches="abc"> ... </Field> |
matchesRegex
|
Process a field if its entire value matches a regular expression. |
Process a field named <Field name="MyField" matchesRegex="\d+"> ... </Field> |
containsRegex
|
Process a field if its value contains a match to a regular expression. |
Process a field named <Field name="MyField" containsRegex="\d{3}"> ... </Field> |
The following rule deletes every field or subfield where the name of the field or subfield begins with temp
.
<FieldStandardization> <Field pathRegex="(.*/)?temp[^/]*"> <Delete/> </Field> </FieldStandardization>
The following rule instructs CFS to rename the field Author
to DOCUMENT_METADATA_AUTHOR_STRING
, but only when the document contains a field named DocumentType
with the value 230
(the KeyView format code for a PDF file).
<FieldStandardization> <Product key="ConnectorFrameWork"> <IfField name="DocumentType" matches="230"> <!-- PDF --> <Field name="Author"> <Move name="DOCUMENT_METADATA_AUTHOR_STRING"/> </Field> </IfField> </Product> </FieldStandardization>
TIP: In this example, the IfField
element is used to check the value of the DocumentType
field. The IfField
element does not change the current position in the document. If you used the Field
element, field standardization would attempt to find an Author
field that is a subfield of DocumentType
, instead of finding the Author
field at the root of the document.
The following rules demonstrate how to use the ValueFormat
operation to change the format of dates. The only format that you can convert date values into is the IDOL AUTNDATE format. The first rule transforms the value of a field named CreatedDate
. The second rule transforms the value of an attribute named Created
, on a field named Date
.
<FieldStandardization> <Field name="CreatedDate"> <ValueFormat type="autndate" format="YYYY-SHORTMONTH-DD HH:NN:SS"/> </Field> <Field name="Date"> <Attribute name="Created"> <ValueFormat type="autndate" format="YYYY-SHORTMONTH-DD HH:NN:SS"/> </Attribute> </Field> </FieldStandardization>
As demonstrated by this example, you can select field attributes to process in a similar way to selecting fields.
You must select attributes using either a fixed name or a regular expression:
Select a field attribute by name | <Attribute name="MyAttribute">
|
Select attributes that match a regular expression | <Attribute nameRegex=".*">
|
You can then add a restriction to limit the attributes that are processed:
Process an attribute only if its value matches a fixed value | <Attribute name="MyAttribute" matches="abc"> |
Process an attribute only if its value matches a regular expression | <Attribute name="MyAttribute" matchesRegex=".*"> |
Process an attribute only if its value contains a match to a regular expression | <Attribute name="MyAttribute" containsRegex="\w+"> |
The following rule moves all of the attributes of a field to sub fields, if the parent field has no value. The id
attribute on the first Field
element provides a name to a matching field so that it can be referred to by later operations. The GetName
and GetValue
operations save the name and value of a selected field or attribute (in this case an attribute) into variables (in this case $'name'
and $'value'
) which can be used by later operations. The AddField
operation uses the variables to add a new field at the selected location (the field identified by id="parent"
).
<FieldStandardization> <Field pathRegex=".*" matches="" id="parent"> <Attribute nameRegex=".*"> <GetName var="name"/> <GetValue var="value"/> <Field fieldId="parent"> <AddField name="$'name'" value="$'value'"/> </Field> <Delete/> </Attribute> </Field> </FieldStandardization>
The following rule demonstrates how to move all of the subfields of UnwantedParentField
to the root of the document, and then delete the field UnwantedParentField
.
<FieldStandardization id="root"> <Product key="ConnectorFrameWork"> <Field name="UnwantedParentField"> <Field nameRegex=".*"> <Move destId="root"/> </Field> <Delete/> </Field> </Product> </FieldStandardization>