Enable Automatic Language Detection
If your IDOL license includes automatic language detection, the IDOL Content component can automatically identify the language and encoding of a document when it is indexed. Content analyzes a certain amount of text in the document content fields (fields for which SourceType is set to True
in the IDOL Content component configuration file).
-
Open the IDOL Content component configuration file in a text editor.
-
Find the
[Server]
section and set AutoDetectLanguagesAtIndex toTrue
:AutoDetectLanguagesAtIndex=True
-
Set DiscardUnconfiguredLanguagesAtIndex to
True
if you do not want to index documents with a language type that is not configured.Set DiscardUnknownLanguagesAtIndex to
True
if you do not want to index documents whose language Content cannot recognize. For example, it might not recognize the language because the document does not contain language, or it might not have enough text for Content to determine the language.By default, Content indexes the document using the default language type. It also logs a warning message in the index log, so that you can add an appropriate language type.
-
You can change the amount of text that Content analyzes to detect the language of a document. By default, Content uses only a few sentences. In some situations, increasing the amount of text to analyze can give more accurate results, such as when significant amounts of a minor second language are present.
Add the MaxLanguageDetectTerms setting to the
[Server]
section, specifying the number of terms (words) that Content uses for detection. For example:MaxLanguageDetectTerms=1000
-
By default, Content detects any 7-bit ASCII characters as UTF-8. If you instead want to group these documents with documents using 8-bit ASCII, disable the LangDetectUTF8 parameter by setting it to
False
.Ensure that the encoding option you want is present in the language type configuration (see Define Language Types). If there are no compatible encodings configured for the detected language, IDOL assigns the default language type.
-
Save and close the configuration file.
-
Restart the IDOL Content component for your changes to take effect.
NOTE: If you enable automatic language detection and set up a field process that reads the language of a document from one of its fields, Content uses the field process rather than autodetection to determine the document language and encoding.