Language Detection
CFS can identify the language of a document, and write the name of the language to a document field. A front-end application could use this field to provide a way to filter documents by language. You can also use language detection to reject invalid documents (when a language cannot be detected).
Language detection can be configured as a post-import task. Set the Post
parameter to LangDetect
and specify the name of a configuration file section that contains the task settings. For example:
[ImportTasks] Post0=LangDetect:LangDetectSettings [LangDetectSettings] LanguageDetectionDirectory=./filters/datafiles/ OutputField=DetectedLanguage FailIfLanguageUnknown=TRUE
You must set the parameter LanguageDetectionDirectory
to the path of the folder that contains the file langdetect.dat
. The remaining parameters are optional. The parameter OutputField
specifies the name of the document field to write the name of detected language to. By default, CFS rejects documents where it cannot detect a language but you can configure this by setting FailIfLanguageUnknown
. To continue processing documents when a language cannot be detected, set FailIfLanguageUnknown=FALSE
.