looks_like_language
The looks_like_language
function analyzes document content. The function aims to determine whether the content contains text in the specified language. You can use this function to check whether Optical Character Recognition has successfully extracted meaningful text from a file.
Syntax
looks_like_language( doc [, params] )
Arguments
Argument | Description |
---|---|
doc
|
(LuaDocument) The document that you want to check. |
params
|
(table) Additional named parameters that configure the analysis performed. The table maps parameter names (String) to parameter values. For information about the parameters that you can set, see the following table. For information about how to use named parameters refer to the Connector Framework Server Administration Guide. |
Named Parameters
Named Parameter | Description | Configuration Parameter |
---|---|---|
section
|
(string) The name of a section in the CFS configuration file. If you set this then any parameters not set in the parameters table are read from this section of the configuration file. | |
threshold
|
(integer) The maximum quality score that a document can have and match the specified language. The quality score is an integer in the range 0-200, where lower numbers indicate higher quality. The default is 75. |
Threshold
|
term_file
|
(string) The filename of the language termlist. | TermFile
|
stop_list
|
(string) The filename of the language stoplist. | StopList
|
language
|
(string) The name of the language against which the document is checked. The default is “ENGLISH”. | Language
|
encoding
|
(string) The expected type of encoding used to encode the document. The default is “UTF-8”. | Encoding
|
minimum_valid_terms
|
(integer) The minimum number of valid terms required for a document to contain content in the specified language. If you do not specify a value, this check is ignored. | MinimumValidTerms
|
minimum_percentage_terms_in_language
|
(integer) The minimum percentage of valid terms required for a document to contain content in the specified language. If you do not specify a value, this check is ignored. | PercentageLanguageTerms
|
maximum_percentage_terms_not_in_language
|
(integer) The maximum percentage of invalid terms. If the percentage of invalid terms exceeds this limit, the document does not match the specified language. If you do not specify a value, this check is ignored. | PercentageNonLanguageTerms
|
punctuation
|
(string) Characters that are considered to be punctuation characters. Specify all of the characters in a single string, for example ".£,%()$ " |
Punctuation
|
maximum_percentage_punctuation
|
(integer) The maximum percentage of characters in a document that can be punctuation characters. If the percentage of punctuation characters exceeds this limit, the document does not match the specified language. If you do not specify a value, this check is ignored. | PercentagePunctuation
|
maximum_percentage_alphanumeric_terms
|
(integer) The maximum percentage of terms in a document that can contain numbers. If the percentage of alphanumeric terms exceeds this limit, the document does not match the specified language. If you do not specify a value, this check is ignored. | PercentageAlphaNumerical
|
classify_short_documents
|
(boolean) Specifies how short documents are processed:
|
ClassifyShortDocuments
|
quality_score_field
|
(string) The name of a new field that is created in the document, and filled with the numeric score (threshold). If you do not specify a value, the field is not created. | QualityScoreField
|
report_field
|
(string) The name of a new field that is created in the document, and filled with a report on the language state of the document. If you do not specify a value, the field is not created. | ReportField
|
Returns
Boolean. Returns True
if the document met all of the conditions, and is likely to contain content in the specified language and encoding.