OpticalCharacterRecognition

Runs optical character recognition on the file(s) associated with an IDOL document FlowFile, and adds the text to the IDOL document.

This processor cannot handle video input.

The processor can handle the following image formats:

Additionally, if you configure your MediaServiceImpl controller service to use a KeyView Export Service, the processor can handle document formats, including:

Properties

Name Default Value Description
IDOL License Service   An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.
Media Service   A MediaServiceImpl that manages media analysis resources.
Any Orientation false A Boolean value that specifies whether to process text in any orientation, rather than just upright.
Languages  

A list of languages that you expect to appear in the text. Specify a comma-separated list of language names or ISO 639-1 language codes, for example en,fr,ja for English, French, and Japanese.

For a list of supported languages, right-click the processor and click View Usage, or refer to the Media Server Administration Guide.

Text finding mode document

Specifies the type of media source:

  • Document. A printed page of formatted text, such as a report, magazine or letter.
  • Scene. An image of a general scene that contains text, such as a photograph.
User Dictionary  

You can create your own dictionaries to improve OCR performance when the media that you are analyzing contains proper names or technical terms.

Use this parameter to specify a list of paths to the dictionaries to use.

Each dictionary file must meet the following requirements:

  • The dictionary must be a text file, in ASCII or UTF-8 encoding.
  • Words must be separated by whitespace.
  • The first two letters of the file name specify the corresponding language code (for example FrenchTownNames.txt would be used with documents written in French).
Word Rejection Threshold 0 The minimum confidence level required to include a word in the output. Enter a value between zero and 100. The value zero specifies that all words are accepted.
Restrict Character Types  

A list of character types to include in the character set used for recognition. Specify the types of characters that you expect to appear in your media. If you know that your media only contains certain types of characters, such as uppercase characters, limit recognition to these characters because this can increase accuracy.

You can specify one or more of the following:

  • digit
  • letter - includes both lower case and upper case
  • lowercase
  • uppercase
  • punctuation
  • symbol - includes anything that the letter or digit options do not include. This includes punctuation but also currency, mathematical, and other non-punctuation symbols.
Disabled Characters   A list of characters to exclude from the character set used for recognition. Do not include a separator, such as a comma, between each character. OCR does not return any of the characters that you specify.
Extra Enabled Characters   A list of extra characters to add to the character set used for recognition, in addition to the characters included by the Languages parameter.

Relationships

Name Description
success Processing was successful.
failure Processing failed.