OpticalCharacterRecognition
Runs optical character recognition on the file(s) associated with an IDOL document FlowFile, and adds the text to the IDOL document.
This processor cannot handle video input.
The processor can handle the following image formats:
- TIFF
- JPEG
- JPEG 2000
- PNG
- GIF (only the first frame of an animated GIF)
- BMP (compressed BMP files are not supported) and ICO
- PBM, PGM, and PPM
- WebP
Additionally, if you configure your MediaServiceImpl controller service to use a KeyView Export Service, the processor can handle document formats, including:
- Adobe PDF
- Microsoft Word Document (.DOC and .DOCX)
- Microsoft Excel Sheet (.XLS and .XLSX)
- Microsoft PowerPoint Presentation (.PPT and .PPTX)
- OpenDocument Text (.ODT)
- OpenDocument Spreadsheet (.ODS)
- OpenDocument Presentation (.ODP)
- Rich Text (RTF)
Properties
Name | Default Value | Description |
---|---|---|
IDOL License Service | An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server. | |
Media Service | A MediaServiceImpl that manages media analysis resources. | |
Any Orientation | false | A Boolean value that specifies whether to process text in any orientation, rather than just upright. |
Languages |
A list of languages that you expect to appear in the text. Specify a comma-separated list of language names or ISO 639-1 language codes, for example For a list of supported languages, right-click the processor and click View Usage, or refer to the Media Server Administration Guide. |
|
Text finding mode | document |
Specifies the type of media source:
|
User Dictionary |
You can create your own dictionaries to improve OCR performance when the media that you are analyzing contains proper names or technical terms. Use this parameter to specify a list of paths to the dictionaries to use. Each dictionary file must meet the following requirements:
|
|
Word Rejection Threshold | 0 | The minimum confidence level required to include a word in the output. Enter a value between zero and 100. The value zero specifies that all words are accepted. |
Restrict Character Types |
A list of character types to include in the character set used for recognition. Specify the types of characters that you expect to appear in your media. If you know that your media only contains certain types of characters, such as uppercase characters, limit recognition to these characters because this can increase accuracy. You can specify one or more of the following:
|
|
Disabled Characters | A list of characters to exclude from the character set used for recognition. Do not include a separator, such as a comma, between each character. OCR does not return any of the characters that you specify. | |
Extra Enabled Characters | A list of extra characters to add to the character set used for recognition, in addition to the characters included by the Languages parameter. |
Relationships
Name | Description |
---|---|
success | Processing was successful. |
failure | Processing failed. |