OpticalCharacterRecognition
Runs optical character recognition on the file(s) associated with an IDOL document FlowFile, and adds the text to the IDOL document.
This processor cannot handle video input.
The processor can handle the following image formats:
- TIFF
- JPEG
- JPEG 2000
- PNG
- GIF (only the first frame of an animated GIF)
- BMP (compressed BMP files are not supported) and ICO
- PBM, PGM, and PPM
- WebP
Additionally, if you configure your MediaServiceImpl controller service to use a KeyView Export Service, the processor can handle document formats, including:
- Adobe PDF
- Microsoft Word Document (.DOC and .DOCX)
- Microsoft Excel Sheet (.XLS and .XLSX)
- Microsoft PowerPoint Presentation (.PPT and .PPTX)
- OpenDocument Text (.ODT)
- OpenDocument Spreadsheet (.ODS)
- OpenDocument Presentation (.ODP)
- Rich Text (RTF)
Properties
Name | Default Value | Description |
---|---|---|
IDOL License Service | An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server. | |
Media Service | A MediaServiceImpl that manages media analysis resources. | |
Any Orientation | false | A Boolean value that specifies whether to process text in any orientation, rather than just upright. |
Languages |
A list of languages that you expect to appear in the text. Specify a comma-separated list of language names or ISO 639-1 language codes, for example For a list of supported languages, right-click the processor and click View Usage, or refer to the Media Server Administration Guide. |
|
Text finding mode | document |
Specifies the type of media source:
|
User Dictionary |
You can create your own dictionaries to improve OCR performance when the media that you are analyzing contains proper names or technical terms. Use this parameter to specify a list of paths to the dictionaries to use. Each dictionary file must meet the following requirements:
|
|
Word Rejection Threshold | 0 | The minimum confidence level required to include a word in the output. Enter a value between zero and 100. The value zero specifies that all words are accepted. |
Restrict Character Types |
A list of character types to include in the character set used for recognition. Specify the types of characters that you expect to appear in your media. If you know that your media only contains certain types of characters, such as uppercase characters, limit recognition to these characters because this can increase accuracy. You can specify one or more of the following:
|
|
Disabled Characters | A list of characters to exclude from the character set used for recognition. Do not include a separator, such as a comma, between each character. OCR does not return any of the characters that you specify. | |
Extra Enabled Characters | A list of extra characters to add to the character set used for recognition, in addition to the characters included by the Languages parameter. |
Relationships
Name | Description |
---|---|
success | Processing was successful. |
failure | Processing failed. |
Example Output
The following example shows metadata that was added to an IDOL document by OCR.
<idol_media> <ocr> <block> <angle>0</angle> <line page="1"> <region height="17" left="126" page="1" top="162" width="328"> <text>Rainfall measurements were taken daily</text> <word height="14" left="126" top="162" width="59">Rainfall</word> <word height="13" left="192" top="163" width="121">measurements</word> <word height="10" left="319" top="166" width="40">were</word> <word height="14" left="365" top="162" width="45">taken</word> <word height="17" left="416" top="162" width="38">daily</word> </region> </line> </block> </ocr> </idol_media>
There is a block
element for each block of text, such as a heading or paragraph. The angle
element gives the orientation of the block (rotated clockwise in degrees from upright). There is a line
element for each line of text that exists within the block.
Each line
element contains a region
element that describes the position of the line. The left
, top
, width
, and height
attributes provide the position and size of the region in pixels (left
specifies the distance from the left side of the image to the left side of the region, and top
specifies the distance from the top of the image to the top of the region).
The region
element includes:
- A
text
element that contains the text that was recognized. - A
word
element for each word, that describes the exact position of the word.
OCR can identify tables that occur in images. The following example shows metadata that was added to an IDOL document when a table was detected.
<idol_media> <ocr> <table> <angle>0</angle> <row> ... <cell page="1"> <span>1</span> <region height="26" left="306" page="1" top="206" width="187"> <text>January total (mm)</text> <word height="24" left="306" top="208" width="84">January</word> <word height="25" left="388" top="207" width="46">total</word> <word height="26" left="438" top="206" width="55">(mm)</word> </region> </cell> ... </row> </table> </ocr> </idol_media>
There is a table
element for each detected table. The angle
element gives the orientation of the table (rotated clockwise in degrees from upright). There is a row
element for each row of cells in the table, and these contain a cell
element for each cell in the row. Each cell
element contains:
span
- the number of columns spanned by the cell.-
region
- the position and size of the table cell. This element contains:- A
text
element that contains the text that was recognized. - A
word
element for each word, that describes the exact position of the word.
- A