OpticalCharacterRecognition

Runs optical character recognition on the file(s) associated with an IDOL document FlowFile, and adds the text to the IDOL document.

This processor cannot handle video input.

The processor can handle the following image formats:

  • TIFF
  • JPEG
  • JPEG 2000
  • PNG
  • GIF (only the first frame of an animated GIF)
  • BMP (compressed BMP files are not supported) and ICO
  • PBM, PGM, and PPM
  • WebP

Additionally, if you configure your MediaServiceImpl controller service to use a KeyView Export Service, the processor can handle document formats, including:

  • Adobe PDF
  • Microsoft Word Document (.DOC and .DOCX)
  • Microsoft Excel Sheet (.XLS and .XLSX)
  • Microsoft PowerPoint Presentation (.PPT and .PPTX)
  • OpenDocument Text (.ODT)
  • OpenDocument Spreadsheet (.ODS)
  • OpenDocument Presentation (.ODP)
  • Rich Text (RTF)

Properties

Name Default Value Description
IDOL License Service   An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.
Media Service   A MediaServiceImpl that manages media analysis resources.
Any Orientation false A Boolean value that specifies whether to process text in any orientation, rather than just upright.
Languages  

A list of languages that you expect to appear in the text. Specify a comma-separated list of language names or ISO 639-1 language codes, for example en,fr,ja for English, French, and Japanese.

For a list of supported languages, right-click the processor and click View Usage, or refer to the Media Server Administration Guide.

Text finding mode document

Specifies the type of media source:

  • Document. A printed page of formatted text, such as a report, magazine or letter.
  • Scene. An image of a general scene that contains text, such as a photograph.
User Dictionary  

You can create your own dictionaries to improve OCR performance when the media that you are analyzing contains proper names or technical terms.

Use this parameter to specify a list of paths to the dictionaries to use.

Each dictionary file must meet the following requirements:

  • The dictionary must be a text file, in ASCII or UTF-8 encoding.
  • Words must be separated by whitespace.
  • The first two letters of the file name specify the corresponding language code (for example FrenchTownNames.txt would be used with documents written in French).
Word Rejection Threshold 0 The minimum confidence level required to include a word in the output. Enter a value between zero and 100. The value zero specifies that all words are accepted.
Restrict Character Types  

A list of character types to include in the character set used for recognition. Specify the types of characters that you expect to appear in your media. If you know that your media only contains certain types of characters, such as uppercase characters, limit recognition to these characters because this can increase accuracy.

You can specify one or more of the following:

  • digit
  • letter - includes both lower case and upper case
  • lowercase
  • uppercase
  • punctuation
  • symbol - includes anything that the letter or digit options do not include. This includes punctuation but also currency, mathematical, and other non-punctuation symbols.
Disabled Characters   A list of characters to exclude from the character set used for recognition. Do not include a separator, such as a comma, between each character. OCR does not return any of the characters that you specify.
Extra Enabled Characters   A list of extra characters to add to the character set used for recognition, in addition to the characters included by the Languages parameter.

Relationships

Name Description
success Processing was successful.
failure Processing failed.

Example Output

The following example shows metadata that was added to an IDOL document by OCR.

<idol_media>
  <ocr>
    <block>
      <angle>0</angle>
      <line page="1">
        <region height="17" left="126" page="1" top="162" width="328">
          <text>Rainfall measurements were taken daily</text>
          <word height="14" left="126" top="162" width="59">Rainfall</word>
          <word height="13" left="192" top="163" width="121">measurements</word>
          <word height="10" left="319" top="166" width="40">were</word>
          <word height="14" left="365" top="162" width="45">taken</word>
          <word height="17" left="416" top="162" width="38">daily</word>
        </region>
      </line>
    </block>
  </ocr>
</idol_media>

There is a block element for each block of text, such as a heading or paragraph. The angle element gives the orientation of the block (rotated clockwise in degrees from upright). There is a line element for each line of text that exists within the block.

Each line element contains a region element that describes the position of the line. The left, top, width, and height attributes provide the position and size of the region in pixels (left specifies the distance from the left side of the image to the left side of the region, and top specifies the distance from the top of the image to the top of the region).

The region element includes:

  • A text element that contains the text that was recognized.
  • A word element for each word, that describes the exact position of the word.

OCR can identify tables that occur in images. The following example shows metadata that was added to an IDOL document when a table was detected.

<idol_media>
  <ocr>
    <table>
      <angle>0</angle>
      <row>
        ...
        <cell page="1">
          <span>1</span>
          <region height="26" left="306" page="1" top="206" width="187">
            <text>January total (mm)</text>
            <word height="24" left="306" top="208" width="84">January</word>
            <word height="25" left="388" top="207" width="46">total</word>
            <word height="26" left="438" top="206" width="55">(mm)</word>
          </region>
        </cell>
        ...
      </row>
    </table>
  </ocr>
</idol_media>

There is a table element for each detected table. The angle element gives the orientation of the table (rotated clockwise in degrees from upright). There is a row element for each row of cells in the table, and these contain a cell element for each cell in the row. Each cell element contains:

  • span - the number of columns spanned by the cell.
  • region - the position and size of the table cell. This element contains:

    • A text element that contains the text that was recognized.
    • A word element for each word, that describes the exact position of the word.