Optical Character Recognition

When processing raster image files, KeyView can perform Optical Character Recognition (OCR) to attempt to filter text that might be visible in the image. If text is detected to form part of a table, it will be filtered in the same way as tables in Word Processing documents.

NOTE: KeyView performs OCR only on standalone raster files, not on images embedded inside other documents. For embedded images, you must first extract the images by using the Extract Images option.

NOTE: OCR is available only on certain platforms (see Optical Character Recognition in the platform differences section).

If your license includes OCR, it is enabled by default.

To enable or disable OCR

Optimize OCR Performance

The default settings for OCR attempt to detect as much text as possible. For example, KeyView attempts to detect text in multiple languages and alphabets, and rotated text in increments of 90 degrees from upright. This increases the amount of text that can be detected, prioritizing recall over processing time.

If you know what you will be processing in advance, you can specify OCR options to improve performance.

To configure OCR through the C API, create a KVOcrOptions structure and then call fpSetConfig() and set the KVFLT_OCR option.

For example, if the input is scanned pages that contain only English or only Japanese text, the following configuration could result in a performance improvement. However, it may fail to recognize text in some images such as landscape pages where the text is not upright.

KVOcrOptions options;
KVStructInit(&options);
options.textFindingMode = KVOcrDocument;
options.languages = "en ja";
options.orientation = KVOcrUpright;
options.detectAlphabet = KVOcrListed;

Text Finding Mode

OCR can use different algorithms for finding text. Each algorithm is optimized for a different type of image:

  • Document - A scanned or printed page of formatted text, such as a report, magazine, or letter.
  • Scene - An image of a general scene that contains text, such as a photograph or TV footage.
  • Hollow - A scene image containing outlined text, such as white characters with a black border which are often used in television subtitles.
  • Auto - The IDOL OCR library selects the algorithm automatically.

Languages

OCR supports many different languages. For a list of supported languages, see OCR Supported Languages. If you know that your files only contain text in a certain language or a small number of languages, you can improve both processing speed and accuracy by configuring OCR with this information.

Orientation

By default, OCR attempts to detect text that appears rotated, in 90-degree increments from upright. This means that KeyView can filter text from an image, even if it has been rotated or was scanned upside-down. If you know that your images contain only upright text, you can improve processing speed by disabling this feature.

Alphabet Detection

Sometimes, if you do not know the language of the input text in advance of processing, you might specify multiple languages. OCR requires more processing time for each additional language, especially when the languages span multiple alphabets (Latin, Cyrillic, Chinese, Arabic, and so on).

You can configure OCR to detect the alphabet for each image, before attempting to recognize characters. You can choose one of the following options.

  • Off. By default, OCR does not detect the alphabet. Use this option when you have specified a single language or multiple languages that use the same alphabet. OpenText also recommends this option when you expect an image to use multiple alphabets (for example, when there is English and Arabic text on the same page).
  • Listed. OCR detects the alphabet, but only considers alphabets that are represented in your chosen list of languages. This option can reduce the time required to recognize characters, because languages that do not match the detected alphabet are ignored. For example, if you set languages="en ja ko" (English, Japanese, and Korean) and OCR detects the Latin alphabet, OCR ignores the Japanese and Korean languages. OpenText recommends using this option when each source image uses a single alphabet, and the list of possible languages is known but spans multiple alphabets.
  • Any. OCR detects the alphabet that is used, and considers all alphabets. This option can reduce the time required to recognize characters, because languages that do not match the detected alphabet are ignored. If none of your chosen languages match the detected alphabet, OCR does not recognize characters and there is no output. OpenText recommends using this option instead of Listed when you want to reject images that do not match any of the specified languages.

If your input contains Chinese, Japanese, or Korean text with some ASCII characters, you can safely set this parameter to any of the available options, because OCR includes ASCII characters for those languages.