Image Server uses OCR to extract text (in UTF-8 format) from images containing text, such as scanned documents or frames from TV footage containing subtitles.
Image Server initially determines which parts of an image have 'text-like' properties, for example by searching for similarly-sized ink blobs grouped into 'word-like' sequences. It then compares the properties of these blobs with known character properties stored in Image Server, and selects the most probable characters. Image Server bases its selection on both the appearance of the text and also language context information and dictionaries. For example, character selections that produce known dictionary words are favored over selections that produce random-looking sequences of letters. Combining all the individual character selections produces a UTF-8 representation of the text parts of the image.
By default, Image Server returns any detected text as a string of characters (UTF-8 encoding), together with an overall confidence score and the bounding box of the text on the page. Alternatively, you can configure OCR tasks to return the output broken down into individual lines, words, or even individual characters, together with their own confidence scores and bounding boxes.
Note: Details about individual words and characters are not available for text elements in PDF files. Image Server also does not generate confidence scores for text elements (each text element would score 100%).
OCR is also used for redaction analysis and Intelligent Document Recognition (IDR).
|