Introduction
Media Server can run Optical Character Recognition (OCR) on images such as scanned documents and photographs of documents. You can also run OCR on video to extract subtitles and scrolling text that sometimes appears during television news broadcasts.
Media Server OCR:
- searches images and video for text-like regions, and performs OCR only on those regions.
- provides options to restrict the language and character types used during recognition, which can increase the accuracy of OCR in some cases.
- supports specialized font types.
- supports many languages.
- can automatically adjust when scanned documents are rotated by either 90 or 180 degrees from upright.
- automatically adjusts for skewed text in scanned documents and photographs.
NOTE: Media Server OCR recognizes machine-printed text. Handwritten text is not supported.
OCR and PDF Files
PDF files can contain text stored in the form of an image, and text stored as text (text elements).
When you ingest a PDF file, Media Server uses KeyView Export functionality to generate a raster image of each page. Any visible text elements are rendered as part of the image. You can choose how to use the text elements:
- By default, OCR ignores any part of an image that is covered by a text element, and returns the text contained in the text element. This should result in perfect accuracy and require almost no processing time. The remaining parts of the image are then processed by running OCR.
- If you set
ProcessTextElements=FALSE
, Media Server uses OCR to process the whole image and does not use the text elements that are embedded in the document.
Sometimes, text elements are added to scanned documents so that users can search the PDF, or highlight and copy text from the document. This can be done by adding invisible text elements over the image of the text. If you ingest one of these documents then by default Media Server will use the text elements rather than running OCR. If the text elements are not accurate then you can set ProcessTextElements=FALSE
so that Media Server runs OCR on the original image.
NOTE: Some OCR features are not supported when text is obtained from text elements:
- Media Server does not provide word- or character-level output for text elements. This means that text from text elements does not appear in the
WordData
,WordResult
, orCharResult
tracks. - You can run OCR on part of a page, by setting the
Region
parameter. Text from a text element is included in the output if any part of the line overlaps the region. (Running OCR on an image produces more precise position information, so individual characters are included or excluded depending on whether they overlap with the region).
If you require these features, you can set ProcessTextElements=FALSE
, but this will take longer and might be less accurate, because Media Server runs OCR on the page and ignores any text elements.
OCR and Office Documents
Media Server can ingest Microsoft Word and Excel documents but does not generate a raster image for each page, like it does for PDF files. When you run OCR on a Word or Excel document, Media Server always reads the text using KeyView. The parameter ProcessTextElements
has no effect. OCR is used only to process embedded images that KeyView extracts from the Word or Excel file.
Media Server does not use text elements when processing Microsoft PowerPoint presentations. Each slide is exported as an image and text is extracted using OCR.