OCR Results

This section describes the format of the results produced by an OCR analysis task.

Results by Line

The following XML shows records from the Result track of an OCR task. The analysis engine produces one record for each line of text in the analyzed image or video frame.

If you are processing a document, then unless you have set ProcessTextElements=FALSE, some of the records in the Result track could represent text that has been extracted from text elements that were present in the document.

<record>
    ...
    <trackname>OCR.Result</trackname>
    <OCRResult>
        <id>c0cf6d75-ad43-4fce-8589-e2a297923996</id>
        <text>New rover discovers life on Mars</text>
        <region>
            <left>35</left>
            <top>21</top>
            <width>290</width>
            <height>15</height>
        </region>
        <confidence>99</confidence>
        <angle>0</angle>
        <block>0</block>
        <fontSize>13.56</fontSize>
        <source>image</source>
        <parentID>4d69390f-a8c4-4c5d-a0b0-705a3f98aa9b</parentID>
    </OCRResult>
</record>
<record>
    ...
    <trackname>OCR.Result</trackname>
    <OCRResult>
        <id>e17ee583-e980-4d07-92c1-579657f46c3e</id>
        <text>Some more text</text>
        <region>
            <left>89</left>
            <top>66</top>
            <width>140</width>
            <height>15</height>
        </region>
        <confidence>99</confidence>
        <angle>0</angle>
        <block>1</block>
        <fontSize>13.56</fontSize> 
        <source>image</source>
        <parentID>4d69390f-a8c4-4c5d-a0b0-705a3f98aa9b</parentID>
    </OCRResult>
</record>

Each record contains the following information:

  • The id element provides a unique identifier for the line of text. The OCR analysis engine issues an ID for each detected appearance of a line of text. If you are running OCR on video and consecutive frames show the same text, all records related to that appearance will have the same ID.

    For example, if text appears in the same location for fifty consecutive video frames, the engine uses the same ID for each record in the data track and produces a single record in the result track. The record in the result track will have a timestamp that covers all fifty frames.

    If the text moves to a different location on the screen, or disappears and then reappears, the engine considers this as a new detection and produces a new ID and a new record in the result track.

  • The text element contains the text recognized by OCR.
  • The region element describes the position of the text in the ingested media. If the record represents a text element that has been extracted from a document, the region is accurate only if the source media was a PDF file or a presentation format (such as Microsoft PowerPoint). Position information is not extracted from other document formats.
  • The confidence element provides the confidence score for the OCR process (from 0 to 100). For text that was extracted from a text element in a document, the confidence score is always 100.
  • The angle element gives the orientation of the text (rotated clockwise in degrees from upright).
  • The block element indicates which block of text the result belongs to. Media Server starts counting at zero and increments the counter each time it recognizes a new heading, paragraph, or table. The counter is reset to zero at the start of each page. For example, a page that consists of ten paragraphs will have blocks numbered from zero to nine, in logical reading order. A table, including headings, columns, and rows, is considered to be a single block.
  • The fontSize element provides information about the font size. Larger values may indicate that the text is a heading or more important than other blocks that have a smaller font size. If you know the resolution at which the image was scanned or created (in dots per inch), you can convert this value to points by multiplying it by 72 / DPI.
  • The source element specifies the origin of the text. The possible values are:

    • image - static text from an image or video.
    • image table - text from an image that forms part of a table. The record will be referenced by an OCRTableResult record (see Tables).
    • scroller, left - text from video of a news ticker, with text scrolling from right to left.
    • text - text from a text element in a document.
    • text table - text from a text element in a PDF file that forms part of a table. The record will be referenced by an OCRTableResult record (see Tables).
  • The parentID element is empty, unless you configure the analysis engine with Region=Input in which case it contains the UUID of the input record. This provides a way to link the result with other records (from another analysis task) that supplied the region to analyze. To generate a single record combining the information, you can use the Combine ESP engine and the example Lua script parentuuidMatch.lua.

Results by Word

OCR also produces a WordResult output track. This track contains a record for each recognized word. The following XML shows an example record.

NOTE: Text that is extracted from a text element in a document is not output to the WordResult or WordData tracks.

<record>
    ...
    <trackname>OCR.WordResult</trackname>
    <OCRResult>
        <id>c0cf6d75-ad43-4fce-8589-e2a297923996</id>
        <text>New</text>
        <region>
            <left>35</left>
            <top>21</top>
            <width>39</width>
            <height>15</height>
        </region>
        <confidence>99</confidence>
        <angle>0</angle>
        <block>0</block>
        <fontSize>13.56</fontSize>
        <source>image</source>
        <parentID>4d69390f-a8c4-4c5d-a0b0-705a3f98aa9b</parentID>
    </OCRResult>
</record>

Each record contains the following information:

  • The id element provides a unique identifier for the word.
  • The text element contains the recognized word.
  • The region element describes the position of the word.
  • The confidence, angle, block, fontSize, and source elements provide the same information as described for the result track.
  • The parentID element provides the identifier of the line of text to which the word belongs. This identifier will match a record in the Result track, and you can use that record to obtain information about the line.

Results by Character

When you analyze an image or document (but not video), OCR produces a CharResult output track. This track contains a record for each line of text. However, the records in this track also provide detail about individual characters. The following XML shows an example record.

NOTE: Text that is extracted from a text element in a document is not output to the CharResult track.

<record>
    ...
    <trackname>OCR.CharResult</trackname>
    <OCRDetail>
        <id>c0cf6d75-ad43-4fce-8589-e2a297923996</id>
        <text>New rover discovers life on Mars</text>
        <region>
            <left>35</left>
            <top>21</top>
            <width>290</width>
            <height>15</height>
        </region>
        <angle>0</angle>
        <block>0</block>
        <fontSize>13.56</fontSize>
        <parentID>4d69390f-a8c4-4c5d-a0b0-705a3f98aa9b</parentID>
        <character>
            <text>N</text>
            <region>
                <left>35</left>
                <top>21</top>
                <width>12</width>
                <height>15</height>
            </region>
        </character>
        <character>
            <text>e</text>
            <region>
                <left>49</left>
                <top>25</top>
                <width>10</width>
                <height>11</height>
            </region>
        </character>
        ...
    </OCRDetail>
</record>

Each record includes the following information:

  • The id element provides a unique identifier for the line of text. Every record in the CharResult track has a different id.
  • The text, region, angle, block, fontSize, and parentID elements provide the same information as described for the result track.
  • There is a character element for each character on the line, including spaces. This element includes the following information:

    • text - the character that was recognized. This element is empty if the character is a space.
    • region - the location of the character in the source media.

Page Information

When you analyze an image or document (but not video), OCR produces a PageResult track that provides information about each page.

<record>
    <pageNumber>1</pageNumber>
    <trackname>OCR.PageResult</trackname>
    <OCRPageResult>
        <id>69ad32b4-3f7b-4b92-b153-7bbe6f850e20</id>
        <alphabet>Latin</alphabet>
        <mode>Document</mode>
        <orientation>0</orientation>
        <parentID/>
    </OCRPageResult>
</record>
  • The alphabet element provides a space-separated list of alphabets that were used on the page. The value can also be "Unknown" if Media Server failed to recognize the text.
  • The mode element describes the mode used by the OCR engine. For descriptions of these modes, see the documentation for the OcrMode configuration parameter. This element will never contain the value "Auto", instead Media Server reports the mode that was chosen.
  • The orientation element describes the orientation of the page (in 90-degree increments from upright).

Tables

Media Server can identify tables that occur in images and tables that are constructed from text elements in PDF files. OCR only identifies tables when your session configuration uses the TableResult track.

When OCR recognizes text that appears to be arranged in a table, Media Server outputs a record in the TableResult track that describes the table. The record includes enough structure information to reconstruct the table. The records in the TableResult track do not include the recognized text, instead they include record IDs that match the records in the Result, WordResult, and CharResult tracks. This means that when a table is identified, some of the records in the Result, WordResult, and CharResult tracks will represent table cells rather than lines of text. For example:

<record>
    ...
    <trackname>OCR.TableResult</trackname>
    <OCRTableResult>
        <id>6596a664-b69a-4a33-b9fc-8adb2be6c37f</id>
        <region>
            <left>256</left>
            <top>166</top>
            <width>1213</width>
            <height>362</height>
        </region>
        <block>0</block>
        <columnCount>9</columnCount>
        <rowCount>10</rowCount>
        <row>
            <cell>
                <columnSpan>1</columnSpan>
            </cell>
            <cell>
                <columnSpan>2</columnSpan>
                <OCRResultID>2240914c-440c-40cc-9254-c3c59727953e</OCRResultID>
	     </cell>
	     <cell>
		 <columnSpan>3</columnSpan>
	     </cell>
            <cell>
	         <columnSpan>3</columnSpan>
	         <OCRResultID>9ea804cc-a1a2-4d31-99f8-d1d96a3a1c9e</OCRResultID>
            </cell>
        </row>
        ...
        <parentID>4d69390f-a8c4-4c5d-a0b0-705a3f98aa9b</parentID>
    </OCRTableResult>
</record>

Each record contains the following elements:

  • id - a unique identifier for the table.
  • region - the position and size of the table in the media source.
  • block - indicates which block of text the result belongs to. Media Server starts counting at zero and increments the counter each time it recognizes a new heading, paragraph, or table. The counter is reset to zero at the start of each page. For example, a page that consists of ten paragraphs will have blocks numbered from zero to nine, in logical reading order. A table, including headings, columns, and rows, is considered to be a single block.
  • columnCount - the total number of columns.
  • rowCount - the total number of rows.
  • row - contains the information for a single row. Each row element contains cell elements. Usually the number of cells in a row matches the value of columnCount, but there can be fewer when cells span multiple columns. The number of columns spanned by a cell is given by the columnSpan element. The OCRResultID element provides the ID of an OCR result. This ID matches the ID of relevant records in the Result, WordResult, and CharResult tracks, so that you can obtain the recognized text. If the cell is empty, the OCRResultID element is omitted.
  • The parentID element is empty, unless you configure the analysis engine with Region=Input in which case it contains the UUID of the input record. This provides a way to link the result with other records (from another analysis task) that supplied the region to analyze. To generate a single record combining the information, you can use the Combine ESP engine and the example Lua script parentuuidMatch.lua.

Media Server includes an example session configuration and XSL transform, named Table.cfg and toHTMLTable.xsl, that use the information in the TableResult track to output HTML tables.