Extract Entities from Tables

Eduction Table mode allows you to extract entities from a table, according to the values in the header of that table. This process allows you to target extraction on likely values in structured data, rather than extracting every possible entity value from a table. It can also improve the confidence that an ambiguous entity value corresponds to a particular type of data.

In standard extraction, Eduction searches text for a value that matches a particular entity. In many cases the entity values are distinctive, and so you can be reasonably confident that matches are relevant. For example, a string that matches an address entity is unlikely to be anything else.

Many other entity values are potentially ambiguous. For example, a number might match several entity types, and a date might be a date of birth or an event date. Without further information, it is difficult to determine whether these values are useful.

For unstructured text, you can use landmarks to find relevant information. Landmarks are values that identify a particular entity, without being a part of the entity value. For example, the phrase Date of Birth is a landmark. When a document contains the value Date of Birth: 06/07/80, it is highly likely that the date is a date of birth, and you can treat the data accordingly.

NOTE: The IDOL PII Package, IDOL PHI Package, and IDOL PCI Package, provide landmark entities in most grammars. To extract entities from tables with the Eduction standard grammar files, you might need to create your own landmark entities.

For structured data, it is less likely that the landmark occurs next to the entity. You might have the value Date of Birth in a table heading, and the actual date values in the rows below. In this case, you can use table extraction to extract the values that correspond to the landmark.

Table Formats

In table mode, Eduction can find header and cell values for CSV (comma-separated values) and TSV (tab-separated values) files. If you use Eduction with Connector Framework Server (CFS) or IDOL NiFi Ingest, you can also use structured XML tables.

Eduction can also process multiple tables from a single text stream when they are separated by the correct table delimiters:

  • Start table: "<blank line><tab line>" (\n\n\t\n)

  • End table: "<tab line><blank line>" (\n\t\n\n)

These delimiters reset the header matches, so that Eduction looks for a new header row.

NOTE: The first table does not need a start delimiter. If there is text outside table delimiters, Eduction treats it as a new separate table.

If you use KeyView to extract table data from your files to send to Eduction, you can configure KeyView to give the correct delimiters format when extracting tables. To get the right format from KeyView, you must make the following changes in you KeyView configuration:

  • set the target character set to UTF-8.

  • enable the Tab Delimited option.

  • enable the Output Table Delimiters option.

For more information about the KeyView configuration, refer to the KeyView Filter SDK Programming Guide.

Configure Table Extraction

In table extraction, you define an entity or entities that you want to detect in the header row, and entities that you want to detect in the cells under that header. When Eduction matches one of these entities in the header row of a table, it attempts to extract the corresponding cell entities from the cells in that column.

To configure these, you use the HeaderEntityN and CellEntityN configuration parameters.

For example:

[Eduction]
HeaderEntity0=pii/date/dob/landmark/all
CellEntity0=pii/date/nocontext/all

This example matches date of birth landmark values in the header, and for all subsequent rows in that column, it extracts any date values.

NOTE: You can specify multiple entities, either by providing a comma-separated list, or by using wildcard characters. In this case, if the table header matches any of the configured header entities, Eduction matches the cell content against any of the configured cell entities.

This option might be useful if you want to match a particular entity in multiple languages, or if you want to include a custom entity in addition to a standard one.

You must configure any entities that you want to use for matching in the ResourceFiles parameter. For example, the example configuration above uses the combined_date.ecr grammar from the PII grammar set:

ResourcesFiles=combined_date.ecr

You can optionally also set:

  • MaxSearchHeaderRow. The number of rows at the top of the table to search for header entities. This option might be useful if there is irrelevant information in the first few rows of your tables. Eduction searches up to the first N non-empty rows, and stops when it finds one of the configured header entities.

  • HeaderEntityMatchLimitN and CellEntityMatchLimitN. The maximum number of header column and cell matches to allow for the corresponding entities. These options might be useful if you want to find some matches for a particular entity, but would prefer to ignore further matches in favor of reducing the processing time.

To use table extraction with Connector Framework Server (CFS) or IDOL NiFi Ingest, you can also add the EntityFieldN parameter. This parameter specifies the field that CFS or NiFi write the extracted entities to in your documents.

In this case, if you do not set EntityFieldN, Eduction uses the value of CellEntityN to create a default field name (the capitalized entity name, with / * and ? characters replaced with underscores).

[Eduction]
HeaderEntity0=pii/date/dob/landmark/all
CellEntity0=pii/date/nocontext/all
EntityField0=DATE_OF_BIRTH

NOTE: You cannot specify EntityFieldN for only some of your CellEntityN values; you must either use the default value for all, or set EntityFieldN for all.

These parameters are the same for extracting entities from CSV or TSV table files, and for structured table data in XML, such as the output from Media Server OCR. For structured XML tables, there is an additional parameter, TableCellPath, for CFS and IDOL NiFi Ingest. TableCellPath describes the structure of the XML to allow Eduction to find the cells. For more information, refer to the Connector Framework Server or NiFi Ingest documentation.

For the Eduction SDK, you do not need to configure TableCellPath, because you use functions to locate the cells.

NOTE: You cannot extract entities from structured XML data in Eduction Server or edktool. In these cases you must use a CSV or TSV table file.

Run Table Extraction

After you configure table extraction, you can run Eduction as normal, with a CSV or TSV table file as input.

  • In the Eduction SDK:

    • C: You provide a table file by using the AddInputText or SetInputStream functions. You can use the EdkGetMatchTablePosition function to retrieve the row and column details of a match.

      For structured XML, call EdkAddTableCell to add table cell data to the session. You can optionally also populate an EdkOffset struct with offset information, and pass in a pointer to this as part of the EdkAddTableCell call. This option allows you to generate matches with offsets that reflect the global position of the cell. By default, the produced matches have offsets relative to the start of the cell.

      When a row is complete, call EdkEndTableRow. For the last row of the table, set the bFinalRow argument to true.

    • Java: You provide the table file by using the addInputText or setInputStream functions. You can use public EDKMatch.TablePosition getTablePosition() to return an object with two public members, row and column.

      For structured XML, call addTableCell to add table cell data to the session. To generate matches with offsets that reflect the global position of the cell, call the version that accepts offsetBytes and offsetCodepoints as arguments. The other version produces matches with offsets that are relative to the start of the cell.

      When a row is complete, call endTableRow. For the last row of the table, set the finalRow argument to true.

    • .NET: You provide the table file by using AddInputText or SetInputStream functions. You can use the readonly property public IExtractionMatchTablePosition TablePosition to return an object that has the readonly properties Row and Column.

      For structured XML, call AddTableCell to add table cell data to the session. To generate matches with offsets that reflect the global position of the cell, call the version that accepts the TextOffset parameter, which is a simple struct that contains the offsets in bytes and Unicode characters, of the start of the cell data in the global input stream. The other version produces matches with offsets that are relative to the start of the cell.

      When a row is complete, call EndTableRow. For hte last row of the table, set the final_row argument to true.

    NOTE: To use Table Extraction with the Eduction SDK, you must create an Eduction engine with a configuration file. See the Standalone API Usage section for your language in API Reference.

  • In Eduction Server and the edktool command-line tool, you provide the table file as plain input text. Eduction returns the matches in the response.

  • In CFS and NiFI, the ingestion process sends the table file to the Eduction engine. CFS and NiFi add the match details to the output documents.