Extract Entities from Tables
Eduction Table mode allows you to extract entities from a table, according to the values in the header of that table. This process allows you to target extraction on likely values in structured data, rather than extracting every possible entity value from a table. It can also improve the confidence that an ambiguous entity value corresponds to a particular type of data.
In standard extraction, Eduction searches text for a value that matches a particular entity. In many cases the entity values are distinctive, and so you can be reasonably confident that matches are relevant. For example, a string that matches an address entity is unlikely to be anything else.
Many other entity values are potentially ambiguous. For example, a number might match several entity types, and a date might be a date of birth or an event date. Without further information, it is difficult to determine whether these values are useful.
For unstructured text, you can use landmarks to find relevant information. Landmarks are values that identify a particular entity, without being a part of the entity value. For example, the phrase Date of Birth is a landmark. When a document contains the value Date of Birth: 06/07/80
, it is highly likely that the date is a date of birth, and you can treat the data accordingly.
NOTE: The IDOL PII Package, IDOL PHI Package, and IDOL PCI Package, provide landmark entities in most grammars. To extract entities from tables with the Eduction standard grammar files, you might need to create your own landmark entities.
For structured data, it is less likely that the landmark occurs next to the entity. You might have the value Date of Birth in a table heading, and the actual date values in the rows below. In this case, you can use table extraction to extract the values that correspond to the landmark.
Table Formats
In table mode, Eduction can find header and cell values for CSV (comma-separated values) and TSV (tab-separated values) files. If you use Eduction with Connector Framework Server (CFS) or IDOL NiFi Ingest, you can also use structured XML tables.
Eduction can also process multiple tables from a single text stream when they are separated by the correct table delimiters:
-
Start table: "<blank line><tab line>" (
\n\n\t\n
) -
End table: "<tab line><blank line>" (
\n\t\n\n
)
These delimiters reset the header matches, so that Eduction looks for a new header row.
NOTE: If you use multiple tables, you must use tab separators (TSV) for your tables. This prevents ambiguities between separator commas and commas in lines of irrelevant content that occur inside your tab delimiters (see MaxSearchHeaderRow).
The first table does not need a start delimiter. If there is text outside table delimiters, Eduction treats it as a new separate table.
If you use KeyView to extract table data from your files to send to Eduction, you can configure KeyView to give the correct delimiters format when extracting tables. To get the right format from KeyView, you must make the following changes in you KeyView configuration:
-
set the target character set to UTF-8.
-
enable the Tab Delimited option.
-
enable the Output Table Delimiters option.
For more information about the KeyView configuration, refer to the KeyView Filter SDK Programming Guide.
Configure Table Extraction
In table extraction, you define an entity or entities that you want to detect in the header row, and entities that you want to detect in the cells under that header. When Eduction matches one of these entities in the header row of a table, it attempts to extract the corresponding cell entities from the cells in that column.
To configure these, you use the HeaderEntityN and CellEntityN configuration parameters.
For example:
[Eduction] HeaderEntity0=pii/date/dob/landmark/all CellEntity0=pii/date/nocontext/all
This example matches date of birth landmark values in the header, and for all subsequent rows in that column, it extracts any date values.
NOTE: You can specify multiple entities, either by providing a comma-separated list, or by using wildcard characters. In this case, if the table header matches any of the configured header entities, Eduction matches the cell content against any of the configured cell entities.
This option might be useful if you want to match a particular entity in multiple languages, or if you want to include a custom entity in addition to a standard one.
You must configure any entities that you want to use for matching in the ResourceFiles parameter. For example, the example configuration above uses the combined_date.ecr
grammar from the PII grammar set:
ResourcesFiles=combined_date.ecr
You can optionally also set:
-
MaxSearchHeaderRow. The number of rows at the top of the table to search for header entities. This option might be useful if there is irrelevant information in the first few rows of your tables. Eduction searches up to the first
N
non-empty rows, and stops when it finds one of the configured header entities. -
HeaderEntityMatchLimitN and CellEntityMatchLimitN. The maximum number of header column and cell matches to allow for the corresponding entities. These options might be useful if you want to find some matches for a particular entity, but would prefer to ignore further matches in favor of reducing the processing time.
To use table extraction with Connector Framework Server (CFS) or IDOL NiFi Ingest, you can also add the TableEntityFieldN or EntityFieldN parameter. These parameters specifies the field that CFS or NiFi write the extracted entities to in your documents.
NOTE: When you configure table extraction, you can use either TableEntityFieldN or EntityFieldN. If you configure both, Eduction uses TableEntityFieldN for table entities.
You can use TableEntityFieldN when you configured a mix of table and free text (non-table) entities, to extract table entities to a different field from the free text entities. See Configure Mixed Table and Free Text Entities.
In this case, if you do not set TableEntityFieldN or EntityFieldN, Eduction uses the value of CellEntityN to create a default field name (the capitalized entity name, with / *
and ?
characters replaced with underscores).
[Eduction] HeaderEntity0=pii/date/dob/landmark/all CellEntity0=pii/date/nocontext/all EntityField0=DATE_OF_BIRTH
NOTE: You cannot specify a field value (by using TableEntityFieldN or EntityFieldN) for only some of your CellEntityN values; you must either use the default value for all, or set a value for all.
These parameters are the same for extracting entities from CSV or TSV table files, and for structured table data in XML, such as the output from Media Server OCR. For structured XML tables, there is an additional parameter, TableCellPath
, for CFS and IDOL NiFi Ingest. TableCellPath
describes the structure of the XML to allow Eduction to find the cells. For more information, refer to the Connector Framework Server or NiFi Ingest documentation.
For the Eduction SDK, you do not need to configure TableCellPath
, because you use functions to locate the cells.
NOTE: You cannot extract entities from structured XML data in Eduction Server or edktool. In these cases you must use a CSV or TSV table file.
Configure Mixed Table and Free Text Entities
You can configure table entities alongside free text (non-table) entities. In this mixed mode, Eduction identifies tables in your input text and searches them for table entity matches, and it searches any blocks of free text for free text entity matches.
When Eduction identifies a table but does not find a header match for a particular column, it searches the rows of that column for any configured free text entity matches. In this way, Eduction can still search for entity matches even if it does not match the headers. Similarly, if you configure MaxSearchHeaderRow, Eduction searches the initial rows that do not contain header matches for free text entity matches.
NOTE: When Eduction does find a HeaderEntityN for a particular column, it searches only for the configured CellEntityN entities in that column.
In mixed mode for CFS and IDOL NiFi Ingest, you can configure both TableEntityFieldN and EntityFieldN to avoid ambiguity between the table and non-table entities. Eduction writes table entity values to the TableEntityFieldN, and writes free text entity values to the EntityFieldN.
The following example shows a mixed configuration with both table and free text entities.
[Eduction] ResourceFiles=testfiles/simple_pii.xml # Free text entities Entity0=simple_pii/name EntityField0=FREE_TEXT_MATCH_NAME Entity1=simple_pii/weather EntityField1=FREE_TEXT_MATCH_WEATHER # Table entities HeaderEntity0=simple_pii/name_header CellEntity0=simple_pii/name TableEntityField0=TABLE_MATCH_NAME HeaderEntity1=simple_pii/number_header CellEntity1=simple_pii/number TableEntityField1=TABLE_MATCH_NUMBER
Run Table Extraction
After you configure table extraction, you can run Eduction as normal, with a CSV or TSV table file as input.
-
In the Eduction SDK:
-
C: You provide a table file by using the
AddInputText
orSetInputStream
functions. You can use theEdkGetMatchTablePosition
function to retrieve the row and column details of a match. You can use theEdkGetMatchTableNumber
function to retrieve the table number of a match (for when the input contains multiple tables).For structured XML, call
EdkAddTableCell
to add table cell data to the session. You can optionally also populate anEdkOffset
struct with offset information, and pass in a pointer to this as part of theEdkAddTableCell
call. This option allows you to generate matches with offsets that reflect the global position of the cell. By default, the produced matches have offsets relative to the start of the cell.When a row is complete, call
EdkEndTableRow
. For the last row of the table, set thebFinalRow
argument totrue
. -
Java: You provide the table file by using the
addInputText
orsetInputStream
functions. You can usepublic EDKMatch.TablePosition getTablePosition()
to return an object with two public members, row and column. You can usepublic EDKMatch.getTableNumber()
to return the table number (for when the input contains multiple tables).For structured XML, call
addTableCell
to add table cell data to the session. To generate matches with offsets that reflect the global position of the cell, call the version that acceptsoffsetBytes
andoffsetCodepoints
as arguments. The other version produces matches with offsets that are relative to the start of the cell.When a row is complete, call
endTableRow
. For the last row of the table, set thefinalRow
argument totrue
. -
.NET: You provide the table file by using
AddInputText
orSetInputStream
functions. You can use the readonly propertypublic IExtractionMatchTablePosition TablePosition
to return an object that has the readonly properties Row and Column. You can use the readonly propertypublic IExtractionMatchTablePosition TableNumber
to return the readonly property TableNumber (for when the input contains multiple tables).For structured XML, call
AddTableCell
to add table cell data to the session. To generate matches with offsets that reflect the global position of the cell, call the version that accepts theTextOffset
parameter, which is a simple struct that contains the offsets in bytes and Unicode characters, of the start of the cell data in the global input stream. The other version produces matches with offsets that are relative to the start of the cell.When a row is complete, call
EndTableRow
. For the last row of the table, set thefinal_row
argument totrue
.
NOTE: To use Table Extraction with the Eduction SDK, you must create an Eduction engine with a configuration file. See the Standalone API Usage section for your language in API Reference.
-
-
In Eduction Server and the edktool command-line tool, you provide the table file as plain input text. Eduction returns the matches in the response.
-
In CFS and NiFI, the ingestion process sends the table file to the Eduction engine. CFS and NiFi add the match details to the output documents.