Eduction

23.2.0

New Features

  • In table mode, Eduction now provides a zero-indexed table number for matches, to avoid ambiguity when extracting entities from an input stream that contains multiple tables.

    In the Eduction SDK, the following methods and attributes are available for obtaining the table number (which is -1 if the match was not sourced from a table):

    • C API: EdkError EdkGetMatchTableNumber(EdkSessionHandle pSession, int * pnTableNumber)

    • .NET API: IExtractionMatchTablePosition.TableNumber

    • Java API: EdkMatch.getTableNumber()

    When Eduction is in table mode, edktool and Eduction Server now output the match table number and Eduction Server now outputs the row and column details of a match, as was already the case for edktool.

  • You can now configure both table and free text (non-table) entities at the same time. In this mixed mode, Eduction identifies tables and searches them for table entity matches, and it searches any blocks of free text for free text entity matches.

    In addition, when Eduction identifies a table but does not find a header match for a particular column, it searches the rows of that column for free text entity matches instead. In this way, Eduction can still search for entity matches even if it does not match the headers. Similarly, if you configure MaxSearchHeaderRow to search for tables beyond the first line of the input, Eduction can now search the initial rows that do not contain header matches for free text entity matches.

    You can use the new TableEntityFieldN parameter to avoid ambiguity in mixed mode. Use this parameter to configure a field for table entities where you have set EntityFieldN for the free text entities. For example:

    [Eduction]
    ResourceFiles=testfiles/simple_pii.xml
    # Free text entities
    Entity0=simple_pii/name
    EntityField0=FREE_TEXT_MATCH_NAME
    Entity1=simple_pii/weather
    EntityField1=FREE_TEXT_MATCH_WEATHER
    # Table entities
    HeaderEntity0=simple_pii/name_header
    CellEntity0=simple_pii/name
    TableEntityField0=TABLE_MATCH_NAME
    HeaderEntity1=simple_pii/number_header
    CellEntity1=simple_pii/number
    TableEntityField1=TABLE_MATCH_NUMBER
  • Two new functions, setMatchOffset and setMatchOffsetLength, have been added to the Eduction match component in Lua to allow you to set the offset for a component in your post-processing scripts. The setMatchOffset function sets the offset for the component inside the matched text in bytes. The setMatchOffsetLength function sets the offset for the component inside the matched text in codepoints. Both functions take a single integer argument.

  • You can now configure Eduction to select a higher scoring match over a longer or shorter match (depending on your NonGreedyMatch configuration) when you have set AllowMultipleResults to False or OnePerEntity.

    To use this option, set the new PrioritiseScore configuration parameter to True. The default value is False. When two entities have equal scores, Eduction uses the length as a tie breaker. You can also set this option in the C API by using the EdkSetPrioritiseScore function.

Enhancements to Eduction Grammars

  • The Eduction standard grammar psi_api_credentials.ecr grammar (in the Eduction standard grammars) has been updated with additional entities for authorization headers and JSON Web Token (JWT).

  • The PHI dea.ecr grammar has been updated with new entities for National Drug Codes (NDC) and NDC billing derivatives.

  • The PHI healthplan.ecr grammar has been updated with new entities for National Provider Identifiers (NPI), Medicare Beneficiary Identifiers (MBI), Health Insurance Claim Number (HICN), and Healthcare Common Procedure Codes (HCPCS) level I and II.

  • A new PII grammar, voter_id.ecr is available, which contains entities for matching voter IDs for the UK, India, and Mexico. This grammar is also available in EJR format, and a combined gramamr combined_voter_id.ecr.

  • The PII national_id grammar now includes national ID entities for Cambodia (kh), Honduras (hn), Vietnam (vn), and Qatar (qa).

  • The PII names grammar has the following improvements:

    • Handling of known surnames that begin with a prefix (for example, Mc) has been improved.

    • Handling of surname prefixes that have more than one part (for example van der) has been improved.

    • The ability to match speculative names for various countries has been improved, by expanding the permissible character set for those countries.

    • Stop list handling has been improved for known first names and surnames for various countries (for example Snow is acceptable as a surname for certain countries, despite it appearing in the stoplist).

    • Matching of multi-character initials (for example "Hans Chr. Schmidt" and "Alekos St. Papadopoulos") has been improved.

    • Matching of hyphenated forenames and surnames where either one of the hyphenated names is known and the other is known or unknown (for example "Jean-Léon Huens" and "Christiane Teschl-Hofmeister") has been improved.

    • Precision has been improved by reducing the score of non-CJKVT name matches that contain uppercase and a title case components (not including values that match as initials, titles, or surname prefixes), for example "AC Milan" or "ABBA Gold".

  • The PII names grammar has been expanded to include Russian (ru) and Ukrainian (ua) names.

  • The PII address grammar has been expanded to include Russian (ru) and Ukrainian (ua) addresses.

  • The PII postcode grammar has been expanded to include Russian (ru) and Ukrainian (ua) post codes.

  • Recall for US addresses (in the PHI and PII address grammars) has been improved, by adding direction and apartment data and by matching buildings as part of addresses, for example 'One Irvington Center' in the following address:

    One Irvington Center
    700 King Farm Boulevard
    Suite 125
    Rockville, MD 20850-5736
    USA
  • The PCI grammars now include the combined_name.ecr, combined_name_cjkvt.ecr and scripts/names_stoplist.lua, to allow you to find names from any of the supported countries.

  • In the GOV grammar entity_identifiers.ecr, matching of Legal Entity Identifier (LEI) numbers has been improved. There is no longer a restriction for the reserved numbers (fifth and sixth character) being 00, and there is no longer a penalty for having a prefix (first four characters) that is not in the predefined list (any four characters are allowed). With these changes, any 20 character code matches in the nocontext entities, but those with an incorrect checksum are discarded by postprocessing.

  • In the GOV grammar us_dod_markings.ecr, the classification_authority_block/downgrade, classification_authority_block/declassify, and classification_authority_block/reason entities have been updated so that the normalized text is more consistent with other entities, removing the start. For example, where the text/normalized text was Downgrade To: UNCLASSIFIED on 20200319 it is now UNCLASSIFIED on 20200319.

  • The PII and PHI medical_terms.ecr grammars have been updated to improve precision and recall.

Enhancements to Eduction Server

Eduction Server is an ACI server. For details of changes that affect all ACI servers, see ACI Server Framework.

Resolved Issues

  • The Malta EHIC format could also match valid codes for other countries, because it is very broad. The scoring of the /context/mt entity has been reduced for matches with a non-country-specific landmark (such as "EHIC").

  • In the PII names CJKVT grammar, Eduction sometimes matched characters after a title as part of the title, resulting in incorrect name matches.

  • When a component name was changed (for example SURNAME changed to FORENAME), Eduction did not respect all stoplist values and exceptions.

  • Entities for Vietnam (vn) were available in the banking.ecr grammar, rather than banking_cjkvt.ecr.

  • In the C SDK, calling the EdkFillMatches or EdkFillMatchesTimed functions could result in a memory leak. These functions were also called indirectly by the deprecated EDKProcessableMatchesCollection class in the Java Eduction SDK.

  • The example EDK C code eduction_from_config.c could leak memory.

  • If a synonym was used in a case-insensitive entity, Eduction could produce an incorrect headword when matching the alternative case.

  • In the PII and PHI grammars, after post-processing, matched text was not returned correctly for names that contained a stoplisted component.

  • In the PII and PHI grammars, stoplisted name components were removed from valid name components when Eduction updated the matched text, where the stoplist component was a substring of the valid component (for example And Andrew Adamsmatched as rew Adams).

  • When the configured ResourceFile path was absolute but also contained relative elements (for example, /my/path/to/../../grammar.xml), then inclusions in that grammar failed because Eduction did not correctly resolve the parent path.

  • Eduction could add erroneous extra characters to the output string when it matched a synonym that was longer than the headword.

  • The Eduction SDK C API documentation package was incomplete.

  • When a file contained multiple tables, if a potential header row inside a table delimiter contained a comma, Eduction treated the whole table as a comma-separated values (CSV) table, and could miss matches.

    NOTE: As part of this change, files with multiple tables can now use only TSV tables.

  • During post-processing for names, if a component (such as a forename) was removed because of stoplist rules, Eduction did not adjust the offsetlength correctly. Eduction now gives the correct offsetlength for the remainder of the name.

Notes

  • Deprecated functions in the Eduction C SDK have been moved into edk_deprecated.h. If you still need to use these deprecated functions you must explicitly include this header.

Supported Platforms

The Eduction Server and Eduction SDK are supported on the following platforms.

Windows (x86-64)

  • Windows Server 2022
  • Windows Server 2019
  • Windows Server 2016
  • Windows Server 2012

Linux (x86-64)

The minimum supported versions of particular distributions are:

  • Red Hat Enterprise Linux (RHEL) 7
  • CentOS 7
  • SuSE Linux Enterprise Server (SLES) 12
  • Ubuntu 14.04
  • Debian 8

The Eduction SDK is also available on the following additional platforms:

Linux ARM64

  • CentOS AArch 64

MacOS x86-64

MacOS M1

Documentation

The following documentation is available for Eduction version 23.2.0.

  • Eduction User and Programming Guide
  • Eduction Server Help