Eduction

 

24.1.1

New Features

There were no new features in this release.

Resolved Issues

  • When adding Eduction session input in chunks by making repeated calls to EdkAddInputText, a match was sometimes skipped if the final input chunk was added before all matches from previously-added data had been consumed.

  • When using EdkSetInputStream, matches close to the end of the input buffer were sometimes missed if the stream length was an exact multiple of 1024 bytes.

  • When streaming input to an Eduction session, and using a pre-filter, under certain circumstances the Unicode character offset (offset length) values for individual matches were set incorrectly.

  • When streaming input to an Eduction session, if a multi-byte Unicode character was split across multiple stream reads, the Unicode character offsets (offset length) values for subsequent matches were sometimes incorrect.

24.1.0

New Features

  • You can now enforce whole word matching for an Eduction dictionary prefilter task by setting the new PrefilterMatchWholeWord parameter to True. By default, Eduction can match dictionary entries as substrings of words.

Enhancements to Eduction Grammars

  • The PII telephone grammar now supports Russian (ru) and Ukrainian (ua) telephone numbers.

  • The PII nationality grammar now supports Russian (ru) and Ukrainian (ua) nationalities.

  • The PII driving license grammar now supports Ukrainian (ua) driving licenses.

  • In the names grammars, precision has been improved for Croatian names by preventing "Basic" (unaccented) from matching as a surname (Bašić still matches).

  • In the names grammars, matching of multi-character initials (for example "Hans Chr. Schmidt" and "Alekos St. Papadopoulos") has been further improved to reduce false positives.

  • The PII and PCI name CJKVT grammar now supports foreign names in katakana in the new name/foreign/jp entity. For example, エミー・ネーター (Emmy Noether) and フランクリン・D・ルーズベルト (Franklin D. Roosevelt).

    There are also the following additional new entities for foreign names in katakana: 

    • name/given_name/context/cjkvt/foreign/jp (for example 名称ジャン・バティスト (Name Jean Baptiste))

    • name/given_name/nocontext/cjkvt/foreign/jp (for example アレクサンドラ (Alexandra))

    • name/given_name/nocontext/cjkvt_spaced/foreign/jp (for example ア ン ト ニ オ (Antonio))

    • name/surname/context/cjkvt/foreign/jp (for example 姓: シン (Surname: Singh))

    • name/surname/nocontext/cjkvt/foreign/jp (for example アッ=サカフィー (al-Saqafi))

    • name/surname/nocontext/cjkvt_spaced/foreign/jp (for example ア ッ ツ ォ ー リ (Azzoli))

    • name/title_surname/cjkvt/foreign/jp (for example ドルフマイスター君 (Dorfmeister-kun))

  • A new dictionary prefilter cjkvt_foreign_lastnames_prefilter.dpf has been added. This prefilter attempts to find matches in windows in which there are foreign Katakana surnames.

  • The medical terms prefilter dictionary file has been improved to make it smaller and more accurate.

  • Recall has been improved for the PII and PCI name/jp entity, by adding new native Hiragana and Katakana forename data, as well as Hiragana surname data.

  • Many grammars now have metadata JSON files to assist with deciding which entities you need to select when you configure Eduction. For example, you might use the metadata files to create a UI that allows your end users to select entities by which languages it covers or which region it falls into.

    Metadata files are available for all PII, PCI, PHI,and GOV grammars, as well as sentiment analysis grammars and Protected Security Information (PSI) grammars, which are available in the general Eduction grammar set. A metadata_schema.json file is also provided with the JSON schema for these metadata files.

  • Recall has been improved for the PII and PHI banking grammar routing_number nocontext entities by applying a variable penalty that decreases the longer the routing number is.

  • Matching in the PII passport grammars has been improved by tightening the restrictions on the non-numeric components of Australian and Slovenian passport numbers and allowing Slovakian passports with seven numeric digits.

  • Recall has been improved for the PII telephone grammar. Improvements include updated area codes and number lengths for particular countries, as well as greater delimiter variations.

Resolved Issues

  • US healthplan IDs could match input that only contained letters, so it sometimes returned 14 letter capitalized words. Post-processing now penalizes these strings to a score of 0.2.

  • The nocontext entities for streetlocation in Japanese and Chinese sometimes returned matches that contained only numbers. Post-processing now penalizes these matches so that they are filtered out with the standard post-processing threshold.

  • An entity match for a pattern that had score=0 sometimes returned matches if the input also matched another pattern in the entity with non-zero score.

  • Compilation of entities that include long lists of headwords became slower in EDK 12.11 and later.

  • On the MacOS_x86-64 platform, creating an eduction engine in a program thread, from an in-memory configuration buffer, could cause the program to terminate with an error report when the thread exited.

23.4.0

New Features

  • The performance of entity match limit has been improved.

  • Session creation has been improved, so that it takes about the same amount of time to create or reuse a session, improving overall processing time.

  • Compilation times have been improved for grammars that mostly consist of headwords and synonyms.

  • The edktool command-line tool unify action now accepts a -c <compile_config> argument. This option is a path to a JSON file that contains entity definitions. You can combine this option with the -e parameter, but you must not include duplicate entity definitions. The compilation configuration has the following syntax:

    {
    "combined":[ {"name":"combined-name", "entities":["source-name1","source-name2"]}, ... ] }

Enhancements to Eduction Grammars

  • The PCI numbers grammar (pci_numbers.ecr) now accepts fullwidth digits, hyphens and spaces when matching numbers. Full width is supported only in the value itself, not in the landmarks.

  • The PII National ID grammar (national_id_cjkvt.ecr) now supports national IDs for Macau.

  • The PII National ID grammar (national_id_cjkvt.ecr) now supports older formats of the Taiwan National ID, which are still valid until 2031. This older format is also available for the TIN grammar. The older format score slightly lower than the more recent format.

  • The PII National ID grammar (national_id.ecr) now supports national IDs for Greenland, and additional formats for Croatia (Croatia Unique Master Citizen Number) and Russia (SNILS (Individual insurance account number)).

  • The PII National ID grammar (national_id.ecr) now supports the Spanish social security number (SSN/Número de la Seguridad Social).

  • The PII Passport grammar (passport.ecr) now supports passports for the Philippines.

  • The PII Driving License grammar (driving.ecr) now supports driving licenses for India.

  • The address grammars have been improved to allow a pound sign (#) before the house or apartment number.

  • For PII national ID, postprocessing scripts have been added to check the pattern for Honduras (hn), Cambodia (kh), Qatar (qa), and Vietnam (vn).

  • The pci_numbers grammar now includes entiites for the following new security numbers:

    • Committee on Uniform Securities Identification Procedures (CUSIP), including entities for internal CUSIP numbers and private placement numbers.

      NOTE: These numbers were previously available in the number_banking_us.ecr standard grammar. They are now available only in the PCI grammar set.

    • Stock Exchange Daily Official List (SEDOL) numbers

    • Wertpapierkennnummer (WKN)

    • Financial Instrument Global Identifier (FIGI) numbers

    • International Securities Identification Number (ISIN)

  • The names stop list has been nearly doubled in size to improve precision.

  • Postprocessing for names now includes weak stoplist exceptions to penalize some uncommon names rather than discarding them. The penalty depends on the number of these components in a given match.

    If a stoplist word is a leading or trailing component in a particular match, Eduction initially discards the stoplist word. However, if discarding the component results in the match being discarded, Eduction penalizes the stop word instead.

    For example, the name Mike Board has a trailing stop word. Dropping the stop word Board would invalidate the name match, to postprocessing instead penalizes the stop word match. In this case, the score is still above the 0.5 threshold, so the match returns.

  • The following entities have been improved to include additional options for delimiters (such as dashes, dots, and spaces) in numbers:

    • pii/id/nocontext/at and pii/id/context/at now supports spaces and dashes in appropriate places (for example 1788-011550 and 1788 01 15 50).

    • pii/health/nhs_number/gb now supports variations with dots or without spaces (for example 943.476.5919 and 9434765919).

    • pii/tin/vatin/nocontext/au and pii/tin/vatin/context/au now support dashes between parts of the ID (for example 123-456-782)

  • The Name grammars have been improved to support multiple spaces (or a tab) in place of a single space. For example, the name John  Smith now matches, even with two spaces.

  • The PII TIN grammars (tin.ecr) now includes patterns for Greenland (gl), Malaysia (my), India (in), and Andorra (ad), and the CJKVT TIN grammar (tin_cjkvt.ecr) includes patterns for Singapore (sg) and Thailand (th).

  • The PII national ID grammar national_id_cjkvt.ecr was updated to include a new format for Singapore NRIC that was introduced in 2022 (the first letter can now be an M).

  • The PII banking grammar now supports account numbers, routing numbers, and CLABE numbers for Mexico (mx).

  • The following PII grammars now include additional entities for Singapore: 

    • Dates, with landmarks for dates and months, date of birth, and date of death for Tamil (tam) and Malay (may).

    • Driving licenses, with landmarks in Tamil and Malay.

    • Medical terms in Tamil (tam) and Malay (may).

    • Address and postcode.

    • Names.

  • Eduction grammars that contain mostly synonym entries have been optimized.

  • The PII medical terms grammars have been optimized for faster matching.

  • The PII national ID grammar now includes the Rentenversicherungsnummer (pension number) for Germany.

  • The PII Driving License grammar (driving.ecr) now supports driving licenses for Russia.

  • The PII telephone grammar has new additional components to identify landmarks.

  • The following environment variables have been added to allow you to configure post-processing for the PII telephone grammar:

    • IDOL_PII_TELEPHONE_EXTRA_STOPWORDS_CHECKS. Set this string to TRUE to activate additional stopwords checks in the Lua post-processing of telephone matches. In this case, Eduction rejects matches when the upper-case version of that match contains any of the stopwords in the STOPWORDS table of the postprocessing_telephone Lua script.

      Currently, only the word COUNT is a stopword.

    • IDOL_PII_TELEPHONE_EXTRA_MATCH_QUALITY_CHECKS. Set this string to TRUE to activate additional match quality checks in the Lua post-processing of telephone matches. For contextual matches, Eduction rejects a match when the landmark is considered weak and one other quality check fails.

      Currently the landmark T without a period following it is considered a weak landmark.

      Additional quality check failures include: 

      • a matched telephone number with 5 or fewer digits.

      • two or more types of separators (non-digit characters) present in the matched telephone number.

Resolved Issues

  • In table mode, when processing tabular input where columns were separated with tabs or commas, Eduction sometimes encountered a memory error, if the input data had a table with a differing number of columns in each row (for example, the header row has fewer rows than a body row). This error did not occur when using KeyView to insert the delimiters.

23.3.0

New Features

  • When you compile and save a grammar file, EDK now removes any data associated with inaccessible entities before it writes the grammar to disk. Previously, data with the type private was included from ECR files that were included by the grammar definition, increasing the file size of the output grammar.

    You can now use edktool to create a reduced version of a grammar file, containing only selected entities. For example:

    edktool compile -l license.dat -i original.ecr -e entity1,entity2,... -o selected.ecr

    Previously, this command produced an ECR of the same size, where the unselected entities were made private. Now, edktool removes the definitions of unreferenced entities, resulting in a reduced-footprint ECR.

  • Improvements have been made to the Eduction Table mixed mode:

    • You can now use the addTableCell and endTableRow API calls in mixed mode in the same way as for normal table mode.

    • Setting the finalRow argument to true for the addTableCell and endTableRow API calls now sets Eduction up to process another table, rather than resetting the session. This change allows you to feed in multiple tables and get the matches afterward. Eduction now also keeps track of the table numbering, where previously the session reset meant that it treated every table as the first table.

  • Speed improvements have been made within Eduction sessions, relating to match identification and Lua post-processing of identified matches.

  • Creating a session from an Eduction engine is now more efficient.

Enhancements to Eduction Grammars

  • In the PHI package, the profession agentboolean IDX has been replaced with the new profession.ecr grammar.

  • Scoring has been improved so that matches of certain patterns of Swedish no-context valid national IDs now have a score above the 0.4 threshold.

  • Contextual entities now allow an arbitrary number of spaces or tabs before a colon that separates the landmark from the nocontext entity. For example, the PII telephone context grammar (for GB) now matches the text Mobile :07928 875 419.

  • The PII medical terms grammar has new entities to match ICD10 medical condition codes and ICD10 procedure codes.

  • The PII medical terms grammar has been updated to add Turkish, Ukrainian (cyrillic) and Russian (cyrillic) languages.

  • A new PII internet grammar has been added, with entities to match email addresses for many countries.

  • The driving license grammar has new formats for Australian driving licenses. The grammar still supports older formats, with a 10% matching penalty, so that older formats without context now score at 0.36 (below the typical 0.4 minimum threshold).

  • Post-processing for names has been improved:

    • Eduction now rejects certain matches where a title such as Prince, King, or Queen occurs in a title_surname entity match, to reduce spurious matches of names that were partially stoplisted.

      For example, the name Prince Edward Island (a place in Canada) initially matches as Title+Forename+Surname. Island is removed because it is in the stoplist, and then Edward is promoted to surname, which produced a match of just EDWARD. Eduction now rejects this match.

    • Eduction now rejects matches that have at least one INITIAL component and do not have a multi character SURNAME component. This change prevents acronyms and abbreviations such as A.I matching (where I is a Korean surname).

    • Eduction now removes stoplisted components, such as when a stoplisted word is hyphenated with another word. For example, Saint is stoplisted, so Eduction now also removes Dié in Saint-Dié Christian Pierret, resulting in a match for Christian Pierret. This change includes cases where the stoplisted word appears after the hyphen, for example Dié-Saint.

    • Eduction no longer includes ampersands (&) as part of the original matched text. For example, Extraterrestrial Life, John Wiley & Sons previously produced the match John Wiley &. It now matches John Wiley.

    • Eduction now checks stop names again after removing any leading or trailing stopwords. For example, the text "Rio Grande River" no longer matches. "Rio Grande" is a stop name and "River" is in the stoplist, so previously Eduction dropped "River" and returned "Rio Grande" as a match. Now, after dropping "River", Eduction checks the remaining text, "Rio Grande", against the stopnames and drops it, so it no longer returns a match.

  • The National ID grammar for Belgium now accepts a dot before the final two digits, in addition to what was previously accepted. For example, 85.07.30-033.28 now matches.

  • The telephone grammar has been updated with new area codes for Canada (236, 365, 367, 368, 431, 437, 474, 548, 584, 639, 672, 782, 825, 873).

  • In the PII address grammar for Brazil, the neighborhood component has been added to full address matching as an optional component, which can come before the city.

  • Precision has been improved by penalizing stoplist exception components in some cases, depending on the number of such components.

  • The PII banking grammar now supports greater variations of number groupings and delimiters for US bank account numbers.

Resolved Issues

  • The function EdkSessionSetEntityMatchLimit(), in the C API, could incorrectly return an error when the Eduction engine had been configured to match entities by calling EdkLoadResourceFile() and EdkAddTargetEntity().
  • The method session.setEntityMatchLimit(), in the Java API, could incorrectly throw an exception when the Eduction engine had been configured to match entities by calling engine.loadResourceFile() and engine.addTargetEntity().
  • Telephone numbers in the PII and PHI entity packages sometimes failed to match examples that included extra spacing around the area code (for example, "(310)    840-7089"). This format can occur, for example, when the area code and local subscriber number are extracted from distinct fields in a PDF form.

  • Post-processing for PII, PHI, and PCI names entities could fail on certain names listed in reverse format (for example "Smith, John Washington (2023)").

  • An Eduction engine configured with both Entity and HeaderEntity/CellEntity (mixed mode) failed to return results for a structured table API call (for example EdkAddTableCell).

  • The PHI validation scripts Lua table had an incorrect key for DEA entities, which caused a Lua error at runtime.

  • The combined banking grammars were missing entities for non-country-specific landmarks (IBAN and SWIFT).

  • When a file had a byte order mark (BOM) at the start of the file, Eduction did not make any matches at the start of the file (first word). The BOM is now treated as punctuation to allow these matches.

  • When using edktool, if an incorrect path to a licensefile was provided in the -l flag, edktool would return a misleading Error: License key is not valid for Eduction error. It now returns the same Error: Open file error as for an incorrect path to the -i input.txt or -c config.cfg.

  • In the PII national ID grammars, the Lua checksum function for Saudi Arabian IDs could raise an error unexpectedly if the calculated value to perform check digit validation was not a two-digit number.

  • In mixed table mode, setting the final flag to true on addInputText after passing a whole table in did not always correctly reset the session. In this case, the table number was not reset properly.

23.2.0

New Features

  • In table mode, Eduction now provides a zero-indexed table number for matches, to avoid ambiguity when extracting entities from an input stream that contains multiple tables.

    In the Eduction SDK, the following methods and attributes are available for obtaining the table number (which is -1 if the match was not sourced from a table):

    • C API: EdkError EdkGetMatchTableNumber(EdkSessionHandle pSession, int * pnTableNumber)

    • .NET API: IExtractionMatchTablePosition.TableNumber

    • Java API: EdkMatch.getTableNumber()

    When Eduction is in table mode, edktool and Eduction Server now output the match table number and Eduction Server now outputs the row and column details of a match, as was already the case for edktool.

  • You can now configure both table and free text (non-table) entities at the same time. In this mixed mode, Eduction identifies tables and searches them for table entity matches, and it searches any blocks of free text for free text entity matches.

    In addition, when Eduction identifies a table but does not find a header match for a particular column, it searches the rows of that column for free text entity matches instead. In this way, Eduction can still search for entity matches even if it does not match the headers. Similarly, if you configure MaxSearchHeaderRow to search for tables beyond the first line of the input, Eduction can now search the initial rows that do not contain header matches for free text entity matches.

    You can use the new TableEntityFieldN parameter to avoid ambiguity in mixed mode. Use this parameter to configure a field for table entities where you have set EntityFieldN for the free text entities. For example:

    [Eduction]
    ResourceFiles=testfiles/simple_pii.xml
    # Free text entities
    Entity0=simple_pii/name
    EntityField0=FREE_TEXT_MATCH_NAME
    Entity1=simple_pii/weather
    EntityField1=FREE_TEXT_MATCH_WEATHER
    # Table entities
    HeaderEntity0=simple_pii/name_header
    CellEntity0=simple_pii/name
    TableEntityField0=TABLE_MATCH_NAME
    HeaderEntity1=simple_pii/number_header
    CellEntity1=simple_pii/number
    TableEntityField1=TABLE_MATCH_NUMBER
  • Two new functions, setMatchOffset and setMatchOffsetLength, have been added to the Eduction match component in Lua to allow you to set the offset for a component in your post-processing scripts. The setMatchOffset function sets the offset for the component inside the matched text in bytes. The setMatchOffsetLength function sets the offset for the component inside the matched text in codepoints. Both functions take a single integer argument.

  • You can now configure Eduction to select a higher scoring match over a longer or shorter match (depending on your NonGreedyMatch configuration) when you have set AllowMultipleResults to False or OnePerEntity.

    To use this option, set the new PrioritiseScore configuration parameter to True. The default value is False. When two entities have equal scores, Eduction uses the length as a tie breaker. You can also set this option in the C API by using the EdkSetPrioritiseScore function.

Enhancements to Eduction Grammars

  • The Eduction standard grammar psi_api_credentials.ecr grammar (in the Eduction standard grammars) has been updated with additional entities for authorization headers and JSON Web Token (JWT).

  • The PHI dea.ecr grammar has been updated with new entities for National Drug Codes (NDC) and NDC billing derivatives.

  • The PHI healthplan.ecr grammar has been updated with new entities for National Provider Identifiers (NPI), Medicare Beneficiary Identifiers (MBI), Health Insurance Claim Number (HICN), and Healthcare Common Procedure Codes (HCPCS) level I and II.

  • A new PII grammar, voter_id.ecr is available, which contains entities for matching voter IDs for the UK, India, and Mexico. This grammar is also available in EJR format, and a combined gramamr combined_voter_id.ecr.

  • The PII national_id grammar now includes national ID entities for Cambodia (kh), Honduras (hn), Vietnam (vn), and Qatar (qa).

  • The PII names grammar has the following improvements:

    • Handling of known surnames that begin with a prefix (for example, Mc) has been improved.

    • Handling of surname prefixes that have more than one part (for example van der) has been improved.

    • The ability to match speculative names for various countries has been improved, by expanding the permissible character set for those countries.

    • Stop list handling has been improved for known first names and surnames for various countries (for example Snow is acceptable as a surname for certain countries, despite it appearing in the stoplist).

    • Matching of multi-character initials (for example "Hans Chr. Schmidt" and "Alekos St. Papadopoulos") has been improved.

    • Matching of hyphenated forenames and surnames where either one of the hyphenated names is known and the other is known or unknown (for example "Jean-Léon Huens" and "Christiane Teschl-Hofmeister") has been improved.

    • Precision has been improved by reducing the score of non-CJKVT name matches that contain uppercase and a title case components (not including values that match as initials, titles, or surname prefixes), for example "AC Milan" or "ABBA Gold".

  • The PII names grammar has been expanded to include Russian (ru) and Ukrainian (ua) names.

  • The PII address grammar has been expanded to include Russian (ru) and Ukrainian (ua) addresses.

  • The PII postcode grammar has been expanded to include Russian (ru) and Ukrainian (ua) post codes.

  • Recall for US addresses (in the PHI and PII address grammars) has been improved, by adding direction and apartment data and by matching buildings as part of addresses, for example 'One Irvington Center' in the following address:

    One Irvington Center
    700 King Farm Boulevard
    Suite 125
    Rockville, MD 20850-5736
    USA
  • The PCI grammars now include the combined_name.ecr, combined_name_cjkvt.ecr and scripts/names_stoplist.lua, to allow you to find names from any of the supported countries.

  • In the GOV grammar entity_identifiers.ecr, matching of Legal Entity Identifier (LEI) numbers has been improved. There is no longer a restriction for the reserved numbers (fifth and sixth character) being 00, and there is no longer a penalty for having a prefix (first four characters) that is not in the predefined list (any four characters are allowed). With these changes, any 20 character code matches in the nocontext entities, but those with an incorrect checksum are discarded by postprocessing.

  • In the GOV grammar us_dod_markings.ecr, the classification_authority_block/downgrade, classification_authority_block/declassify, and classification_authority_block/reason entities have been updated so that the normalized text is more consistent with other entities, removing the start. For example, where the text/normalized text was Downgrade To: UNCLASSIFIED on 20200319 it is now UNCLASSIFIED on 20200319.

  • The PII and PHI medical_terms.ecr grammars have been updated to improve precision and recall.

Enhancements to Eduction Server

Eduction Server is an ACI server. For details of changes that affect all ACI servers, see ACI Server Framework.

Resolved Issues

  • The Malta EHIC format could also match valid codes for other countries, because it is very broad. The scoring of the /context/mt entity has been reduced for matches with a non-country-specific landmark (such as "EHIC").

  • In the PII names CJKVT grammar, Eduction sometimes matched characters after a title as part of the title, resulting in incorrect name matches.

  • When a component name was changed (for example SURNAME changed to FORENAME), Eduction did not respect all stoplist values and exceptions.

  • Entities for Vietnam (vn) were available in the banking.ecr grammar, rather than banking_cjkvt.ecr.

  • In the C SDK, calling the EdkFillMatches or EdkFillMatchesTimed functions could result in a memory leak. These functions were also called indirectly by the deprecated EDKProcessableMatchesCollection class in the Java Eduction SDK.

  • The example EDK C code eduction_from_config.c could leak memory.

  • If a synonym was used in a case-insensitive entity, Eduction could produce an incorrect headword when matching the alternative case.

  • In the PII and PHI grammars, after post-processing, matched text was not returned correctly for names that contained a stoplisted component.

  • In the PII and PHI grammars, stoplisted name components were removed from valid name components when Eduction updated the matched text, where the stoplist component was a substring of the valid component (for example And Andrew Adamsmatched as rew Adams).

  • When the configured ResourceFile path was absolute but also contained relative elements (for example, /my/path/to/../../grammar.xml), then inclusions in that grammar failed because Eduction did not correctly resolve the parent path.

  • Eduction could add erroneous extra characters to the output string when it matched a synonym that was longer than the headword.

  • The Eduction SDK C API documentation package was incomplete.

  • When a file contained multiple tables, if a potential header row inside a table delimiter contained a comma, Eduction treated the whole table as a comma-separated values (CSV) table, and could miss matches.

    NOTE: As part of this change, files with multiple tables can now use only TSV tables.

  • During post-processing for names, if a component (such as a forename) was removed because of stoplist rules, Eduction did not adjust the offsetlength correctly. Eduction now gives the correct offsetlength for the remainder of the name.

Notes

  • Deprecated functions in the Eduction C SDK have been moved into edk_deprecated.h. If you still need to use these deprecated functions you must explicitly include this header.