This section describes the enhancements to the IDOL PII Package in version 12.7.
The IDOL PII Package now includes resources for South Africa. A complete set of entities are available to extract information including addresses and postcodes, dates, driving license numbers, names, nationality, national ID numbers, passport numbers, telephone numbers, and tax identification numbers.
The IDOL PII Package now includes resources for Taiwan. A complete set of entities are available to extract information including addresses and postcodes, dates, driving license numbers, names, nationality, national ID numbers, health numbers, passport numbers, telephone numbers, and tax identification numbers.
A new grammar, device_id
has been added to match various device identifiers. This grammar is available in ECR and EJR formats. For more details, see Eduction Grammar Reference.
The address grammar now returns the PO_BOX
component for all countries (previously, only UK and USA had this component). In addition, the following countries now detect the post office name, with the POST_OFFICE
component: Canada, Italy, Lithuania, Norway, New Zealand, Portugal, Taiwan, and Japan.
The address grammar has been improved to reduce false positive matches. In particular, in the recommended configuration, the grammar no longer matches an unknown street with an unknown city. Either the street name or the city must belong to one of the known lists.
The standard PII grammars now detect additional types of spaces in input text in all the places where previously regular spaces were expected. This change adds detection for U+00A0 (no break space), U+2007 (figure space), and U+3000 (ideographic space). Where these spaces are detected in input text, they are normalized to regular spaces.
Similarly, the PII grammars now detect and normalize additional apostrophe characters in places where a regular apostrophe was expected. This change adds detection for U+2019 (right single quote), and U+FF07 (full-width apostrophe).
The national ID grammar now matches national IDs for Bahrain, Dominican Republic, Egypt, Indonesia, Mexico, Pakistan, and Russia.
The name_cjkvt
grammar now has additional entities to match Latin-only and CJKVT-only versions of full names.
The name
For the TIN and national ID entities, you can now enable ambiguous entity matching by setting ambiguous_tin_id_entities=true
in the pii_postprocessing.lua
script. This option returns multiple possible country matches. By default, post-processing returns only one country, which is more efficient.
Synonyms have been added to street components of the address entities