PII Grammar Customization

In cases where you find that the PII grammars miss particular matches in your input, you can customize them. This section describes the possible customizations.

The following grammars support customization:

  • address.ecr

  • address_cjkvt.ecr

  • name.ecr

  • name_cjkvt.ecr

The combined versions of these grammars also support customization. See Combined Grammars.

NOTE: It is technically possible to extend any public entity in a PII grammar, but it can involve a lot of work. If you want to extend an entity that is not listed in the following list, see Modify Other Grammars and Entities.

For each grammar that supports customization, you can customize the following entities:

  • address

    • pii/address/knowncity_headwords/CC

    • pii/address/knownstreet/CC

  • name

    • pii/name/surname/nocontext/CC

    • pii/name/given_name/nocontext/CC

  • name_cjkvt

    • pii/name/surname/nocontext/latin/CC

    • pii/name/surname/nocontext/cjkvt/CC

    • pii/name/surname/nocontext/cjkvt_spaced/CC

    • pii/name/given_name/nocontext/latin/CC

    • pii/name/given_name/nocontext/cjkvt/CC

    • pii/name/given_name/nocontext/cjkvt_spaced/CC

In this list, CC means country code (for example: gb, us, nz). See Country and Language Support.

You can use customizations to add entries that the existing entities do not match (such as unusual names). You might also use it if your data uses unusual separators and punctuation. The following sections provide examples of these changes.

TIP: When you customize an entity, you can either replace or extend the definition. For PII grammars, OpenText recommends that you only extend the entity definitions.

If you replace an entity, you are likely to miss matches or reduce performance. In addition, existing definitions cover many match cases that you might not consider, so there is a lot of value in using these definitions as a base.

TIP: When you add names to the name list grammars, OpenText recommends that you use the following scores:

5.0 The most common names.
2.05 Less common, but still frequently used names.
1.05 Rare or uncommonly-used names.

Example 1: New Street Address

The following grammar definition below shows an example for extending address.ecr.

address_extended.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammars SYSTEM "../published/edk.dtd">
<grammars version="4.0">
   <include path="address.ecr"/>
   <grammar name="pii/address">

      <entity name="suffixes/gb" type="private">
         <entry headword="Cury"/>
         <entry headword="CURY"/>
      </entity>
  
      <entity name="knownstreet/gb" extend="append" type="private">
         <pattern>[A-Z][a-z]+ (?A:suffixes/gb)</pattern>
      </entity>

      <entity name="streetlocation/nocontext/gb" extend="append">
         <pattern score="0.75">(?A=STREET:(?A:knownstreet/gb))</pattern>
      </entity>

   </grammar>
</grammars>

This definition extends the knownstreet/gb and streetlocation/nocontext/gb entities in the PII address grammar:

  • It adds Cury as a known street suffix.
  • It extends the knownstreet entity to accept any two word street name that ends with the new Cury suffix.
  • It extends the streetlocation/nocontext/gb entity to use the extended knownstreet entity, so that these changes take effect.

The result of these changes is that Petty Cury matches as a street location with a score of 0.75. Previously, it would not have matched at all.

TIP: You do not need to redeclare the full address entity to use the extended knownstreet entity.

For example, with these changes 123 Petty Cury, Cambridge CB4 0WZ now matches pii/address/gb with a score of 1. Previously, this address would have matched, but with a lower score.

When you add known street names or patterns for your country of interest, it improves scores for matches that contain these customizations.

Example 2: New Known City

The following grammar definition adds more known cities to address.ecr.

address_extended.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammars SYSTEM "../published/edk.dtd">
<grammars version="4.0">
   <include path="address.ecr"/>
   <grammar name="pii/address">
      <entity name="knowncity_headwords/gb" extend="append" type="private">
         <entry headword="Chesterton"/>
      </entity>

      <entity name="city/nocontext/gb" extend="append">
         <pattern>(?A=CITY:(?A^knowncity_headwords/gb))</pattern>
      </entity>

   </grammar>
</grammars>

This example definition:

  • adds Chesterton to the knowncity_headwords/gb entity.
  • extends the city/nocontext/gb entity to use the extended knowncity entity, so that the change takes effect.

The result of these changes is that Chesterton matches as a city with a score of 1. Previously, it would have matched as a speculative city name, with a lower score.

Again, you do not need to change the full address entity to pick up this new declaration. For example, 123 Main Street, Chesterton CB4 0WZ now matches pii/address/gb with a score of 1, which is an improved score. Previously, it would have matched with a lower score, because the city was a speculative match.

TIP: The definition for city/nocontext/gb uses the dynamic reference syntax when using the knowncity_headwords/gb; that is, (?A^. OpenText recommends this syntax for performance reasons when you refer to that entity, because the version of this entity for each country often contains several thousand entries.

To make both sets of changes for known streets and cities, merge the declarations in examples 1 and 2 into a single XML file.

Example 3: New Name and Custom Separator

Another way to use entity customizations is to declare patterns with custom separators. For example, if your input data contains unusual spacing or characters between entities, you can declare these in your entity extensions.

The following grammar definition extends name.ecr.

name_extended.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammars SYSTEM "../published/edk.dtd">
<grammars version="4.0">
   <include path="name.ecr"/>
   <grammar name="pii/name">

      <entity name="given_name/nocontext/gb" extend="append" case="insensitive">
         <entry headword="Fobo" score="2"/>
      </entity>

      <entity name="surname/nocontext/gb" extend="append" case="insensitive">
         <entry headword="Jobo" score="2"/>
      </entity>

      <entity name="gb" extend="append">
         <pattern>(?A=SURNAME:(?A:surname/nocontext/gb))@@(?A=FORENAME:(?A:given_name/nocontext/gb))</pattern>
      </entity>

   </grammar>
</grammars>

This declaration makes two changes:

  • It adds new entries for given_name and surname. This change allows Fobo Jobo to match as a name for the gb entity.

  • It declares a new pattern for the gb entity, to match a name in reverse order, with the elements separated by a custom separator (two @ symbols). This change allows Jobo@@Fobo to match as a name.

TIP: The grammar already handles hyphenated known names. For example, after this definition change, Eduction matches Fobo-Fobo Jobo with a score of 1, with no further changes required. You do not need to add hyphenated entries to the given_name/nocontext or surname/nocontext entities.

Example 4: New Names for CJKVT Grammar

The following example adds a new CJKVT and latin name, and adds tabs as a custom separator.

Example extension XML:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Grammars SYSTEM "edk.dtd">
<grammars version="4.0">
   <include path="name_cjkvt.ecr"/>
   <grammar name="pii/name">
      <entity name="given_name/nocontext/cjkvt/jp" extend="append">
         <entry headword="亮美" score="1.05"/>
      </entity>

      <entity name="surname/nocontext/cjkvt/jp" extend="append">
         <entry headword="電話" score="1.05"/>
      </entity>

      <entity name="given_name/nocontext/latin/jp" extend="append">
         <entry headword="Fobo" score="2.05"/>
      </entity>

      <entity name="surname/nocontext/latin/jp" extend="append">
         <entry headword="Jobo" score="2.05"/>
      </entity>

      <entity name="jp" extend="append">
         <pattern>(?A=SURNAME:(?A:surname/nocontext/cjkvt/jp))\t(?A=FORENAME:(?A:given_name/nocontext/cjkvt/jp))</pattern>
      </entity>
   </grammar>
</grammars>

This declaration makes two changes: 

  • It extends the lists of known CJKVT and Latin names for Japan, allowing 電話亮美 to match as a CJKVT full name, and Fobo Jobo to match as a Latin name.

  • It adds a new full name format, allowing tab-separated surname+given name to match.

Combined Grammars

You can make the same extensions for the combined grammars. The following example updates the combined_address grammar to make the same changes as in Example 1: New Street Address and Example 2: New Known City..

combined_address_extended.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammars SYSTEM "../published/edk.dtd">
<grammars version="4.0">
   <include path="combined_address.ecr"/>
   <grammar name="pii/address">

      <entity name="suffixes/gb" type="private">
         <entry headword="Cury"/>
         <entry headword="CURY"/>
      </entity>

      <entity name="knownstreet/gb" extend="append" type="private">
         <pattern>[A-Z][a-z]+ (?A:suffixes/gb)</pattern>
      </entity>

      <entity name="streetlocation/nocontext/all" extend="append">
         <pattern score="0.75">(?A=STREET:(?A:knownstreet/gb))</pattern>
      </entity>

      <entity name="knowncity_headwords/gb" extend="append" type="private">
         <entry headword="Chesterton"/>
      </entity>

      <entity name="city/nocontext/all" extend="append">
         <pattern>(?A=CITY:(?A^knowncity_headwords/gb))</pattern>
      </entity>

   </grammar>
</grammars>

NOTE: The public entities use all as the country code, while the private ones continue to use the appropriate country code.

Compile Custom Grammars

As with any Eduction grammar, OpenText recommends that you compile your grammar extensions before using them. You can use the edktool command-line tool to compile the XML file that contains your extension declarations into an ECR file.

For more information about compiling custom grammars, refer to the Eduction User and Programming Guide.

Modify Other Grammars and Entities

It is possible to extend any public entity in a PII grammar. However, you cannot use the various private entities that the public ones use in their definitions.

For entities in the simpler grammars such as driving or national ID, this might be less of a problem, as long as you know the format for the data portion of this entity. For example, you might want to add new landmarks to these entities, for example.

However, be aware that existing definitions account for factors such as varying spaces, and additional words between the landmark and the data. In this case, you must emulate this behavior in your extensions, which might take a lot of work.

In practice, OpenText recommends that you make a support request to make these changes to the official PII grammars, unless you need to add support in a very short time frame. The existing definitions provide a lot of value because they cover so many match cases, and you might miss these cases when you extend the public entities where these definitions are not available.