Eduction Grammar Tutorial

This topic gives an example of how to create an effective and efficient Eduction grammar.

The intended audience is an Eduction user who wants to detect entities that are not currently supported by the Eduction grammars available from Micro Focus. In this case, you might want to consider writing your own grammar.

TIP: If you need to detect entities that are not supported by the Eduction grammars available from Micro Focus, raise the issue with your support contact. The entity you want to detect might be supported in an upcoming release. Alternatively, Micro Focus might be able to add support in future, if other customers want it too.

Using an official grammar means you do not have to maintain it.

Overview

Entities and Grammars

An entity is a group of patterns or headwords that you want to match in the source text and map to a single concept. Entities are defined in grammars. For example, you might define entities to match credit card numbers, postcodes, or email addresses (these entities are supported by existing Micro Focus Eduction grammars).

Source Files

Eduction source files are XML documents that define grammars and entities. Eduction can either use these source files directly, or you can compile them into ECR files. This tutorial focuses on generating the XML source files.

ECR files

Eduction grammar source files are compiled into ECR files. ECR files are binary files that encode the information in a format that Eduction can load quickly, which improves performance at run time. In general, Micro Focus recommends that you compile your sources into ECR for performance reasons.

Employee Identity Numbers

This tutorial uses a scenario where your business assigns each employee a unique Employee Identity Number (EIN). The company security policy requires you to detect these identity numbers in documents or written into emails. There is no official Micro Focus Eduction grammar for these EINs.

Research

The EINs have the format of one uppercase Latin letter followed by nine digits. The first digit must be 0 or 1, and the last digit is a checksum (see Improve Match Confidence: Checksum Verification).

Specification

The Eduction grammar must:

  • Return matches for the EINs. When a match is returned, we want to be confident that the match is actually an EIN.

  • Return the actual identity number only. However, other text in the vicinity of the number can be used to increase our confidence in the match.

In addition:

  • You might want to write further identity number entities in future, so we define the entity in a general grammar, called identity_number.
  • We call the entity itself employee.
  • The grammar must detect EINs in English text.
  • The detection speed at Eduction run time is the largest performance concern.

Start the Source File

Eduction source files are UTF8-encoded XML documents, where the top-level element is grammars. Create a file called idnumber_ein.xml and start with the following contents:

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0">
</grammars>

The grammars tag has optional attributes of version, case, and debug. The value of the version tag provides version information for the file. Micro Focus recommends that you use a number that starts with 4.

IMPORTANT: A number that starts with 3., or the string VERSION_NOT_SPECIFIED are disallowed by Eduction.

Here we have defined version="4.0", using Eduction grammar convention. The default values of case and debug are acceptable for now. For the full grammar reference, refer to the Eduction User and Programming Guide.

The grammars tag can have multiple grammar tags defined. Each grammar tag defines a collection of entities under a grammar name. In this example, we only need a single grammar tag called identity_number.

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0.1">
   <grammar name="identity_number">
   </grammar>
</grammars>

The name attribute is required for the grammar tag. The grammar tag also supports the optional attributes case, extend, and debug, for which the default values are acceptable. For the full grammar reference, refer to the Eduction User and Programming Guide.

Match the Identity Number

Now that we have an empty grammar, we can define and insert the entity (employee) to match the EIN.

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0">
   <grammar name="identity_number">
      <entity name="employee" type="public">
         <pattern>[A-Z][01]\d{8}</pattern>
      </entity>
   </grammar>
</grammars>

The required name attribute of the entity is defined as employee, and the optional type attribute is set to public, so that Eduction can return it as a match. The other optional attributes (case, extend, and debug) have acceptable defaults.

You define what to match in child elements of the entity element. You can either use pattern, as in this example, or entry (see Improve Match Confidence: Supporting Landmarks).

The pattern element defines a regular expression that is matched against candidate text to detect the entity.

  • [A-Z] specifies that the first character can be any uppercase letter in the range A-Z.
  • [01] specifies that the next character must be the number 0 or 1.
  • \d{8} specifies that eight digits must follow the second character.

NOTE: Eduction uses a subset of the standard regular expression syntax.

For further information on what is supported in the regular expressions used in pattern elements, refer to the grammar reference section of the Eduction User and Programming Guide.

All of the pattern element attributes have acceptable default values for now. For more information about available attributes, refer to the grammar reference section of the Eduction User and Programming Guide.

We now have a grammar that detects the correct pattern, but it might produce spurious matches. In the Micro Focus IDOL PII grammars, it would be a nocontext match (that is, a match without any context to prove that it represents the concept the entity refers to). These matches are considered poor quality, and they are given low scores (in the default PII Eduction configuration, these matches do not return).

Improve Match Confidence: Supporting Landmarks

We can use text in the immediate vicinity of a match to improve our confidence that the match really is an EIN. In the IDOL PII grammars, the additional text that provides context is called landmark text. You can then define:

  • landmark entities. Entities that match the contextualizing pieces of text.
  • entities with context. Entities that combine the landmark entities with the nocontext entity to provide confident matches.

Let's define our landmarks in a separate grammar. Create a new file, landmarks_idnumbers.xml with the following contents:

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0">
   <grammar name="landmarks">
      <entity name="number">
         <entry headword="number" />
         <entry headword="#" />
         <entry headword="no." />
         <entry headword="№" />
      </entity>
      <entity name="employee_id">
         <entry headword="Employee ID" />
         <entry headword="EI" />
         <entry headword="EID" />
      </entity>
      <entity name="employee_id_number">
         <pattern>(?A:employee_id)( ?(?A:number))?</entity>
         <entry headword="EIN" />
      </entity>
   </grammar>
</grammars>

In line with our Specification, landmarks have been defined in English. The constant strings have been defined in the headword attributes of entry elements in the number and employee_id entities.

Next, these are assembled into a single pattern in the employee_id_number entity with optional portions, allowing flexibility in what the landmark can match.

For example, the pattern would match Employee ID number, EI №, or just EIN, according to the entry headword in the employee_id_number entity.

The previously defined entities are used in the employee_id_number pattern. The (?A:entity_name) operator copies the entity entity_name into that position in the pattern. This operation is the quickest way to use a previously defined entity at Eduction run time.

The drawback is increased compilation time, compiled file size and memory usage. In general, Micro Focus recommends the copy operator unless the copied entity is especially large. This recommendation is in line with our run time performance consideration.

TIP: You could use the (?A^entity_name) instead. This operator references the entity, rather than copies it. This option has better compilation time, file size and memory usage, at the expense of slower run time performance.

Generally, Micro Focus recommends that you use the copy operator unless the entity you want to use consists of several hundred or more headwords. In such cases, use the reference operator instead.

If you are not sure, try both forms to check performance for your application.

In general, you should aim to define small private entities for parts of things that you want to match, and then define public entities that use those private entities to define the full match. This method makes it easy to add or modify entities in future.

All the entities defined in this file have the default type of private: it does not need to provide matches for the landmarks themselves.

Include Other Grammars

When you develop different grammar sources to detect similar things (for example, if you add an entity for Product Identity Numbers, which are similar to Employee Identity Numbers), you might find that the same entities appear repeatedly. You can redefine these entities separately in each grammar source, but this approach makes the grammar more difficult to maintain. If you need to change the definition, you must change it wherever it occurs to be consistent and correct.

A better approach is to define the entity in a single grammar, which you can include in other grammars when necessary. In this case, the definition is in one place that you can update easily.

For our employee identity number grammars, you can include the landmark entities from landmarks_idnumbers.xml in the main grammar source:

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0">
   <include path="landmarks_idnumbers.ecr">
   <grammar name="identity_number">
      <entity name="employee" type="public">
         <pattern>[A-Z][01]\d{8}</pattern>
      </entity>
   </grammar>
</grammars>

To include a grammar, you use the include element ,where the required path attribute is a path to either a compiled ECR grammar or its XML source. You can set the type attribute to private to ensure that you do not unexpectedly expose any public entities from the included file, although in this case, all the included entities are private anyway.

In general, you should include grammars in ECR form rather than XML form. If you include an XML grammar, Eduction must compile that file separately before it can use it. This file might have inclusions of its own, resulting in potentially lengthy compilation times. When the file is in ECR form, Eduction only has to load it in order to use its definitions, which is much more efficient.

TIP: When you include ECR files, you should ensure that your included ECR files are up to date. See Grammar Dependency.

Now we can use these included entities to improve the Employee Identity Number entity:

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0">
   <include path="landmarks_idnumbers.ecr">
   <grammar name="identity_number">
      <entity name="employee_nocontext">
         <pattern>[A-Z][01]\d{8}</pattern>
      </entity>
      <entity name="employee_context">
         <pattern>(?A!(?A:landmarks/employee_id_number)((: ?)| ))(?A:employee_nocontext)</pattern>
      </entity>
      <entity name="employee">
         <pattern>(?A:employee_context)</pattern>
         <pattern score="0.4">(?A:employee_nocontext)</pattern>
      </entity>
   </grammar>
</grammars>

In this version:

  • The original public employee entity was renamed to employee_nocontext, and made private.

  • A new private employee_context entity was added.

  • The public employee entity matches either the employee_context entity, with the default score of 1, or the employee_nocontext entity at a lower score of 0.4

  • The employee_context entity patterns use the nooutput operator (?A!), which specifies that you want to match the expression, but not return it as part of the match. This operator is needed to satisfy the Specification point that the entity must return only the identity number.

    You declare the no output operator as (?A!expression), where expression can be any valid Eduction expression. This expression is not returned as part of the match.

  • The landmark must be followed either by a colon and optional space, or just a space.

Grammar Dependency

When you use a build automation tool (for example, make) to compile a grammar that includes other grammars, you can define the included grammars as dependencies to ensure that the compiled grammars are always up-to-date with respect to their includes. Micro Focus recommends that you use such a tool if you are not already.

Scores

You can assign scores to patterns and entries by using the score attribute. The default score is 1. The score for a match is the multiplication of all the scores given to the patterns and entries that matched.

The scores that you give to a match are arbitrary, and have meaning only in the context of the MinScore and PostprocessThreshold configuration parameters. In general, it is best to be consistent with assigning scores to patterns and entries in your grammars.

In the IDOL PII grammars, the convention is that to assign good matches a maximum score of one, and to discard matches that have a score of less than 0.5.

Which Entities Should Be Public?

The entity elements support the type attribute, which can take the values of public and private, with the default value private.

Eduction returns only public entities as matches. You can use private entities in the definitions of other entities. Therefore, mark only entities that you want to return as matches as public.

TIP: If an entity represents the entirety of something that you want to match, make it public. Otherwise, make it private.

Improve Match Confidence: Checksum Verification

The final digit of an EIN is a checksum, which for this tutorial uses the following algorithm:

  1. Convert the letter to a number (A=1, B=2, and so on).
  2. Multiply each digit by its corresponding weight, which is given in the worked example below.
  3. Sum the multiplied numbers. For the checksum to be valid, this sum must be divisible by 10 with no remainder.

You can implement this algorithm as a Lua script to run against matches as an Eduction postprocessing task.

We can be much more confident that matches that have a valid checksum are truly EINs, therefore we can boost their scores, or discard matches with invalid checksums.

Worked Example

  Letter Digit 1 Digit 2 Digit 3 Digit 4 Digit 5 Digit 6 Digit 7 Digit 8 Checksum
Example Number X (24) 1 2 3 4 5 6 7 8 6
Weights 1 2 3 4 5 6 7 8 9 1
Multipled 24 2 6 12 20 30 42 56 72 6
  • Sum: 24 + 2 + 6 + 12 + 20 + 30 + 42 + 56 + 72 + 6 = 270
  • Divide by 10: 270 / 10 = 27 remainder 0

Therefore X123456786 is a valid EIN.

Lua Script Overview

Lua scripts must define a processmatch function that takes a single argument of type edkMatch and returns a Boolean value. If the match is valid, the function must return true, and otherwise it should return false.

You can access and adjust the edkMatch object through its methods. For more information, refer to the Eduction Lua Methods Reference in the Eduction User and Programming Guide.

You can use Lua scripts to process matches in any Eduction process. You can configure the scripts in the configuration file that the Eduction process uses. If you use the Eduction C or Java SDK, you can also call functions to set up usage of Lua scripts. For more information, refer to the Eduction User and Programming Guide, and the API references in the Eduction SDK installation.

Example Script

-- transforms the first letter to its corresponding number
-- (A = 1, B = 2, etc)
local function letter_to_number (letter)
    return string.byte(letter) - string.byte("A") + 1
end

local PATTERN = "([A-Z])(%d%d%d%d%d%d%d%d)(%d)"
local DIGITS_WEIGHTS = { 2, 3, 4, 5, 6, 7, 8, 9 }
-- Returns true if checksum is valid
local function validate_checksum (number)
    -- Get the different parts of the ID number
    local first, last, letter, serial, checksum = string.find(number, PATTERN)
    
    if not first or not last or last < #number then
        -- Number did not match the expected format
        return false
    end

    -- Store the sum that will be used to validate the checksum,
    -- and initialise its value to that defined by the first letter.
    local sum = letter_to_number(letter)

    -- Go through each digit in the serial, multiply it by it's
    -- corresponding weight, and add it to the sum
    local digit = 1
    for d in string.gmatch(serial, "%d") do
        local n = tonumber(d)
        sum = sum + (DIGITS_WEIGHTS[digit] * n)
        digit = digit + 1
    end

    -- Add the checksum
    sum = sum + tonumber(checksum)

    -- Valid numbers have sums that divide wholly by 10
    return (sum % 10) == 0
end

-- Simple tests of validation function
assert(validate_checksum("X123456786"))
assert(not validate_checksum("X123456789"))

function processmatch (edkmatch)
    -- Get the number to check
    local number = edkmatch:getOutputText()
    if validate_checksum(number) then
        -- Boost the score
        local score = edkmatch:getScore()
        edkmatch:setScore(score * 1.1)
        return true
    end
    -- Throw away invalid matches
    return false
end

processmatch Function

This function gets the number (the matched text) from edkmatch and calls the validation function. If the number is valid, the match score is given a 10% boost and the function returns true. Otherwise, the function returns false, which means Eduction discards the match.

validate_checksum Function

This function implements the checksum algorithm, and returns true if the checksum if valid and false otherwise. This function:

  1. Captures the various portions of the number by using a pattern, which also checks the format of the supplied string.
  2. Transforms the letter to a number.
  3. Multiplies each digit of the serial number by its weight, and adds it to the sum.
  4. Adds the checksum to the sum and checks the validation condition.

Lua Script Tips

  • Ensure that your script can accept matches with any value returned by edkmatch:getOutputText. That is, do not assume that Eduction will provide only matches that fit the expected pattern to the script. This makes the script much more robust and reusable.
  • If you include Lua modules in your script, make them local includes. This approach avoids function name clashes. For example, if you want to reuse a processmatch function to provide additional validation for a match, using a local include means that you can clearly define which function you are calling.
  • The Micro Focus Eduction packages include Lua implementations of some algorithms. For example, the PII package includes the Luhn algorithm. You might be able to use these instead of your own script implementations, or include them in your scripts.
  • Consider using pcall to call functions that can fail reasonably often. For example, if you need to send data to a server to validate a match, use pcall to call the function that sends the data. This approach simplifies error handling and avoids unexpected failures. For more information about pcall, refer to the Lua error handling documentation: https://www.lua.org/pil/8.4.html.

Conclusion

We now have a working system for detecting Employee Identity Numbers with high confidence. You have seen:

  • how to define a basic Eduction grammar source.
  • how to refer to other entities when you define an entity.
  • how to split up grammar sources and include other grammars.
  • how to assign scores to patterns and entries.
  • when to make entities public or private.
  • how to increase match confidence by using the surrounding context of the document.
  • how to increase match confidence further by implementing a checksum verification Lua script.

There are still further improvements that you can make, for example you might define further landmark strings to detect these numbers in different context. For now, this tutorial describes something for this use case that is functionally similar to the PII Eduction setup.

Advanced Functionality

This tutorial has covered basic grammar functionality, allowing you to write grammars for simple matching cases. However, you can also use more advanced functionality to cover more cases, or increase the value of identified matches.

Pre-Filtering Tasks

For some types of entities, you can improve performance by adding a pre-filtering task. Pre-filtering sets up a quick initial matching step (a regex) that finds sections of text that contain likely matches. This process reduces the total amount of text that Eduction must analyze.

Pre-Filtering can improve performance for some entities when there is an appropriate broad way to find a potential match without either matching too much of the input text, or eliminating valid matches.

For the employee ID example, you might add a pre-filtering task to find any numeric values in your text. Eduction then creates a window around the initial numeric matches and checks them against the full entity. This method might improve performance if the employee IDs occur infrequently in the documents that you want to analyze.

For example:

[Eduction]
PrefilterTask0=IDPrefilter

[AddressPrefilter]
Regex=\d{9}
WindowCharsBeforeMatch=100
WindowCharsAfterMatch=100

NOTE: The pre-filter method is less useful for entities that match a list of possible words, such as names, where there is no simple regular expression that matches all your possible entities.

Components

It is possible to tag parts of a match with an arbitrary identifier. For example, if you wrote a grammar to match personal names, you could tag the first name, middle name and surname. In Eduction, these tags are known as components.

To tag part of a match, use the component operator: (?A=component_name:expression). The component_name is an arbitrary name for the component. By convention, component names are capitalized. The expression is any valid Eduction expression.

If you want to tag the individual parts of the EIN match, you could use something like the following example:

<entity name="employee_context_tagged">
   <pattern>(?A=LANDMARK:(?A:landmarks/employee_id_number))((: ?)| ))(?A=EIN:(?A:employee_nocontext))</pattern>
</entity>

In addition to returning the matched text, Eduction also returns two components:

  • LANDMARK, which contains the text for the landmark only.
  • EIN, which contains the text for the EIN only.

Each component also has its own offset information, allowing you to see where it occurred.

Components allow end users to make more sense of what has matched. It is particularly beneficial for more complex matches, such as addresses. For example, being able to easily pick out the city or town from an address might be useful for census takers or statisticians.

User-Defined Extensions

If an Eduction grammar doesn't quite match what you want it to match and you don't have access to the source, you can extend it by using the user-defined extension mechanism. This option is useful if, for example, you want to make minor modifications to an existing Micro Focus grammar for your use case.

This mechanism allows you to augment or replace the definition for an entity. You need to know the name of the entity you are modifying. It is possible to modify private entities, but this might be hard if you don't have the source file to hand.

Below is an example of modifying entities from the Micro Focus US name grammar:

<?xml version="1.0" encoding="UTF-8"?>
<grammars version="4.0">
   <include path="grammars/person_name_engus.ecr" type="private"/>

   <grammar name="person">
   
      <entity name="lastname/engus" type="private" extend="append">
         <entry score="0.9" headword="Ananthakrishnan" />
      </entity>

      <entity name="firstname/engus" type="private" extend="replace">
         <pattern>(?A:malefirstname/engus)|(?A:femalefirstname/engus)</pattern>
      </entity>

   </grammar>
</grammars>

This grammar includes the person_name_engus grammar and modifies two of its entities. The extend attribute declares what kind of modification to apply to the entity in question: append adds new definitions to that entity, while replace replaces the old definition of the entity by the one we define in this file. You can use this attribute for both public and private entities.

In this example, we add a new surname to the lastname entity. We also overwrite the definition of the firstname entity to include female names as well as male ones.

We would compile this file and use it wherever we would use the person_name_engus grammar. The new grammar mimics the behavior of the original grammar and also reflects the modifications we have made.

TIP: If you find deficiencies in Micro Focus standard grammars, report these to your support contact. Micro Focus can address these deficiencies, so that all customers can benefit from the fixes, and you do not need to maintain extensions.

However, if you need to make quick changes to support your use case, knowing about this mechanism is helpful.