IDOL Eduction Grammars

The following section describes the Eduction grammars available in the IDOL PII Package.

You can use these grammars with IDOL Eduction, by using Eduction Server, the edktool command-line utility, or the Eduction SDK. For more information, refer to the IDOL Eduction User Guide and the Eduction SDK Programming Guide.

IMPORTANT: The format of the EJR grammars and DPF pre-filter files that are included in the IDOL PII Package has changed in version 12.9.0. To use the files from the 12.13 package, you must use Eduction tools with a version of 12.9 or later. You can use files from older versions of the package only with tools from Eduction 12.8 or earlier.

IMPORTANT: To use the Eduction grammars in the IDOL PII Package, you must have a license that enables them. To obtain a license, contact Micro Focus Support.

The IDOL PII Package includes a default configuration file, which includes the basic required settings that you need to use the PII grammars.

NOTE: If you create your own configuration file, you must include some of the settings in the default configuration file, such as post-processing and Eduction components (see Configure Post Processing).

Configure Post Processing

When you use the IDOL PII Package Eduction grammars it is essential to configure a Lua post-processing task to run the script pii_postprocessing.lua. This script contains post-processing to improve results for various entities, such as stop list filtering, entity name mapping for combined grammars (see Combined Entities), ambiguous landmark detection (see Ambiguous Entities) and checksum validation (see Validated ID Numbers).

IMPORTANT: If you do not run this script, you might encounter unexpected behavior.

The default configuration file provided in the IDOL PII Package includes a suitable post-processing task. If you use a different configuration, you must add the post-processing task to your Eduction configuration. For example:

[Eduction]
PostProcessingTask0=MyPostProcessingSection

[MyPostProcessingSection]
Type=Lua
Script=scripts/pii_postprocessing.lua
Entities=pii/*,gdpr/*

IMPORTANT: The post-processing script requires Eduction components (see Components). The default PII configuration file enables components. If you use a custom configuration file you must set the EnableComponents parameter to True to return components.

For more information about configuring post-processing tasks, refer to the Eduction User and Programming Guide.

Configure Pre-Filtering

Pre-filtering allows the IDOL PII Package to run a quick initial check to find potential matches in your input text. It then selects match windows around these potential matches, reducing the amount of text that it must match against your grammars. This process can improve the performance in certain cases.

Micro Focus recommends that you use the following pre-filtering configuration with the address.ecr and combined_address.ecr grammars.

[Eduction]
PrefilterTask0=AddressPrefilter

[AddressPrefilter]
Regex=\d{1,7}
WindowCharsBeforeMatch=100
WindowCharsAfterMatch=100

NOTE: Pre-filter tasks run for all configured entities, so you must configure it only for the appropriate entities to ensure that it does not affect the results for other entities.

The IDOL PII Package also includes sample pre-filter configuration files for the name, address, and medical grammars, including dictionary pre-filter files where they are required by the sample configuration.

IMPORTANT: To use the DPF files from the 12.13 package, you must use Eduction tools with a version of 12.9 or later.

NOTE: The provided medical grammar pre-filter files can improve match performance in cases where there is a low density of matches. However, it can reduce the performance when there is a high density of matches.

For more information about pre-filtering, refer to the Eduction User and Programming Guide.

Entity Context

Some of the entities are available in two versions, with and without context. The context-based entities match the entity when it occurs in an easily identifiable location in text. For example, it might match a telephone number that occurs next to the prefix Phone:.

The entities that do not have context attempt to match the entity wherever it occurs. This version might over-match significantly (that is, it is likely to return values that are similar to the entity patterns, such a number that is not a telephone number). However, it also reduces the number of false negatives (that is, it misses fewer matches).

You can configure Eduction to use both versions of an entity; matches located with context are given a higher score in the results.

When you have data in tables, the context for an entity might not occur next to the entity value. For example, you might have a table with columns titled name and date of birth, but the values themselves do not occur next to these headers.

In this case, you can use Eduction table extraction to extract entities according to the landmarks detected in the table headers. For example, you can configure Eduction so that if it finds a table heading that matches the landmark date of birth, it extracts dates from that column.

For more information about how to configure table extraction, refer to the Eduction User and Programming Guide.

ECR and EJR Grammars

Some grammars are available in two formats, ECR and EJR. In these cases, both formats contain the same entities for extraction, and the format that you use depends on your input data.

EJR files are performance-optimized for cases where the expected match density in your input text is low. Micro Focus recommends that you use EJR files when you expect less than 10% of the input text to be valid matches. In all other cases, use the ECR files.

When you use EJR grammars, you must run them in a separate matching engine to any ECR grammars, although you can run multiple EJR grammars in the same engine.

For example, the following configuration is allowed:

ResourceFiles=passport.ejr,date.ejr

You cannot set ResourceFiles=passport.ejr,date.ecr.