Custom Grammar Guidelines

This section describes some guidelines that you can use when you create custom grammars.

Eduction is generally very fast at grammar compilation and entity extraction. However, some expressions in the grammar patterns can increase the extraction and compilation times significantly.

The grammar files that are included in the Eduction packages are designed to be as fast as possible. The following section describes some ways to ensure that your user-created Eduction grammars also work quickly.

In general, the more concise a grammar is, the faster it returns matches. For the best performance, use the simplest entity possible that matches what you need to match. Use additional features only if you need them.

TIP: Before you create a custom grammar, check the standard grammars to see whether the entity you want is already supported. If it is not, contact Micro Focus support. The entity you want to detect might be supported in an upcoming release. Alternatively, Micro Focus might be able to add support in future, if other customers want it too.

Using an official grammar means you do not have to maintain it.

It is also generally easier to extend an existing grammar by using a user extension file than to create a completely new grammar.

The following sections describe the most efficient use of specific features, and how you might be able to avoid using slow features in some cases.

For details of the grammar syntax, see Grammar Format Reference. In particular, for details of the regular expression syntax used in these examples, see Regular Expressions.

NOTE: Some configuration settings affect extraction speed, for example MatchCase, MatchWholeWord, AllowOverlaps, and NonGreedyMatch.

For a tutorial that gives an example of how to create a custom grammar, refer to IDOL Expert.

Exclusions and Negations

Always try to describe the value that you want to match, rather than a value to exclude.

For example, you can exclude a match by using a score of zero (score="0" in the pattern definition). However, this option can significantly increase processing time. Micro Focus recommends that you define your patterns such that your entities do not match the values you want to exclude.

For example, to extract mobile phone numbers that do not end in 25:

   <pattern>07[0-9]{7}[013-9][0-9]</pattern>
   <pattern>07[0-9]{7}[0-9][0-46-9]</pattern>

is faster than

   <pattern>07[0-9]{9}</pattern>
   <pattern score="0">07[0-9]{7}25</pattern>

TIP: You can also use a Lua script to filter out the exclusions in post-processing. This option might be preferable when you want to define more general patterns.

Similarly, Micro Focus recommends that you describe the values to match rather than using the negation operator. For example, to avoid matching the digit 0, use [1-9] rather than [^0]. The negation operator can increase compilation and processing times.

Case Sensitivity

See Also: Case Sensitive Matches.

Wherever possible, use case sensitive matching. After exclusions, case insensitive matching is the slowest feature in Eduction. In particular, avoid using the case="insensitive" tag in entities or patterns. When case sensitivity is essential, consider using one of the following options:

  • Use alternate casing in a regular expression pattern. For example, to match the word Paul case insensitively, you might use the pattern [Pp][Aa][Uu][Ll].

    NOTE: Extensive use of this format can greatly increase grammar compilation times, and compiled ECR file size.

  • Use text normalization. The Eduction process can normalize text before matching, to convert the input to all lower- or all uppercase. You can define your entities in one case, and normalize the input text the same way.

    This approach might be unsuitable if you need to match capitalized words in certain places.

Reference or Copy Entities

When you create a custom grammar, you can match a previously defined entity and either:

  • copy it to the new entity by using the syntax (?A:

  • reference it in the new entity by using the syntax (?A^

For large or complex Eduction grammars, copying entities results in a very large grammar file, which can take an extremely long time to compile. In addition, the resulting file can take longer to load and scan than the equivalent file created by using references.

For example, if an entity matches a static list of several thousand names, always use the reference operator to include it in other patterns. Similarly, reference an entity if it contains patterns that can match a wide variety of expressions.

For very simple grammar files, it might be faster to copy entities, because this method creates an ECR with more efficient instructions for extraction.

In general, Micro Focus recommends that you use references in all cases, unless your grammar file is very simple, and extraction speed is critical.

Merge Entities

When you know what entities you want to use for extraction in advance, you can improve performance by creating a single public entity that includes each of these entities. It is quicker for Eduction to process a single large entity definition than for it to use several smaller definitions. Similarly, extraction is faster because it needs to check only one entity for a match.

For example, the following public entity merges the entities animal, vegetable, and mineral:

<entity name="my_entity" type="public">
<pattern>(?A^animal)</pattern>
<pattern>(?A^vegetable)</pattern>
<pattern>(?A^mineral)</pattern>
</entity>

In your Eduction configuration, you can use this merged entity rather than the individual ones.

TIP: You can use merged entities to improve performance even if you only use entities from Micro Focus grammar files.

Micro Focus recommends that you merge any entities that you can, unless merging them alters what the grammar can match.

Include Grammars

When you include another grammar in your custom grammar, use the ECR form rather than the XML. When you use an XML inclusion, compiling your grammar requires in-memory compilation of the included grammar file, which might in turn require the same for any further inclusions.

It is always quicker to compile a grammar that includes ECR files.

Use Common Forms of Matches

Where possible, use only the most common match cases for your grammar. Additional forms for acceptable matches can result in increased processing times, particularly for complex forms. Consider your matches carefully and only add additional forms if it is necessary.

Quantifiers

The syntax expression{n,m} matches at least n, but at most m consecutive occurrences of the specified expression. When m is large, it can result in a large ECR file and slow extraction.

In this situation, Micro Focus recommends that you use {n,} unless the upper bound m is important.

Reduce the Number of Ways to Match

When there are many ways to match a particular entity, Eduction must try many methods to determine whether an input string matches, and to attempt to find the longest match. Where possible, make sure that there are as few ways to match your entity as possible. In particular, avoid using the {n,m} operator for entities that can match a wide variety of tokens.

For example, the following entity matches a five letter word that occurs between one and three times. This five letter word must occur between two matches of an entity called name.

<entity name=”myentity”>
   <pattern>(?A:name) ([A-Za-z]{5}){1,3} (?A:name)</pattern>
</entity>

The example name entity might also include five-letter names, such as Chris, James, or Alice, that also match the regex pattern. When Eduction processes text, it tries every combination that might lead to a match. In this case there might be a very large number of options, which could be very slow.

Lua Post-Processing

You can use a Lua Script for more advanced matching, which might improve processing speed in complex cases. You can use Lua scripts in many ways, from checksum validation to checking match proximity (if you use the en masse matching mode).

The match proximity check might be useful for complex entities. That is, if matches from a certain set of entity names are close together, you can consider them to form a single, complex entity. This approach is also useful in cases where the major elements of an entity are separated by filler or unknown content.

Optional Phrases

Try to start your entities with a required entity or phrase. An entity that starts with one or more optional phrases can be very slow during extraction. For example, the following type of pattern might be very slow:

<pattern>(?A^animal)?(?A^vegetable)*(?A^mineral)?(?A^name)</pattern>

In this case, the entity name is required, while animal, vegetable, and mineral are optional. When extracting, Eduction must check each word for matches in the animal entity, then check whether it matches vegetable, mineral, and then name. If the word matches animal, Eduction must then check whether the following word matches vegetable, mineral, or name, and so on.

This process can be time-consuming, particularly if each of the optional entities occurs regularly in the input text.

This issue does not occur if the pattern starts with the required phrase. For example:

<pattern>(?A^name)(?A^animal)?(?A^vegetable)*(?A^mineral)?</pattern>

In this case, Eduction must only check each word for matches in name, and it checks for the optional phrases only when it finds a match for name.

Use Private Entities

Eduction has public and private entities. Public entities are available to match during extraction. Private entities are available to use in other entities, but you cannot extract them directly as matches.

You can use private entities to break up complicated pattern expressions into several simple patterns, which you can use in a single public entity. This process keeps your patterns simple, which makes them easier to maintain and troubleshoot.

Making an entity public unnecessarily in a grammar can result in longer processing times if you use all entities from the grammar. Micro Focus recommends that you mark all entities as private by default. You can then expose as public only the entities that represent the entirety of a match, rather than subelements.

Output Exclusions

You can use output exclusions (the (?A! operator) for text that is useful when you identify a match, but that is not part of the match itself. For example, if you have form labels that identify a piece of information (such as Name, Telephone Number, and so on), you can use these in an entity to match the correct information, and then use an output exclusion so that the final output includes only the important content.

NOTE: Output exclusions can increase compilation times if you exclude complex entities.

Patterns and Headwords

You can extract regular expressions by using patterns (the <pattern> element), or by explicitly listing each possible match as a headword (in the <entry> element).

For example, the following alternatives are equivalent:

   <pattern>[Ee]xtract(ed|ing|s)?</pattern>

and

   <entry headword="Extract"/>
   <entry headword="Extracted"/>
   <entry headword="Extracting"/>
   <entry headword="Extracts"/>
   <entry headword="extract"/>
   <entry headword="extracted"/>
   <entry headword="extracting"/>
   <entry headword="extracts"/>

Patterns are often faster to code, easier to maintain, and faster for extraction. However, if there are fewer than about 50 entries represented by one pattern, the compilation time is faster for headwords.

Micro Focus recommends that you use patterns in your entities, unless the pattern becomes too complex. Additionally, if the compilation time becomes too slow, you might want to consider replacing the simplest patterns with headwords.

NOTE: You can also provide a list of words by adding multiple <pattern> elements with a word in each, unless the word or phrase contains characters that are also valid regular expression syntax.

Components

See Also: Components

In some cases, including components in an Eduction grammar file can increase the extraction time, even if you do not enable the components during the extraction. This occurs because the ECR is less compact than the equivalent file that does not describe components. Do not include components if you do not need them.

When you do use components, Micro Focus recommends that you make the structure of the components as uniform as possible.

For example:

   <pattern>(?A=COMPONENT:(?A^entity_A) )(?A^entity_B) (?A^entity_C) (?A^entity_D)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A)) (?A^entity_B)(?A^entity_D) (?A^entity_C)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B))(?A^entity_C) (?A^entity_E)</pattern>

might be slower than:

   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_C) (?A^entity_D)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_D) (?A^entity_C)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_C) (?A^entity_E)</pattern>

Scores

You can add scoring to your entities to indicate the relative confidence for matches. The score for a match is the product of all the scores of the patterns or entities it includes, with a default value of one.

It is up to you how you use scores in your use case. Eduction does not define the meaning of a score.

Whitespace

You can define a space in grammars by using a space character, or by using the \s syntax. The \s syntax matches all types of whitespace, such as spaces, tabs, and newlines. In most practical situations, the matches you want from the input text only include spaces and it is slightly faster to use a space, rather than \s.