Eduction is very fast at grammar compilation and entity extraction. However, some expressions in the grammar patterns can increase the extraction and compilation times significantly.
The grammar files that are included in the Eduction SDK package are designed to be as fast as possible. The following section describes some ways to ensure that user-created Eduction grammars also work quickly. As a general rule, the more concise a grammar is, the faster it returns matches.
Some configuration settings affect extraction speed, for example MatchCase
, MatchWholeWords
, AllowOverlaps
, and NonGreedyMatch
.
When you create a custom grammar, you can match a previously defined entity and either:
copy it to the new entity by using the syntax (?A:
reference it in the new entity by using the syntax (?A^
For very simple grammar files, it is generally faster to use (?A:
, because this method creates an .ECR with slightly more efficient instructions for extraction.
For complicated Eduction grammars, copying entities by using (?A:
results in a very large grammar file, which can take an extremely long time to compile. In addition, the resulting file can take longer to load and scan than the equivalent file created by using (?A^
.
Micro Focus recommends that you use (?A^
in all cases, unless your grammar file is very simple, and extraction speed is critical.
You can define a space in grammars by using a space character ( ), or by using the \s
syntax. The \s
syntax matches all types of whitespace, such as spaces, tabs, and newlines. In most practical situations, the matches you want from the input text only include spaces. In these cases, it is slightly faster to use a space, rather than \s
.
You can extract regular expressions by using patterns (the <pattern>
element), or by explicitly listing each possible match as an entry (the <entry>
element). For example, the following alternatives are equivalent:
Example 1:
<pattern>[Ee]xtract(ed|ing|s)?</pattern>
Example 2:
<entry headword="Extract"/> <entry headword="Extracted"/> <entry headword="Extracting"/> <entry headword="Extracts"/> <entry headword="extract"/> <entry headword="extracted"/> <entry headword="extracting"/> <entry headword="extracts"/>
The first alternative is faster to code and maintain, and slightly faster for extraction. However, if there are fewer than about 50 entries represented by one pattern, the compilation time is faster for entries.
Micro Focus recommends that you use patterns unless the compilation time becomes too slow, in which case you might consider replacing the simplest patterns with entries.
The syntax expression{n,m}
matches at least n
, but at most m
consecutive occurrences of the specified expression
. When m
is large, it can result in a large .ECR file and slow extraction.
In this situation, Micro Focus recommends that you use {n,}
unless the upper bound m
is important.
The following type of pattern can be slow during extraction:
<pattern>(?A^entity_A)?(?A^entity_B)*(?A^entity_C)?(?A^entity_D)</pattern>
In this example, each time Eduction encounters a new word, it must check whether it matches entity_A
, then check whether it matches entity_B
, then entity_C
, and then entity_D
. If the word does match entity_A
, Eduction must then check whether the following word is matched by entity_B
, entity_C
, or entity_D
, and so on.
This process can be time-consuming, especially if each of the optional entities occurs regularly in the input text. When extraction speed is critical, Micro Focus recommends that you remove any unnecessary optional entities at the beginning of a pattern.
This issue does not occur if the optional phrases are not at the start of the pattern.
In some cases, including components in an Eduction grammar XML file increases the time for extraction by up to 50%, even if the components are not enabled during the extraction. This occurs because the resulting .ECR is less compact than the equivalent file that does not describe components. Do not use components if you do not need them.
When you do use components, Micro Focus recommends that you make the structure of the components as uniform as possible.
For example:
<pattern>(?A=COMPONENT:(?A^entity_A) )(?A^entity_B) (?A^entity_C) (?A^entity_D)</pattern> <pattern>(?A=COMPONENT:(?A^entity_A)) (?A^entity_B)(?A^entity_D) (?A^entity_C)</pattern> <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B))(?A^entity_C) (?A^entity_E)</pattern>
might be slower than:
<pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_C) (?A^entity_D)</pattern> <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_D) (?A^entity_C)</pattern> <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_C) (?A^entity_E)</pattern>
You can set the score of an entry or pattern to zero to exclude that entry or pattern. However, this method can reduce performance during extraction.
Where possible, and when extraction speed is critical, Micro Focus recommends that you consider ways to list matches to extract, rather than the matches to exclude.
For example, to extract mobile phone numbers that do not end in 25
:
<pattern>07[0-9]{9}</pattern> <pattern score="0">07[0-9]{7}25</pattern>
is slower than
<pattern>07[0-9]{7}[013-9][0-9]</pattern> <pattern>07[0-9]{7}[0-9][0-46-9]</pattern>
The following set of entities run fast most of the time:
<entity name="entity1"/> <pattern>[0-9]{2,3}</pattern> </entity> <entity name="entity2"/> <pattern>[0-9]{3,4}</pattern> </entity> <entity name="entity1"/> <pattern>((?A^entity1)|(?A^entity2))+</pattern> </entity>
However on some data, they might run extremely slowly - for example when the input text includes:
123 234 345 456 567 678 789 890 900 000
In this example, either entity can match every number, so there are 210 (1024) different ways that Eduction can match this phrase. It might try many methods while looking for a longer match.
In general, Micro Focus recommends that you avoid having a large number of possible ways to match a given phrase.
When patterns refer to entities by reference, Eduction checks for a match using each entity separately. While this is usually fast in practice, the following:
<pattern>(?A^entity_ABC)</pattern>
is likely to be faster than:
<pattern>((?A^entity_A)|(?A^entity_B)|(?A^entity_C))</pattern>
Where <entity_ABC>
contains everything in entities A, B and C.
Micro Focus recommends that you merge any entities that can be merged, unless merging them alters what the grammar can match.
|