Select Matches

By default, Eduction does not return all possible matches. For example, it does not return matches that overlap a previous match:

  • If you have patterns for fox jumps and jumps over, the text "The quick brown fox jumps over the lazy dog" returns only the first match, because the second match overlaps it.

Eduction returns the longest possible match at the same position. For example:

  • If you have a pattern for brown fox, and one for brown fox jumps, only the second match returns.

Eduction has configuration parameters to allow you to modify this and other matching behavior.

Return the Smallest Match

When Eduction finds two matches that start at the same position, it returns only one match unless you enable overlaps. By default, Eduction returns the longer match.

You can configure it to use the shorter match by enabling NonGreedyMatch (see NonGreedyMatch). However, always consider whether you need to use this option, or whether you could redefine your Eduction grammar to be more precise instead.

Generally, Micro Focus recommends that you make your Eduction grammar definition as precise as possible, which reduces the chance of getting two matches at the same position. A precise Eduction grammar is also more efficient during extraction.

Overlapping and Duplicate Matches

You can return overlapping matches by enabling the AllowOverlaps parameter (see AllowOverlaps).

When the same string occurs at more than one position in the input data, by default Eduction returns only the first match. You can allow duplicates (for example, if you need to find the positions of all occurrences) by setting the AllowMultipleResults parameter (see AllowMultipleResults).

If you want to return only unique matches in each document, set EnableUniqueMatches to True (see EnableUniqueMatches). Eduction returns only a single occurrence of a particular value (the first match), even if the matches occur for different entities.

Return Multiple Results for a Single Match

In some instances, you might want to get multiple results for a single match. For example, if a word can occur in different contexts, you might want to tag a document according to the occurrence of the word.

<entity name="IT_industry">
   <entry headword="software">
      <synonym>CompanyA</synonym>
      <synonym>HP</synonym>
   </entry>
   <entry headword="hardware">
      <synonym>CompanyB</synonym>
      <synonym>HP</synonym>
   </entry>
</entity>

With this entity, CompanyA returns software, while CompanyB returns hardware. A match of HP might return either software or hardware. If you want to use this entity to return both software and hardware for HP, set the AllowMultipleResults configuration parameter to True (see AllowMultipleResults).