Select Matches

By default, Eduction does not return all possible matches. For example, it does not return matches that overlap a previous match.

If your text includes the sentence The quick brown fox jumps over the lazy dog, and you have patterns for fox jumps and jumps over, only the first match returns, because the second match overlaps it.

Eduction returns the longest possible match at the same position.

If there is a pattern for brown fox, and one for brown fox jumps, only the second match returns.

You can also configure this and other behavior.

Return the Smallest Match

When Eduction finds two matches that start at the same position, it returns only one match unless you enable overlaps. By default, Eduction returns the longer match.

You can configure it to use the shorter match by enabling NonGreedyMatch. However, you should consider whether you need to use this option, or if you could redefine your Eduction grammar to be more precise instead.

Generally, you should make your Eduction grammar definition as precise as possible, which reduces the chance of getting two matches at the same position. A precise Eduction grammar is also more efficient during extraction.

Overlapping and Duplicate Matches

You can return overlaps, by enabling the AllowOverlaps parameter.

In some cases, the same string might occur at more than one position in the input data. By default, Eduction returns only the first match. You can allow duplicates, for example so that you can find the positions of all occurrences, by setting the AllowDuplicates parameter to the names of the fields where you want to allow duplicates.

If you wish to return only unique matches in each document, set EnableUniqueMatches to True. Eduction returns a single occurrence of a particular value, even if the matches occur for different entities. Only the first match is returned.

Return Multiple Results for a Single Match

In some instances, you might want to get multiple results for a single match. For example, if a word can occur in different contexts, you might want to tag a document according to the occurrence of the word.

<entity name="IT_industry">
   <entry headword="software">
      <synonym>CompanyA</synonym>
      <synonym>HP</synonym>
   </entry>
   <entry headword="hardware">
      <synonym>CompanyB</synonym>
      <synonym>HP</synonym>
   </entry>
</entity>

With this entity, CompanyA returns software, while CompanyB returns hardware. A match of HP might return either software or hardware. If you want to use this entity to return both software and hardware for HP, set the AllowMultipleResults configuration parameter to True.


_FT_HTML5_bannerTitle.htm