By default, Eduction does not return all possible matches. For example, it does not return matches that overlap a previous match.
If your text includes the sentence The quick brown fox jumps over the lazy dog, and you have patterns for fox jumps and jumps over, only the first match returns, because the second match overlaps it.
Eduction returns the longest possible match at the same position.
If there is a pattern for brown fox, and one for brown fox jumps, only the second match returns.
You can also configure this and other behavior.
When Eduction finds two matches that start at the same position, it returns only one match unless you enable overlaps. By default, Eduction returns the longer match.
You can configure it to use the shorter match by enabling NonGreedyMatch
. However, you should consider whether you need to use this option, or if you could redefine your Eduction grammar to be more precise instead.
Generally, you should make your Eduction grammar definition as precise as possible, which reduces the chance of getting two matches at the same position. A precise Eduction grammar is also more efficient during extraction.
You can return overlaps, by enabling the AllowOverlaps
parameter.
In some cases, the same string might occur at more than one position in the input data. By default, Eduction returns only the first match. You can allow duplicates, for example so that you can find the positions of all occurrences, by setting the AllowDuplicates
parameter to the names of the fields where you want to allow duplicates.
If you wish to return only unique matches in each document, set EnableUniqueMatches
to True
. Eduction returns a single occurrence of a particular value, even if the matches occur for different entities. Only the first match is returned.
In some instances, you might want to get multiple results for a single match. For example, if a word can occur in different contexts, you might want to tag a document according to the occurrence of the word.
<entity name="IT_industry"> <entry headword="software"> <synonym>CompanyA</synonym> <synonym>HP</synonym> </entry> <entry headword="hardware"> <synonym>CompanyB</synonym> <synonym>HP</synonym> </entry> </entity>
With this entity, CompanyA
returns software
, while CompanyB
returns hardware
. A match of HP
might return either software
or hardware
. If you want to use this entity to return both software
and hardware
for HP
, set the AllowMultipleResults
configuration parameter to True
.
|