Find Repeated Matches

A repeated match occurs when Eduction finds another match (exactly the same text), at a different location in the input (a different offset). For example, if you are searching some input text for telephone numbers, Eduction could find a match "564123". The same text could occur later in the document and would result in a repeated match. Repeated matches might belong to the same entity, but this is not always the case because multiple entities can match the same text.

TIP: Eduction does not return repeated matches if you configure the engine with EnableUniqueMatches set to TRUE.

Eduction normally returns matches in the order that they appear in the input, but you might prefer to process a match, followed by all of its repeated matches, and then return to the next unique match. In extreme cases, where the matched text is repeated many times, this provides a convenient way to stop processing and move on to the next unique match or maybe even the next document.

Each Eduction API provides a way to find the next repeated match. The SDK includes sample programs, in each language, that demonstrate this functionality.

NOTE: This feature is not supported for streaming input or in table mode.

C API

In the C API, instead of calling EdkGetNextMatch you can call EdkGetNextRepeatedMatch.

If the input contains another match with the same text as the current match, you can then call EdkGetRepeatedMatchByteOffset and EdkGetRepeatedMatchCodepointOffset to establish the location of the repeated match.

You can call EdkGetNextRepeatedMatch repeatedly. If the matched text does not occur again, Eduction returns EdkNoMatch and you can proceed to the next unique match by calling EdkGetNextMatch.

Any repeated matches that you access using EdkGetNextRepeatedMatch are not returned by subsequent calls to EdkGetNextMatch. By using EdkGetNextRepeatedMatch you are changing the order in which the matches are returned.

The following code sample demonstrates how you might use these functions. For more information about these functions, refer to the API reference documentation.

Copy
while (EdkGetNextMatch(session) == EdkSuccess)
{
    // call match accessors and do something with the information

    while (EdkGetNextRepeatedMatch(session) == EdkSuccess)
    {
        size_t nRepCodepointOffset = 0;
        size_t nRepByteOffset = 0;
        EdkGetRepeatedMatchCodepointOffset(session, &nRepCodepointOffset);
        EdkGetRepeatedMatchByteOffset(session, &nRepByteOffset);

        // do something with the offsets...
    }
} 

Java API

The repeated matches functionality in the Java API is similar to the C API. You can iterate over repeated matches as shown in the following code sample. You can call the getByteOffset() and getCodepointOffset() methods to establish the position of a repeated match.

Copy
for (EDKMatch match : session)
{
    // call match accessors and do something with the information
    
    for (EDKRepeatedMatchOffset repeat : match)
    {
        long byteOffset = repeat.getByteOffset();
        long codepointOffset = repeat.getCodepointOffset();
        
        // do something with the offsets
    }
}

.NET API

The repeated matches functionality in the .NET API is similar to the C API. You can iterate over repeated matches as shown in the following code sample. The properties ByteOffset and CodepointOffset provide the position of a repeated match.

Copy
foreach (IExtractionMatch match in session)
{
    // call match accessors and do something with the information
    
    foreach (IExtractionRepeatedMatch repeat in match.RepeatedMatches)
    {
        long byteOffset = repeat.ByteOffset;
        long codepointOffset = repeat.CodepointOffset;
        
        // do something with the offsets
    }
}