Match Special Characters

By default, Eduction matches whole words; that is, all matches start at the beginning of a word and stop at the end of a word. Any pattern that corresponds to a substring of a word does not match the substring. For example, the pattern rain does not match the word raining.

Additionally, by default Eduction treats all punctuation as a word boundary, so you cannot match punctuation at the start of a word.

You can change this Eduction matching behavior if the default settings do not allow you to extract the information you want from your data.

Match Part of a Word

In normal use, it usually makes sense to match only whole words with entities. If you do need to allow partial matches, you can set the MatchWholeWord configuration parameter to False (see MatchWholeWord).

NOTE: This change can have a significant performance impact.

When you set MatchWholeWord to False, Eduction does not consider the difference between words and word boundaries. This might mean it matches a word boundary at the start of the match. However, if you explicitly want to match punctuation at the start of a match, OpenText recommends that you use TangibleCharacters instead of MatchWholeWord (see Match Punctuation at the Start of a Match and TangibleCharacters).

IMPORTANT: If you set MatchWholeWord to False, TangibleCharacters has no effect.

Match Punctuation at the Start of a Match

By default, Eduction treats all punctuation as a word boundary. When Eduction parses and tokenizes the input data, it ignores all word boundaries at the start of a word (such as spaces). To match punctuation characters at the start of a match, you can set the punctuation as a tangible character (see TangibleCharacters).

For example:

  • To match UK phone numbers with area codes, such as (01223) 448000, you can set TangibleCharacters to ( to include the opening parenthesis as part of the match.
  • To match negative numbers, such as Bob's account shows -455 pounds, you can set TangibleCharacters to - to specify that - is part of the word that you want to extract. By default, Eduction extracts 455 instead of -455, even if the - character is part of the pattern in your Eduction grammar file.

You can also configure all punctuation marks as tangible characters, by enabling TokenWithPunctuation (see TokenWithPunctuation).

NOTE: Punctuation in this case refers to punctuation in the ASCII character set.

Punctuation marks regarded as characters from Chinese, Japanese, or Korean (CJK) languages are treated in the same way as other CJK text (see Match CJK Text).

When you set tangible characters, Eduction treats those characters as part of the word you want to match, rather than as word boundaries. For this reason, the pattern in your grammar file must explicitly include any characters you have set as tangible.

For example, if your grammar file contains an entity to match the pattern bob, by default Eduction returns a match for the phrase One day "bob" walked in to town.

However, if you set TangibleCharacters to ", the same pattern does not retrieve a match for this phrase. In this case, Eduction treats " as part of the word that it has to match, so bob is not the same as "bob", and it does not match. You must include the " character in your grammar pattern to retrieve a match for "bob".

Eduction tries to match whole words inside word boundaries. If you set up a grammar rule that includes punctuation characters that you want to match, but you do not include those characters as tangible characters, the results might not be what you expect. For example:

  • Eduction ignores boundary characters if the next set of characters in the data matches a defined pattern.

    For example, if TangibleCharacters does not include !, the grammar pattern [!"bo]* (match ! or " or b or o, zero or many times) returns a match for bob when a document contains the text !bob.

  • Eduction does not return a match for boundary characters that appear on their own in documents.

  • Eduction returns a match for boundary characters if they are included in the pattern and are embedded in another match.

    For example, the grammar pattern [!"bo]* returns a match for b"o!b if a document contains the text b"o!b. If you did not include " or ! in the grammar pattern, Eduction would treat them as boundary characters and would not return a match.

Examples

The following table shows some examples of the difference in Eduction matches when TangibleCharacters includes !, " and # and when it does not.

These examples use the grammar rule [A-Za-z!"#]+@com (match all uppercase or all lowercase letters or ! or " or #, followed by the string @com).

Document text Returns (TangibleCharacters does not include !, " and #) Returns (TangibleCharacters includes !, " and #)
!Chris@com went to town to pick up some fruit Chris@com !Chris@com
Ch!ris@com went to town to pick up some melons Ch!ris@com Ch!ris@com
“Chris@com” went to town to get some tea Chris@com “Chris@com”
!@com went to town to get nothing No match returned !@com

You must determine your use of TangibleCharacters by what you are trying to achieve, and the type of content you are working with. For example, it is likely to be less helpful if you are working with continuous strings of information, where punctuation characters separate possible Eduction matches.

Match CJK Text

All input data and grammars for Eduction must be encoded in UTF-8.

For characters in Chinese, Japanese, and Korean (CJK) languages, the data is tokenized character by character, with a word boundary on both sides of each character. This process ensures that when Eduction processes data that includes CJK characters, it can match the logical word unit in the data without disabling MatchWholeWord.