Match Special Characters

By default, Eduction matches whole words: all matches start at the beginning of a word and stop at the end of a word. Any pattern that corresponds to a substring of a word does not match the substring.

The pattern rain does not match the word raining.

By default, all punctuation is treated as a word boundary, and so cannot be matched at the beginning of a word.

You can change the Eduction matching behavior. You should consider your data, and determine whether you need to modify the matching behavior to extract the information that you want.

Match Part of a Word

Normally, you do not need to match part of a word. If you do, you can disable the MatchWholeWord configuration parameter. This has a small performance impact.

When you disable MatchWholeWord, Eduction does not consider the difference between words and word boundaries. In this case, a word boundary might be matchable at the start of the match. However, if you want to match punctuation at the start of a match, you should use tangible characters.

NOTE:

If you set MatchWholeWord to False, any TangibleCharacters settings have no effect.

Match Punctuation at the Start of a Match

By default, all punctuation is treated as a word boundary. When Eduction parses and tokenizes the input data, it passes over all the word boundaries at the beginning of a word (like spaces). To match punctuation characters at the beginning of a match, you can set the punctuation as a tangible character.

Match UK phone numbers with area codes: (01223) 448000. In this case, you can set TangibleCharacters to (, so that it can match the opening parenthesis at the start of a match.

Match negative numbers: Bob's account shows -455 pounds. Unless you set TangibleCharacters to - to specify that - is part of the word that you want to extract, Eduction extracts 455 instead, even if the - character is included in the pattern to match in your Eduction grammar file.

You can also configure all punctuation marks as tangible characters, by enabling TokenWithPunctuation.

NOTE:

Punctuation in this case refers to punctuation in the ASCII character set. Punctuation marks encoded in UTF-8 with multiple bytes, such as those used in Asian languages, are treated in the same way as other multiple byte characters.

If you set specific punctuation characters as tangible characters, Eduction treats those characters as part of the word you want to match, rather than as characters that indicate a word boundary. For this reason, the pattern in your grammar file must explicitly include any characters you have set as tangible.

If you specify in your grammar file that Eduction should match the pattern bob, Eduction returns a match if, for example, the phrase One day "bob" walked in to town appears in a document. However, if you set TangibleCharacters to ", the same pattern does not retrieve a match in the same document. This is because Eduction in this case treats " as part of the word that it has to match; bob is not the same as "bob", so a match is not found. You must include the " character in your grammar pattern to retrieve a match for "bob".

Eduction tries to match whole words within word boundaries. If you have set up a grammar rule that includes punctuation characters that you want to match, but you have not specified that those characters are tangible characters rather than characters that mark word boundaries, the results might not be what you expected. For example:

The following table shows examples of the difference in Eduction matches when TangibleCharacters includes !, " and # and when it does not, for the grammar rule [A-Za-z!"#]+@com (match all uppercase or all lowercase letters or ! or " or #, followed by the string @com).

Document text Returns (TangibleCharacters does not include !, " and #) Returns (TangibleCharacters includes !, " and #)
!Chris@com went to town to pick up some fruit Chris@com !Chris@com
Ch!ris@com went to town to pick up some melons Ch!ris@com Ch!ris@com
“Chris@com” went to town to get some tea Chris@com “Chris@com”
!@com went to town to get nothing No match returned !@com

Your use of the TangibleCharacters setting should be determined by what you are trying to achieve, and by the type of content you are working with. For example, it is likely to be less helpful if you are working with continuous strings of information, where punctuation characters are used to separate possible Eduction matches.

Match Multiple Byte Characters

All input data and grammars for Eduction must be encoded in UTF-8. For characters that have more than one byte in UTF-8, typically Chinese, Japanese, and Korean (CJK) language characters, the data is tokenized character by character, with a word boundary on both sides of each character. This process ensures that when Eduction processes data that includes CJK characters, the logical word unit in the data can be matched without disabling MatchWholeWord.


_FT_HTML5_bannerTitle.htm