By default, Eduction matches whole words: all matches start at the beginning of a word and stop at the end of a word. Any pattern that corresponds to a substring of a word does not match the substring.
The pattern rain does not match the word raining.
By default, all punctuation is treated as a word boundary, and so cannot be matched at the beginning of a word.
You can change the Eduction matching behavior. You should consider your data, and determine whether you need to modify the matching behavior to extract the information that you want.
Normally, you do not need to match part of a word. If you do, you can disable the MatchWholeWord
configuration parameter. This has a small performance impact.
When you disable MatchWholeWord
, Eduction does not consider the difference between words and word boundaries. In this case, a word boundary might be matchable at the start of the match. However, if you want to match punctuation at the start of a match, you should use tangible characters.
Note: If you set MatchWholeWord
to False
, any TangibleCharacters
settings have no effect.
By default, all punctuation is treated as a word boundary. When Eduction parses and tokenizes the input data, it passes over all the word boundaries at the beginning of a word (like spaces). To match punctuation characters at the beginning of a match, you can set the punctuation as a tangible character.
Match UK phone numbers with area codes: (01223) 448000
. In this case, you can set TangibleCharacters
to (,
so that it can match the opening parenthesis at the start of a match.
Match negative numbers: Bob's account shows -455 pounds.
Unless you set TangibleCharacters
to -
to specify that -
is part of the word that you want to extract, Eduction extracts 455
instead, even if the -
character is included in the pattern to match in your Eduction grammar file.
You can also configure all punctuation marks as tangible characters, by enabling TokenWithPunctuation
.
Note: Punctuation in this case refers to punctuation in the ASCII character set. Punctuation marks encoded in UTF-8 with multiple bytes, such as those used in Asian languages, are treated in the same way as other multiple byte characters.
If you set specific punctuation characters as tangible characters, Eduction treats those characters as part of the word you want to match, rather than as characters that indicate a word boundary. For this reason, the pattern in your grammar file must explicitly include any characters you have set as tangible.
If you specify in your grammar file that Eduction should match the pattern bob
, Eduction returns a match if, for example, the phrase One day "bob" walked in to town
appears in a document. However, if you set TangibleCharacters
to "
, the same pattern does not retrieve a match in the same document. This is because Eduction in this case treats "
as part of the word that it has to match; bob
is not the same as "bob"
, so a match is not found. You must include the "
character in your grammar pattern to retrieve a match for "bob"
.
Eduction tries to match whole words within word boundaries. If you have set up a grammar rule that includes punctuation characters that you want to match, but you have not specified that those characters are tangible characters rather than characters that mark word boundaries, the results might not be what you expected. For example:
Eduction ignores boundary characters if the next set of characters in the data matches a defined pattern. This means that if TangibleCharacters
does not include !
, the grammar pattern [!"bo]*
(match !
or "
or b
or o
, zero or many times) returns a match for bob
if a document contains the text !bob
.
Eduction does not return a match for boundary characters that appear on their own in documents.
Eduction returns a match for boundary characters if they are included in the pattern and are embedded in another match. For example, the grammar pattern [!"bo]*
returns a match for b"o!b
if a document contains the text b"o!b
. If you did not include "
or !
in the grammar pattern, Eduction would treat them as boundary characters and would not return a match.
The following table shows examples of the difference in Eduction matches when TangibleCharacters
includes !
, "
and #
and when it does not, for the grammar rule [A-Za-z!"#]+@com
(match all uppercase or all lowercase letters or !
or "
or #
, followed by the string @com
).
Document text | Returns (TangibleCharacters does not include !, " and #) | Returns (TangibleCharacters includes !, " and #) |
---|---|---|
!Chris@com went to town to pick up some fruit | Chris@com | !Chris@com |
Ch!ris@com went to town to pick up some melons | Ch!ris@com | Ch!ris@com |
“Chris@com” went to town to get some tea | Chris@com | “Chris@com” |
!@com went to town to get nothing | No match returned | !@com |
Your use of the TangibleCharacters
setting should be determined by what you are trying to achieve, and by the type of content you are working with. For example, it is likely to be less helpful if you are working with continuous strings of information, where punctuation characters are used to separate possible Eduction matches.
All input data and grammars for Eduction must be encoded in UTF-8. For characters that have more than one byte in UTF-8, typically Chinese, Japanese, and Korean (CJK) language characters, the data is tokenized character by character, with a word boundary on both sides of each character. This process ensures that when Eduction processes data that includes CJK characters, the logical word unit in the data can be matched without disabling MatchWholeWord
.
|