Character Tokenization
You can tokenize characters into N-grams of a specified size. Set the NGram
configuration parameter in your language configuration section to the number of characters to use in each N-gram group.
NOTE: You must not use NGram
with the SentenceBreaking
configuration parameter.
For example, if you set NGram
to 2
, Content tokenizes the word Hello as:
he el ll lo
To tokenize only multibyte strings, set NGramMultiByteOnly
to True
.
[Japanese] NGram=2 NGramMultiByteOnly=True
For this configuration, if you have a document that contains both English and Asian (multibyte) text, Content tokenizes the Asian text according to the NGram
parameter. It does not tokenize the English text.
To tokenize only multiple-byte strings in Chinese, Japanese, and Korean characters (and ignore multiple-byte strings in other languages), set NGramSentenceBrokenScriptOnly
to True
.
[Japanese] NGram=2 NGramSentenceBrokenScriptOnly=True
For this configuration, if you have a document that contains multibyte text in both Japanese and Greek, Content tokenizes the Japanese text according to the NGram
parameter. It does not tokenize the Greek multibyte text.