You can tokenize characters into N-grams of a specified size. Set the NGram
configuration parameter in your language configuration section to the number of characters to use in each N-gram group.
You must not use NGram
with the SentenceBreaking
configuration parameter.
For example, if you set NGram
to 2
, HPE IDOL Server tokenizes the word Hello as:
he el ll lo
To tokenize only multibyte strings, set NGramMultiByteOnly
to True
.
[Japanese] NGram=2 NGramMultiByteOnly=True
For this configuration, if you have a document that contains both English and Asian (multibyte) text, HPE IDOL Server tokenizes the Asian text according to the NGram
parameter. It does not tokenize the English text.
To tokenize only multiple-byte strings in Chinese, Japanese, and Korean characters (and ignore multiple-byte strings in other languages), set NGramOrientalOnly
to True
.
[Japanese] NGram=2 NGramOrientalOnly=True
For this configuration, if you have a document that contains multibyte text in both Japanese and Greek, HPE IDOL Server tokenizes the Japanese text according to the NGram
parameter. It does not tokenize the Greek multibyte text.
|