SentenceBreakingOptions
Additional options to use for the Chinese and Japanese sentence-breaking libraries.
Options for the Chinese Sentence-Breaking Library
-
traditional
. Use a traditional Chinese dictionary. By default, Chinese sentence breaking uses the simplified dictionary.
Options for the Japanese Sentence-Breaking Library
For the Japanese sentence-breaking library, SentenceBreakingOptions
specifies how IDOL Server normalizes equivalent multi-byte characters, to ensure that it tokenizes terms that contain equivalent characters in the same way. You can use one or more of the following options:
-
kana
. Normalizes half-width and full-width Kana characters. -
oldnew
. Normalizes old Kanji characters (Kyujitai) and new Kanji characters (Shinjitai). -
hyphen
. Normalizes Katakana prolonged sound marks (similar to a Latin hyphen) at the end of a Katakana term when the Katakana term (including the hyphen) is four letters or more. -
dbcs
. Normalizes the Japanese Double Byte Character Set (DBCS) alphabet and the ASCII Single Byte Character Set (SBCS) alphabet. -
number
. Normalizes Japanese Kanji numbers and ASCII numbers. Currently, it supports only a single digit. -
weakeol
. Ignores new line characters ("\r" and "\n") at indexing time, to ensure that IDOL Server correctly tokenizes Japanese terms that break across lines. It ignores new lines only for multi-byte characters.
Separate multiple options with commas. There must be no space before or after a comma.
Type: | String |
Default: | |
Required: | No |
Configuration Section: | LanguageTypes or MyLanguage |
Example: | SentenceBreakingOptions=kana,dbcs,number,weakeol
|
See Also: | SentenceBreaking
NGram |
NOTE: If you change this setting after you have indexed content into IDOL Server, the new setting applies only to new content, and the server logs a warning. To clear the warning and ensure that your change applies to all your content, you must initialize your index and reindex the content.