Additional options to use for the Chinese and Japanese sentence-breaking libraries.
traditional
. Use a traditional Chinese dictionary. By default, Chinese sentence breaking uses the simplified dictionary.
For the Japanese sentence-breaking library, SentenceBreakingOptions
specifies how HPE Content Component normalizes equivalent multi-byte characters, to ensure that it tokenizes terms that contain equivalent characters in the same way. You can use one or more of the following options:
kana
. Normalizes half-width and full-width Kana characters.
oldnew
. Normalizes old Kanji characters (Kyujitai) and new Kanji characters (Shinjitai).
hyphen
. Normalizes Katakana prolonged sound marks (similar to a Latin hyphen) at the end of a Katakana term when the Katakana term (including the hyphen) is four letters or more.
dbcs
. Normalizes the Japanese Double Byte Character Set (DBCS) alphabet and the ASCII Single Byte Character Set (SBCS) alphabet.
number
. Normalizes Japanese Kanji numbers and ASCII numbers. Currently, it supports only a single digit.
weakeol
. Ignores new line characters ("\r" and "\n") at indexing time, to ensure that HPE Content Component correctly tokenizes Japanese terms that break across lines. It ignores new lines only for multi-byte characters.
Separate multiple options with commas. There must be no space before or after a comma.
Type: | String |
Default: | |
Required: | No |
Configuration Section: | LanguageTypes or MyLanguage |
Example: | SentenceBreakingOptions=kana,dbcs,number,weakeol
|
See Also: |
|