SentenceBreakingOptions

Additional options to use for the Chinese and Japanese sentence-breaking libraries.

Options for the Chinese Sentence-Breaking Library

  • traditional. Use a traditional Chinese dictionary. By default, Chinese sentence breaking uses the simplified dictionary.

Options for the Japanese Sentence-Breaking Library

For the Japanese sentence-breaking library, SentenceBreakingOptions specifies how IDOL Server normalizes equivalent multi-byte characters, to ensure that it tokenizes terms that contain equivalent characters in the same way. You can use one or more of the following options:

  • kana. Normalizes half-width and full-width Kana characters.

  • oldnew. Normalizes old Kanji characters (Kyujitai) and new Kanji characters (Shinjitai).

  • hyphen. Normalizes Katakana prolonged sound marks (similar to a Latin hyphen) at the end of a Katakana term when the Katakana term (including the hyphen) is four letters or more.

  • dbcs. Normalizes the Japanese Double Byte Character Set (DBCS) alphabet and the ASCII Single Byte Character Set (SBCS) alphabet.

  • number. Normalizes Japanese Kanji numbers and ASCII numbers. Currently, it supports only a single digit.

  • weakeol. Ignores new line characters ("\r" and "\n") at indexing time, to ensure that IDOL Server correctly tokenizes Japanese terms that break across lines. It ignores new lines only for multi-byte characters.

Separate multiple options with commas. There must be no space before or after a comma.

Type: String
Default:  
Required: No
Configuration Section: LanguageTypes or MyLanguage
Example: SentenceBreakingOptions=kana,dbcs,number,weakeol
See Also: SentenceBreaking
NGram

NOTE: If you change this setting after you have indexed content into IDOL Server, the new setting applies only to new content, and the server logs a warning. To clear the warning and ensure that your change applies to all your content, you must initialize your index and reindex the content.