SentenceBreakingOptions

Additional options to use for the Chinese and Japanese sentence-breaking libraries.

Options for the Chinese Sentence-Breaking Library

  • traditional. Use a traditional Chinese dictionary. By default, Chinese sentence breaking uses the simplified dictionary.

Options for the Japanese Sentence-Breaking Library

For the Japanese sentence-breaking library, SentenceBreakingOptions specifies how IDOL Content Component normalizes equivalent multi-byte characters, to ensure that it tokenizes terms that contain equivalent characters in the same way. You can use one or more of the following options:

  • kana. Normalizes half-width and full-width Kana characters.

  • oldnew. Normalizes old Kanji characters (Kyujitai) and new Kanji characters (Shinjitai).

  • hyphen. Normalizes Katakana prolonged sound marks (similar to a Latin hyphen) at the end of a Katakana term when the Katakana term (including the hyphen) is four letters or more.

  • dbcs. Normalizes the Japanese Double Byte Character Set (DBCS) alphabet and the ASCII Single Byte Character Set (SBCS) alphabet.

  • number. Normalizes Japanese Kanji numbers and ASCII numbers. Currently, it supports only a single digit.

  • weakeol. Ignores new line characters ("\r" and "\n") at indexing time, to ensure that IDOL Content Component correctly tokenizes Japanese terms that break across lines. It ignores new lines only for multi-byte characters.

Separate multiple options with commas. There must be no space before or after a comma.

Type: String
Default:  
Required: No
Configuration Section: LanguageTypes or MyLanguage
Example: SentenceBreakingOptions=kana,dbcs,number,weakeol
See Also: SentenceBreaking
NGram

NOTE: If you change this setting after you have indexed content into IDOL Server, the new setting applies only to new content, and the server logs a warning. To clear the warning and ensure that your change applies to all your content, you must initialize your index and reindex the content.