Unicode Script Ranges
The following table describes the unicode script ranges that IDOL Server identifies.
Script | Begin | End |
---|---|---|
Arabic | U+0600 | U+06FF |
BasicLatin | U+0000 | U+007F |
Bengali | U+0981 | U+09FB |
Burmese | U+1000 | U+109F |
CJKComp | U+3300 | U+33FF |
CJKComp | U+2F00 | U+2FDF |
CJKComp | U+FE30 | U+FE4F |
CJKCompIdeo | U+F900 | U+FAFF |
CJKCompIdeo | U+2F800 | U+2FA1F |
CJKRadicalsSup | U+2E80 | U+2EFF |
CJKRadicalsSup | U+3000 | U+303F |
CJKRadicalsSup | U+31C0 | U+31EF |
CJKUnifIdeo | U+4E00 | U+9FFF |
CJKUnifIdeo | U+20000 | U+2A6D6 |
CJKUnifIdeo | U+2A700 | U+2B73F |
CJKUnifIdeo | U+2B740 | U+2B81F |
CJKUnifIdeo | U+3200 | U+32FF |
CJKUnifIdeo | U+2FF0 | U+2FFF |
CJKUnifIdeoExtA | U+3400 | U+4DBF |
Cyrillic | U+0400 | U+04FF |
Cyrillic | U+0500 | U+052F |
Cyrillic | U+2DE0 | U+2DFF |
Cyrillic | U+A640 | U+A69F |
Devanagari | U+0901 | U+097F |
Ethiopic | U+1200 | U+1399 |
Georgian | U+10A0 | U+10FF |
GreekAndCoptic | U+0370 | U+03FF |
GreekAndCoptic | U+1F00 | U+1FFF |
Gujarati | U+0A81 | U+0AF1 |
Hangul | U+AC00 | U+D7A3 |
Hangul | U+1100 | U+11FF |
Hangul | U+3130 | U+318F |
Hangul | U+A960 | U+A97F |
Hangul | U+D7B0 | U+D7FF |
Hebrew | U+0590 | U+05FF |
Hiragana | U+3040 | U+309F |
Kannada | U+0C82 | U+0CF2 |
Katakana | U+30A0 | U+30FF |
Katakana | U+31F0 | U+31FF |
Lao | U+0E81 | U+0EDF |
Latin1Sup | U+0080 | U+00FF |
LatinExtA | U+0100 | U+017F |
LatinExtB | U+0180 | U+024F |
Malayalam | U+0D02 | U+0D7F |
Mongolian | U+1800 | U+18AA |
OrientalMisc | U+3105 | U+312C |
OrientalMisc | U+31A0 | U+31BF |
OrientalMisc | U+3190 | U+319F |
OrientalMisc | U+4DC0 | U+4DFF |
Oriya | U+0B01 | U+0B77 |
Sinhala | U+0D82 | U+0DF4 |
Tamil | U+0B82 | U+0BFA |
Telugu | U+0C01 | U+0C7F |
Thai | U+0E01 | U+0E5B |
Tibetan | U+0F00 | U+0FDA |
Vietnamese | U+1EA0 | U+1EF9 |
Chinese, Japanese, and Korean Scripts
When processing text, IDOL Server identifies the script range that a character belongs to. In some cases, the script range can determine how that part of the text is processed. For example, when a language has NGramSentenceBrokenScriptOnly
set to True
in the configuration, IDOL Server only produces NGrams from words that consist entirely of characters that belong to one of the following Chinese, Japanese, and Korean script ranges:
-
CJKUnifIdeo
-
CJKUnifIdeoExtA
-
CJKCompIdeo
-
CJKComp
-
CJKRadicalsSup
-
Hiragana
-
Katakana
-
Hangul
-
OrientalMisc