IDOL Language Support Concepts

IDOL uses probabilistic modeling and therefore does not require any form of language-dependent parsing, dictionaries, or translation modules.

Treating words as abstract symbols of meaning allows IDOL technology to derive understanding through the context in which symbols occur rather than a rigid definition of grammar. Slang and other variations in language do not affect the software analysis.

The IDOL Content component can build up a statistical understanding of the patterns in any language. The more information Content has about a particular type of information (for example, legal terms, pharmaceutical developments, technology, and so on), the more understanding it gains of those topics.

You can think of a new language as simply another type of information, for which Content needs enough material to learn from. Therefore, it is possible to mix more than one language in Content as long as you have sufficient amounts of each language to build its understanding.

The choice of language does not compromise the accuracy of the concepts extracted by the IDOL Content component. The underlying algorithm is the same regardless of the language used.

IDOL internationalization functionality enables:

  • automatic language detection. Content can automatically detect the language and encoding of documents that it processes. This feature allows you to set up processes that Content automatically applies to documents or document metadata if they are in a specific language. For example, if Content identifies a document as Chinese, it automatically applies the appropriate preliminary linguistic tools.

    NOTE: If a document contains multiple languages, Content determines which language it contains most, and processes the document according to the settings for this language.

  • cross-lingual systems. You can set up cross-lingual systems in Content. This feature allows you to produce multilingual results for queries, or to restrict results to documents in a specific language or encoding. For example, an English query can return information both in English and Spanish.

Although IDOL technology is language independent, it can be beneficial to use language-dependent features to optimize the ability of the IDOL Content component to match concepts irrespective of their appearance in text. IDOL therefore provides the following features:

  • stop word lists. Every language has words that do not carry much significant meaning. In grammatical terms these are normally prepositions, conjunctions, auxiliary verbs, and so on (for example, words such as the, a, and to in English). These words can be safely ignored when processing content.

    IDOL provides as standard a set of stop word lists for the most commonly used languages.

  • stemming. In languages, some words have a common morphological root. IDOL provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping, and helped can all be stripped to their stem help without significant loss of meaning.

    IDOL provides as standard a set of stemming algorithms for the most commonly used languages. IDOL applies stemming after it discards stop words, both at index time (when content is stored in the IDOL Content component) and at query time (IDOL removes stop words and stems query text before matching).

    NOTE: Content also supports per-language use of a stemming file, which you can use in conjunction with the stemming algorithms to specify stems for individual words.

  • multiple encodings. Content supports multiple encodings for languages such as Greek and Russian. You can use different encodings interchangeably, which means that it does not matter which encoding a language is given in. For example, it is possible to query in one recognized encoding for a language and receive results that are in other encodings.

  • transliteration schemes. Transliteration is the ability to represent letters that do not belong to the Latin alphabet or words that contain accented letters with the corresponding characters of another alphabet. This makes familiarity with the accents and special characters of different languages unnecessary.

  • canonicalization of characters. Some encodings have more than one way to represent a character. For example, the Japanese katakana script can have full width or half width characters. Regardless of its width the character in itself carries the same meaning.

    The IDOL software infrastructure uses canonicalization to ensure that it treats all character forms equally. It automatically converts to an internationally recognized canonical form.