HPE IDOL Server uses probabilistic modeling and therefore does not require any form of language-dependent parsing, dictionaries, or translation modules.
Treating words as abstract symbols of meaning allows HPE IDOL technology to derive understanding through the context in which symbols occur rather than a rigid definition of grammar. Slang and other variations in language do not affect the software analysis.
HPE IDOL Server can build up a statistical understanding of the patterns in any language. The more information HPE IDOL Server has about a particular type of information (for example, legal terms, pharmaceutical developments, technology, and so on), the more understanding it gains of those topics.
You can think of a new language as simply another type of information, for which HPE IDOL Server needs enough material to learn from. Therefore, it is possible to mix more than one language in HPE IDOL Server as long as you have sufficient amounts of each language to build its understanding.
The choice of language does not compromise the accuracy of the concepts extracted by HPE IDOL Server. The underlying algorithm is the same regardless of the language used.
HPE IDOL internationalization functionality enables:
automatic language detection. HPE IDOL Server can automatically detect the language and encoding of documents that it processes. This feature allows you to set up processes that HPE IDOL Server automatically applies to documents or document metadata if they are in a specific language. For example, if HPE IDOL Server identifies a document as Chinese, it automatically applies the appropriate preliminary linguistic tools.
If a document contains multiple languages, HPE IDOL Server determines which language it contains most, and processes the document according to the settings for this language.
cross-lingual systems. You can set up cross-lingual systems in HPE IDOL Server. This feature allows you to produce multilingual results for queries, or to restrict results to documents in a specific language or encoding. For example, an English query can return information both in English and Spanish.
Although HPE IDOL technology is language independent, it can be beneficial to use language-dependent features to optimize the ability of HPE IDOL Server to match concepts irrespective of their appearance in text. HPE IDOL therefore provides the following features:
stop word lists. Every language has words that do not carry much significant meaning. In grammatical terms these are normally prepositions, conjunctions, auxiliary verbs, and so on (for example, words such as the, a, and to in English). These words can be safely ignored when processing content.
HPE IDOL provides as standard a set of stop word lists for the most commonly used languages.
stemming. In languages, some words have a common morphological root. HPE IDOL provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping, and helped can all be stripped to their stem help without significant loss of meaning.
HPE IDOL provides as standard a set of stemming algorithms for the most commonly used languages. HPE IDOL applies stemming after it discards stop words, both at index time (when content is stored in HPE IDOL Server) and at query time (HPE IDOL removes stop words and stems query text before matching).
HPE IDOL Server also supports per-language use of a stemming file, which you can use in conjunction with the stemming algorithms to specify stems for individual words.
multiple encodings. HPE IDOL supports multiple encodings for languages such as Greek and Russian. You can use different encodings interchangeably, which means that it does not matter which encoding a language is given in. For example, it is possible to query in one recognized encoding for a language and receive results that are in other encodings.
transliteration schemes. Transliteration is the ability to represent letters that do not belong to the Latin alphabet or words that contain accented letters with the corresponding characters of another alphabet. This makes familiarity with the accents and special characters of different languages unnecessary.
canonicalization of characters. Some encodings have more than one way to represent a character. For example, the Japanese katakana script can have full width or half width characters. Regardless of its width the character in itself carries the same meaning.
The HPE IDOL software infrastructure uses canonicalization to ensure that it treats all character forms equally. It automatically converts to an internationally recognized canonical form.
|