Character Encoding
To ensure that all filtered text is output in the same character encoding, File Content Extraction performs character encoding conversion. In most cases, if your license includes advanced character set detection, File Content Extraction can detect the character encoding used in a source file, and automatically outputs filtered text in the encoding you choose. OpenText recommends that you specify your preferred target encoding. In the rare cases where File Content Extraction cannot detect the character encoding used in a source file, you can also specify the source encoding.
Many encodings have significant overlap in terms of the bytes used to encode characters, with the same bytes used to express different characters in different encodings. Advanced character set detection performs frequency analysis on the characters of the text to determine the most likely language and encoding. Detection therefore works best on large portions of natural language, and might not be able to determine the correct source encoding where the text is very short, contains a random sequence of characters, has some blocks of characters repeated many times, or contains characters from multiple languages.
Specify a Target Character Encoding
OpenText recommends that you specify a target character encoding when you initialize File Content Extraction, and recommends using UTF-8 or UTF-16 because these are widely supported and can encode a diverse range of characters.
DEPRECATED: Target character sets other than UTF-8 and UTF-16 are deprecated in File Content Extraction version 24.4 and later. OpenText recommends that you only use the widely supported UTF-8 and UTF-16 formats. Other character sets are still available, but might be removed in future. For a full list of these target encodings, see Coded Character Sets.
To specify a target character encoding
- In the C++ API, set the property target_encoding on the configuration object. Set the target encoding to one of the values in the enumerated list
Encoding
inKeyview_Enumerations.hpp
.
Performance Considerations
When a file format does not specify a character encoding, File Content Extraction attempts to detect the encoding automatically. Some character encodings, including UTF-8 and UTF-16, can be detected by core File Content Extraction functionality but others can be detected only if your license includes advanced character set detection. Advanced character set detection is enabled by default (if it is included in your license), but can increase the time required to filter some documents.
You can disable advanced character set detection on a file-by-file basis. Before doing this, be aware that File Content Extraction cannot output filtered text in your chosen encoding unless it detects the encoding of the source file, or you specify the source encoding yourself though the API.
To disable advanced character set detection
- In the C++ API, set character_set_detection to
FALSE
.
Specify a Source Character Encoding
In most cases, File Content Extraction can automatically detect the character encoding of an input file and specifying a source encoding is not necessary. You might need to specify the source character encoding if you have disabled advanced character set detection.
To specify the source character encoding
- In the C++ API, set source_encoding on a
Configuration
object, before creating a session, using any value in the enumerated listEncoding
inKeyview_Enumerations.hpp
. See The Configuration Class for more information.
Disable Character Encoding Conversion
You can completely disable character encoding conversion, and retain the original character encoding of the document.
To disable character encoding conversion
- In the C++ API, set no_encoding_conversion to
TRUE
.