Character Encoding

To ensure that all filtered text is output in the same character encoding, File Content Extraction performs character encoding conversion. In most cases, if your license includes advanced character set detection, File Content Extraction can detect the character encoding used in a source file, and automatically outputs filtered text in the encoding you choose. OpenText recommends that you specify your preferred target encoding. In the rare cases where File Content Extraction cannot detect the character encoding used in a source file, you can also specify the source encoding.

Many encodings have significant overlap in terms of the bytes used to encode characters, with the same bytes used to express different characters in different encodings. Advanced character set detection performs frequency analysis on the characters of the text to determine the most likely language and encoding. Detection therefore works best on large portions of natural language, and might not be able to determine the correct source encoding where the text is very short, contains a random sequence of characters, has some blocks of characters repeated many times, or contains characters from multiple languages.

Specify a Target Character Encoding

OpenText recommends that you specify a target character encoding when you initialize File Content Extraction, and recommends using UTF-8 or UTF-16 because these are widely supported and can encode a diverse range of characters.

DEPRECATED: Target character sets other than UTF-8 and UTF-16 are deprecated in File Content Extraction version 24.4 and later. OpenText recommends that you only use the widely supported UTF-8 and UTF-16 formats. Other character sets are still available, but might be removed in future. For a full list of these target encodings, see Coded Character Sets.

To specify a target character encoding

  • In the C API, set the outputCharSet member of the KVFilterInitOptions structure that you pass to fpInit().

    After filtering, you can verify the output encoding by calling the function fpGetTrgCharSet(). If the result is KVCS_UNKNOWN, File Content Extraction was unable to determine the source character encoding and therefore no conversion occurred. If you know the character encoding used in the source file you can specify it through the API - see Specify a Source Character Encoding.

Performance Considerations

When a file format does not specify a character encoding, File Content Extraction attempts to detect the encoding automatically. Some character encodings, including UTF-8 and UTF-16, can be detected by core File Content Extraction functionality but others can be detected only if your license includes advanced character set detection. Advanced character set detection is enabled by default (if it is included in your license), but can increase the time required to filter some documents.

You can disable advanced character set detection on a file-by-file basis. Before doing this, be aware that File Content Extraction cannot output filtered text in your chosen encoding unless it detects the encoding of the source file, or you specify the source encoding yourself though the API.

To disable advanced character set detection

  • In the C API, call fpSetConfig() and set the flag KVFLT_CHARSETDETECTION to FALSE.

Specify a Source Character Encoding

In most cases, File Content Extraction can automatically detect the character encoding of an input file and specifying a source encoding is not necessary. You might need to specify the source character encoding if you have disabled advanced character set detection.

To specify the source character encoding

Disable Character Encoding Conversion

You can completely disable character encoding conversion, and retain the original character encoding of the document.

To disable character encoding conversion

  • In the C API, set the flag KVF_NODEFAULTCHARSETCONVERT in the dwFlags argument of the KVFilterInitOptions structure that you pass to fpInit().