Convert Character Sets

Export allows you to control the character set of both the input and the output text. This is accomplished by either

  • setting the source and/or target character set in the API, or

  • basing the input/output on the character set of the document (if the document character set is stored in the document and can be determined by the document reader).

The character sets are defined as constants in the Export class.

Not all character sets can be used to specify the target character set. See Coded Character Sets for a list of character sets that can be used as a target character set.

Determine the Character Set of the Output Text

To determine the output character set of a converted document, Export considers the following:

  • Whether the reader can extract the character set from the document. This depends on whether the file format can provide character set information and whether the document actually contains character set information.

    If character set information cannot be determined for your document type, you must set the source, the target character set, or both, in the API.

  • Whether a source character set is set in the API.

    NOTE: To set the source character set, you must specify a character set and set the parameter setForceSourceCharSet to TRUE.

  • Whether a target character set is set in the API.

    NOTE: To set the target character set, you must specify a character set and set the parameter setForceOutputCharSet to TRUE.

Guidelines for Character Set Conversion

The following diagram shows how the output character set is determined when the document character set can be determined:

Document Character Set Can Be Determined

The following diagram shows how the output character set is determined when the document character set cannot be determined:

Document Character Set Cannot Be Determined

Examples of Character Set Conversion

The examples below demonstrate possible configurations for mapping character sets and the expected output for each scenario.

Document Character Set Can be Determined

For the example in the following table, the document is an RTF file. The document character set is Traditional Chinese (BIG5).

Document character set can be determined

Source charset set

Target charset set

Output charset

KVCS_GB KVCS_UTF8

KVCS_UTF8

Converts GB (Simplified Chinese) to UTF-8. The output character set is the target character set specified in the API.

KVCS_GB --

KVCS_GB

Converts BIG5 to GB (Simplified Chinese). The output character set is the source character set specified in the API.

-- KVCS_UTF8

KVCS_UTF8

Converts BIG5 to UTF-8. The output character set is the target character set specified in the API.

-- --

KVCS_BIG5

The output character set is the document character set. No conversion.

Document Character Set Cannot be Determined

For the example in the following table, the document is an ASCII file. The section Document Readers indicates that the document character set cannot be obtained from this file type. The document character set is KVCS_1251.

Document character set cannot be determined

Source charset set

Target charset set

Output charset

KVCS_1252 KVCS_UTF8

KVCS_UTF8

Converts KVCS_1252 to KVCS_UTF8. The output character set is the target character set specified in the API.

KVCS_1252 KVCS_UNKNOWN

KVCS_1252

The output character set is the source character set specified in the API because KVCS_UNKNOWN cannot be used. No conversion.

KVCS_1252 --

KVCS_1252

The output character set is the source character set specified in the API. No conversion.

-- KVCS_1252

KVCS_1252

Converts OS code page to KVCS_1252. The output character set is the target character set specified in the API.

-- --

The output character set is OS code page. No conversion.

Set the Character Set During Conversion

You can convert the character set of a file at the time the file is converted.

To specify the source character set, use the setSourceCharSet method of the OptionInfo object and set setForceSourceCharSet to TRUE.

To specify the target character set, use the setOutputCharSet method of the OptionInfo object and set setForceOutputCharSet to TRUE.

Set the Character Set During File Extraction from a Container

You can convert the character set of a container subfile at the time the subfile is extracted from the container and before it is converted to HTML. This is most often used to set the output character set of a mail message's body text. See Use the File Extraction API.

To specify the source and target character set of a subfile

  1. Use the methods of the ExtSubFileExtractConfig object to set the source and target character set.

  2. Call the extExtractSubFile method of the Export object and pass in the ExtSubFileExtractConfig object.