Convert Character Sets
Export allows you to control the character set of both the input and the output text. This is accomplished by either
-
setting the source and/or target character set in the API, or
-
basing the input/output on the character set of the document (if the document character set is stored in the document and can be determined by the document reader).
The character sets are defined as constants in the Export class.
Not all character sets can be used to specify the target character set. See Coded Character Sets for a list of character sets that can be used as a target character set.
Determine the Character Set of the Output Text
To determine the output character set of a converted document, Export considers the following:
-
Whether the reader can extract the character set from the document. This depends on whether the file format can provide character set information and whether the document actually contains character set information.
If character set information cannot be determined for your document type, you must set the source, the target character set, or both, in the API.
Guidelines for Character Set Conversion
The following diagram shows how the output character set is determined when the document character set can be determined:
Document Character Set Can Be Determined
The following diagram shows how the output character set is determined when the document character set cannot be determined:
Document Character Set Cannot Be Determined
Examples of Character Set Conversion
The examples below demonstrate possible configurations for mapping character sets and the expected output for each scenario.
Document Character Set Can be Determined
For the example in the following table, the document is an RTF file. The document character set is Traditional Chinese (BIG5).
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_GB
|
KVCS_UTF8
|
Converts GB (Simplified Chinese) to UTF-8. The output character set is the target character set specified in the API. |
KVCS_GB
|
--
|
Converts BIG5 to GB (Simplified Chinese). The output character set is the source character set specified in the API. |
--
|
KVCS_UTF8
|
Converts BIG5 to UTF-8. The output character set is the target character set specified in the API. |
--
|
--
|
The output character set is the document character set. No conversion. |
Document Character Set Cannot be Determined
For the example in the following table, the document is an ASCII file. The section Document Readers indicates that the document character set cannot be obtained from this file type. The document character set is KVCS_1251
.
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_1252
|
KVCS_UTF8
|
Converts |
KVCS_1252
|
KVCS_UNKNOWN
|
The output character set is the source character set specified in the API because |
KVCS_1252
|
--
|
The output character set is the source character set specified in the API. No conversion. |
--
|
KVCS_1252
|
Converts OS code page to |
--
|
--
|
The output character set is OS code page. No conversion. |
Set the Character Set During Conversion
You can convert the character set of a file at the time the file is converted.
To specify the source character set, use the setSourceCharSet
method of the OptionInfo
object and set setForceSourceCharSet
to TRUE
.
To specify the target character set, use the setOutputCharSet
method of the OptionInfo
object and set setForceOutputCharSet
to TRUE
.
Set the Character Set During File Extraction from a Container
You can convert the character set of a container subfile at the time the subfile is extracted from the container and before it is converted to HTML. This is most often used to set the output character set of a mail message's body text.
To specify the source and target character set of a subfile
-
Use the methods of the
ExtSubFileExtractConfig
object to set the source and target character set. -
Call the
extExtractSubFile
method of theExport
object and pass in theExtSubFileExtractConfig
object.