Export enables you to control the character set of both the input and the output text. This is accomplished by either
setting the source, the target character set, or both, in the API
basing the input/output on the character set of the document (if the document character set is stored in the document and can be determined by the document reader)
The character sets are defined as constants in the Export class. Not all character sets can be used to specify the target character set. See Coded Character Sets for a list of character sets that can be used as a target character set.
To determine the output character set of a converted document, Export considers the following:
Whether the reader can extract the character set from the document. This depends on whether the file format can provide character set information and whether the document actually contains character set information.
The section Supported Formats indicates the file formats for which character set information can be extracted. If character set information cannot be determined for your document type, you must set the source, the target character set, or both in the API.
The following diagram shows how the output character set is determined when the document character set can be determined.
The following diagram shows how the output character set is determined when the document character set cannot be determined.
The examples below demonstrate possible configurations for mapping character sets and the expected output for each scenario.
For the example in the following table, the document is an RTF file. The section Word Processing Formats indicates that the document character set can be obtained from this file type. The document character set is Traditional Chinese (BIG5).
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_GB
|
KVCS_UTF8
|
Converts GB (Simplified Chinese) to UTF-8. The output character set is the target character set specified in the API. |
KVCS_GB
|
--
|
Converts BIG5 to GB (Simplified Chinese). The output character set is the source character set specified in the API. |
--
|
KVCS_UTF8
|
Converts BIG5 to UTF-8. The output character set is the target character set specified in the API. |
--
|
--
|
The output character set is the document character set. No conversion. |
For the example in the following table, the document is an ASCII file. The section Word Processing Formats indicates that the document character set cannot be obtained from this file type. The document's source character set is KVCS_1251
.
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_1252
|
KVCS_UTF8
|
Converts |
KVCS_1252
|
KVCS_UNKNOWN
|
The output character set is the source character set specified in the API because |
KVCS_1252
|
--
|
The output character set is the source character set specified in the API. No conversion. |
--
|
KVCS_1252
|
Converts OS code page to |
--
|
--
|
The output character set is OS code page. No conversion. |
You can convert the character set of a file at the time the file is converted.
To specify the source character set, use the setSourceCharSet
method of the OptionInfo
object and set setForceSourceCharSet
to TRUE
.
To specify the target character set, use the setOutputCharSet
method of the OptionInfo
object and set setForceOutputCharSet
to TRUE
.
You can convert the character set of a container subfile at the time the subfile is extracted from the container and before it is converted to HTML. This is most often used to set the character set of a mail message's body text. See Use the File Extraction API.
To specify the source and target character set of a subfile
Use the methods of the ExtSubFileExtractConfig
object to set the source and target character set.
Call the extExtractSubFile
method of the Export
object and pass in the ExtSubFileExtractConfig
object.
|