Convert Character Sets
Export allows you to control the character set of both the input and the output text. This is accomplished by either
-
setting the source and/or target character set in the API, or
-
basing the input/output on the character set of the document (if the document character set is stored in the document and can be determined by the document reader).
The character sets are enumerated in KVCharSet
in kvcharset.h
.
Not all character sets can be used to specify the target character set. See Code Character Sets for a list of character sets that can be used as a target character set.
Determine the Character Set of the Output Text
To determine the output character set of a converted document, Export considers the following:
-
Whether the reader can extract the character set from the document. This depends on whether the file format can provide character set information and whether the document actually contains character set information.
The section Document Readers indicates the file formats for which character set information can be extracted. If character set information cannot be determined for your document type, you must set the source, the target character set, or both, in the API.
-
Whether a source character set is set in the API.
NOTE: To set the source character set, you must specify a character set and set the
bForceSrcCharSet
member of the KVXMLOptions structure toTRUE
. -
Whether a target character set is set in the API.
NOTE: To set the target character set, you must specify a character set and set the
bForceOutputCharSet
member of the KVXMLOptions structure toTRUE
.
Guidelines for Character Set Conversion
The following diagram shows how the output character set is determined when the document character set can be determined:
Document Character Set Can Be Determined
The following diagram shows how the output character set is determined when the document character set cannot be determined:
Document Character Set Cannot Be Determined
Examples of Character Set Conversion
The examples below demonstrate possible configurations for mapping character sets and the expected output for each scenario.
Document Character Set Can be Determined
For the example in the following table, the document is an RTF file. The section Document Readers indicates that the document character set can be obtained from this file type. The document character set is Traditional Chinese (BIG5).
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_GB
|
KVCS_UTF8
|
Converts GB (Simplified Chinese) to UTF-8. The output character set is the target character set specified in the API. |
KVCS_GB
|
--
|
Converts BIG5 to GB (Simplified Chinese). The output character set is the source character set specified in the API. |
--
|
KVCS_UTF8
|
Converts BIG5 to UTF-8. The output character set is the target character set specified in the API. |
--
|
--
|
The output character set is the document character set. No conversion. |
Document Character Set Cannot be Determined
For the example in the following table, the document is an ASCII file. The section Document Readers indicates that the document character set cannot be obtained from this file type. The document character set is KVCS_1251
.
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_1252
|
KVCS_UTF8
|
Converts |
KVCS_1252
|
KVCS_UNKNOWN
|
The output character set is the source character set specified in the API because |
KVCS_1252
|
--
|
The output character set is the source character set specified in the API. No conversion. |
--
|
KVCS_1252
|
Converts OS code page to |
--
|
--
|
The output character set is OS code page. No conversion. |
Set the Character Set During Conversion
You can convert the character set of a file at the time the file is converted.
To specify the source character set for documents from which the document character set cannot be obtained by the reader
-
Set the
eSrcCharSet
member of theKVXMLOptions
structure to one of the character sets enumerated inKVCharSet
inkvcharset.h
. See KVXMLOptions. -
Set the
bForceSrcCharSet
member of theKVXMLOptions
structure toTRUE
.
To specify the target character set
-
Set the
eOutputCharSet
member of theKVXMLOptions
structure to one of the character sets enumerated inKVCharSet
inkvcharset.h
. See KVXMLOptions. -
Set the
bForceOutputCharSet
member of theKVXMLOptions
structure toTRUE
.
Set the Character Set During File Extraction from a Container
You can convert the character set of a container subfile at the time the subfile is extracted from the container and before it is converted to XML. This is most often used to set the output character set of a mail message's body text.
To specify the source character set of a subfile, call the fpExtractSubFile()
function, and set the KVExtractSubFileArg->srcCharset
argument to any value in the enumerated list in KVCharSet
in kvcharset.h
. See fpExtractSubFile().
To specify the target character set of a subfile, call the fpExtractSubFile()
function, and set the KVExtractSubFileArg->trgCharSet
argument to any value in the enumerated list in KVCharSet
in kvcharset.h
.