Convert Character Sets
Export allows you to control the character set of both the input and the output text. This is accomplished by either
-
setting the source and/or target character set in the API, or
-
basing the input/output on the character set of the document (if the document character set is stored in the document and can be determined by the document reader).
The character sets are enumerated in KVCharSet
in kvcharset.h
.
Not all character sets can be used to specify the target character set. See Coded Character Sets for a list of character sets that can be used as a target character set.
Determine the Character Set of the Output Text
To determine the output character set of a converted document, Export considers the following:
-
Whether the reader can extract the character set from the document. This depends on whether the file format can provide character set information and whether the document actually contains character set information.
If character set information cannot be determined for your document type, you must set the source, the target character set, or both, in the API.
-
Whether a source character set is set in the API.
NOTE: To set the source character set, you must specify a character set and set the
bForceSrcCharSet
member of the KVHTMLOptionsEx structure toTRUE
. -
Whether a target character set is set in the API.
NOTE: To set the target character set, you must specify a character set and set the
bForceOutputCharSet
member of the KVHTMLOptionsEx structure toTRUE
.
Guidelines for Character Set Conversion
The following diagram shows how the output character set is determined when the document character set can be determined:
Document Character Set Can Be Determined
The following diagram shows how the output character set is determined when the document character set cannot be determined:
Document Character Set Cannot Be Determined
Examples of Character Set Conversion
The examples below demonstrate possible configurations for mapping character sets and the expected output for each scenario.
Document Character Set Can be Determined
For the example in the following table, the document is an RTF file. The document character set is Traditional Chinese (BIG5).
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_GB
|
KVCS_UTF8
|
Converts GB (Simplified Chinese) to UTF-8. The output character set is the target character set specified in the API. |
KVCS_GB
|
--
|
Converts BIG5 to GB (Simplified Chinese). The output character set is the source character set specified in the API. |
--
|
KVCS_UTF8
|
Converts BIG5 to UTF-8. The output character set is the target character set specified in the API. |
--
|
--
|
The output character set is the document character set. No conversion. |
Document Character Set Cannot be Determined
For the example in the following table, the document is an ASCII file. The section Document Readers indicates that the document character set cannot be obtained from this file type. The document character set is KVCS_1251
.
Source charset set |
Target charset set |
Output charset |
---|---|---|
KVCS_1252
|
KVCS_UTF8
|
Converts |
KVCS_1252
|
KVCS_UNKNOWN
|
The output character set is the source character set specified in the API because |
KVCS_1252
|
--
|
The output character set is the source character set specified in the API. No conversion. |
--
|
KVCS_1252
|
Converts OS code page to |
--
|
--
|
The output character set is OS code page. No conversion. |
Set the Character Set During Conversion
You can convert the character set of a file at the time the file is converted.
To specify the source character set for documents from which the document character set cannot be obtained by the reader
-
Set the
eSrcCharSet
member of theKVHTMLOptionsEx
structure to one of the character sets enumerated inKVCharSet
inkvcharset.h
. See KVHTMLOptionsEx. -
Set the
bForceSrcCharSet
member of theKVHTMLOptionsEx
structure toTRUE
.
To specify the target character set
-
Set the
OutputCharSet
member of theKVHTMLOptionsEx
structure to one of the character sets enumerated inKVCharSet
inkvcharset.h
. See KVHTMLOptionsEx. -
Set the
bForceOutputCharSet
member of theKVHTMLOptionsEx
structure toTRUE
.
Set the Character Set During File Extraction from a Container
You can convert the character set of a container subfile at the time the subfile is extracted from the container and before it is converted to HTML. This is most often used to set the output character set of a mail message's body text.
To specify the source character set of a subfile, call the fpExtractSubFile()
function, and set the KVExtractSubFileArg->srcCharset
argument to any value in the enumerated list in KVCharSet
in kvcharset.h
. See fpExtractSubFile().
To specify the target character set of a subfile, call the fpExtractSubFile()
function, and set the KVExtractSubFileArg->trgCharSet
argument to any value in the enumerated list in KVCharSet
in kvcharset.h
.