Determine the Character Set of the Output Text

To determine the output character set of a filtered document, Filter considers the following:

  • Whether the document reader can determine the character set of the file format. If the document reader cannot determine the character set information for the document type, set the source character set in the API.
  • Whether the source character set is specified in the API.
  • Whether the target character set is specified in the API.

Guidelines for Character Set Conversion

Below are some rules for the determination of character set mapping:

  • If the source is not determined by the document reader or configured in the API, then the character set of the output text is always unknown, regardless of the target character set configuration. The document cannot be converted to a target character set or the operating system's code page unless the source character set is known.
  • If the target character set is not specified in the API, and the source character set is identified, then the operating system's code page is used for the output text.
  • If the source character set is identified, and the target character set is specified in the API, then the target character set specified in the API is used for the output text.
  • For documents that contain multiple character sets, OpenText recommends that the target character set be forced to UNICODE or UTF-8.

The following table illustrates how Filter determines the character set of the output text.

Determining the Output Character Set—Example

Source charset read by Filter

Source charset specified in API

Target charset specified in API

Output charset

No

No

No

no conversion

No

KVCS_936

No

OS code page

No

No

UNICODE

no conversion

No

KVCS_936

UNICODE

UNICODE

Yes

No

No

OS code page

Yes

KVCS_936

No

OS code page

Yes

No

UNICODE

UNICODE

Yes

KVCS_936

UNICODE

UNICODE