Determine the Character Set of the Output Text
To determine the output character set of a filtered document, Filter considers the following:
- Whether the document reader can determine the character set of the file format. If the document reader cannot determine the character set information for the document type, set the source character set in the API.
- Whether the source character set is specified in the API.
- Whether the target character set is specified in the API.
Guidelines for Character Set Conversion
Below are some rules for the determination of character set mapping:
- If the source is not determined by the document reader or configured in the API, the character set of the output text is always unknown, regardless of the target character set configuration. The document cannot be converted to a target character set or the operating system's code page unless the source character set is known.
- If the target character set is not specified in the API, and the source character set is identified, the operating system's code page is used for the output text.
- If the source character set is identified, and the target character set is specified in the API, the target character set specified in the API is used for the output text.
- For documents that contain multiple character sets, Micro Focus recommends that the target character set be forced to UNICODE or UTF-8.
The following table illustrates how Filter determines the character set of the output text.
Source charset read by Filter | Source charset specified in API | Target charset specified in API | Output charset |
---|---|---|---|
No | No | No | no conversion |
No | KVCS_936 | No | OS code page |
No | No | UNICODE | no conversion |
No | KVCS_936 | UNICODE | UNICODE |
Yes | No | No | OS code page |
Yes | KVCS_936 | No | OS code page |
Yes | No | UNICODE | UNICODE |
Yes | KVCS_936 | UNICODE | UNICODE |