Skip to content

XML Document Declarations

An XML document must identify itself as such to the SERNET messaging server in order to be routed properly to and from XML Services. In addition, once an XML document reaches an XML parser or similar XML processor on either the server or the client, the document must declare the type of XML document it is. This allows the XML parser to interpret the document data structures properly.

Identifying XML Documents

Standard-compliant XML relies on a combination of file naming conventions and declarations in the XML instance document itself to flag XML documents for processing. Conventions for doing this differ somewhat on distributed systems and mainframes.

Distributed systems usually identify XML documents by the Web-style .xml file name extension, which is appended to a base file name of up to 8 characters (or more on modern systems). The file name extension identifies the document type immediately for Web browsers and other distributed applications that work with XML. This eliminates the need for these applications to open each document they receive and inspect the contents to determine whether it contains XML. If you access XML Services from a distributed client, you may want to append the .xml file extension to any file names when saving reusable Serena XML documents in your local development environment. This facilitates the integration of ChangeMan ZMF with distributed applications.

Mainframes do not support the same file naming conventions used on most distributed systems. The SERNET messaging server therefore cannot rely on file naming conventions to identify XML documents. Instead, SERNET inspects the first line of an incoming message to determine whether or not it contains XML. For this reason, XML Services requires that XML documents always include an <?xml?> declaration to identify themselves. This requirement applies regardless of the type of system on which the document originates.

Mainframe users may find it useful to define a library type called “XML” for storing reusable XML documents. However, this is not a requirement of XML Services.

<?XML?> Declaration Syntax

An <?xml?> declaration is required on the first line of an XML document. Because it is not properly an XML statement, it precedes the XML root tag of your document. It also precedes any other non-XML declarations or processing instructions that appear before the root tag.

The <?xml?> declaration looks something like this:

<?xml version=”1.0” encoding=”UTF-8”?>

The version attribute is required. The encoding attribute is optional (the default is UTF-8).

<?XML?> Version Attribute

The version attribute in the <?xml?> declaration refers to the particular W3C syntax standard followed in your XML document. XML Services recognizes XML Version 1.0, Second Edition, which was published by the W3C in October 2000. This is the latest version of XML. Attempts to use other versions will fail. Consequently, your <?xml?> declaration will always have the following version attribute:

<?xml version=”1.0”?>

<?XML?> Encoding Attribute

The encoding attribute in the <?xml?> declaration identifies the character encoding standard used to represent text in your XML document. To ensure both cross-platform and international language compatibility, the W3C specification for XML states that all standardcompliant XML parsers support Unicode. Support for additional character sets is optional.

Unicode is a superset of the 7-bit ASCII character code, with international language and special symbol extensions. The most widely supported variant of Unicode is UTF-8, a variable-length encoding that uses one to four 8-bit bytes to represent characters and symbols. It yields compact files sizes for Latin-based alphabetic text, yet expands to support non-Latin alphabets, ideographic characters, and a wide variety of special symbols on demand. The first 128 code points in UTF-8 — i.e., character codes 0 to 127 — correspond to the same character codes in 7-bit ASCII.

XML Services supports 7-bit ASCII and the full U.S. EBCDIC character set, as well as the subset of UTF-8 that happens to match 7-bit ASCII. Any of the following encoding attributes are therefore valid in the <?xml?> declaration for XML Services:

<?xml version=”1.0” **encoding=”UTF-8”**?>
<?xml version=”1.0” **encoding=”US-ASCII”**?>
<?xml version=”1.0” **encoding=”EBCDIC-US”**?>

Note

You may also omit the encoding attribute and it will default to UTF-8.

The values for the encoding attribute have the meanings shown in the following table:

Exhibit 3-2. XML Character Encoding Attributes

Attribute Value Character Encoding Description
UTF-8 Variable-length Unicode representation in one to four 8-bit bytes. Supports international languages, including non-Latin and ideographic scripts. The default encoding for XML. XML Services accepts documents with this attribute, but interprets them as 7-bit ASCII at this time. Codes higher than 127 are ignored.
US-ASCII 8-bit ASCII character set. XML Services accepts documents with this attribute, but interprets them as 7-bit ASCII at this time. Codes higher than 127 are ignored.
EBCDIC-US 1987 standard EBCDIC for U.S. English & IBM 3270 terminals. Fully supported by XML Services.

Undefined Character Code Handling

The double-byte variant of Unicode is UTF-16. UTF-16 reserves the range of character codes E000 – F8FF as the Private Use Area (PUA) range. The PUA range is reserved for private use by software vendors.

When converting from EBCDIC to UTF-16 or UTF-8, conversion will fail for characters that are not defined in the EBCDIC code page. To handle characters that fail conversion, SERNET utilizes PUA range F800 – F8FF. For UTF-16, undefined characters are converted to F8xx, where xx is the hexadecimal value of the undefined EBCDIC character.

For UTF-8, in binary this corresponds to:

11101111 101000*bb* 10*bbbbbb*

Where *bbbbbbbb* is the binary value of the undefined EBCDIC character.

When converting from UTF-16 or UTF-8 back to EBCDIC, SERNET will convert the F8xx characters back to their original xx form.