UTF-8 Data

The UTF-8 encoding scheme is a variable-width Unicode encoding. Each valid code point is encoded using one to four 8-bit bytes. UTF-8 is a popular encoding scheme, as it is backward-compatible (with ASCII); it is endianness independent; and it is often provides a more compact representation of Unicode than UTF-16.

Enterprise Developer provides native COBOL support for defining, comparing, and moving UTF-8 data.

UTF-8 data items

UTF-8 data items can either be of fixed character length, or fixed byte length:

Fixed character-length data items

Define this type of data item within the PICTURE clause by specifying one or more U characters, or a single U following by a repetition factor. Each U character represents a single Unicode code point, which could vary in length between 1 and 4 bytes. (For fixed character-length items, do not specify the BYTE-LENGTH clause.)

01 U1 PIC U(4).
01 U2 PIC UUUU.

The number of storage bytes required for each character is 4 bytes; therefore, the data item examples shown above required 16 bytes of storage each. Because UTF-8 is a variable-width encoding, not all characters will require all 4 bytes, and so during move operations, not all reserved bytes are used; where this is the case, the unused bytes are padded with the UTF-8 blank space encoding of x'20'. If truncation is required during a move operation, truncation occurs on a character boundary.

Fixed byte-length data items

Define this type of data item within the PICTURE clause by specifying a single U character, and include the BYTE-LENGTH phrase. The BYTE-LENGTH phrase indicates the exact number of bytes in length of the data item.

01 U3 PIC U BYTE-LENGTH 24

Again, due to the varying nature (in length) of UTF-8 items, the number of characters in a data item is variable, depending on the size of each character; however, it will always be in the range of [ceil(n/4),n] where n is the specified byte length.

UTF-8 literals

There are two types of UTF-8 literal: basic and hexadecimal.

Basic literals

A basic literal can be defined as follows:

01 U4 PIC U VALUE u'lit-string'

where lit-string is the literal value. If you specify any double-byte characters, these must be delimited with the shift-out and shift-in characters. Due to the variable-width nature of Unicode, the maximum number of characters possible within lit-string varies.

lit-string can also contain Unicode escape sequences in the following formats:

\uhhhh - where each h represents a hexadecimal digit in the range 0-9, a-f, and A-F inclusive. This escape sequence corresponds to a Unicode code point from the Basic Multilingual Plane (BMP), within the range U+0000 to U+FFFF.
\U00hhhhhh - where each h represents a hexadecimal digit in the range 0-9, a-f, and A-F inclusive. This escape sequence corresponds to a Unicode code point from the Basic Multilingual Plane, or any of the Supplementary Planes. This means that as well as the range specified above, it also includes U+10000 to U+10FFFF.

Note: Code points U+D800 to U+DFFF are reserved for the high and low halves of surrogate pairs used by UTF-16; therefore, do not specify \uD800 through \uDFFF and \U0000D800 through \U0000DFFF as Unicode escape sequences in UTF-8 literals.

To include \uhhhh or \U00hhhhhh as a string in a UTF-8 literal, the escape character (\) itself can be escaped (using \) to interpret the string literally; for example \\u00FF is not processed as a Unicode escape sequence.

Hexadecimal literals

A hexadecimal literal can be defined as follows:

01 U5 PIC U VALUE ux'hex-string'

where hex-string can be a minimum of 2 hexadecimal digits, which can be in the range 0-9, a-f, and A-F inclusive. Each group of two digits represents a single encoding of a UTF-8 character.

The sequence of bytes represented by hex-string is validated to ensure that it contains a valid sequence of UTF-8 bytes. If it does, this hexadecimal notation is stored as UTF-8 characters, and results in the content having the same meaning as a basic UTF-8 literal specifying the same characters.

COBOL statements that support UTF-8 data items

The following COBOL statements support the use of UTF-8 data items:

ALLOCATE and FREE
EVALUATE and IF (see UTF-8 comparisons, below)
INITIALIZE - the category default for UTF-8 data items is UTF-8 spaces (x'20')
MERGE and SORT (see UTF-8 comparisons, below)
MOVE - the basic MOVE rules are:
- An item of class UTF-8 can only be moved to an item of class national or UTF-8.
- An item of class UTF-8 can only receive an item of class alphabetic, alphanumeric, national, or UTF-8.
  Note: Sending items also include numeric-edited, alphanumeric-edited, national-edited, and national-numeric-edited.
STRING and UNSTRING

UTF-8 comparisons

A UTF-8 comparison is a comparison between two operands of class UTF-8. When either of those operands is not of class UTF-8 (of which, only class alphabetic, alphanumeric, or national are permitted), that operand is converted to an item of class UTF-8 before the comparison.

During MERGE or SORT operations, comparisons are performed using a binary, byte-by-byte comparison, which produces the same order as a corresponding set of national strings representing the same Unicode code points (assuming all code points are taken from the Basic Multilingual Plane).

If the operands are of unequal length, the comparison is performed as if the shorter operand were padded (with trailing UTF-8 space characters) to the length of the other operand.

If the operands are of equal length (or assumed to be, due to the additional padding), the comparison compares each corresponding character position, starting at the left-most position, until either unequal UTF-8 characters are encountered or the right-most character position is reached, whichever comes first. The operands are considered equal if all corresponding UTF-8 characters are equal.

When the first unequal character is encountered, it is compared to determine the relationship of the operands. The operand that contains the UTF-8 character with the higher collating sequence value is the greater operand.

Note: The higher collating sequence value is determined using the hexadecimal value of characters; the PROGRAM COLLATING SEQUENCE clause has no effect on UTF-8 comparisons.

Intrinsic function support for UTF-8 data items

There are a number of intrinsic functions that support the processing of UTF-8 data in native COBOL:

BIT-OF
BYTE-LENGTH
DISPLAY-OF
HEX-OF
LENGTH
LOWER-CASE
NATIONAL-OF
TRIM
ULENGTH
UPOS
UPPER-CASE
USUBSTR
USUPPLEMENTARY
UVALID
UWIDTH