Draft Proposal for UTF-G-16 Specification

Introduction | Properties | Details | Examples | UCS-X Home Page

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated August 26, 2007

Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.

1. Introduction

The name of this UTF (UCS Transformation Format) is "UTF-G-16". UTF-G-16 extends UTF-16 to support code points up to U+7FFFFFFF.

UTF-G-16 is one of the encodings defined as part of UCS-G, which also includes similar extensions for UTF-8 and UTF-32. For general information about UCS-G, please see the UCS-G Specification.

2. Properties

UTF-G-16 preserves and extends useful properties of UTF-16. For code points less than or equal to U+10FFFF, it is identical to UTF-16.

UTF-G-16 employs sixteen-bit code units. Single-unit codes are for code points less than or equal to U+FFFF. Codes that consist of two units are standard UTF-16 surrogate pairs, and cover code points up to U+10FFFF.

Codes for code points greater than or equal to U+110000 have three or four units, all in the range DC04..DFFF. (Hexadecimal notation is used here and below except where binary or decimal are explicitly specified.)

Leading (initial) units can be distinguished from trailing (non-initial) units according to rules explained below under Details. The distinction between leading and trailing code units has the consequence that there is no risk of a "false match" when searching.

The length of each code is specified by the leading code unit.

UTF-16 is the most efficient UTF for characters that are mostly in the upper half of the Basic Multilingual Plane, such as CJK characters, and UTF-G-16 preserves this beneficial property of UTF-16.

Binary comparison or sorting of UTF-16 codes generally yields different sort-orders than numerical comparison of code points, owing to the use of surrogate pairs. Consequently, any compatible extension of UTF-16 (including UTF-G-16) must inherit this limitation. In this respect, UTF-16 and UTF-G-16 are different from UTF-8, UTF-G-8, UTF-32, and UTF-G-32, which all yield the same sort-order as a numerical comparison of code points. (UTF-G-16 does preserve the property that codes outside the range U+E000..U+FFFF sort in numerical order.)

3. Details

(Note: some readers may prefer to skip down to view the examples first, then return here to read the details.)

3.1 Terms and Abbreviations

3.2 Code points less than or equal to U+10FFFF

For code points less than or equal to U+10FFFF, UTF-G-16 is identical to UTF-16. Codes are either one unit (up to U+FFFF) or two units (above U+FFFF, using surrogate pairs). (A UTF-16 surrogate pair consists of a leading unit in the range D800..DBFF followed by a trailing unit in the range DC00..DFFF.)

3.3 Code points in the range U+110000..U+7FFFFFFF

3.3.1 For code points in the range U+110000..U+3FFFFFF, a code consists of three units (forty-eight bits). The leading unit is in the range DC04..DCFF, with bit pattern 11011100xxxxxxxx (binary), where the eight x bits are available for storage of the USV.

3.3.2 For code points in the range U+4000000..U+7FFFFFFF, a code consists of four units (sixty-four bits). The leading unit is in the range DD00..DD0F, with bit pattern 110111010000xxxx (binary), where the four x bits are available for storage of the USV.

3.3.3 Trailing units are in the range DE00..DFFF, with bit pattern 1101111xxxxxxxxx (binary), where the nine x bits are available for storage of the USV.

3.3.4 The USV is stored in the available bits. A code always uses the minimum number of units necessary to store the USV. Leading zero bits on the USV are omitted if this omission results in a shorter code. (For example, the first hexadecimal digit of U+1FFFFFF is 1, which in binary is 0001, with three leading zero bits which are NOT all stored in the code.) If more bits are available than are needed to store the USV, then the USV is effectively padded on the left with zero bits to fill the leftover space.

3.4 Analysis

LengthUSVLeadingTrailing
2 unitsU+10000..U+10FFFFD800..DBFFDC00..DFFF
3 unitsU+110000..U+3FFFFFFDC04..DCFFDE00..DFFF
4 unitsU+4000000..U+7FFFFFFFDD00..DD0F

3.4.1 The table above illustrates the forms of two-, three-, and four-unit codes (as already defined in sections 3.2 and 3.3).

LengthD800..DBFFDC00..DDFFDE00..DFFF
2 unitsleadingtrailing
3 or 4 unitsunusedleadingtrailing

3.4.2 The second table rearranges some of the same information. It shows that leading units can be distinguished from trailing units based on their values, except for some surrogates in the range DC00..DDFF that can function either as trailing units in 2-unit codes, or as leading units in longer codes. Units DC00..DDFF can be recognized as leading or trailing by these two rules:

Either rule can be applied, whichever is most convenient; the results are the same for well-formed UTF-G-16 text. (See the function isLeadingUTFG16Unit() in ConvertUTFG.c for an implementation of these rules.)

3.4.3 Only part of the range DC00..DDFF, namely DC04..DD0F, is used for leading units in UTF-G-16. (The two rules in 3.4.2 are formulated to cover further extensions, such as UTF-E-16, which may use DD10..DDFF as leading units.)

3.4.4 Traditionally, for UTF-16 surrogate pairs, leading and trailing units have also been called "high-surrogate code units" and "low-surrogate code units", respectively. In the context of three- and four-unit codes, however, the "high/low" terminology would only lead to confusion.

4. Examples

 U+0041 = 0041 (one unit; the code for the letter 'A')

 U+10FFFF = DBFF DFFF (two units; the last UTF-16 code, composed of two surrogates)

 U+110000 = DC04 DE80 DE00 (three units; the first code that is not UTF-16)

 U+3FFFFFF = DCFF DFFF DFFF (the last three-unit code)

 U+4000000 = DD00 DF00 DE00 DE00 (the first four-unit code)

 U+7FFFFFFF = DD0F DFFF DFFF DFFF (four units; the last code point supported by UCS-G)

To the UCS-G Specification

UCS-X Home Page

Valid XHTML 1.0!