Draft Proposal for UTF-∞-16 Specification

Introduction | Properties | Details | Examples | UCS-X Home Page

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated August 26, 2007

Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.

1. Introduction

The name of this UTF (UCS Transformation Format) is "UTF-∞-16", to be read in English as "UTF-Infinity-Sixteen". UTF-∞-16 extends UTF-16 to support an infinite number of characters.

UTF-∞-16 is one of the encodings defined as part of UCS-∞, which also includes similar extensions for UTF-8 and UTF-32. For general information about UCS-∞, please see the UCS-∞ Specification.

2. Properties

UTF-∞-16 preserves and extends useful properties of UTF-16, UTF-G-16, and UTF-E-16. For code points less than or equal to U+10FFFF, it is identical to UTF-16. For code points less than or equal to U+7FFFFFFF, it is identical to UTF-G-16. For code points less than or equal to U+7FFFFFFFFFFFFFFF, it is identical to UTF-E-16.

Codes for code points greater than or equal to U+8000000000000000 have eight or more units, all in the range DDF1..DFFF.

UTF-∞-16 specifies the length of each code within the first "few" code units, so that it is not necessary to scan all the way to the end of a code to determine its length. Except for extremely long codes (thirteen units or longer), the length is completely specified in the leading code unit.

Three- and four-unit codes are sufficient for over seventeen billion code points, far beyond the two billion code points originally supported by ISO 10646 (UCS-4). UTF-16 is the most efficient UTF for characters that are mostly in the upper half of the Basic Multilingual Plane, such as CJK characters, and UTF-∞-16 preserves this beneficial property of UTF-16.

In other respects, UTF-∞-16 shares the same properties listed in the UTF-G-16 and UTF-E-16 specifications (q.v.).

3. Details

(Note: some readers may prefer to skip down to view the examples first, then return here to read the details.)

3.1 Terms and Abbreviations

3.2 Code points less than or equal to U+10FFFF

For code points less than or equal to U+10FFFF, UTF-∞-16 is identical to UTF-16. Codes are either one unit (up to U+FFFF) or two units (above U+FFFF, using surrogate pairs).

3.3 Code points in the range U+110000..U+3FFFFFFFFFFFFFFFFFFFFFF

3.3.1 For code points in the range U+110000..U+3FFFFFFFFFFFFFFFFFFFFFF (3 followed by twenty-two F's), a code consists of between three and eleven units. The leading unit is in the range DC04..DDFE and conforms to the bit pattern 1101110xxxxxxxxx (binary), where the nine x bits indicate the number of units as follows:

        Bit pattern of low nine       Number 
          bits in leading unit       of units
               0yyyyyyyy .............. 3
               10yyyyyyy .............. 4
               110yyyyyy .............. 5
               1110yyyyy .............. 6
               11110yyyy .............. 7
               111110yyy .............. 8
               1111110yy .............. 9
               11111110y .............. ten
               111111110 .............. eleven

3.3.2 The bits marked y in the table are available for storage of the USV.

3.3.3 Trailing units are in the range DE00..DFFF, with bit pattern 1101111xxxxxxxxx (binary), where the nine x bits are available for storage of the USV.

3.3.4 The USV is stored in the available bits. A code always uses the minimum number of units necessary to store the USV. Leading zero bits on the first udigit are omitted if this omission results in a shorter code. (For example, the first udigit of U+1FFFFFF is 1, which in binary is 0001, with three leading zero bits which are NOT all stored in the code.) If more bits are available than are needed to store the USV, then the USV is effectively padded on the left with zero bits to fill the leftover space.

3.4 Code points greater than or equal to U+40000000000000000000000

3.4.1 The leading unit is DDFF. Trailing units are in the range DE00..DFFF.

3.4.2 Starting with the second unit, NUD (the number of udigits) is indicated by a variable-length sequence of one or more units. These units are referred to as "length-storage units".

3.4.3 Instead of NUD, the length value actually stored is NMT = NUD minus twenty-three (since NUD is greater than or equal to twenty-three).

3.4.4 If NMT is less than or equal to FF hexadecimal (255 decimal), there is only a single length-storage unit DExx, which stores NMT in the low eight bits. Examples:

  DDFF DE00 ... (NMT = zero;        NUD = twenty-three)
  DDFF DE03 ... (NMT = three;       NUD = twenty-six)
  DDFF DE10 ... (NMT = sixteen;     NUD = thirty-nine)
  DDFF DEFF ... (NMT = 255 decimal; NUD = 278 decimal)

3.4.5 If NMT requires more than eight bits, then it is stored in as many DExx units as necessary, with eight bits in each unit, and with leading zero bits as needed to make a multiple of eight. The length-storage DExx units are preceded by one or more DFB4 ("before") units. The number of DFB4 units is one less than the number of length-storage DExx units, and thereby indicates which is the last length-storage DExx unit. Examples:

  DDFF DFB4 DE01 DE23 ...                     NMT = 123 hexadecimal
  DDFF DFB4 DFB4 DE65 DE43 DE21 ...           NMT = 654321 hexadecimal
  DDFF DFB4 DFB4 DFB4 DE02 DE46 DE8A DECE ... NMT = 2468ACE hexadecimal

3.4.6 The remaining units in a code (after DDFF and length-storage units) are referred to as "USV-storage units". Each USV-storage unit can hold nine bits of the USV. A code always uses the minimum number of units necessary to store the USV. Leading zero bits on the first udigit are omitted if this omission results in a shorter code. If more bits are available than are needed to store the USV, then the USV is effectively padded on the left with zero bits to fill the leftover space.

3.5 Analysis

LengthUSVLeadingTrailing
8 unitsU+400000000000000..U+3FFFFFFFFFFFFFFFFDDF0..DDF7DE00..DFFF
9 unitsU+40000000000000000..U+3FFFFFFFFFFFFFFFFFFDDF8..DDFB
10 unitsU+4000000000000000000..U+3FFFFFFFFFFFFFFFFFFFFDDFC..DDFD
11 unitsU+4000000000000000000000..U+3FFFFFFFFFFFFFFFFFFFFFFDDFE
13 unitsU+400000000000000000000000..U+7FFFFFFFFFFFFFFFFFFFFFFFFDDFF
14 unitsU+8000000000000000000000000..(etc.)DDFF

3.5.1 The table above illustrates the forms of eight-to-fourteen-unit codes (as already defined in sections 3.2 to 3.4). (See the UTF-E-16 specification for shorter codes.) Note the jump from eleven- to thirteen-unit codes, corresponding to the start of codes with leading unit DDFF, which continue forever.

3.5.2 Leading units can be distinguished from trailing units according to the same rules that apply to UTF-G-16 and UTF-E-16.

4. Examples

 U+0041 = 0041 (one unit; the code for the letter 'A')

 U+10FFFF = DBFF DFFF (two units; the last code point in UCS-M)

 U+110000 = DC04 DE80 DE00 (three units; the first code point beyond UCS-M)

 U+3FFFFFF = DCFF DFFF DFFF (the last three-unit code)

 U+4000000 = DD00 DF00 DE00 DE00 (the first four-unit code)

 U+7FFFFFFF = DD0F DFFF DFFF DFFF (four units; the last code point in UCS-G)

 U+80000000 = DD10 DE00 DE00 DE00 (the first code point beyond UCS-G)

 U+3FFFFFFFF = DD7F DFFF DFFF DFFF (the last four-unit code; NUD = 9;
                                    3FFFFFFFF hex = 17,179,869,183 decimal)

 U+123456789ABCD = DDC9 DE34 DEAC DFE2 DED5 DFCD (six units; NUD = thirteen)

 U+3FFFFFFFFFFFFFFFFFFFFFF
  = DDFE DFFF DFFF DFFF DFFF DFFF DFFF DFFF DFFF DFFF DFFF
  (the last eleven-unit code; NUD = twenty-three)

 U+40000000000000000000000
  = DDFF DE00 DE01 DE00 DE00 DE00 DE00 DE00 DE00 DE00 DE00 DE00 DE00
  (the first thirteen-unit code; no twelve-unit codes exist;
   the second unit DE00 indicates NMT = 0, NUD = twenty-three)

 U+FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  = DDFF DE0E DE0F DFFF DFFF ... (eleven DFFF's omitted) ... DFFF DFFF DFFF
  (the second unit DE0E indicates NMT = fourteen, NUD = thirty-seven)

 U+FF...(two hundred and seventy-five F's omitted)...FF
  = DDFF DFB4 DE01 DE00 DFFF DFFF ...
            (one hundred and twenty DFFF's omitted) ... DFFF DFFF
  (the first code that employs a DFB4 length-storage unit and
   two DExxx length-storage units; DE01 DE00 indicates NMT = 100 hexadecimal
   = 256 decimal; NUD = 256 + 23 = 278 decimal)

To the UCS-∞ Specification

UCS-X Home Page

Valid XHTML 1.0!