Draft Proposal for UTF-∞-8 Specification

Introduction | Properties | Details | Examples | UCS-X Home Page

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated November 28, 2009 (changes since 2007 are only stylistic)

Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.

1. Introduction

The name of this UTF (UCS Transformation Format) is "UTF-∞-8", to be read in English as "UTF-Infinity-Eight". It extends UTF-8 to support an infinite number of characters.

UTF-∞-8 is one of the encodings defined as part of UCS-∞, which also includes similar extensions for UTF-16 and UTF-32. For general information about UCS-∞, please see the UCS-∞ Specification.

2. Properties

UTF-∞-8 preserves and extends useful properties of UTF-8. For code points less than U+80000000, UTF-∞-8 is identical to the original one-to-six-byte UTF-8 encoding. (This implies that for ASCII text, UTF-∞-8 is identical to ASCII.)

The rule for distinguishing leading (initial) and trailing (non-initial) bytes still applies. All trailing bytes have the bit pattern 10xxxxxx (binary). In other words, they are in the range 80..BF, which is the same range used by trailing bytes in one-to-six-byte UTF-8. The only new leading bytes are FE and FF, which conform to the same bit pattern 11xxxxxx (binary) as the leading bytes C2..FD. (Hexadecimal notation is used here and below except where binary or decimal are explicitly specified.)

One valuable consequence of the distinction between leading and trailing bytes is that there is no risk of a "false match" when searching.

Another great property of UTF-8, which is still true for UTF-∞-8, is that a simple binary comparison of strings (with the C function strcmp(), for example) yields the same sort-order as a numerical comparison of code points, or a binary comparison of the same strings in a fixed-width encoding (such as big-endian UTF-32).

Yet another useful property of one-to-six-byte UTF-8 is that the leading byte indicates the length of a code. Since a single byte can't indicate the unlimited variable lengths of UTF-∞-8, this property requires modification. For codes longer than seven bytes, UTF-∞-8 does the next best thing by specifying the length in the first "few" bytes of a code, so that it is not necessary to scan all the way to the end of a code to determine its length.

The regular correspondence of two USV-storage bytes for three udigits makes conversion between USV and UTF-∞-8 codes simple. Conversion between UTF-∞-8 and other formats does not require storing the entire USV as an integer in a machine register. There is no need to do any arithmetic with USV values too large for a thirty-two-bit CPU. Although slightly more efficient encodings could be invented (e.g., by allowing an odd number of USV-storage bytes in a code longer than fourteen bytes, and omitting the first USV-storage byte if it has the value 80), there would be a price to pay in other qualities (such as simplicity). When appropriate, data-compression methods can be employed.

3. Details

(Note: some readers may prefer to skip down to view the examples first, then return here to read the details.)

3.1 Terms and Abbreviations

3.2 Code points less than U+80000000

For code points less than U+80000000, UTF-∞-8 is identical to the one-to-six-byte encoding defined by the original UTF-8 specification.

NOTE: see, for example, RFC 2279.

3.3 Code points in the range U+80000000..U+FFFFFFFFF

Seven-byte codes are used for code points in this range. The leading byte is FE. The six trailing bytes are in the range 80..BF. In other words, they each have the bit pattern 10xxxxxx (binary), where the six "x" bits are available for storing the USV. Thirty-six = 6*6 = 9*4 USV-storage bits are thus available. The USV is effectively padded on the left with zeros as needed to fill the available bits.

3.4 Code points in the range U+1000000000..U+7FFFFFFFFFFFFFFFFF (seventeen F's)

Thirteen-byte codes are used for code points in this range. The leading byte is FF. The second byte is in the range 80..9F, with the bit pattern 100xxxxx (binary), where the five "x" bits are available for storing the USV. The last eleven trailing bytes are in the range 80..BF, with the bit pattern 10xxxxxx (binary), where the six "x" bits are available for storing the USV. The USV is effectively padded on the left with zeros as needed to fill the available bits. Seventy-one ( = 5 + 11*6 = 3 + 17*4) USV-storage bits are thus available.

3.5 Code points greater than U+7FFFFFFFFFFFFFFFFF (seventeen F's)

3.5.1 The leading byte is FF.

3.5.2 Immediately following the leading byte, NUD (the number of udigits) is indicated by zero or more bytes whose value is B4 ("before"), followed by one or more bytes in the range A0..AF ("Ax"). These B4 and Ax bytes are referred to as "length-storage bytes", and their usage is explained in the following three paragraphs.

3.5.3 Instead of NUD, the length value actually stored is NME = NUD minus eighteen (since NUD is greater than or equal to eighteen).

3.5.4 If NME is less than sixteen, it is stored in a single Ax which immediately follows FF, and there is no B4 before the Ax. Examples:

  FF A0 ... (NME = zero;    NUD = eighteen)
  FF A3 ... (NME = three;   NUD = twenty-one)
  FF AF ... (NME = fifteen; NUD = thirty-three)

3.5.5 If NME is sixteen or more, it is stored in two or more Ax bytes (as many as necessary). Before the Ax bytes are one or more B4 bytes, to indicate how many length-storage Ax bytes follow, and also to ensure that shorter codes are sorted before longer codes, thus enabling the useful binary comparison property described above. The number of B4 bytes is one less than the number of Ax bytes. Examples:

  FF B4 A1 A0 ...       (NME =  10 hexadecimal = sixteen)
  FF B4 AF AF ...       (NME =  FF hexadecimal = 255 decimal)
  FF B4 B4 A1 A0 A0 ... (NME = 100 hexadecimal = 256 decimal)

(Notice that the last example would sort before the second-to-last example if B4 were missing, since A1 is less than AF.)

3.5.6 The remaining trailing bytes in a code (after FF and length-storage bytes) are referred to as "USV-storage bytes". Each USV-storage byte is in the range 80..BF, and stores six bits of the USV in the low six bits. In other words, three udigits (twelve bits) are stored in two USV-storage bytes.

3.5.7 If NUD is not a multiple of three, then the udigits are effectively padded on the left with one or two leading zero nybbles (four or eight zero bits), to reach a multiple of three nybbles (twelve bits), before being packed into the low six bits of each USV-storage byte. Consequently, the number of USV-storage bytes is ((NUD + 2) / 3) * 2, which is an even number. (Here the symbol / indicates integer division, e.g., 7 / 3 = 2.)

4. Examples

 U+0041 = 41 (the one-byte ASCII code for the letter 'A')

 U+10FFFF = F4 8F BF BF (four bytes; the last code that is compatible with UTF-16)

 U+110000 = F4 90 80 80 (four bytes; the first code that is incompatible with UTF-16)

 U+7FFFFFFF = FD BF BF BF BF BF (six bytes; the last original UTF-8 code)

 U+80000000 = FE 82 80 80 80 80 80
            (the first seven-byte code)

 U+FFFFFFFFF = FE BF BF BF BF BF BF (the last seven-byte code)

 U+1000000000 = FF 80 80 80 80 80 81 80 80 80 80 80 80 (the first thirteen-byte code)

 U+7FFFFFFFFFFFFFFF = FF 80 87 BF BF BF BF BF BF BF BF BF BF
                      (thirteen bytes; the last code point in UCS-E)

 U+8000000000000000 = FF 80 88 80 80 80 80 80 80 80 80 80 80
                      (thirteen bytes; the first code point beyond UCS-E)

 U+7FFFFFFFFFFFFFFFFF = FF 9F BF BF BF BF BF BF BF BF BF BF BF
                      (the last thirteen-byte code)

 U+800000000000000000 = FF A0 A0 80 80 80 80 80 80 80 80 80 80 80
          (fourteen bytes; the first A0 means NME = zero, NUD = eighteen;
                           the second A0 is a USV-storage byte)

 U+FFFFFFFFFFFFFFFFFF = FF A0 BF BF BF BF BF BF BF BF BF BF BF BF
          (the last fourteen-byte code)

 U+1000000000000000000 = FF A1 80 81 80 80 80 80 80 80 80 80 80 80 80 80
          (the first sixteen-byte code; there are no fifteen-byte codes;
           NME = 1, NUD = nineteen)

 U+FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  = FF AF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF BF
  (the last twenty-four-byte code; NME = fifteen, NUD = thirty-three;
   this is the last code with only a single length-storage Ax byte and
   without any length-storage B4 bytes)

 U+1000000000000000000000000000000000
  = FF B4 A1 A0 80 81 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80
    80 80 80 80 80 80
  (the first twenty-eight-byte code; NME = sixteen, NUD = thirty-four;
   this is the first code with a B4 and two Ax bytes for length-storage)

 U+AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  = FF B4 A2 A1 AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA
    AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA
  (thirty-four bytes; NME = 21 hexadecimal = thirty-three; NUD = fifty-one)

 U+A...(one hundred A's omitted)...A
  = FF B4 A5 A4 AA ... (sixty-six AA's omitted) ... AA
  (seventy-two bytes; NME = 54 hexadecimal = eighty-four;
   NUD = one hundred and two)

To the UCS-∞ Specification

UCS-X Home Page

Valid XHTML 1.0!