Draft Proposal for UTF-E-8 Specification

Introduction | Properties | Details | Examples | UCS-X Home Page

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated October, 2009

Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.

1. Introduction

The name of this UTF (UCS Transformation Format) is "UTF-E-8". It extends UTF-8 to support code points up to U+7FFFFFFFFFFFFFFF.

UTF-E-8 is one of the encodings defined as part of UCS-E, which also includes similar extensions for UTF-16 and UTF-32. For general information about UCS-E, please see the UCS-E Specification.

UTF-E-8 (under the name "utf8") was implemented in Perl version 5 for 64-bit computers, created by Larry Wall. The authors of this specification do not claim any credit for it, except for giving it the new name "UTF-E-8".

2. Properties

UTF-E-8 preserves and extends useful properties of UTF-8. For code points less than U+80000000, UTF-E-8 is identical to the original one-to-six-byte UTF-8 encoding. (This implies that for ASCII text, UTF-E-8 is identical to ASCII.)

The rule for distinguishing leading (initial) and trailing (non-initial) bytes still applies. All trailing bytes have the bit pattern 10xxxxxx (binary). In other words, they are in the range 80..BF, which is the same range used by trailing bytes in one-to-six-byte UTF-8. The only new leading bytes are FE and FF, which conform to the same bit pattern 11xxxxxx (binary) as the leading bytes C2..FD. (Hexadecimal notation is used here and below except where binary or decimal are explicitly specified.)

One valuable consequence of the distinction between leading and trailing bytes is that there is no risk of a "false match" when searching.

Another great property of UTF-8, which is still true for UTF-E-8, is that a simple binary comparison of strings (with the C function strcmp(), for example) yields the same sort-order as a numerical comparison of code points, or a binary comparison of the same strings in a fixed-width encoding (such as big-endian UTF-32).

Yet another useful property of one-to-six-byte UTF-8 is that the leading byte indicates the length of a code. UTF-E-8 preserves this property.

3. Details

(Note: some readers may prefer to skip down to view the examples first, then return here to read the details.)

3.1 Terms and Abbreviations

3.2 Code points less than U+80000000

For code points less than U+80000000, UTF-E-8 is identical to the one-to-six-byte encoding defined by the original UTF-8 specification.

NOTE: see, for example, RFC 2279.

3.3 Code points in the range U+80000000..U+FFFFFFFFF

Seven-byte codes are used for code points in this range. The leading byte is FE. The six trailing bytes are in the range 80..BF. In other words, they each have the bit pattern 10xxxxxx (binary), where the six "x" bits are available for storing the USV. Thirty-six = 6*6 = 9*4 USV-storage bits are thus available. The USV is effectively padded on the left with zeros as needed to fill the available bits.

3.4 Code points in the range U+1000000000..U+7FFFFFFFFFFFFFFF

Thirteen-byte codes are used for code points in this range. The leading byte is FF. The twelve trailing bytes are in the range 80..BF. In other words, they each have the bit pattern 10xxxxxx (binary), where the six "x" bits are available for storing the USV. The USV is effectively padded on the left with zeros as needed to fill the available bits. Seventy-two ( = 12*6 = 18*4) USV-storage bits would thus be available; however since only sixty-three bits are needed for U+7FFFFFFFFFFFFFFF, the leftmost nine ( = 72 - 63) bits are always zero for UTF-E-8, and are reserved for future extension (as in UTF-∞-8). Hence, for UTF-E-8 the second byte is always 80 (with six reserved bits) and the third byte is in the range 80..87 (with three reserved bits).

4. Examples

 U+0041 = 41 (the one-byte ASCII code for the letter 'A')

 U+10FFFF = F4 8F BF BF (four bytes; the current maximum for Unicode)

 U+110000 = F4 90 80 80 (four bytes; the first code point beyond the U+10FFFF limit)

 U+7FFFFFFF = FD BF BF BF BF BF (six bytes; the last original UTF-8 code)

 U+80000000 = FE 82 80 80 80 80 80  (the first seven-byte code)

 U+FFFFFFFFF = FE BF BF BF BF BF BF (the last seven-byte code)

 U+1000000000 = FF 80 80 80 80 80 81 80 80 80 80 80 80 (the first thirteen-byte code)

 U+7FFFFFFFFFFFFFFF = FF 80 87 BF BF BF BF BF BF BF BF BF BF
                        (thirteen bytes; the last code point in UCS-E)

To the UCS-E Specification

UCS-X Home Page

Valid XHTML 1.0!