The UCS-X Family of UCS Extensions (Draft Proposal)

Set Encoding forms (convert) Maximum code point Size (code points)
UCS-M UTF-8 UTF-16 UTF-32 U+10FFFF 1,114,112 (mega)
UCS-G UTF-G-8 UTF-G-16 UTF-G-32 U+7FFFFFFF 231 ≈ 2x109 (giga)
UCS-E UTF-E-8 UTF-E-16 UTF-E-32 U+7FFFFFFFFFFFFFFF 263 ≈ 9x1018 (exa)
UCS-∞ UTF-∞-8 UTF-∞-16 UTF-∞-32 (variable) ∞ (infinity)

Important: these are drafts of work in progress, not finalized specifications. They all may change.

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated May 15, 2012.

ucs-x logo

Abstract

Compatible extensions are specified for the Universal Character Set (UCS) and the encoding forms UTF-8, UTF-16, and UTF-32, enabling usage of an unlimited number of character code points. These extensions overcome the current limitation of UCS to approximately one million code points.

Contents

1. Introduction

Unicode and ISO 10646 definitions make UCS the foundation for all modern text processing. The current document assumes familiarity with those standards.

UCS has been defined with different sizes:

Here “size” means the number of code points, including unassigned (reserved) as well as assigned (designated) code points. (249,031 code points are assigned in Unicode 6.0, including: 137,468 private-use; 109,242 graphic; 2,048 surrogates; 142 format; 66 noncharacters; and 65 control.)

We define “UCS-M” to mean the 1,114,112 code points (1,112,064 UCS scalar values, maximum U+10FFFF) specified by the Unicode Standard and ISO 10646, providing a permanent explicit name for the set of this particular size.

We specify three supersets of UCS-M, named “UCS-G”, “UCS-E”, and “UCS-∞”. Each superset has a specification for extending each of the three encoding forms UTF-8, UTF-16, and UTF-32 to support more code points. Hence, there are nine (3x3) detailed encoding specifications, “UTF-G-8” through “UTF-∞-32”. (See the table at the top of this page.)

The larger sets are supersets of the smaller ones. (In other words, the smaller sets are subsets of the larger ones.) The encoding forms with larger limits are compatible extensions of those with smaller limits.

Widespread usage is likely to occur for UCS-G (smallest of the supersets) sooner than for UCS-E or UCS-∞. Therefore, the UCS-G specification is written so that it stands on its own, and can be read first. The specifications for UCS-E and UCS-∞ occasionally refer back to UCS-G, for brevity and clarity, rather than repeating the same information. (Readers who prefer to start with ∞ are welcome to do so, of course. After ∞ has been fully grasped, the encoding details of E and G are mostly redundant, but some discussion, analysis, etc., that appear in the G or E specifications should still be read for completeness.)

Most details of UCS-X are provided in the separate specifications for UCS-G, UCS-E, and UCS-∞, and their UTF encoding forms, which are to be regarded as portions of this UCS-X specification.

2. Meaning of “compatible extension”

UCS-X provides compatible extensions of the encoding forms UTF-8, UTF-16, and UTF-32, in much the same way that UTF-8 is a compatible extension of ASCII. The following attributes have made UTF-8 especially popular for compatibility with applications, protocols, formats, etc., that were originally ASCII-based (e.g., Unix, email, HTML):

(Note on terminology: in a multibyte UTF-8 code, the first byte is called “leading” and all the other bytes are called “trailing.” Similarly, for UTF-16 and extensions of UTF-32, in a code consisting of more than one code unit, the first unit is called “leading” and all the other units are called “trailing.”)

Fortunately, in terms of the success of UTF-8, many applications that originally supported ASCII were written in such a way that bytes with the values 00..7F (hexadecimal) were recognized as ASCII, while bytes 80..FF were treated as generic characters and generally stored, transmitted, etc., intact. Applications passing this test were called “8-bit safe.”

There is another sense in which UTF-8 is NOT compatible with ASCII. A “strict” application of ASCII might disallow any character or byte beyond 7F. When encountering an “illegal” character, such an application might crash, report an error and quit, delete the character, or otherwise bend/fold/mutilate it. Nevertheless, many such applications could be made 8-bit safe, often by means of trivial modifications.

Each of the extended encoding forms is intended to provide analogous compatibility for the corresponding UCS-M encoding form (UTF-8/16/32):

These attributes should make it relatively easy to use or adapt existing applications to handle UCS-X. This is the most basic way in which UCS-X is a “compatible extension.” In addition, compatibility requires recognition that Unicode and ISO 10646 are complex, detailed standards, and UCS-X is designed to follow and harmonize with them as much as possible.

3. Rationale

This specification is based on the assumption that availability of more than a million UCS code points may be useful. It does not assume any particular future application. It proposes the assignment of private-use code points that could be used for any application; see the next section.

NOTE: The authors are motivated by our belief that a permanent and truly universal character set and information encoding standard should enable use of an unlimited number of character code points. As an optional supplement to the specification itself, there is documentation explaining the rationale for UCS-X, including ideas for a variety of hypothetical applications.

4. Assignment of code points

4.1 All code points greater than U+10FFFF whose first (most significant) hexadecimal digit is 1, E, or F (U+1..., U+E..., and U+F...) are assigned for extended private use (or “user-defined”). None of these code points are “noncharacters.”

4.2 All other code points greater than U+10FFFF (i.e., those whose first digit is 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, or D) are reserved for future assignment by standards organizations.

NOTES:

5. Byte-order (big-/little-endian encoding schemes and BOM)

The Unicode Standard makes a distinction between encoding “forms” (which don't specify byte-order) and encoding “schemes” (which do). UCS-X specifies “forms.” Byte-order is irrelevant for UTF-X-8. For UTF-X-16 and UTF-X-32, byte-order can be specified with names such as UTF-G-16LE, etc., or possibly with BOM (byte-order mark) U+FEFF (but see the second note below). The corresponding schemes exactly follow the analogy of UTF-16LE, etc., as defined by the Unicode Standard.

NOTES:

6. Conformance

6.1 To help ensure compatibility, security, and stability, conformance to UCS-X requires conformance to the Unicode Standard and ISO 10646, with the exception that code points beyond U+10FFFF (and within a specified upper limit) are legal for interchange between UCS-X-compatible processes.

6.2 A UCS-X-compatible process must never feed noninterpretable out-of-range data to another process. A UCS-X-compatible process must be explicitly identified as such (in its documentation or through suitable protocols). In the absence of explicit identification, no process should be assumed to be UCS-X-compatible or to support any code points greater than U+10FFFF.

NOTE: Simply saving a file might not literally be “feeding it to another process”; but, for example, a file containing an XML declaration with encoding="UTF-8" and also containing the UTF-E-8 code for U+123456789 would violate the spirit of this rule. Better would be encoding="X-UTF-E-8". (The prefix "X-" should be used, in accordance with www.w3.org/TR/xml/, until the UCS-X encoding names have been added to the IANA Registry.) See also the second note about BOM in the previous section (5. Byte-Order).

6.3 A UCS-X-compatible process must indicate its upper limit as a maximum number of USV hex digits (maxNUD). The only allowable values for maxNUD are (in decimal):

6.4 The Unicode Standard specifies how a Unicode-compatible process must correctly handle every USV up to U+10FFFF. In an exactly analogous way, a UCS-X-compatible process must correctly handle every USV up to its maximum.

6.5 In each encoding form (UTF-X-8, UTF-X-16, and UTF-X-32), only the unique shortest well-formed code that could represent a particular USV is allowed; a longer code with excessive zero padding which might otherwise represent the same USV is ill-formed and illegal. (This is to prevent problems analogous to those with some early implementations of UTF-8.)

6.6 Software and hardware modules implementing UCS-X should specify maxNUD, and also specify their behaviors when codes beyond that limit are encountered. (For example, replacement of out-of-range codes with U+FFFD.) Protocols should be developed to accomplish this communication at run-time. For example, a compatible library might include a function for querying the upper limit for the library. Functions that accept UCS-∞ as input or produce it as output should have arguments for their callers to specify their own upper limits. Even a function that is in principle capable of handling any size code should be configurable to enforce a limit, for compatibility with other modules and to avoid excessive memory usage or processing time.

NOTE: According to the original definition of UTF-8, it was correct for a program to interpret the 4-byte UTF-8 code F4 90 80 80 as U+110000, and convert it to UCS-2 as FFFD (U+FFFD REPLACEMENT CHARACTER). Under the U+10FFFF limit, however, F4 90 80 80 may be treated as an illegal byte sequence, and converted to UTF-16 as FFFD FFFD FFFD FFFD. See Unicode Technical Report #36, especially section 3.6 Secure Encoding Conversion, which implies that stopping conversion and reporting an error (“strategy 1”), converting to FFFD (“strategy 2”), or converting to FFFD FFFD FFFD FFFD (“strategy 3”) may all be considered safe, while “Strategy 2 is the most natural.” UCS-X applications should observe UTR #36, and apply the same principle, for example, when encountering the 7-byte UTF-E-8 code for U+80000000 given maxNUD = 8. For security reasons (see also the following section), it is essential to follow the rule in UTF #36 that “an illegal byte sequence must not include bytes that encode valid characters or are leading bytes for valid characters.” The same rules apply when “byte” is replaced with “code unit.”

7. Security

UCS-X is still experimental and should not be hastily employed in situations where unpredictability is hazardous. Errors are liable to be made in the design and implementation of any new or extended technology.

Precise specification of UCS-X will help to coordinate experimentation and testing, to answer questions about potential effects of UCS-X on efficiency, reliability, compatibility, and other security-related issues.

“[A]ctual implementations must deal with error conditions, including out-of-range errors, and having two specifications [namely: the original UTF-8, which we call UTF-G-8; and UTF-16] which treat that edge condition somewhat differently can be real trouble in distributed software.” (N3248) The risk of such trouble would be greatly multiplied in the case of multiple specifications for sets of different sizes, lacking a common protocol for consistent handling of out-of-range characters. The previous section (6. Conformance) provides a foundation for such protocols, including the establishment of UCS-M as the default set. Some details remain to be specified (especially section 6.6). The issue of interoperability is crucial from the standpoint of security.

There is also potential risk in moving too slowly with testing and standardizing UCS-X, given that 64-bit computers are already common. Font-related technologies are very powerful tools, ripe for application to new domains. International standards need to anticipate the possible deployment of technologies supporting more than 137,468 private-use code points. Without adequate standards, a wide variety of incompatible UCS extensions could arise resulting in garbled communications, breakdowns, and worse. Problems of UCS extension should be solved, at least in theory, and with some experimental evidence, before they become urgent.

8. Extending non-UCS encodings

Some nations and other groups use encodings that are not UCS, yet can be mapped to and from UCS. For handling code points beyond U+10FFFF, users of such encodings may choose either to adopt UCS-X directly, or to create equivalent extensions of their own encodings. We give one example of how that might be done. The Chinese national standard character encoding GB18030 already has codes that map to and from all UCS code points up to U+10FFFF, whose GB code is E3 32 9A 35. An effective compatible extension of GB could easily be made as follows: to obtain the GB-X code for any USV greater than U+10FFFF, start with the UTF-∞-8 code and insert the byte 30 between the first and second bytes. For example, U+110000 would be F4 30 90 80 80; and U+7FFFFFFF would be FD 30 BF BF BF BF BF. Such codes would be different from all currently defined GB codes and easily recognized as such from the first two bytes. Of course, extension of local standards is a matter for local authorities. This possibility simply demonstrates that flexible solutions for continued compatibility between standards may be fairly easy to find.

9. History and status of UCS-X

The authors (Bishop and Cook) came up with the idea of UCS-X (but not UCS), starting in 2007. We don't claim any authority (power to set standards); we're acting as individuals, not on behalf of an organization. UCS-X has not been approved by the Unicode Consortium, or ISO, or any other standards organization. We're letting people know about it on an experimental basis. In no event shall we be liable for any damages. The original UTF-8 for codes up to six bytes in length (which we call UTF-G-8) was invented by Ken Thompson, and the extension of UTF-8 to thirteen bytes (which we call UTF-E-8) was invented by Larry Wall.

10. Implementation

A Perl script that implements conversion, in both directions, between UCS-X scalar values and encoding forms (UTF-X-8, UTF-X-16, and UTF-X-32) is provided for running or viewing.

Also, C source code is provided that supports UCS-G. The three files ConvertUTFG.c, ConvertUTFG.h, and utfg_harness.c are modified versions of the UTF conversion programs that were formerly on the unicode.org website. (The original programs are no longer on unicode.org.)

C source code for UCS-E and UCS-∞ is not yet available.

11. Future development

To do:

12. Request for comments

Please email comments, suggestions, and constructive criticisms to the authors.

∞. References

The Unicode Consortium (HTML)

WG2/N3248 Synchronization Issues for UTF-8 Ken Whistler (2007-04-20) (Microsoft Word document)

UTR 17 Character Encoding Model Ken Whistler, Mark Davis, Asmus Freytag (HTML)

UTF-8 and Unicode FAQ Markus Kuhn (HTML)

ISO-10646-UTF-8 (HTML)

ISO-10646-UTF-16 (HTML)

UTF-8 History Ken Thompson, Rob Pike, et al. (plain text)

utf-fss.c (more history) from CJKV Information Processing by Ken Lunde (plain text)

The Perl programming language Larry Wall, The Perl Foundation (HTML)

JTC1/SC2/WG2 - ISO/IEC 10646 - UCS (HTML)

UTF-8: RFC 2279 (1998, max. U+7FFFFFFF), RFC 3629 (2003, max. U+10FFFF) (HTML)

UTF-16: RFC 2781 (2000) (HTML)


To the home page of Wenlin Institute

Unicode Encoded

Valid XHTML 1.0!