The UCS-X Family of UCS Extensions (Draft Proposal)

Set	Encoding forms (convert)			Maximum code point	Size (code points)
UCS-M	UTF-8	UTF-16	UTF-32	`U+10FFFF`	1,114,112 (mega)
UCS-G	UTF-G-8	UTF-G-16	UTF-G-32	`U+7FFFFFFF`	2³¹ ≈ 2x10⁹ (giga)
UCS-E	UTF-E-8	UTF-E-16	UTF-E-32	`U+7FFFFFFFFFFFFFFF`	2⁶³ ≈ 9x10¹⁸ (exa)
UCS-∞	UTF-∞-8	UTF-∞-16	UTF-∞-32	(variable)	∞ (infinity)

Important: these are drafts of work in progress, not finalized specifications. They all may change.

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated May 15, 2012.

Abstract

Compatible extensions are specified for the Universal Character Set (UCS) and the encoding forms UTF-8, UTF-16, and UTF-32, enabling usage of an unlimited number of character code points. These extensions overcome the current limitation of UCS to approximately one million code points.

1. Introduction
2. Meaning of “compatible extension”
3. Rationale
4. Assignment of code points
5. Byte-order (big-/little-endian encoding schemes and BOM)
6. Conformance
7. Security
8. Extending non-UCS encodings
9. History and status of UCS-X
10. Implementation
11. Future development
12. Request for comments
∞. References

1. Introduction

Unicode and ISO 10646 definitions make UCS the foundation for all modern text processing. The current document assumes familiarity with those standards.

“UCS” = “Universal Character Set” = “Unlimited Code Space.”
“UTF” = “UCS Transformation Format” = “encoding form.”
The letter “X” means “extended,” and is also a wildcard meaning any one of the following:
- M = mega = million;
- G = giga = 10⁹ (billion or thousand million);
- E = exa = 10¹⁸ (quintillion or million million million);
- ∞ = infinity.

UCS has been defined with different sizes:

2,147,483,648 (maximum U+7FFFFFFF) by the earliest version of ISO 10646
65,536 (maximum U+FFFF) by the earliest version of the Unicode Standard
1,114,112 (maximum U+10FFFF) by later versions of the standards.

Here “size” means the number of code points, including unassigned (reserved) as well as assigned (designated) code points. (249,031 code points are assigned in Unicode 6.0, including: 137,468 private-use; 109,242 graphic; 2,048 surrogates; 142 format; 66 noncharacters; and 65 control.)

We define “UCS-M” to mean the 1,114,112 code points (1,112,064 UCS scalar values, maximum U+10FFFF) specified by the Unicode Standard and ISO 10646, providing a permanent explicit name for the set of this particular size.

We specify three supersets of UCS-M, named “UCS-G”, “UCS-E”, and “UCS-∞”. Each superset has a specification for extending each of the three encoding forms UTF-8, UTF-16, and UTF-32 to support more code points. Hence, there are nine (3x3) detailed encoding specifications, “UTF-G-8” through “UTF-∞-32”. (See the table at the top of this page.)

The larger sets are supersets of the smaller ones. (In other words, the smaller sets are subsets of the larger ones.) The encoding forms with larger limits are compatible extensions of those with smaller limits.

Widespread usage is likely to occur for UCS-G (smallest of the supersets) sooner than for UCS-E or UCS-∞. Therefore, the UCS-G specification is written so that it stands on its own, and can be read first. The specifications for UCS-E and UCS-∞ occasionally refer back to UCS-G, for brevity and clarity, rather than repeating the same information. (Readers who prefer to start with ∞ are welcome to do so, of course. After ∞ has been fully grasped, the encoding details of E and G are mostly redundant, but some discussion, analysis, etc., that appear in the G or E specifications should still be read for completeness.)

Most details of UCS-X are provided in the separate specifications for UCS-G, UCS-E, and UCS-∞, and their UTF encoding forms, which are to be regarded as portions of this UCS-X specification.

2. Meaning of “compatible extension”

UCS-X provides compatible extensions of the encoding forms UTF-8, UTF-16, and UTF-32, in much the same way that UTF-8 is a compatible extension of ASCII. The following attributes have made UTF-8 especially popular for compatibility with applications, protocols, formats, etc., that were originally ASCII-based (e.g., Unix, email, HTML):

Any ASCII file is a UTF-8 file, preserving character identities.
Any UTF-8 file that contains only ASCII characters is also an ASCII file.
Codes are compatible: searching does not produce false matches, leading/trailing bytes are distinguished.

(Note on terminology: in a multibyte UTF-8 code, the first byte is called “leading” and all the other bytes are called “trailing.” Similarly, for UTF-16 and extensions of UTF-32, in a code consisting of more than one code unit, the first unit is called “leading” and all the other units are called “trailing.”)

Fortunately, in terms of the success of UTF-8, many applications that originally supported ASCII were written in such a way that bytes with the values 00..7F (hexadecimal) were recognized as ASCII, while bytes 80..FF were treated as generic characters and generally stored, transmitted, etc., intact. Applications passing this test were called “8-bit safe.”

There is another sense in which UTF-8 is NOT compatible with ASCII. A “strict” application of ASCII might disallow any character or byte beyond 7F. When encountering an “illegal” character, such an application might crash, report an error and quit, delete the character, or otherwise bend/fold/mutilate it. Nevertheless, many such applications could be made 8-bit safe, often by means of trivial modifications.

Each of the extended encoding forms is intended to provide analogous compatibility for the corresponding UCS-M encoding form (UTF-8/16/32):

Any UCS-M (UTF-8/16/32) file is a UCS-X (UTF-X-8/16/32) file, preserving character identities.
Any UCS-X (UTF-X-8/16/32) file that contains only UCS-M characters is also a UCS-M (UTF-8/16/32) file.
Codes are compatible: searching does not produce false matches, leading/trailing units are distinguished.

These attributes should make it relatively easy to use or adapt existing applications to handle UCS-X. This is the most basic way in which UCS-X is a “compatible extension.” In addition, compatibility requires recognition that Unicode and ISO 10646 are complex, detailed standards, and UCS-X is designed to follow and harmonize with them as much as possible.

3. Rationale

This specification is based on the assumption that availability of more than a million UCS code points may be useful. It does not assume any particular future application. It proposes the assignment of private-use code points that could be used for any application; see the next section.

NOTE: The authors are motivated by our belief that a permanent and truly universal character set and information encoding standard should enable use of an unlimited number of character code points. As an optional supplement to the specification itself, there is documentation explaining the rationale for UCS-X, including ideas for a variety of hypothetical applications.

4. Assignment of code points

4.1 All code points greater than U+10FFFF whose first (most significant) hexadecimal digit is 1, E, or F (U+1..., U+E..., and U+F...) are assigned for extended private use (or “user-defined”). None of these code points are “noncharacters.”

4.2 All other code points greater than U+10FFFF (i.e., those whose first digit is 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, or D) are reserved for future assignment by standards organizations.

NOTES:

As of Unicode version 2.0 (and still in version 6.0; see also the following note), there are 137,468 private-use code points, namely the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. The assignment here proposed would effectively add three private-use ranges to UCS-G: U+110000..U+1FFFFF; U+E00000..U+1FFFFFF; and U+E000000..U+1FFFFFFF, total 321,847,296 code points. In UCS-E, the last and largest private-use range would be U+E00000000000000..U+1FFFFFFFFFFFFFFF with 1,297,036,692,682,702,848 code points.
The Unicode Character Encoding Stability Policy states, “The General_Category property value Private_Use (Co) is immutable: the set of code points with that value will never change.” It may therefore be appropriate to define a new property or value Extended_Private_Use.
It may eventually be useful to distinguish between different kinds of private use. According to Unicode 6.0, “By convention, the primary Private Use Area [U+E000..U+F8FF] is divided into a corporate use subarea for platform writers, starting at U+F8FF and extending downward in values, and an end-user subarea, starting at U+E000 and extending upward.” More conventions of this sort may arise.
Unicode 6.0 assigns 66 non-character code points, or “noncharacters,” including all code points up to U+10FFFF ending in ...FFFE or ...FFFF. Applications are free to use any of these code points internally but should never attempt to interchange them. (In effect, noncharacters can be thought of as application-internal private-use code points.) It is not certain whether adding to the set of noncharacters would be beneficial. If all code points greater than U+10FFFF ending in ...FFFE or ...FFFF were assigned as noncharacters, it would be impossible to have a numerically contiguous range of character code points, even in a private-use area, longer than 65,534. Such ubiquitous noncharacters might do more harm than good, when noncharacters are capable of triggering errors or being deleted/ignored, or treated as unassigned. Therefore this specification does not assign any more noncharacters. Code points assigned for private use, such as U+123FFFE and U+FFFFFFFF, are not noncharacters. An unassigned code point such as U+222FFFFE might someday be assigned as a noncharacter.
For UCS-∞, infinitely many code points would be reserved, while an infinite set of private-use code points would be assigned, including contiguous ranges of any desired length. The entire set of nonnegative integers can be mapped into private-use USV (UCS scalar values), for example by adding 1,114,112 (decimal) to the number (to go beyond U+10FFFF), writing the sum as hexadecimal, and inserting “U+1” as a prefix to ensure it is private-use. (Similarly, mappings could be devised that include negative integers, rational numbers, or any countably infinite set; however, the binary sorting of codes might not agree with numerical ordering for sets other than the nonnegative integers.)

5. Byte-order (big-/little-endian encoding schemes and BOM)

The Unicode Standard makes a distinction between encoding “forms” (which don't specify byte-order) and encoding “schemes” (which do). UCS-X specifies “forms.” Byte-order is irrelevant for UTF-X-8. For UTF-X-16 and UTF-X-32, byte-order can be specified with names such as UTF-G-16LE, etc., or possibly with BOM (byte-order mark) U+FEFF (but see the second note below). The corresponding schemes exactly follow the analogy of UTF-16LE, etc., as defined by the Unicode Standard.

NOTES:

The Unicode Standard (5.0, page 107) states: “The UTF-32 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-32 encoding scheme is big-endian.” This conformance requirement needs to be followed for UCS-X as well. There is a potential problem with little-endian UTF-X-32 files if they do not include BOM (FF FE 00 00). Since the first three bytes could have almost any values, there is risk of confusion with BOM for a different encoding scheme. For example, U+5FEFF is encoded as FF FE 05 00, which can be mistaken for UTF-16 BOM (followed by U+0005). Similarly, in little-endian UTF-X-32, U+5EFBBBF is encoded as BF BB EF 05, which can be mistaken for UTF-8 BOM (followed by U+0005). Therefore, little-endian UTF-X-32 should not be used in the absence of a higher-level protocol (or possibly BOM, but see the following note).
If BOM (U+FEFF) is narrowly interpreted to indicate only one of the UCS-M encodings, not any of the extended UCS-X encodings, then it could be misleading to start a text with BOM and then include codes for any code points greater than U+10FFFF. (Compare the note about XML declarations in the following section.) Possibly a code point serving as an encoding signature should be assigned for UCS-X, or even three code points for UCS-G/E/∞, respectively. This detail of the UCS-X specification still needs to be decided. In the meantime, higher-level protocols should be used to indicate the encoding of UCS-X text with code points beyond U+10FFFF, and BOM (U+FEFF) should not be used in such text.

6. Conformance

6.1 To help ensure compatibility, security, and stability, conformance to UCS-X requires conformance to the Unicode Standard and ISO 10646, with the exception that code points beyond U+10FFFF (and within a specified upper limit) are legal for interchange between UCS-X-compatible processes.

6.2 A UCS-X-compatible process must never feed noninterpretable out-of-range data to another process. A UCS-X-compatible process must be explicitly identified as such (in its documentation or through suitable protocols). In the absence of explicit identification, no process should be assumed to be UCS-X-compatible or to support any code points greater than U+10FFFF.

NOTE: Simply saving a file might not literally be “feeding it to another process”; but, for example, a file containing an XML declaration with encoding="UTF-8" and also containing the UTF-E-8 code for U+123456789 would violate the spirit of this rule. Better would be encoding="X-UTF-E-8". (The prefix "X-" should be used, in accordance with www.w3.org/TR/xml/, until the UCS-X encoding names have been added to the IANA Registry.) See also the second note about BOM in the previous section (5. Byte-Order).

6.3 A UCS-X-compatible process must indicate its upper limit as a maximum number of USV hex digits (maxNUD). The only allowable values for maxNUD are (in decimal):

6, meaning maximum U+10FFFF (same as UCS-M)
8, meaning maximum U+7FFFFFFF (same as UCS-G)
16, meaning maximum U+7FFFFFFFFFFFFFFF (fifteen F's, same as UCS-E)
32, 64, 128, ... (power of two ≥ 32), meaning maximum U+7FFF... with one less than maxNUD F's

6.4 The Unicode Standard specifies how a Unicode-compatible process must correctly handle every USV up to U+10FFFF. In an exactly analogous way, a UCS-X-compatible process must correctly handle every USV up to its maximum.

6.5 In each encoding form (UTF-X-8, UTF-X-16, and UTF-X-32), only the unique shortest well-formed code that could represent a particular USV is allowed; a longer code with excessive zero padding which might otherwise represent the same USV is ill-formed and illegal. (This is to prevent problems analogous to those with some early implementations of UTF-8.)

6.6 Software and hardware modules implementing UCS-X should specify maxNUD, and also specify their behaviors when codes beyond that limit are encountered. (For example, replacement of out-of-range codes with U+FFFD.) Protocols should be developed to accomplish this communication at run-time. For example, a compatible library might include a function for querying the upper limit for the library. Functions that accept UCS-∞ as input or produce it as output should have arguments for their callers to specify their own upper limits. Even a function that is in principle capable of handling any size code should be configurable to enforce a limit, for compatibility with other modules and to avoid excessive memory usage or processing time.

NOTE: According to the original definition of UTF-8, it was correct for a program to interpret the 4-byte UTF-8 code F4 90 80 80 as U+110000, and convert it to UCS-2 as FFFD (U+FFFD REPLACEMENT CHARACTER). Under the U+10FFFF limit, however, F4 90 80 80 may be treated as an illegal byte sequence, and converted to UTF-16 as FFFD FFFD FFFD FFFD. See Unicode Technical Report #36, especially section 3.6 Secure Encoding Conversion, which implies that stopping conversion and reporting an error (“strategy 1”), converting to FFFD (“strategy 2”), or converting to FFFD FFFD FFFD FFFD (“strategy 3”) may all be considered safe, while “Strategy 2 is the most natural.” UCS-X applications should observe UTR #36, and apply the same principle, for example, when encountering the 7-byte UTF-E-8 code for U+80000000 given maxNUD = 8. For security reasons (see also the following section), it is essential to follow the rule in UTF #36 that “an illegal byte sequence must not include bytes that encode valid characters or are leading bytes for valid characters.” The same rules apply when “byte” is replaced with “code unit.”

7. Security

UCS-X is still experimental and should not be hastily employed in situations where unpredictability is hazardous. Errors are liable to be made in the design and implementation of any new or extended technology.

Precise specification of UCS-X will help to coordinate experimentation and testing, to answer questions about potential effects of UCS-X on efficiency, reliability, compatibility, and other security-related issues.

“[A]ctual implementations must deal with error conditions, including out-of-range errors, and having two specifications [namely: the original UTF-8, which we call UTF-G-8; and UTF-16] which treat that edge condition somewhat differently can be real trouble in distributed software.” (N3248) The risk of such trouble would be greatly multiplied in the case of multiple specifications for sets of different sizes, lacking a common protocol for consistent handling of out-of-range characters. The previous section (6. Conformance) provides a foundation for such protocols, including the establishment of UCS-M as the default set. Some details remain to be specified (especially section 6.6). The issue of interoperability is crucial from the standpoint of security.

There is also potential risk in moving too slowly with testing and standardizing UCS-X, given that 64-bit computers are already common. Font-related technologies are very powerful tools, ripe for application to new domains. International standards need to anticipate the possible deployment of technologies supporting more than 137,468 private-use code points. Without adequate standards, a wide variety of incompatible UCS extensions could arise resulting in garbled communications, breakdowns, and worse. Problems of UCS extension should be solved, at least in theory, and with some experimental evidence, before they become urgent.

8. Extending non-UCS encodings

Some nations and other groups use encodings that are not UCS, yet can be mapped to and from UCS. For handling code points beyond U+10FFFF, users of such encodings may choose either to adopt UCS-X directly, or to create equivalent extensions of their own encodings. We give one example of how that might be done. The Chinese national standard character encoding GB18030 already has codes that map to and from all UCS code points up to U+10FFFF, whose GB code is E3 32 9A 35. An effective compatible extension of GB could easily be made as follows: to obtain the GB-X code for any USV greater than U+10FFFF, start with the UTF-∞-8 code and insert the byte 30 between the first and second bytes. For example, U+110000 would be F4 30 90 80 80; and U+7FFFFFFF would be FD 30 BF BF BF BF BF. Such codes would be different from all currently defined GB codes and easily recognized as such from the first two bytes. Of course, extension of local standards is a matter for local authorities. This possibility simply demonstrates that flexible solutions for continued compatibility between standards may be fairly easy to find.

9. History and status of UCS-X

The authors (Bishop and Cook) came up with the idea of UCS-X (but not UCS), starting in 2007. We don't claim any authority (power to set standards); we're acting as individuals, not on behalf of an organization. UCS-X has not been approved by the Unicode Consortium, or ISO, or any other standards organization. We're letting people know about it on an experimental basis. In no event shall we be liable for any damages. The original UTF-8 for codes up to six bytes in length (which we call UTF-G-8) was invented by Ken Thompson, and the extension of UTF-8 to thirteen bytes (which we call UTF-E-8) was invented by Larry Wall.

10. Implementation

A Perl script that implements conversion, in both directions, between UCS-X scalar values and encoding forms (UTF-X-8, UTF-X-16, and UTF-X-32) is provided for running or viewing.

Run a web-based version of the script (in a separate window)
View the source code (in a separate window)

Also, C source code is provided that supports UCS-G. The three files ConvertUTFG.c, ConvertUTFG.h, and utfg_harness.c are modified versions of the UTF conversion programs that were formerly on the unicode.org website. (The original programs are no longer on unicode.org.)

C source code for UCS-E and UCS-∞ is not yet available.

11. Future development

To do:

refine conformance rules and protocols for security and stability
develop and test implementations for UCS-E and UCS-∞ in the C programming language (to supplement the existing Perl implementation for UCS-X, and C implementation for UCS-G)
implement “filtering” utilities, for removing (or replacing with U+FFFD, or a string, etc.) all codes for code points above a given limit (such as U+10FFFF, or U+7FFFFFFF, etc.)
experiment with adding UCS-X support to essential Unicode-based libraries such as ICU and PCRE
revise wording of specifications to conform more closely to conventions used in prior specifications, RFC's, etc.
organize references better
make known to UCS people, form working group (formal or informal)
translate the specs into languages besides English
consider optional use of underscores in “U+” notation for USV with more than eight digits, to improve legibility; for example “U+1_23456789” would be equivalent to “U+123456789”, and “U+FFF_FFFFFFFF_FFFFFFFF” would be equivalent to “U+FFFFFFFFFFFFFFFFFFF” (Perl and Java support underscores between digits)
consider a naming convention such that, for example, “UCS-X512” would mean “UCS-∞ (maxNUD=512)”, and “UTF-X1024-8” would mean “UTF-∞-8 (maxNUD=1024)” (then “M”, “G”, and “E” might be considered abbreviations for “X6”, “X8”, and “X16”, respectively)