Draft Proposal for UTF-G-32 Specification

Introduction | Properties | Details | Examples | UCS-X Home Page

Authors: Tom Bishop (tbishop@wenlin.com) and Richard Cook (rscook@wenlin.com).

Updated August 26, 2007

Please Note: This is not a finalized specification. It is still at the "draft proposal" stage and may change.

1. Introduction

The name of this UTF (UCS Transformation Format) is "UTF-G-32". UTF-G-32 extends UTF-32 to support over two billion characters, with code points up to U+7FFFFFFF.

UTF-G-32 is one of the encodings defined as part of UCS-G, which also includes similar extensions for UTF-8 and UTF-16. For general information about UCS-G, please see the UCS-G Specification.

UTF-G-32 is identical to the original UCS-4 encoding. The only thing that is new about it is the name. Unlike UTF-32, it explicitly supports code points greater than U+10FFFF. (If the reader is already familiar with UCS-4, the remainder of this UTF-G-32 specification is probably superfluous.)

2. Properties

UTF-G-32 preserves and extends useful properties of UTF-32 and UCS-4. For code points less than or equal to U+10FFFF, it is identical to UTF-32. It is always identical to the original UCS-4 encoding.

UTF-G-32 employs thirty-two-bit code units. All codes are one unit in length.

A simple binary comparison of UTF-G-32 codes yields the same sort-order as a numerical comparison of code points.

3. Details

3.1 Terms and Abbreviations

3.2 Code points less than or equal to U+7FFFFFFF

A code is a single unit which simply contains the USV.

4. Examples

 U+0041 = 00000041 (the code for the letter 'A')

 U+10FFFF = 0010FFFF (the last UTF-32 code)

 U+110000 = 00110000

 U+12345678 = 12345678

 U+7FFFFFFF = 7FFFFFFF (the last original UCS-4 code and the last UTF-G-32 code)

To the UCS-G Specification

UCS-X Home Page

Valid XHTML 1.0!