Network Working Group M. Crispin
Request for Comments: 4042 Panda Programming
Category: Informational 1 April 2005
UTF-9 and UTF-18
Efficient Transformation Formats of Unicode
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2005).
Abstract
ISO-10646 defines a large character set called the Universal
Character Set (UCS), which encompasses most of the world's writing
systems. The same set of codepoints is defined by Unicode, which
further defines additional character properties and other
implementation details. By policy of the relevant standardization
committees, changes to Unicode and amendments and additions to
ISO/IEC 646 track each other, so that the character repertoires and
code point assignments remain in synchronization.
The current representation formats for Unicode (UTF-7, UTF-8, UTF-16)
are not storage and computation efficient on platforms that utilize
the 9 bit nonet as a natural storage unit instead of the 8 bit octet.
This document describes a transformation format of Unicode that takes
advantage of the nonet so that the format will be storage and
computation efficient.
1. Introduction
A number of Internet sites utilize platforms that are not based upon
the traditional 8-bit byte or octet. One such platform is the PDP-
10, which is based upon a 36-bit word. On these platforms, it is
wasteful to represent data in octets, since 4 bits are left unused in
each word. The 9-bit nonet is a much more sensible representation.
Although these platforms support IETF standards, many of these
platforms still utilize a text representation based upon the septet,
which is only suitable for [US-ASCII] (although it has been used for
various ISO 10646 national variants).
To maximize international and multi-lingual interoperability, the IAB
has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default
coded character set.
Although other transformation formats of [UNICODE] exist, and
conceivably can be used on nonet-oriented machines (most notably
[UTF-8]), they suffer significant disadvantages: