ISO/IEC 10646-1:1993/Amd 2:1996
(Amendment)Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane — Amendment 2: UCS Transformation Format 8 (UTF-8)
Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane — Amendment 2: UCS Transformation Format 8 (UTF-8)
Technologies de l'information — Jeu universel de caractères codés à plusieurs octets — Partie 1: Architecture et table multilingue — Amendement 2: Format de transformation UCS 8 (UTF-8)
General Information
Relations
Standards Content (Sample)
INTERNATIONAL
lSO/IEC
STANDARD
10646-l
First edition
1993-05-01
AMENDMENT 2
1996-l O-l 5
Information technology - Universal
Multiple-Octet Coded Character
Set (UCS) -
Part 1:
Architecture and Basic Multilingual Plane
AMENDMENT 2: UCS Transformation
Format 8 (UTF-8)
Technologies de /‘information
- Jeu universe/ de caracteres cod& ;i
plusieurs octets -
Partie 7: Architecture et table multilingue
AMENDEMENT 2: Format de transformation KS 8 (UTF-8)
Reference number
q 5
ISO/lEC 10646-I :I 993/Amd.Z:1996(E)
---------------------- Page: 1 ----------------------
ISOllEC 10646-l : 1993/Amd.2:1996 (E)
Contents
Page
. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*. III
Foreword
iv
Introduction . . . . . . . . . . . . . . . . . . . . .~.
2 Conformance . . . . . . . . . . . . . . . . . . .*.*. 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*. 1
5 General structure of the UCS
Annexes
I
F The use of “signatures” to identify UCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .*. 1
M External references to character repertoires
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*. 2
R UCS Transformation Format 8 (UTF-8)
FL1 Features of UTF-8 . . . . . . . . . . . . . . . . . . .*. 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
R.2 Specification of UTF-8
R.3 Notation .*.,.,.-.*. 4
R.4 Mapping from UCS-4 form to UTF-8 form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
R.5 Mapping from UTF-8 form to UCS-4 form
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
R.6 Identification of UTF-8
R.7 Incorrect sequences of octets: Interpretation by receiving
devices *.,. 5
0 ISO/IEC 1996
All rights reserved. Unless otherwise specified no part of this publication may be
reproduced or utilized in any form or by any means, electronic or mechanical, including
photocopying and microfilm, without permission in writing from the publisher.
lSO/IEC Copyright Office l Case postale 56 . CH-1211 Geneve 20 l Switzerland
Printed in Switzerland
---------------------- Page: 2 ----------------------
0 ISOAEC ISOIIEC 10646.1 :1993/Amd.2: 1996 (E)
Foreword
IS0 (the International Organization for Standardization) and IEC (the
International Electrotechnical Commission) form the specialized system for
worldwide standardization. National bodies that are members of IS0 or IEC
participate in the development of International Standards through technical
committees established by the respective organization to deal with particular
fields of technical activity. IS0 and IEC technical committees collaborate in
fields of mutual interest. Other international organizations, governmental and
non-governmental, in liaison with IS0 and IEC, also take part in the work.
In the field of information technology, IS0 and IEC have established a joint .
technical committee, ISO/IEC JTC 1. Draft International Standards adopted
by the joint technical committee are circulated to national bodies for voting.
Publication as an International Standard requires approval by at least 75 %
of the national bodies casting a vote.
Amendment 2 to International Standard ISO/IEC 10646-l :1993 was prepared
by Joint Technical Committee ISO/IEC JTC 1, information fechnology.
---------------------- Page: 3 ----------------------
lSO/lEC 10646-l : 1993/Amd.2:1996 (E) 0 ISO/IEC
Introduction
ISO/IEC 10646 specifies the Universal Multiple-Octet Coded Character
Set (UCS). It is applicable to the representation, transmission, inter-
change, processing, storage, input and presentation of the written form of
the languages (scripts) of the world as well as additional symbols.
This amendment to ISO/IEC 10646 specifies an additional transformation
format, UTF-8. In UTF-8 all the characters of the UCS have a coded
representation which is suitable for use in communications and other
environments where some octet values of the code are assumed to have
a fixed definition according to ISOAEC 4873.
IV
---------------------- Page: 4 ----------------------
ISOllEC 10646-l :1993/Amd.2: 1996 (E)
0 ISO/IEC
Information technology - Universal Multiple-Octet
Coded Character Set (UCS) -
Part 1:
Architecture and Basic Multilingual Plane
AMENDMENT 2: UCS Transformation Format 8 (UTF-8)
Annex M - External references to character
2 Conformance
repertoires
Clause 2 as amended by Amendment 1 applies with
2.2a) amended as follows. Replace: Annex M as amended by Amendment 1 applies with
M.3 amended as follows. In the third paragraph
or Annex Q,
replace:
- UTF16-form (5).
or Annex Q or Annex R,
with:
- UTF16-form (5), or
5 General structure of the UCS
- UTF8-form (8).
Clause 5 applies with the following new paragraph
Replace:
added at the end of the clause:
- “IS0 10646 part-l utf-16 ”.
A UCS Transformation Format (UTF-8) is specified
with:
in Annex R which can be used to transmit text data
- “IS0 10646 part-l utf-16”
through communication systems which are sensitive
to octet values for control characters coded
- “IS0 10646 part-l utf-8 ”.
according to the 8-bit structure of ISO/lEC 2022, and
to ISO/IEC 4873. UTF-8 also avoids the use of
octet values according to ISO/IEC 4873 which have
special significance during the parsing of file-name
character strings in widely-used file-handling
systems.
Annex F - The use of “signatures” to identify
ucs
Annex F applies with the text amended as follows.
After:
UCS-4 signature: 0000 FEFF
insert:
UTF-8 signature: EF BB BF
---------------------- Page: 5 ----------------------
0 ISOAEC
lSO/lEC 10646-1:1993/Amd.2:1996 (E)
Add the following new annex:
Annex R
(normative)
UCS Transformation Format 8 (UTF-8)
UTF-8 is an alternative coded representation form
R.2 Specification of UTF-8
for all of the characters of the UCS. It can be used
In the UTF-8 coded representation form each
to transmit text data through communication
character from this International Standard shall have
systems which assume that individual octets in the
a coded representation that comprises a sequence
range 00 to 7F have a definition according to
of octets of length 1, 2,3, 4, 5, or 6 octets.
iSO/IEC 4873, including a CO set of control functions
according to the 8-bit structure of ISO/lEC 2022.
For all sequences of one octet the most significant
UTF-8 also avoids the use of octet values in this
bit shall be a ZERO bit.
range which have special significance during the
For all sequences of more than one octet, the
parsing of file-name character strings in widely-used
number of ONE bits in the first octet, starting from
file-handling systems.
the most significant bit position, shall indicate the
number of octets in the sequence. The next most
significant bit shall be a ZERO bit.
The number of octets in the UTF-8 coded
representation of the characters of the UCS ranges
NOTE 1 - For example, the first octet of a Z-octet sequence
from one to six; the value of the first octet indicates
has bits 110 in the most significant positions, and the first octet
of a 6-octet sequence has bits 1111110 in the most significant
the number of octets in that coded representation.
positions.
R.1 Features of UTF-8
All of the octets, other than the first in a sequence,
are known as continuing octets. The two most
l UCS characters from the BASIC LATIN collection
significant bits of a continuing octet shall be a ONE
are represented in UTF-8 in accordance with
lSO/IEC 4873, i.e. single octets with values bit followed by a ZERO bit.
ranging from 20 to 7E.
The remaining bit positions in the octets of the
sequence shall be “free bit positions” that are used
l Control functions in positions 0000 0000 to
to distinguish between the characters of this
0000 OOlF, and the DELETE character in position
International Standard. These free bit positions shall
0000 007F, are represented without the padding
be used, in order of increasing significance, for the
octets specified in clause 16, i.e. as single octets
bits of the UCS-4 coded representation of the
with values ranging from 00 to 1 F, and 7F
character, starting from its least significant bit.
respectively in accordance with ISO/IEC 4873
and with the 8-bit structure of ISO/IEC 2022. Some of the high-order ZERO bits of the UCS-4
representation shall be omitted, as specified below.
l Octet values 00 to 7F do not otherwise occur in
Table 1 below shows the format of the octets of a
the UTF-8 coded representation of any character.
This provides compatibility with existing file- coded character according to UTF-8. Each free bit
position available for distinguishing between the
handling systems and communications sub-
characters is indicated by an X. Each entry in the
systems which parse CC-data-elements for these
octet values. column “Maximum UCS-4 value” indicates the upper
end of the range of coded representations from
l The first octet in the UTF-8 coded representation
UCS-4 that may be represented in a UTF-8
of any character can be d
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.