ISO/IEC 10646:2003
(Main)Information technology - Universal Multiple-Octet Coded Character Set (UCS)
Information technology - Universal Multiple-Octet Coded Character Set (UCS)
ISO/IEC 10646:2003 specifies the Universal Multiple-Octet Coded Character Set (UCS). It is applicable to the representation, transmission, interchange, processing, storage, input and presentation of the written form of the languages of the world as well as additional symbols. ISO/IEC 10646:2003 specifies the architecture of ISO/IEC 10646:2003; defines terms used in ISO/IEC 10646:2003; describes the general structure of the coded character set; specifies the Basic Multilingual Plane (BMP) of the UCS; specifies supplementary planes of the UCS: the Supplementary Multilingual Plane (SMP), the Supplementary Ideographic Plane (SIP) and the Supplementary Special-purpose Plane (SSP); defines a set of graphic characters used in scripts and the written form of languages on a world-wide scale; specifies the names for the graphic characters of the BMP, SMP, SIP, SSP and their coded representations; specifies the four-octet (32-bit) canonical form of the UCS: UCS-4; specifies a two-octet (16-bit) BMP form of the UCS: UCS-2; specifies a multiple byte (one to four) byte transformation UTF-8 for use with ISO 646 (ASCII) byte-oriented environments; specifies a two 16-bit form and associated transformation UTF-16 for supplementary characters; specifies collection identifiers for selected set of character subsets; specifies the coded representations for control functions; specifies the management of future additions to this coded character set; incorporates the Unicode bi-directional algorithm and normalization forms by reference. The UCS is a coding system different from that specified in ISO/IEC 2022. A graphic character will be assigned only one code position in ISO/IEC 10646:2003, located either in the BMP or in one of the supplementary planes. NOTE - The Unicode Standard, Version 4.0 includes a set of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of character properties, processing algorithms, and definitions that are useful to implementers. Version 4.0 strengthens Unicode support for worldwide communication, software availability, and publishing. By defining a consistent way of encoding multilingual text ISO/IEC 10646:2003 enables the exchange of data internationally. The information technology industry gains data stability, greater global interoperability and data interchange. ISO/IEC 10646:2003 has been widely adopted in new Internet and W3C protocols and mark up languages such as XML and HTML, and implemented in modern operating systems and computer programming languages. This edition covers over 96 000 characters from the world's scripts.
Technologies de l'information — Jeu universel de caractères codés sur plusieurs octets (JUC)
L'ISO/CEI 10646:2003 normalise le jeu universel de caractères codés sur plusieurs octets (JUC). Elle s'applique à la représentation, à la transmission, à l'échange, au traitement, au stockage, à la saisie et à la présentation des langues du monde sous forme écrite et de symboles complémentaires. L'ISO/CEI 10646:2003 décrit l'architecture de l'ISO/CEI 10646:2003, définit les termes utilisés dans l'ISO/CEI 10646:2003, décrit la structure générale du jeu de caractères codés, décrit le plan multilingue de base (PMB) du JUC, décrit les plans complémentaires du JUC: le Plan multilingue complémentaire (PMC), le Plan idéographique complémentaire (PIC) et le Plan complémentaire spécialisé (PCS), définit un ensemble de caractères graphiques utilisés dans la forme écrite des langues à l'échelle mondiale, nomme et établit la représentation codée des caractères graphiques du PMB, du PMC, du PIC et du PCS, prescrit la forme canonique à quatre octets (32 bits) du JUC: UCS-4, précise une forme du PMB à deux octets (16 bits) pour le JUC: UCS-2, établit la représentation codée des fonctions de commandes, et établit la gestion de tout développement ultérieur du présent jeu de caractères codés. Le JUC est un système de codage différent de celui décrit dans l'ISO/CEI 2022. Un caractère graphique donné ne sera affecté qu'à une seule position de code dans l'ISO/CEI 10646:2003, située soit dans le PMB, soit dans un des plans complémentaires. NOTE - La version 4.0 d'Unicode définit un ensemble de caractères, de noms et de représentations codées identiques à l'ensemble de l'ISO/CEI 10646:2003. Elle fournit, de surcroît, des informations supplémentaires relatives aux propriétés de ces caractères, aux algorithmes de traitement ainsi que des définitions utiles aux développeurs. En définissant une manière cohérente de coder du texte multilingue, l'ISO/CEI 10646:2003 permet l'échange de données au niveau international. L'industrie des technologies de l'information y gagne en stabilité des données et en une meilleure interopérabilité mondiale. L'ISO/CEI 10646:2003 a été adoptée par de nouveaux protocoles Internet et mise en oeuvre dans des systèmes d'exploitation et des langages informatiques. Cette édition contient plus de 95 000 caractères des écritures du monde entier.
Informacijska tehnologija – Univerzalni večoktetni nabor znakov (UCS)
General Information
Relations
Frequently Asked Questions
ISO/IEC 10646:2003 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Universal Multiple-Octet Coded Character Set (UCS)". This standard covers: ISO/IEC 10646:2003 specifies the Universal Multiple-Octet Coded Character Set (UCS). It is applicable to the representation, transmission, interchange, processing, storage, input and presentation of the written form of the languages of the world as well as additional symbols. ISO/IEC 10646:2003 specifies the architecture of ISO/IEC 10646:2003; defines terms used in ISO/IEC 10646:2003; describes the general structure of the coded character set; specifies the Basic Multilingual Plane (BMP) of the UCS; specifies supplementary planes of the UCS: the Supplementary Multilingual Plane (SMP), the Supplementary Ideographic Plane (SIP) and the Supplementary Special-purpose Plane (SSP); defines a set of graphic characters used in scripts and the written form of languages on a world-wide scale; specifies the names for the graphic characters of the BMP, SMP, SIP, SSP and their coded representations; specifies the four-octet (32-bit) canonical form of the UCS: UCS-4; specifies a two-octet (16-bit) BMP form of the UCS: UCS-2; specifies a multiple byte (one to four) byte transformation UTF-8 for use with ISO 646 (ASCII) byte-oriented environments; specifies a two 16-bit form and associated transformation UTF-16 for supplementary characters; specifies collection identifiers for selected set of character subsets; specifies the coded representations for control functions; specifies the management of future additions to this coded character set; incorporates the Unicode bi-directional algorithm and normalization forms by reference. The UCS is a coding system different from that specified in ISO/IEC 2022. A graphic character will be assigned only one code position in ISO/IEC 10646:2003, located either in the BMP or in one of the supplementary planes. NOTE - The Unicode Standard, Version 4.0 includes a set of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of character properties, processing algorithms, and definitions that are useful to implementers. Version 4.0 strengthens Unicode support for worldwide communication, software availability, and publishing. By defining a consistent way of encoding multilingual text ISO/IEC 10646:2003 enables the exchange of data internationally. The information technology industry gains data stability, greater global interoperability and data interchange. ISO/IEC 10646:2003 has been widely adopted in new Internet and W3C protocols and mark up languages such as XML and HTML, and implemented in modern operating systems and computer programming languages. This edition covers over 96 000 characters from the world's scripts.
ISO/IEC 10646:2003 specifies the Universal Multiple-Octet Coded Character Set (UCS). It is applicable to the representation, transmission, interchange, processing, storage, input and presentation of the written form of the languages of the world as well as additional symbols. ISO/IEC 10646:2003 specifies the architecture of ISO/IEC 10646:2003; defines terms used in ISO/IEC 10646:2003; describes the general structure of the coded character set; specifies the Basic Multilingual Plane (BMP) of the UCS; specifies supplementary planes of the UCS: the Supplementary Multilingual Plane (SMP), the Supplementary Ideographic Plane (SIP) and the Supplementary Special-purpose Plane (SSP); defines a set of graphic characters used in scripts and the written form of languages on a world-wide scale; specifies the names for the graphic characters of the BMP, SMP, SIP, SSP and their coded representations; specifies the four-octet (32-bit) canonical form of the UCS: UCS-4; specifies a two-octet (16-bit) BMP form of the UCS: UCS-2; specifies a multiple byte (one to four) byte transformation UTF-8 for use with ISO 646 (ASCII) byte-oriented environments; specifies a two 16-bit form and associated transformation UTF-16 for supplementary characters; specifies collection identifiers for selected set of character subsets; specifies the coded representations for control functions; specifies the management of future additions to this coded character set; incorporates the Unicode bi-directional algorithm and normalization forms by reference. The UCS is a coding system different from that specified in ISO/IEC 2022. A graphic character will be assigned only one code position in ISO/IEC 10646:2003, located either in the BMP or in one of the supplementary planes. NOTE - The Unicode Standard, Version 4.0 includes a set of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of character properties, processing algorithms, and definitions that are useful to implementers. Version 4.0 strengthens Unicode support for worldwide communication, software availability, and publishing. By defining a consistent way of encoding multilingual text ISO/IEC 10646:2003 enables the exchange of data internationally. The information technology industry gains data stability, greater global interoperability and data interchange. ISO/IEC 10646:2003 has been widely adopted in new Internet and W3C protocols and mark up languages such as XML and HTML, and implemented in modern operating systems and computer programming languages. This edition covers over 96 000 characters from the world's scripts.
ISO/IEC 10646:2003 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.040.10 - Coding of character sets. The ICS classification helps identify the subject area and facilitates finding related standards.
ISO/IEC 10646:2003 has the following relationships with other standards: It is inter standard links to ISO 14644-16:2019, ISO/IEC 10646:2003/Amd 7:2010, ISO/IEC 10646:2003/Amd 2:2006, ISO/IEC 10646:2003/Amd 1:2005, ISO/IEC 10646:2003/Amd 5:2008, ISO/IEC 10646:2003/Amd 4:2008, ISO/IEC 10646:2003/Amd 6:2009, ISO/IEC 10646:2011, ISO/IEC 10646-1:2000, ISO/IEC 10646-1:2000/Amd 1:2002; is excused to ISO/IEC 10646:2003/Amd 7:2010, ISO/IEC 10646:2003/Amd 2:2006, ISO/IEC 10646:2003/Amd 4:2008, ISO/IEC 10646:2003/Amd 5:2008, ISO/IEC 10646:2003/Amd 1:2005. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
You can purchase ISO/IEC 10646:2003 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 10646
First edition
2003-12-15
Information technology — Universal
Multiple-Octet Coded Character Set (UCS)
Technologies de l'information — Jeu universel de caractères codés sur
plusieurs octets (JUC)
Reference number
©
ISO/IEC 2003
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
This CD-ROM contains
1) the publication ISO/IEC 10646:2003 in portable document format (PDF), which can be viewed using
Adobe® Acrobat® Reader;
2) text files containing lists of
i) source references for CJK ideographs,
ii) Hangul syllables and mapping information,
iii) alphabetically sorted character names.
Adobe and Acrobat are trademarks of Adobe Systems Incorporated.
This first edition cancels and replaces ISO/IEC 10646-1:2000 and ISO/IEC 10
...
SLOVENSKI STANDARD
01-november-2008
,QIRUPDFLMVNDWHKQRORJLMD±8QLYHU]DOQLYHþRNWHWQLQDERU]QDNRY8&6
Information technology -- Universal Multiple-Octet Coded Character Set (UCS)
Technologies de l'information -- Jeu universel de caractères codés sur plusieurs octets
(JUC)
Ta slovenski standard je istoveten z: ISO/IEC 10646:2003
ICS:
35.040 Nabori znakov in kodiranje Character sets and
informacij information coding
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO/IEC
STANDARD 10646
First edition
2003-12-15
Information technology — Universal
Multiple-Octet Coded Character Set (UCS)
Technologies de l'information — Jeu universel de caractères codés sur
plusieurs octets (JUC)
Reference number
©
ISO/IEC 2003
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO/IEC 2003
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2003 – All rights reserved
Contents Page
1 Scope.1
2 Conformance.1
3 Normative references.2
4 Terms and definitions.2
5 General structure of the UCS.4
6 Basic structure and nomenclature.5
7 General requirements for the UCS.9
8 The Basic Multilingual Plane .9
9 Supplementary planes.10
10 Private use groups, planes, and zones.10
11 Revision and updating of the UCS .10
12 Subsets.10
13 Coded representation forms of the UCS .11
14 Implementation levels.11
15 Use of control functions with the UCS.11
16 Declaration of identification of features.12
17 Structure of the code tables and lists .13
18 Block names.13
19 Characters in bi-directional context.14
20 Special characters.14
21 Presentation forms of characters .17
22 Compatibility characters.18
23 Order of characters .18
24 Normalization forms.18
25 Combining characters.18
26 Special features of individual scripts .20
27 Source references for CJK Ideographs.20
28 Character names and annotations.23
29 Structure of the Basic Multilingual Plane.25
30 Structure of the Supplementary Multilingual Plane for Scripts and symbols.27
31 Structure of the Supplementary Ideographic Plane .28
32 Supplementary Special-purpose Plane.28
33 Code tables and lists of character names.28
NOTE The code tables and lists of character names are given on pages 29-1348. They are contained
in separate files which are accessed by clicking on the appropriate highlighted text in Clause 33.
Annexes
A (normative) Collections of graphic characters for subsets .1349
B (normative) List of combining characters .1358
C (normative) Transformation format for 16 planes of Group 00 (UTF-16) .1364
© ISO/IEC 2003 – All rights reserved
iii
D (normative) UCS Transformation Format 8 (UTF-8) . 1367
E (informative) Mirrored characters in Arabic bi-directional context. 1371
F (informative) Alternate format characters. 1374
G (informative) Alphabetically sorted list of character names . 1379
H (informative) The use of “signatures” to identify UCS. 1380
J (informative) Recommendation for combined receiving/originating devices with
internal storage . 1381
K (informative) Notations of octet value representations . 1382
L (informative) Character naming guidelines . 1383
M (informative) Sources of characters . 1386
N (informative) External references to character repertoires . 1390
P (informative) Additional information on characters . 1392
Q (informative) Code mapping table for Hangul syllables . 1395
R (informative) Names of Hangul syllables . 1396
S (informative) Procedure for the unification and arrangement of CJK
Ideographs. 1408
T (informative) Language tagging using Tag Characters. 1416
U (informative) Usage of musical symbols . 1418
iv © ISO/IEC 2003 – All rights reserved
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 10646 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 2, Coded character sets.
This first edition of ISO/IEC 10646 cancels and replaces ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001. It
also incorporates ISO/IEC 10646-1:2000/Amd.1:2002.
© ISO/IEC 2003 – All rights reserved
v
Introduction
ISO/IEC 10646 specifies the Universal Multiple-Octet Coded Character Set (UCS). It is
applicable to the representation, transmission, interchange, processing, storage, input
and presentation of the written form of the languages of the world as well as additional
symbols.
By defining a consistent way of encoding multilingual text it enables the exchange of
data internationally. The information technology industry gains data stability, greater
global interoperability and data interchange. ISO/IEC 10646 has been widely adopted in
new Internet protocols and implemented in modern operating systems and computer
languages. This edition covers over 95 000 characters from the world’s scripts.
ISO/IEC 10646 contains material which may only be available to users who obtain their
copy in a machine readable format. That material consists of the following printable files:
⎯ CJKU_SR.txt
⎯ CJKC_SR.txt
⎯ Allnames.txt
⎯ HangulX.txt
⎯ HangulSy.txt
vi © ISO/IEC 2003 – All rights reserved
INTERNATIONAL STANDARD ISO/IEC 10646: 2003 (E)
Information technology — Universal Multiple-Octet
Coded Character Set (UCS)
1 Scope
2 Conformance
ISO/IEC 10646 specifies the Universal Multiple-Octet
Coded Character Set (UCS). It is applicable to the
2.1 General
representation, transmission, interchange, processing,
storage, input, and presentation of the written form of
Whenever private use characters are used as speci-
the languages of the world as well as of additional
fied in ISO/IEC 10646, the characters themselves shall
symbols.
not be covered by these conformance requirements.
This document:
2.2 Conformance of information interchange
- specifies the architecture of ISO/IEC 10646,
- defines terms used in ISO/IEC 10646, A coded-character-data-element (CC-data-element)
within coded information for interchange is in confor-
- describes the general structure of the coded char-
mance with ISO/IEC 10646 if
acter set;
a) all the coded representations of graphic charac-
- specifies the Basic Multilingual Plane (BMP) of the
ters within that CC-data-element conform to
UCS,
clauses 6 and 7, to an identified form chosen from
- specifies supplementary planes of the UCS: the
clause 13 or annex C or annex D, and to an identi-
Supplementary Multilingual Plane (SMP), the
fied implementation level chosen from clause 14;
Supplementary Ideographic Plane (SIP) and the
b) all the graphic characters represented within that
Supplementary Special-purpose Plane (SSP),
CC-data-element are taken from those within an
- defines a set of graphic characters used in scripts
identified subset (see clause 12);
and the written form of languages on a world-wide
c) all the coded representations of control functions
scale;
within that CC-data-element conform to clause 15.
- specifies the names for the graphic characters of
the BMP, SMP, SIP, SSP and their coded repre- A claim of conformance shall identify the adopted form,
sentations; the adopted implementation level and the adopted
subset by means of a list of collections and/or charac-
- specifies the four-octet (32-bit) canonical form of
ters.
the UCS: UCS-4;
- specifies a two-octet (16-bit) BMP form of the
2.3 Conformance of devices
UCS: UCS-2;
A device is in conformance with ISO/IEC 10646 if it
- specifies the coded representations for control
conforms to the requirements of item a) below, and
functions;
either or both of items b) and c).
- specifies the management of future additions to
NOTE – The term device is defined (in 4.18) as a compo-
this coded character set.
nent of information processing equipment which can trans-
The UCS is a coding system different from that speci- mit and/or receive coded information within CC-data-
elements. A device may be a conventional input/output de-
fied in ISO/IEC 2022. The method to designate UCS
vice, or a process such as an application program or gate-
from ISO/IEC 2022 is specified in clause 16.2.
way function.
A graphic character will be assigned only one code
A claim of conformance shall identify the document
position in the standard, located either in the BMP or
that contains the description specified in a) below, and
in one of the supplementary planes.
shall identify the adopted form(s), the adopted imple-
NOTE – The Unicode Standard, Version 4.0 includes a set
mentation level, the adopted subset (by means of a list
of characters, names, and coded representations that are
of collections and/or characters), and the selection of
identical with those in this International Standard. It addi-
control functions adopted in accordance with
tionally provides details of character properties, processing
algorithms, and definitions that are useful to implementers.
clause 15.
© ISO/IEC 2003 – All rights reserved 1
a) Device description: A device that conforms to
4 Terms and definitions
ISO/IEC 10646 shall be the subject of a descrip-
For the purposes of this document, the following terms
tion that identifies the means by which the user
and definitions apply.
may supply characters to the device and/or may
recognize them when they are made available to
4.1 Basic Multilingual Plane (BMP)
the user, as specified respectively, in sub-clauses
Plane 00 of Group 00.
b), and c) below.
4.2 Block
A contiguous range of code positions to which a set of
b) Originating device: An originating device shall
characters that share common characteristics, such as
allow its user to supply any characters from an
a script, are allocated. A block does not overlap an-
adopted subset, and be capable of transmitting
other block. One or more of the code positions within a
their coded representations within a CC-data-
block may have no character allocated to them.
element in accordance with the adopted form and
implementation level.
4.3 Canonical form
The form with which characters of this coded character
c) Receiving device: A receiving device shall be
set are specified using four octets to represent each
capable of receiving and interpreting any coded
character.
representation of characters that are within a CC-
data-element in accordance with the adopted form
4.4 CC-data-element (coded-character-data-
and implementation level, and shall make any cor-
element)
responding characters from the adopted subset
An element of interchanged information that is speci-
available to the user in such a way that the user
fied to consist of a sequence of coded representations
can identify them.
of characters, in accordance with one or more identi-
fied standards for coded character sets.
Any corresponding characters that are not within the
4.5 Cell
adopted subset shall be indicated to the user. The way
The place within a row at which an individual character
used for indicating them need not distinguish them
may be allocated.
from each other.
NOTE 1 – An indication to the user may consist of making 4.6 Character
available the same character to represent all characters not
A member of a set of elements used for the organiza-
in the adopted subset, or providing a distinctive audible or
tion, control, or representation of data.
visible signal when appropriate to the type of user.
4.7 Character boundary
NOTE 2 – See also annex J for receiving devices with re-
Within a stream of octets the demarcation between the
transmission capability.
last octet of the coded representation of a character
and the first octet of that of the next coded character.
4.8 Coded character
3 Normative references
A character together with its coded representation.
The following referenced documents are
4.9 Coded character set
indispensable for the application of this document.
A set of unambiguous rules that establishes a charac-
For dated references, only the edition cited applies.
ter set and the relationship between the characters of
For undated references, the latest edition of the
the set and their coded representation.
referenced document (including any amendments)
4.10 Code table
applies.
A table showing the characters allocated to the octets
in a code.
ISO/IEC 2022:1994, Information technology — Charac-
ter code structure and extension techniques.
4.11 Collection
A set of coded characters which is numbered and
ISO/IEC 6429:1992, Information technology — Control
named and which consists of those coded characters
functions for coded character sets.
whose code positions lie within one or more identified
ranges.
Unicode Standard Annex, UAX#9, The Unicode Bidi-
NOTE – If any of the identified ranges include code posi-
rectional Algorithm, Version 4.0.0, 2003-04-17.
tions to which no character is allocated, the repertoire of the
collection will change if an additional character is assigned
to any of those positions at a future amendment of this In-
Unicode Standard Annex, UAX#15, Unicode Normali-
ternational Standard. However it is intended that the collec-
zation Forms, Version 4.0.0, 2003-04-17.
tion number and name will remain unchanged in future edi-
tions of this International Standard.
2 © ISO/IEC 2003 – All rights reserved
4.12 Combining character 4.22 Group
A member of an identified subset of the coded charac- A subdivision of the coding space of this coded char-
ter set of ISO/IEC 10646 intended for combination with acter set; of 256 x 256 x 256 cells.
the preceding non-combining graphic character, or
4.23 High-half zone
with a sequence of combining characters preceded by
A set of cells reserved for use in UTF-16 (see annex
a non-combining character (see also 4.14).
C); an RC-element corresponding to any of these cells
NOTE – ISO/IEC 10646 specifies several subset collections
may be used in UTF-16 as the first of a pair of RC-
which include combining characters.
elements which represents a character from a plane
other than the BMP.
4.13 Compatibility character
A graphic character included as a coded character of
4.24 Interchange
ISO/IEC 10646 primarily for compatibility with existing
The transfer of character coded data from one user to
coded character sets.
another, using telecommunication means or inter-
changeable media.
4.14 Composite sequence
A sequence of graphic characters consisting of a non-
4.25 Interworking
combining character followed by one or more combin-
The process of permitting two or more systems, each
ing characters (see also 4.12).
employing different coded character sets, meaningfully
to interchange character coded data; conversion be-
NOTE 1 – A graphic symbol for a composite sequence
generally consists of the combination of the graphic sym-
tween the two codes may be involved.
bols of each character in the sequence.
4.26 ISO/IEC 10646-1
NOTE 2 – A composite sequence is not a character and
A former subdivision of the standard. It is also referred
therefore is not a member of the repertoire of ISO/IEC
to as Part 1 of ISO/IEC 10646 and contained the
10646.
specification of the overall architecture and the Basic
4.15 Control function
Multilingual Plane (BMP). There are a First and a Sec-
An action that affects the recording, processing,
ond Edition of ISO/IEC 10646-1.
transmission, or interpretation of data, and that has a
4.27 ISO/IEC 10646-2
coded representation consisting of one or more octets.
A former subdivision of the standard. It is also referred
4.16 Default state
to as Part 2 of ISO/IEC 10646 and contained the
The state that is assumed when no state has been
specification of the Supplementary Multilingual Plane
explicitly specified.
(SMP), the Supplementary Ideographic Plane (SIP)
and the Supplementary Special-purpose Plane (SSP).
4.17 Detailed code table
There is only a First Edition of ISO/IEC 10646-2.
A code table showing the individual characters, and
normally showing a partial row.
4.28 Low-half zone
A set of cells reserved for use in UTF-16 (see annex
4.18 Device
C); an RC-element corresponding to any of these cells
A component of information processing equipment
may be used in UTF-16 as the second of a pair of RC-
which can transmit and/or receive coded information
elements which represents a character from a plane
within CC-data-elements. (It may be an input/output
other than the BMP.
device in the conventional sense, or a process such as
an application program or gateway function.)
4.29 Octet
An ordered sequence of eight bits considered as a unit.
4.19 Fixed collection
A collection in which every code position within the
4.30 Plane
identified range(s) has a character allocated to it, and
A subdivision of a group; of 256 x 256 cells.
which is intended to remain unchanged in future edi-
4.31 Presentation; to present
tions of this International Standard.
The process of writing, printing, or displaying a graphic
4.20 Graphic character
symbol.
A character, other than a control function, that has a
4.32 Presentation form
visual representation normally handwritten, printed, or
In the presentation of some scripts, a form of a graphic
displayed.
symbol representing a character that depends on the
4.21 Graphic symbol
position of the character relative to other characters.
The visual representation of a graphic character or of
4.33 Private use plane
a composite sequence.
A plane within this coded character set; the contents of
which is not specified in ISO/IEC 10646 (see
clause 10).
© ISO/IEC 2003 – All rights reserved 3
The value of any octet is expressed in hexadecimal
4.34 RC-element
notation from 00 to FF in ISO/IEC 10646 (see an-
A two-octet sequence comprising the R-octet and the
nex K).
C-octet (see clause 6.2) from the four octet sequence
(in the canonical form) that corresponds to a cell in the
The canonical form of this coded character set – the
coding space of this coded character set.
way in which it is to be conceived – uses a four-
dimensional coding space, regarded as a single entity,
4.35 Repertoire
consisting of 128 three-dimensional groups.
A specified set of characters that are represented in a
coded character set.
NOTE 1 – Thus, bit 8 of the most significant octet in the ca-
nonical form of a coded character can be used for internal
4.36 Row
processing purposes within a device as long as it is set to
A subdivision of a plane; of 256 cells.
zero within a conforming CC-data-element.
4.37 Script
Each group consists of 256 two-dimensional planes.
A set of graphic characters used for the written form of
Each plane consists of 256 one-dimensional rows,
one or more languages.
each row containing 256 cells. A character is located
and coded at a cell within this coding space or the cell
4.38 Supplementary plane
is declared unused.
A plane other than Plane 00 of Group 00; a plane that
accommodates characters which have not been allo-
In the canonical form, four octets are used to repre-
cated to the Basic Multilingual Plane.
sent each character, and they specify the group, plane,
row and cell, respectively. The canonical form consists
4.39 Supplementary Multilingual Plane for
of four octets since two octets are not sufficient to
scripts and symbols (SMP)
cover all the characters in the world, and a 32-bit rep-
Plane 01 of Group 00.
resentation follows modern processor architectures.
4.40 Supplementary Ideographic Plane (SIP)
The four-octet canonical form can be used as a four-
Plane 02 of Group 00.
octet coded character set, in which case it is called
UCS-4.
4.41 Supplementary Special-purpose Plane
(SSP)
NOTE 2 – The use of the term “canonical” for this form
Plane 0E of Group 00. does not imply any restriction or preference for this form
over transformation formats that a conforming implementa-
4.42 Unpaired RC-element
tion may choose for the representation of UCS characters.
An RC-element in a CC-data element that is either:
ISO/IEC 10646 defines graphic characters and their
• an RC-element from the high-half zone that is not coded representation for the following planes:
immediately followed by an RC-element from the
• The Basic Multilingual Plane (BMP, Plane 00 of
low-half zone, or
Group 00). The Basic Multilingual Plane can be
• an RC-element from the low-half zone that is not
used as a two-octet coded character set identified
immediately preceded by an RC-element from the
as UCS-2.
high-half zone.
• The Supplementary Multilingual Plane for scripts
4.43 User
and symbols (SMP, Plane 01 of Group 00).
A person or other entity that invokes the service pro-
• The Supplementary Ideographic Plane (SIP,
vided by a device. (This entity may be a process such
Plane 02 of Group 00).
as an application program if the “device” is a code
converter or a gateway function, for example.) • The Supplementary Special-purpose Plane (SSP,
Plane 0E of Group 00).
4.44 Zone
Additional supplementary planes may be defined in
A sequence of cells of a code table, comprising one or
the future to accommodate additional graphic charac-
more rows, either in whole or in part, containing char-
ters.
acters of a particular class (for example see clause 8).
The planes that are reserved for private use are speci-
fied in clause 10. The contents of the cells in private
5 General structure of the UCS
use planes and zones are not specified in ISO/IEC
The general structure of the Universal Multiple-Octet
10646.
Coded Character Set (referred to hereafter as “this
Each character is located within the coded character
coded character set”) is described in this explanatory
set in terms of its Group-octet, Plane-octet, Row-octet,
clause, and is illustrated in figures 1 and 2. The nor-
and Cell-octet.
mative specification of the structure is given in the fol-
lowing clauses.
4 © ISO/IEC 2003 – All rights reserved
Subsets of the coding space may be used in order to This entire coded character set shall be conceived of
give a sub-repertoire of graphic characters. as comprising 128 groups of 256 planes. Each plane
shall be regarded as containing 256 rows of charac-
A UCS Transformation Format (UTF-16) is specified in
ters, each row containing 256 cells. In a code table
annex C which can be used to represent characters
representing the contents of a plane (such as in figure
from 16 supplementary planes of Group 00 (Planes 01
2), the horizontal axis shall represent the least signifi-
to 10), in addition to the BMP (Plane 00), in a form that
cant octet, with its smaller value to the left; and the
is compatible with the two-octet BMP form.
vertical axis shall represent the more significant octet,
with its smaller value at the top.
Another UCS Transformation Format (UTF-8) is speci-
fied in annex D which can be used to transmit text
Each axis of the coding space shall be coded by one
data through communication systems which are sensi-
octet. Within each octet the most significant bit shall
tive to octet values for control characters coded ac-
be bit 8 and the least significant bit shall be bit 1. Ac-
cording to the 8-bit structure of ISO/IEC 2022, and to
cordingly, the weight allocated to each bit shall be:
ISO/IEC 4873. UTF-8 also avoids the use of octet val-
ues according to ISO/IEC 4873 that have special sig-
nificance during the parsing of file-name character
bit 8 bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1
strings in widely-used file-handling systems.
128 64 32 16 8 4 2 1
6 Basic structure and nomenclature
6.1 Structure
The Universal Multiple-Octet Coded Character Set as
specified in ISO/IEC 10646 shall be regarded as a
single entity.
© ISO/IEC 2003 – All rights reserved 5
Group 7F
Plane 00 of Group 7F
Group 01
Group 00
Plane 00 of Group 01
Each plane: Plane FF of Group 00
256 x 256
cells
Plane 00 of Group 00
NOTE – To ensure continued interoperability between the UTF-16 form and other coded representations of the UCS, it is intended that
no characters will be allocated to code positions in Planes 11 to FF in Group 00 or any planes in any other groups.
Figure 1 - Entire coding space of the Universal Multiple-Octet Coded Character Set
6 © ISO/IEC 2003 – All rights reserved
Supplementary planes
Cell-octet
00 80 FF
Row-
octet
FF
0F Private use planes
0F, 10
D8.DF S-zone
E0.F8 Private use zone 01
F9.FF 00
Basic Multilingual Plane Plane-octet
NOTE 1 – Labels “S-zone” and “Private use zone” are specified in clause 8.
NOTE 2 – To ensure continued interoperability between the UTF-16 form and other coded representations of the UCS, it is in-
tended that no characters will be allocated to code positions in Planes 11 to FF in Group 00.
Figure 2 - Group 00 of the Universal Multiple-Octet Coded Character Set
© ISO/IEC 2003 – All rights reserved 7
b. describes the shape of the corresponding graphic
6.2 Coding of characters
symbol, or
In the canonical form of the coded character set, each
character within the entire coded character set shall be c. follows the rule given in clause 28.2 for Chinese
represented by a sequence of four octets. The most /Japanese/Korean (CJK) unified ideographs, or
significant octet of this sequence shall be the group-
d. follows the rule given in clause 28.3 for Hangul
octet. The least significant octet of this sequence shall
syllables.
be the cell-octet. Thus this sequence may be repre-
Guidelines to be used for constructing the names of
sented as
characters in cases a. and b. are given in annex L.
m.s. l.s.
6.5 Short identifiers for code positions (UIDs)
Group-octet Plane-octet Row-octet Cell-octet
ISO/IEC 10646 defines short identifiers for each code
position, including code positions that are reserved. A
short identifier for any code position is distinct from a
where m.s. means the most significant octet, and l.s.
short identifier for any other code position. If a charac-
means the least significant octet.
ter is allocated at a code position, a short identifier for
For brevity, the octets may be termed
that code position can be used to refer to the character
allocated at that code position.
m.s. l.s.
G-octet P-octet R-octet C-octet
NOTE 1 – For instance, U+DC00 identifies a code position
that is permanently reserved for UTF-16, and U+FFFF iden-
tifies a code position that is permanently reserved. U+0025
identifies a code position to which a character is allocated;
Where appropriate, these may be further abbreviated
U+0025 also identifies that character (named PERCENT
to G, P, R, and C.
SIGN).
The value of any octet shall be represented by two
NOTE 2 – These short identifiers are independent of the
hexadecimal digits, for example: 31 or FE. When a language in which this standard is written, and are thus re-
tained in all translations of the text.
single character is to be identified in terms of the val-
ues of its group, plane, row, and cell, this shall be rep-
The following alternative forms of notation of a short
resented such as:
identifier are defined here.
0000 0030 for DIGIT ZERO
a. The eight-digit form of short identifier shall consist
of the sequence of eight hexadecimal digits that
0000 0041 for LATIN CAPITAL LETTER A
represents the code position of the character
When referring to characters within an identified plane,
(see clause 6.2).
the leading four digits (for G-octet and P-octet) may be
b. The four-to-six-digit form of short identifier shall
omitted. For example, within the Plane 00 (BMP),
consist of the last four to six digits of the eight-digit
0030 may be used to refer to DIGIT ZERO.
form. It is not defined if the eight-digit form is
When referring to characters within planes 00 to 0F,
greater than 0010FFFF. Leading zeroes beyond
the leading three digits may be omitted. For example,
four digits are suppressed.
the five-digit value 11100 corresponds to the canonical
c. The character “-” (HYPHEN-MINUS) may, as an
form 0001 1100 and the corresponding coded charac-
option, precede the 8-digit form of short identifier.
ter is part of Plane 01.
d. The character “+” (PLUS SIGN) may, as an option,
6.3 Octet order
precede the four-to-six-digit form of short identifier.
The sequence of the octets that represent a character,
e. The prefix letter “U” (LATIN CAPITAL LETTER U)
and the most significant and least significant ends of it,
may, as an option, precede any of the four forms
shall be maintained as shown above. When serialized
of short identifier defined in a. to d. above.
as octets, a more significant octet shall precede less
f. For the 8 digit forms, the characters SPACE or
significant octets. When not serialized as octets, the
NO-BREAK SPACE may optionally be inserted
order of octets may be specified by agreement be-
before the four last digits.
tween sender and recipient (see clause 16.1 and an-
nex H).
The capital letters A to F, and U that appear within
short identifiers may be replaced by the corresponding
6.4 Naming of characters
small letters.
ISO/IEC 10646 assigns a unique name to each char-
The full syntax of the notation of a short identifier, in
acter. The name of a character either:
Backus-Naur form, is:
a. denotes the customary meaning of the character,
{ U | u } [ {+}(xxxx | xxxxx | xxxxxx) | {-}xxxxxxxx ]
or
8 © ISO/IEC 2003 – All rights reserved
where “x” represents one hexadecimal digit (0 to 9, A a. The values of P-, and R-, and C-octets used for
to F, or a to f). For example: representing graphic characters shall be in the
range 00 to FF. The values of G-octets used for
-hhhhhhhh +kkkk
representation of graphic characters shall be in the
Uhhhhhhhh U+kkkk
range 00 to 7F. On any plane, code positions FFFE
where hhhhhhhh indicates the eight-digit form and
and FFFF are permanently reserved.
kkkk indicates the four-to-six-digit form.
NOTE 1 – These code positions can be used for internal
NOTE 3 – As an example the short identifier for LATIN
processing uses requiring a numeric value that is guaran-
SMALL LETTER LONG S (see tables for Row 01 in clause
teed not to be a coded character.
33) may be notated in any of the following forms:
NOTE 2 – A “permanently reserved” code position can-
0000017F -0000017F U0000017F U-0000017F
not be changed by future amendments.
017F +017F U017F U+017F
Any of the capital letters may be replaced by the corre-
b. Code positions to which a character is not allocated,
sponding small letter.
except for the positions reserved for private use
characters or for transformation formats, are re-
NOTE 4 – Two special prefixed forms of notation have also
been used, in which the letter T (LATIN CAPITAL LETTER
served for future standardization and shall not be
T or LATIN SMALL LETTER T) replaces the letter U in the
used for any other purpose. Future editions of
corresponding prefixed forms. The forms of notation that in-
ISO/IEC 10646 will not allocate any characters to
cluded the prefix letter T indicated that the short identifier
code positions reserved for private use characters
refers to a character in ISO/IEC 10646-1 First Edition (be-
or for transformation formats.
fore the application of any Amendments), whereas the
forms of notation that include the prefix letter U always indi-
c. The same graphic character shall not be allocated
cate that the short identifier refers to a character in ISO/IEC
to more than one code position. There are graphic
10646 at the most recent state of amendment. Correspond-
characters with similar shapes in the coded charac-
ing short identifiers of the form T-xxxxxxxx and U-xxxxxxxx
refer to the same character except when xxxxxxxx lies in ter set; they are used for different purposes and
the range 00003400 to 00004DFF inclusive. Forms of nota-
have different character names.
tion that include no prefix letter always indicate a reference
to the most recent state of amendment of ISO/IEC 10646,
unless otherwise qualified.
8 The Basic Multilingual Plane
6.6 UCS Sequence Identifiers
The Plane 00 of Group 00 is the Basic Multilingual
Plane (BMP). The BMP can be used as a two-octet
ISO/IEC 10646 defines an identifier for any sequence
coded character set in which case it shall be called
of code positions taken from the standard. Such an
identifier is known as a UCS Sequence Identifier (USI). UCS-2 (see clause 13.1).
For a sequence of n code positions it has the following
NOTE 1 – Since UCS-2 only contains the repertoire of the
form:
BMP it is not fully interoperable with UCS-4, UTF-8 and
UTF-16.
Code positions 0000 0000 to 0000 001F in the BMP
where UID1, UID2, etc. represent the short identifiers
are reserved for control characters, and code position
of the corresponding code positions, in the same order
0000 007F is reserved for the character DELETE (see
as those code positions appear in the sequence. If
clause 15). Code positions 0000 0080 to 0000 009F
each of the code positions in such a sequence has a
are reserved for control characters.
character allocated to it, the USI can be used to iden-
tify the sequence of characters allocated at those code Code positions 0000 2060 to 0000 206F, 0000 FFF0
positions. The syntax for UID1, UID2, etc. is specified to 0000 FFFC, and 000E 0000 to 000E 0FFF are re-
in clause 6.5. A COMMA character (optionally followed served for Alternate Format Characters (see annex F).
by a SPACE character) separates the UIDs. The UCS
NOTE 2 – Unassigned code positions in those ranges may
Sequence Identifier shall include at least two UIDs; it
be ignored in normal processing and display.
shall begin with a LESS-THAN SIGN and be termi-
Code positions 0000 D800 to 0000 DFFF are reserved
nated by a GREATER-THAN SIGN.
for the use of UTF-16 (see annex C). These positions
NOTE – UCS Sequences Identifiers cannot be used for
are known as the S-zone.
specification of subset and collection content. They may be
used outside this standard to identify: composite sequences
Code positions 0000 E000 to 0000 F8FF are reserved
for mapping purposes, font repertoire, etc.
for private use (see clause 10). These positions are
known as the private use zone.
7 General requirements for the UCS
In addition to code positions 0000 FFFE and
0000 FFFF (see sub-clause 7.a), code positions
The following requirements apply to the entire coded
0000 FDEF to 0000 FDD0 are also permanently re-
character set.
served.
© ISO/IEC 2003 – All rights reserved 9
NOTE 3 – Code position 0000 FFFE is reserved for “signa- NOTE – To ensure continued interoperability between the
ture” (see annex H). Code positions 0000 FDD0 to UTF-16 form and other coded representations of the UCS,
0000 FDEF, and 0000 FFFF can be used for internal proc- it is intended that no characters will be allocated to code
essing uses requiring numeric values which are guaranteed positions in Planes 11 to FF in Group 00 or any planes in
not to be coded characters, such as in terminating tables, or any other groups.
signaling end-of-text. Furthermore, since 0000 FFFF is the
largest BMP value, it may also be used as the final value in
binary or sequential searching index within the context of
10 Private use planes and zones
UCS-2 or UTF-16.
10.1 Private use characters
Private use characters are not constrained in any way
9 Supplementary planes
by ISO/IEC 10646. Private use characters can be
9.1 Planes accessible by UTF-16 used to provide user-defined characters. For example,
this is a common requirement for users of ideographic
Each code position in Planes 01 to 10 of Group 00 has
scripts.
a unique mapping to a four-octet sequence in accor-
dance with the UTF-16 form of coded representation
NOTE 1 – For meaningful interchange of private use char-
acters, an agreement, independent of ISO/IEC 10646, is
(see annex C). This form is compatible with the two-
necessary between sender and recipient.
octet BMP form of UCS-2 (see clause 13.1).
Private use characters can be used for dynamically-
The planes 01, 02 and 0E of Group 00 are the Sup-
redefinable character applications.
plementary Multilingual Plane (SMP), the Supplemen-
tary Ideographic Plane (SIP) and the Supplementary
NOTE 2 – For meaningful interchange of dynamically-
redefinable characters, an agreement, independent of
Special-purpose Plane (SSP) respectively. Like the
ISO/IEC 10646 is necessary between sender and recipient.
BMP, these planes contain graphic characters allo-
ISO/IEC 10646 does not specify the techniques for defining
cated to code positions. The Planes from 03 to 0D of
or setting up dynamically-redefinable characters.
Group 00 are reserved for future standardization. See
clause 10.2 for the definition of Plane 0F and 10 of 10.2 Code positions for private use characters
Group 00.
The code positions of Plane 0F and Plane 10 of Group
NOTE – The following table shows the boundary code posi- 00 shall be for private use.
tions for planes 01, 02 and 0E expressed
...
NORME ISO/CEI
INTERNATIONALE 10646
Première édition
2003-12-15
Technologies de l'information — Jeu
universel de caractères codés sur
plusieurs octets (JUC)
Information technology — Universal Multiple-Octet Coded Character
Set (UCS)
Numéro de référence
ISO/CEI 10646:2003(F)
©
ISO/CEI 2003
ISO/CEI 10646:2003(F)
PDF – Exonération de responsabilité
Le présent fichier PDF peut contenir des polices de caractères intégrées. Conformément aux conditions de licence d'Adobe, ce fichier
peut être imprimé ou visualisé, mais ne doit pas être modifié à moins que l'ordinateur employé à cet effet ne bénéficie d'une licence
autorisant l'utilisation de ces polices et que celles-ci y soient installées. Lors du téléchargement de ce fichier, les parties concernées
acceptent de fait la responsabilité de ne pas enfreindre les conditions de licence d'Adobe. Le Secrétariat central de l'ISO décline toute
responsabilité en la matière.
Adobe est une marque déposée d'Adobe Systems Incorporated.
Les détails relatifs aux produits logiciels utilisés pour la création du présent fichier PDF sont disponibles dans la rubrique General Info
du fichier; les paramètres de création PDF ont été optimisés pour l'impression. Toutes les mesures ont été prises pour garantir
l'exploitation de ce fichier par les comités membres de l'ISO. Dans le cas peu probable où surviendrait un problème d'utilisation,
veuillez en informer le Secrétariat central à l'adresse donnée ci-dessous.
Le présent CD-ROM contient:
1) la publication ISO/CEI 10646:2003 au format PDF (portable document format), qui peut être
visualisée en utilisant Adobe® Acrobat® Reader;
2) des fichiers textes contenant les listes de
i) références de source pour les idéogrammes CJC,
ii) syllabes hangûl et d'informations relatives au mappage,
iii) noms de caractères triés par ordre alphabétique.
Adobe et Acrobat sont des marques déposées de Adobe Systems Incorporated.
Cette première édition annule et remplace l
...












Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...