ISO/IEC 19757-7:2009
(Main)Information technology — Document Schema Definition Languages (DSDL) — Part 7: Character Repertoire Description Language (CREPDL)
Information technology — Document Schema Definition Languages (DSDL) — Part 7: Character Repertoire Description Language (CREPDL)
ISO/IEC 19757 defines a set of Document Schema Definition Languages (DSDL) that can be used to specify one or more validation processes performed against Extensible Markup Language (XML) documents. ISO/IEC 19757-7:2009 specifies a Character Repertoire Description Language (CREPDL); a CREPDL schema describes a character repertoire. ISO/IEC 19757-7:2009 introduces kernels and hulls of repertoires, then specifies the syntax of CREPDL schemas and the semantics of a correct CREPDL schema; the semantics specify when a character is in a repertoire described by a CREPDL schema. ISO/IEC 19757-7:2009 defines CREPDL processors and their behaviour. Finally, it describes differences of conformant CREPDL processors, and provides examples of CREPDL schemas.
Technologies de l'information — Langages de définition de schéma de documents (DSDL) — Partie 7: Langage de description de répertoire de caractères (CREPDL)
General Information
Relations
Buy Standard
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 19757-7
First edition
2009-12-15
Information technology — Document
Schema Definition Languages (DSDL) —
Part 7:
Character Repertoire Description
Language (CREPDL)
Technologies de l'information — Langages de définition de schéma de
documents (DSDL) —
Partie 7: Langage de description de répertoire de caractères (CREPDL)
Reference number
ISO/IEC 19757-7:2009(E)
©
ISO/IEC 2009
---------------------- Page: 1 ----------------------
ISO/IEC 19757-7:2009(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2009
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2009 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC 19757-7:2009(E)
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Normative references.1
3 Terms and definitions .2
4 Notation .2
5 Repertoire, kernel, and hull .2
6 Syntax.3
6.1 General.3
6.2 RELAX NG schema.3
6.3 NVDL script.4
6.4 Regular expressions .5
7 Semantics.6
7.1 General.6
7.2 char.6
7.3 union.7
7.4 intersection.7
7.5 difference.7
7.6 ref.8
7.7 repertoire.8
8 Validation.9
Annex A (informative) Differences of Conformant Processors.10
Annex B (informative) Example CREPDL schemas.11
Bibliography.15
© ISO/IEC 2009 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC 19757-7:2009(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 19757-7 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 34, Document description and processing languages.
ISO/IEC 19757 consists of the following parts, under the general title Information technology — Document
Schema Definition Languages (DSDL):
⎯ Part 1: Overview
⎯ Part 2: Regular-grammar-based validation — RELAX NG
⎯ Part 3: Rule-based validation — Schematron
⎯ Part 4: Namespace-based Validation Dispatching Language (NVDL)
⎯ Part 5: Extensible datatypes
⎯ Part 7: Character Repertoire Description Language (CREPDL)
⎯ Part 8: Document Semantics Renaming Language (DSRL)
⎯ Part 9: Namespace and datatype declaration in Document Type Definitions (DTDs)
iv © ISO/IEC 2009 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC 19757-7:2009(E)
Introduction
ISO/IEC 19757 defines a set of Document Schema Definition Languages (DSDL) that can be used to specify
one or more validation processes performed against Extensible Markup Language (XML) documents. A
number of validation technologies are standardized in DSDL to complement those already available as
standards or from industry.
The main objective of ISO/IEC 19757 is to bring together different validation-related technologies to form a
single extensible framework that allows technologies to work in series or in parallel to produce a single or a
set of validation results. The extensibility of DSDL accommodates validation technologies not yet designed or
specified.
This part of ISO/IEC 19757 provides a language for describing character repertoires. Descriptions in this
language may be referenced from schemas. Furthermore, they may also be referenced from forms and
stylesheets.
NOTE At present, no schema languages provide mechanisms for referencing CREPDL schemas.
Descriptions of repertoires need not be exact. Non-exact descriptions are made possible by kernels and hulls,
which provide the lower and upper limits, respectively.
© ISO/IEC 2009 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO/IEC 19757-7:2009(E)
Information technology — Document Schema Definition
Languages (DSDL) —
Part 7:
Character Repertoire Description Language (CREPDL)
1 Scope
This part of ISO/IEC 19757 specifies a Character Repertoire Description Language (CREPDL); a CREPDL
schema describes a character repertoire. This part of ISO/IEC 19757 introduces kernels and hulls of
repertoires, then specifies the syntax of CREPDL schemas and the semantics of a correct CREPDL schema;
the semantics specify when a character is in a repertoire described by a CREPDL schema. This part of
ISO/IEC 19757 defines CREPDL processors and their behaviour. Finally, it describes differences of
conformant CREPDL processors, and provides examples of CREPDL schemas.
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
NOTE Each of the following documents has a unique identifier that is used to cite the document in the text. The
unique identifier consists of the part of the reference up to the first comma.
ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS)
ISO/IEC 19757-2, Information technology — Document Schema Definition Language (DSDL) — Part 2:
Regular-grammar-based validation — RELAX NG
ISO/IEC 19757-4, Information technology — Document Schema Definition Languages (DSDL) — Part 4:
Namespace-based Validation Dispatching Language (NVDL)
W3C XML, Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation, 16 August 2006,
available at http://www.w3.org/TR/2006/REC-xml-20060816
W3C XML-Names, Namespaces in XML 1.0 (Second Edition), W3C Recommendation, 16 August 2006,
available at http://www.w3.org/TR/2006/REC-xml-names-20060816
W3C XML Schema Part 2, XML Schema Part 2: Datatypes (Second Edition), W3C Recommendation, 28
October 2004, available at http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/
IETF RFC 3987, Internationalized Resource Identifiers (IRIs), Internet Standards Track Specification, January
2005, available at http://www.ietf.org/rfc/rfc3987.txt
IANA Charsets, IANA CHARACTER SETS, Internet Assigned Numbers Authority, available at
http://www.iana.org/assignments/character-sets
© ISO/IEC 2009 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO/IEC 19757-7:2009(E)
Unicode, The Unicode Standard, The Unicode Consortium, available at http://www.unicode.org/
CLDR, Unicode Common Locale Data Repository, The Unicode Consortium, available at
http://www.unicode.org/cldr/
3 Terms and definitions
For the purposes of this document, the terms “character” and “repertoire” as defined in ISO/IEC 10646 and the
following apply.
3.1
kernel
set of characters that are guaranteed to be in the repertoire
3.2
hull
set of characters that may be in the repertoire
4 Notation
in(x, A): character x is in the repertoire described by a CREPDL element A
not-in(x, A): character x is not in the repertoire described by a CREPDL element A
unknown(x, A): it is unknown whether character x is in the repertoire described by a CREPDL element A
5 Repertoire, kernel, and hull
A repertoire shall be described by specifying a kernel and hull. Kernels and hulls shall be sets of characters.
A character shall be in a repertoire when it is in the kernel. A sequence of characters shall be in a repertoire
when any of the characters is in the kernel.
A character shall not be in a repertoire when it is in neither the hull nor the kernel. A sequence of characters
shall be not in a repertoire when at least one of the characters is in neither the kernel nor the hull.
It shall be unknown whether or not a character is in a repertoire when it is in the hull but is not in the kernel. It
shall be unknown whether or not a sequence of characters is in a reperoire when at least one of the
characters is not in the kernel but any of the characters is in the hull or kernel.
NOTE 1 Kernel and hull are borrowed from W3C Note-charcol[3]. Some examples in Annex B also borrowed.
NOTE 2 It may be impossible to specify a repertoire exactly, since characters may continue to be added to the
repertoire. However, it is often possible to specify which character is absolutely included, and which character is absolutely
excluded. Kernels and hulls help to describe such open repertoires. A kernel is used to specify those characters which are
guaranteed to be in the repertoire, while a hull is used to specify an outer boundary. An example of such open repertoires
is shown in B.4.
NOTE 3 This part of ISO/IEC 19757 can handle sets of characters, but cannot handle sets of sequences of characters.
In other words, CREPDL schemas cannot indicate that a combining character is allowed only when it directly follows some
base character. Likewise, CREPDL schemas cannot handle named sequences, but can only handle characters occurring
in named sequences. It is believed that this part of ISO/IEC 19757 needs this limitation, since implementations become
significantly easier.
NOTE 4 It is possible but not recommended to specify a hull that disallows some character in the corresponding kernel.
Note that the condition that a character is in a repertoire does not mention the hull.
2 © ISO/IEC 2009 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/IEC 19757-7:2009(E)
6 Syntax
6.1 General
A CREPDL schema shall be an XML document (W3C XML) valid against the the NVDL (ISO/IEC 19757-4)
script in 6.3, which in turn relies on the RELAX NG (ISO/IEC 19757-2) schema in 6.2. The elements
allowed in the RELAX NG schema are of the name space (W3C XML-Names)
http://purl.oclc.org/dsdl/crepdl/ns/structure/1.0. Further constraints on the character
content of the char, kernel or hull elements are shown in 6.4
NOTE 1 W3C XML 1.1[6] shall not be used for representing CREPDL schemas.
NOTE 2 W3C XML specifies that characters in XML documents are either U+0009 (CHARACTER TABULATION),
U+000A (LINE FEED), U+000D (CARRIAGE RETURN), or a character in the ranges from U+0020 to U+D7FF, U+E000 to
U+FFFD, or U+10000 to U+10FFFF. In other words, XML documents cannot contain U+0000, U+0001, U+0002, U+0003,
U+0004, U+0005, U+0006, U+0007, U+0008, U+000B, U+000C, U+000E, U+000F, U+0010, U+0011, U+0012, U+0013,
U+0014, U+0015, U+0016, U+0017, U+0018, U+0019, U+001A, U+001B, U+001C, U+001D, U+001E, or U+001F. Since
CREPDL schemas are represented by XML documents, these characters cannot directly occur in CREPDL schemas.
6.2 RELAX NG schema
#$Id: crepdl.rnc 5 2009-05-02 09:48:49Z makoto $
#
# The following permission notice and disclaimer shall be included in all
# copies of this schema ("the Schema"), and derivations of the Schema:
#
# Permission is hereby granted, free of charge in perpetuity, to any
# person obtaining a copy of the Schema, to use, copy, modify, merge and
# distribute free of charge, copies of the Schema for the purposes of
# developing, implementing, installing and using software based on the
# Schema, and to permit persons to whom the Schema is furnished to do so,
# subject to the following conditions:
#
# THE SCHEMA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SCHEMA OR THE USE OR
# OTHER DEALINGS IN THE SCHEMA.
#
# In addition, any modified copy of the Schema shall include the following
# notice:
#
# THIS SCHEMA HAS BEEN MODIFIED FROM THE SCHEMA DEFINED IN ISO/IEC 19757-7,
# AND SHOULD NOT BE INTERPRETED AS COMPLYING WITH THAT STANDARD.
default namespace = "http://purl.oclc.org/dsdl/crepdl/ns/structure/1.0"
start = coll
coll =
union | intersection | difference | ref | repertoire | char
union = element union { commonAtts, coll+ }
intersection = element intersection { commonAtts, coll+ }
difference = element difference { commonAtts, coll+ }
ref =
element ref {
commonAtts,
attribute href { xsd:anyURI }
}
repertoire =
element repertoire {
commonAtts,
© ISO/IEC 2009 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO/IEC 19757-7:2009(E)
attribute registry { text },
attribute version { text }?,
(attribute name { text } | attribute number {xsd:int} }
char =
element char {
commonAtts,
(text
| element kernel { commonAtts, text }
| element hull { commonAtts, text }
| (element kernel { commonAtts, text },
element hull { commonAtts, text }))
}
commonAtts =
attribute minUcsVersion { text }?,
attribute maxUcsVersion { text }?
# Note that xml:id is allowed, since any foreign attribute is
# allowed by the NVDL script.
The value of a minUcsVersion or maxUcsVersion attribute shall be a string indicating a verion number of
the Unicode standard, possibly having leading or trailing whitespace.
6.3 NVDL script
schemaType="application/relax-ng-compact-syntax">
4 © ISO/IEC 2009 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/IEC 19757-7:2009(E)
NOTE This NVDL script allows foreign elements and attributes everywhere.
6.4 Regular expressions
The character content of a char, kernel or hull element shall be a regular expression that matches either
Char or charClass as specified in W3C XML Schema Part 2.
NOTE 1 Since this part of ISO/IEC 19757 uses regular expressions for representing sets of characters rather than sets
of strings, regular expressions are restricted to Char and charClass.
NOTE 2 The following rules are duplicated from W3C XML Schema Part 2 for information. The semantics of [29]
through [37] depend on the version of Unicode.
[10] Char ::= [^.\?*+()|#x5B#x5D]
[11] charClass ::= charClassEsc | charClassExpr | WildcardEsc
[12] charClassExpr ::= '[' charGroup ']'
[13] charGroup ::= posCharGroup | negCharGroup | charClassSub
[14] posCharGroup ::= ( charRange | charClassEsc )+
[15] negCharGroup ::= '^' posCharGroup
[16] charClassSub ::= ( posCharGroup | negCharGroup )
'-' charClassExpr
[17] charRange ::= seRange | XmlCharIncDash
[18] seRange ::= charOrEsc '-' charOrEsc
[20] charOrEsc ::= XmlChar | SingleCharEsc
[21] XmlChar ::= [^\#x2D#x5B#x5D]
[22] XmlCharIncDash ::= [^\#x5B#x5D]
[23] charClassEsc ::= ( SingleCharEsc | MultiCharEsc
| catEsc | complEsc )
[24] SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]
[25] catEsc ::= '\p{' charProp '}'
[26] complEsc ::= '\P{' charProp '}'
[27] charProp ::= IsCategory | IsBlock
[28] IsCategory ::= Letters | Marks | Numbers
| Punctuation | Separators | Symbols | Others
[29] Letters ::= 'L' [ultmo]?
[30] Marks ::= 'M' [nce]?
[31] Numbers ::= 'N' [dlo]?
[32] Punctuation ::= 'P' [cdseifo]?
[33] Separators ::= 'Z' [slp]?
[34] Symbols ::= 'S' [mcko]?
[35] Others ::= 'C' [cfon]?
[36] IsBlock ::= 'Is' [a-zA-Z0-9#x2D]+
[37] MultiCharEsc ::= '\' [sSiIcCdDwW]
[37a] WildcardEsc ::= '.'
NOTE 3 Since W3C REC-xpath-functions[4] extends the definition of regular expressions in W3C XML Schema Part 2,
Char and charClass in W3C REC-xpath-functions[4] and those in W3C XML Schema Part 2 are different in three points.
First, charClass in W3C REC-xpath-functions[4] allows single character escapes \^ and \$, but that in W3C XML
Schema Part 2 does not. Second, Char in W3C XML Schema Part 2 allows $ and ^, but Char in W3C REC-xpath-
functions[4] does not. Third, Char (production [10]) in W3C XML Schema Part 2 has a known error in which it fails to
disallow the left brace ({) and right brace (}), while Char in W3C REC-xpath-functions[4] disallows them.
Implementations of regular expressions in W3C REC-xpath-functions[4] can safely handle the content of a char, kernel
or hull element if neither $ nor ^ appear as Char (e.g., $).
© ISO/IEC 2009 – All rights reserved 5
---------------------- Page: 10 ----------------------
ISO/IEC 19757-7:2009(E)
7 Semantics
7.1 General
This clause specifies the semantics of a CREPDL element using three notations: in(x, A), not-in(x, A), and
unknown(x, A), where x is a character and A is a CREPDL element. These notations are introduced in
Clause 4.
7.2 char
First, the semantics of regular expressions occurring in char, kernel, and hull elements shall be as
specified in W3C XML Schema Part 2.
NOTE 1 Since regular expressions in W3C XML Schema Part 2 do not satisfy Level-1 conformance requirements in
UTS #18[7], implementations of this part of ISO/IEC 19757 do not conform to UTS #18[7].
The semantics of . are defined below.
⎯ Case 1: the char element has neither kernel nor hull as a child element.
It is assumed that this element has a kernel element and a hull element whose contents are identical
to the character content of this element. The rest is the same as in Case 4.
⎯ Case 2: the char element has a kernel element but does not have a hull element.
⎯ in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
⎯ not-in(x, . ) never holds.
⎯ unknown(x, . ) when x does not match the regular expression specified as the
content of the kernel element.
⎯ Case 3: the char element has a hull element but does not have a kernel element.
⎯ in(x, . ) never holds.
⎯ not-in(x, . ) when x does not match the regular expression specified as the
content of the hull element.
⎯ unknown(x, . ) when x matches the regular expression specified as the
content of the hull element.
⎯ Case 4: the char element has a hull element and a kernel element.
⎯ in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
⎯ not-in(x, . ) when x does not match the regular expression specified as the
content of the kernel element and x does not match the regular expression specified as the
content of the hull element.
⎯ unknown(x, . ) when x does not match the regular expression specified as the
content of the kernel element and x matches the regular expression specified as the content of the
hull element.
6 © ISO/IEC 2009 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/IEC 19757-7:2009(E)
Since the semantics of regular expressions depend on the version of the Unicode standard, the author of a
CREPDL schema may specify the intended versions by specifying the minUcsVersion and
maxUcsVersion attributes.
EXAMPLE \p{Nd} represents the set of
characters of the category "Nd" in Unicode Version 4.0.
NOTE 2 It is not guaranteed that every version between these two attribute values specify the same properties for
every character. However, the author is assumed to accept the discrepancies.
If the CREPDL processor cannot use some version between these two attribute values, it should report an
error and may stop normal processing.
When a char element does not explicitly specify the minUcsVersion attribute, the nearest ancestor
element having this attribute is searched. If it is found, its attribute value is used. If not found, there is no lower
bound on Unicode versions. The same applies to maxUcsVersion.
7.3 union
First, define the semantics of union elements A B, which contain two child elements A and
B. A character is in the union repertoire described by this element if and only if it is in the one described by A
or the one described by B. It is not in the union repertoire if and only if it is in neither the one described by A
nor the one described by B.
⎯ in(x, A B) when in(x, A) or in(x, B).
⎯ not-in(x, A B) when not-in(x, A) and not-in(x, B).
⎯ unknown(x, A B), otherwise.
When a union element has one and only one child element, the semantics shall be the same as that of the
child element. When a union element has more than two child elements, the semantics shall be the same as
that of A B where A is the first child and B is the union of the other child elements.
7.4 intersection
First, define the semantics of intersection elements A B, which contain
two child elements A and B. A character is in the repertoire described by this intersection element if and only if
it is in the one described by A and it is in the one described by B. It is not in this intersection repertore if and
only if it is not in the one described by A or it is not in the one described by B.
⎯ in(x, A B) when in(x, A) and in(x, B).
⎯ not-in(x, A B) when not-in(x, A) or not-in(x, B)
⎯ unknown(x, A B), otherwise.
When an intersection element has one and only one child element, the semantics shall be the same as
that of the child element. When an intersection element has more than two child elements, the semantics
shall be the same as that of A B where A is the first child and B is the
intersection of the other child elements.
7.5 difference
First, define the semantics of difference elements A B, which contain two
child elements A and B. A character is in the repertoire described by this difference element if and only if it is
in the one described by A and it is not in the one described by B. It is not in this difference repertoire if and
only if either it is not in the one described by A or it is in the one described by B.
© ISO/IEC 2009 – All rights reserved 7
---------------------- Page: 12 ----------------------
ISO/IEC 19757-7:2009(E)
⎯ in(x, A B) when in(x, A) and not-in(x, B)
⎯ not-in(x, A B) when not-in(x, A) or in(x, B)
⎯ unknown(x, A B), otherwise.
When a difference element has one and only one child element, the semantics shall be the same as that
of the child element. When a difference element has more than two child elements, the semantics shall be
the same as that of A B where A is the first child and B is the union of the
other child elements.
7.6 ref
Define the semantics of , where iri is an IRI as sp
...
INTERNATIONAL ISO/IEC
STANDARD 19757-7
First edition
2009-12-15
Information technology — Document
Schema Definition Languages (DSDL) —
Part 7:
Character Repertoire Description
Language (CREPDL)
Technologies de l'information — Langages de définition de schéma de
documents (DSDL) —
Partie 7: Langage de description de répertoire de caractères (CREPDL)
Reference number
ISO/IEC 19757-7:2009(E)
©
ISO/IEC 2009
---------------------- Page: 1 ----------------------
ISO/IEC 19757-7:2009(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2009
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2009 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC 19757-7:2009(E)
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Normative references.1
3 Terms and definitions .2
4 Notation .2
5 Repertoire, kernel, and hull .2
6 Syntax.3
6.1 General.3
6.2 RELAX NG schema.3
6.3 NVDL script.4
6.4 Regular expressions .5
7 Semantics.6
7.1 General.6
7.2 char.6
7.3 union.7
7.4 intersection.7
7.5 difference.7
7.6 ref.8
7.7 repertoire.8
8 Validation.9
Annex A (informative) Differences of Conformant Processors.10
Annex B (informative) Example CREPDL schemas.11
Bibliography.15
© ISO/IEC 2009 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC 19757-7:2009(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 19757-7 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 34, Document description and processing languages.
ISO/IEC 19757 consists of the following parts, under the general title Information technology — Document
Schema Definition Languages (DSDL):
⎯ Part 1: Overview
⎯ Part 2: Regular-grammar-based validation — RELAX NG
⎯ Part 3: Rule-based validation — Schematron
⎯ Part 4: Namespace-based Validation Dispatching Language (NVDL)
⎯ Part 5: Extensible datatypes
⎯ Part 7: Character Repertoire Description Language (CREPDL)
⎯ Part 8: Document Semantics Renaming Language (DSRL)
⎯ Part 9: Namespace and datatype declaration in Document Type Definitions (DTDs)
iv © ISO/IEC 2009 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC 19757-7:2009(E)
Introduction
ISO/IEC 19757 defines a set of Document Schema Definition Languages (DSDL) that can be used to specify
one or more validation processes performed against Extensible Markup Language (XML) documents. A
number of validation technologies are standardized in DSDL to complement those already available as
standards or from industry.
The main objective of ISO/IEC 19757 is to bring together different validation-related technologies to form a
single extensible framework that allows technologies to work in series or in parallel to produce a single or a
set of validation results. The extensibility of DSDL accommodates validation technologies not yet designed or
specified.
This part of ISO/IEC 19757 provides a language for describing character repertoires. Descriptions in this
language may be referenced from schemas. Furthermore, they may also be referenced from forms and
stylesheets.
NOTE At present, no schema languages provide mechanisms for referencing CREPDL schemas.
Descriptions of repertoires need not be exact. Non-exact descriptions are made possible by kernels and hulls,
which provide the lower and upper limits, respectively.
© ISO/IEC 2009 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO/IEC 19757-7:2009(E)
Information technology — Document Schema Definition
Languages (DSDL) —
Part 7:
Character Repertoire Description Language (CREPDL)
1 Scope
This part of ISO/IEC 19757 specifies a Character Repertoire Description Language (CREPDL); a CREPDL
schema describes a character repertoire. This part of ISO/IEC 19757 introduces kernels and hulls of
repertoires, then specifies the syntax of CREPDL schemas and the semantics of a correct CREPDL schema;
the semantics specify when a character is in a repertoire described by a CREPDL schema. This part of
ISO/IEC 19757 defines CREPDL processors and their behaviour. Finally, it describes differences of
conformant CREPDL processors, and provides examples of CREPDL schemas.
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
NOTE Each of the following documents has a unique identifier that is used to cite the document in the text. The
unique identifier consists of the part of the reference up to the first comma.
ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS)
ISO/IEC 19757-2, Information technology — Document Schema Definition Language (DSDL) — Part 2:
Regular-grammar-based validation — RELAX NG
ISO/IEC 19757-4, Information technology — Document Schema Definition Languages (DSDL) — Part 4:
Namespace-based Validation Dispatching Language (NVDL)
W3C XML, Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation, 16 August 2006,
available at http://www.w3.org/TR/2006/REC-xml-20060816
W3C XML-Names, Namespaces in XML 1.0 (Second Edition), W3C Recommendation, 16 August 2006,
available at http://www.w3.org/TR/2006/REC-xml-names-20060816
W3C XML Schema Part 2, XML Schema Part 2: Datatypes (Second Edition), W3C Recommendation, 28
October 2004, available at http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/
IETF RFC 3987, Internationalized Resource Identifiers (IRIs), Internet Standards Track Specification, January
2005, available at http://www.ietf.org/rfc/rfc3987.txt
IANA Charsets, IANA CHARACTER SETS, Internet Assigned Numbers Authority, available at
http://www.iana.org/assignments/character-sets
© ISO/IEC 2009 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO/IEC 19757-7:2009(E)
Unicode, The Unicode Standard, The Unicode Consortium, available at http://www.unicode.org/
CLDR, Unicode Common Locale Data Repository, The Unicode Consortium, available at
http://www.unicode.org/cldr/
3 Terms and definitions
For the purposes of this document, the terms “character” and “repertoire” as defined in ISO/IEC 10646 and the
following apply.
3.1
kernel
set of characters that are guaranteed to be in the repertoire
3.2
hull
set of characters that may be in the repertoire
4 Notation
in(x, A): character x is in the repertoire described by a CREPDL element A
not-in(x, A): character x is not in the repertoire described by a CREPDL element A
unknown(x, A): it is unknown whether character x is in the repertoire described by a CREPDL element A
5 Repertoire, kernel, and hull
A repertoire shall be described by specifying a kernel and hull. Kernels and hulls shall be sets of characters.
A character shall be in a repertoire when it is in the kernel. A sequence of characters shall be in a repertoire
when any of the characters is in the kernel.
A character shall not be in a repertoire when it is in neither the hull nor the kernel. A sequence of characters
shall be not in a repertoire when at least one of the characters is in neither the kernel nor the hull.
It shall be unknown whether or not a character is in a repertoire when it is in the hull but is not in the kernel. It
shall be unknown whether or not a sequence of characters is in a reperoire when at least one of the
characters is not in the kernel but any of the characters is in the hull or kernel.
NOTE 1 Kernel and hull are borrowed from W3C Note-charcol[3]. Some examples in Annex B also borrowed.
NOTE 2 It may be impossible to specify a repertoire exactly, since characters may continue to be added to the
repertoire. However, it is often possible to specify which character is absolutely included, and which character is absolutely
excluded. Kernels and hulls help to describe such open repertoires. A kernel is used to specify those characters which are
guaranteed to be in the repertoire, while a hull is used to specify an outer boundary. An example of such open repertoires
is shown in B.4.
NOTE 3 This part of ISO/IEC 19757 can handle sets of characters, but cannot handle sets of sequences of characters.
In other words, CREPDL schemas cannot indicate that a combining character is allowed only when it directly follows some
base character. Likewise, CREPDL schemas cannot handle named sequences, but can only handle characters occurring
in named sequences. It is believed that this part of ISO/IEC 19757 needs this limitation, since implementations become
significantly easier.
NOTE 4 It is possible but not recommended to specify a hull that disallows some character in the corresponding kernel.
Note that the condition that a character is in a repertoire does not mention the hull.
2 © ISO/IEC 2009 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/IEC 19757-7:2009(E)
6 Syntax
6.1 General
A CREPDL schema shall be an XML document (W3C XML) valid against the the NVDL (ISO/IEC 19757-4)
script in 6.3, which in turn relies on the RELAX NG (ISO/IEC 19757-2) schema in 6.2. The elements
allowed in the RELAX NG schema are of the name space (W3C XML-Names)
http://purl.oclc.org/dsdl/crepdl/ns/structure/1.0. Further constraints on the character
content of the char, kernel or hull elements are shown in 6.4
NOTE 1 W3C XML 1.1[6] shall not be used for representing CREPDL schemas.
NOTE 2 W3C XML specifies that characters in XML documents are either U+0009 (CHARACTER TABULATION),
U+000A (LINE FEED), U+000D (CARRIAGE RETURN), or a character in the ranges from U+0020 to U+D7FF, U+E000 to
U+FFFD, or U+10000 to U+10FFFF. In other words, XML documents cannot contain U+0000, U+0001, U+0002, U+0003,
U+0004, U+0005, U+0006, U+0007, U+0008, U+000B, U+000C, U+000E, U+000F, U+0010, U+0011, U+0012, U+0013,
U+0014, U+0015, U+0016, U+0017, U+0018, U+0019, U+001A, U+001B, U+001C, U+001D, U+001E, or U+001F. Since
CREPDL schemas are represented by XML documents, these characters cannot directly occur in CREPDL schemas.
6.2 RELAX NG schema
#$Id: crepdl.rnc 5 2009-05-02 09:48:49Z makoto $
#
# The following permission notice and disclaimer shall be included in all
# copies of this schema ("the Schema"), and derivations of the Schema:
#
# Permission is hereby granted, free of charge in perpetuity, to any
# person obtaining a copy of the Schema, to use, copy, modify, merge and
# distribute free of charge, copies of the Schema for the purposes of
# developing, implementing, installing and using software based on the
# Schema, and to permit persons to whom the Schema is furnished to do so,
# subject to the following conditions:
#
# THE SCHEMA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SCHEMA OR THE USE OR
# OTHER DEALINGS IN THE SCHEMA.
#
# In addition, any modified copy of the Schema shall include the following
# notice:
#
# THIS SCHEMA HAS BEEN MODIFIED FROM THE SCHEMA DEFINED IN ISO/IEC 19757-7,
# AND SHOULD NOT BE INTERPRETED AS COMPLYING WITH THAT STANDARD.
default namespace = "http://purl.oclc.org/dsdl/crepdl/ns/structure/1.0"
start = coll
coll =
union | intersection | difference | ref | repertoire | char
union = element union { commonAtts, coll+ }
intersection = element intersection { commonAtts, coll+ }
difference = element difference { commonAtts, coll+ }
ref =
element ref {
commonAtts,
attribute href { xsd:anyURI }
}
repertoire =
element repertoire {
commonAtts,
© ISO/IEC 2009 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO/IEC 19757-7:2009(E)
attribute registry { text },
attribute version { text }?,
(attribute name { text } | attribute number {xsd:int} }
char =
element char {
commonAtts,
(text
| element kernel { commonAtts, text }
| element hull { commonAtts, text }
| (element kernel { commonAtts, text },
element hull { commonAtts, text }))
}
commonAtts =
attribute minUcsVersion { text }?,
attribute maxUcsVersion { text }?
# Note that xml:id is allowed, since any foreign attribute is
# allowed by the NVDL script.
The value of a minUcsVersion or maxUcsVersion attribute shall be a string indicating a verion number of
the Unicode standard, possibly having leading or trailing whitespace.
6.3 NVDL script
schemaType="application/relax-ng-compact-syntax">
4 © ISO/IEC 2009 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/IEC 19757-7:2009(E)
NOTE This NVDL script allows foreign elements and attributes everywhere.
6.4 Regular expressions
The character content of a char, kernel or hull element shall be a regular expression that matches either
Char or charClass as specified in W3C XML Schema Part 2.
NOTE 1 Since this part of ISO/IEC 19757 uses regular expressions for representing sets of characters rather than sets
of strings, regular expressions are restricted to Char and charClass.
NOTE 2 The following rules are duplicated from W3C XML Schema Part 2 for information. The semantics of [29]
through [37] depend on the version of Unicode.
[10] Char ::= [^.\?*+()|#x5B#x5D]
[11] charClass ::= charClassEsc | charClassExpr | WildcardEsc
[12] charClassExpr ::= '[' charGroup ']'
[13] charGroup ::= posCharGroup | negCharGroup | charClassSub
[14] posCharGroup ::= ( charRange | charClassEsc )+
[15] negCharGroup ::= '^' posCharGroup
[16] charClassSub ::= ( posCharGroup | negCharGroup )
'-' charClassExpr
[17] charRange ::= seRange | XmlCharIncDash
[18] seRange ::= charOrEsc '-' charOrEsc
[20] charOrEsc ::= XmlChar | SingleCharEsc
[21] XmlChar ::= [^\#x2D#x5B#x5D]
[22] XmlCharIncDash ::= [^\#x5B#x5D]
[23] charClassEsc ::= ( SingleCharEsc | MultiCharEsc
| catEsc | complEsc )
[24] SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]
[25] catEsc ::= '\p{' charProp '}'
[26] complEsc ::= '\P{' charProp '}'
[27] charProp ::= IsCategory | IsBlock
[28] IsCategory ::= Letters | Marks | Numbers
| Punctuation | Separators | Symbols | Others
[29] Letters ::= 'L' [ultmo]?
[30] Marks ::= 'M' [nce]?
[31] Numbers ::= 'N' [dlo]?
[32] Punctuation ::= 'P' [cdseifo]?
[33] Separators ::= 'Z' [slp]?
[34] Symbols ::= 'S' [mcko]?
[35] Others ::= 'C' [cfon]?
[36] IsBlock ::= 'Is' [a-zA-Z0-9#x2D]+
[37] MultiCharEsc ::= '\' [sSiIcCdDwW]
[37a] WildcardEsc ::= '.'
NOTE 3 Since W3C REC-xpath-functions[4] extends the definition of regular expressions in W3C XML Schema Part 2,
Char and charClass in W3C REC-xpath-functions[4] and those in W3C XML Schema Part 2 are different in three points.
First, charClass in W3C REC-xpath-functions[4] allows single character escapes \^ and \$, but that in W3C XML
Schema Part 2 does not. Second, Char in W3C XML Schema Part 2 allows $ and ^, but Char in W3C REC-xpath-
functions[4] does not. Third, Char (production [10]) in W3C XML Schema Part 2 has a known error in which it fails to
disallow the left brace ({) and right brace (}), while Char in W3C REC-xpath-functions[4] disallows them.
Implementations of regular expressions in W3C REC-xpath-functions[4] can safely handle the content of a char, kernel
or hull element if neither $ nor ^ appear as Char (e.g., $).
© ISO/IEC 2009 – All rights reserved 5
---------------------- Page: 10 ----------------------
ISO/IEC 19757-7:2009(E)
7 Semantics
7.1 General
This clause specifies the semantics of a CREPDL element using three notations: in(x, A), not-in(x, A), and
unknown(x, A), where x is a character and A is a CREPDL element. These notations are introduced in
Clause 4.
7.2 char
First, the semantics of regular expressions occurring in char, kernel, and hull elements shall be as
specified in W3C XML Schema Part 2.
NOTE 1 Since regular expressions in W3C XML Schema Part 2 do not satisfy Level-1 conformance requirements in
UTS #18[7], implementations of this part of ISO/IEC 19757 do not conform to UTS #18[7].
The semantics of . are defined below.
⎯ Case 1: the char element has neither kernel nor hull as a child element.
It is assumed that this element has a kernel element and a hull element whose contents are identical
to the character content of this element. The rest is the same as in Case 4.
⎯ Case 2: the char element has a kernel element but does not have a hull element.
⎯ in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
⎯ not-in(x, . ) never holds.
⎯ unknown(x, . ) when x does not match the regular expression specified as the
content of the kernel element.
⎯ Case 3: the char element has a hull element but does not have a kernel element.
⎯ in(x, . ) never holds.
⎯ not-in(x, . ) when x does not match the regular expression specified as the
content of the hull element.
⎯ unknown(x, . ) when x matches the regular expression specified as the
content of the hull element.
⎯ Case 4: the char element has a hull element and a kernel element.
⎯ in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
⎯ not-in(x, . ) when x does not match the regular expression specified as the
content of the kernel element and x does not match the regular expression specified as the
content of the hull element.
⎯ unknown(x, . ) when x does not match the regular expression specified as the
content of the kernel element and x matches the regular expression specified as the content of the
hull element.
6 © ISO/IEC 2009 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/IEC 19757-7:2009(E)
Since the semantics of regular expressions depend on the version of the Unicode standard, the author of a
CREPDL schema may specify the intended versions by specifying the minUcsVersion and
maxUcsVersion attributes.
EXAMPLE \p{Nd} represents the set of
characters of the category "Nd" in Unicode Version 4.0.
NOTE 2 It is not guaranteed that every version between these two attribute values specify the same properties for
every character. However, the author is assumed to accept the discrepancies.
If the CREPDL processor cannot use some version between these two attribute values, it should report an
error and may stop normal processing.
When a char element does not explicitly specify the minUcsVersion attribute, the nearest ancestor
element having this attribute is searched. If it is found, its attribute value is used. If not found, there is no lower
bound on Unicode versions. The same applies to maxUcsVersion.
7.3 union
First, define the semantics of union elements A B, which contain two child elements A and
B. A character is in the union repertoire described by this element if and only if it is in the one described by A
or the one described by B. It is not in the union repertoire if and only if it is in neither the one described by A
nor the one described by B.
⎯ in(x, A B) when in(x, A) or in(x, B).
⎯ not-in(x, A B) when not-in(x, A) and not-in(x, B).
⎯ unknown(x, A B), otherwise.
When a union element has one and only one child element, the semantics shall be the same as that of the
child element. When a union element has more than two child elements, the semantics shall be the same as
that of A B where A is the first child and B is the union of the other child elements.
7.4 intersection
First, define the semantics of intersection elements A B, which contain
two child elements A and B. A character is in the repertoire described by this intersection element if and only if
it is in the one described by A and it is in the one described by B. It is not in this intersection repertore if and
only if it is not in the one described by A or it is not in the one described by B.
⎯ in(x, A B) when in(x, A) and in(x, B).
⎯ not-in(x, A B) when not-in(x, A) or not-in(x, B)
⎯ unknown(x, A B), otherwise.
When an intersection element has one and only one child element, the semantics shall be the same as
that of the child element. When an intersection element has more than two child elements, the semantics
shall be the same as that of A B where A is the first child and B is the
intersection of the other child elements.
7.5 difference
First, define the semantics of difference elements A B, which contain two
child elements A and B. A character is in the repertoire described by this difference element if and only if it is
in the one described by A and it is not in the one described by B. It is not in this difference repertoire if and
only if either it is not in the one described by A or it is in the one described by B.
© ISO/IEC 2009 – All rights reserved 7
---------------------- Page: 12 ----------------------
ISO/IEC 19757-7:2009(E)
⎯ in(x, A B) when in(x, A) and not-in(x, B)
⎯ not-in(x, A B) when not-in(x, A) or in(x, B)
⎯ unknown(x, A B), otherwise.
When a difference element has one and only one child element, the semantics shall be the same as that
of the child element. When a difference element has more than two child elements, the semantics shall be
the same as that of A B where A is the first child and B is the union of the
other child elements.
7.6 ref
Define the semantics of , where iri is an IRI as sp
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.