Language resource management -- Syntactic annotation framework (SynAF) -- Part 2: XML serialization (ISOTiger)

This document describes an XML-conformant serialization of the ISO 24615-1 meta-model, with the
objective of supporting interoperability across language resources or language processing components
in the domain of syntactic annotations. As an extension of ISO 24615-1, this document is also
coordinated with ISO 24612.

Gestion de ressources linguistiques -- Cadre d'annotation syntaxique (SynAF) -- Partie 2: Sérialisation XML (ISOTiger)

Upravljanje z jezikovnimi viri - Ogrodje za skladenjsko označevanje (SynAF) - 2. del: Serializacija XML (nabor Tiger)

Ta dokument opisuje serializacijo metamodela iz standarda ISO 24615-1, skladno z XML, z namenom podpore interoperabilnosti med jezikovnimi viri ali komponentami za jezikovno obdelavo v domeni skladenjskega označevanja. Ta dokument je, kot razširitev standarda ISO 24615-1, usklajen tudi s standardom ISO 24612.

General Information

Status
Published
Publication Date
23-Aug-2018
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
30-Jul-2018
Due Date
04-Oct-2018
Completion Date
24-Aug-2018

Buy Standard

Standard
SIST ISO 24615-2:2018
English language
17 pages
sale 10% off
Preview
sale 10% off
Preview

e-Library read for
1 day
Standard
ISO 24615-2:2018 - Language resource management -- Syntactic annotation framework (SynAF)
English language
12 pages
sale 15% off
Preview
sale 15% off
Preview
Standard
SIST ISO 24615-2:2018
English language
17 pages
sale 10% off
Preview
sale 10% off
Preview

e-Library read for
1 day

Standards Content (sample)

SLOVENSKI STANDARD
SIST ISO 24615-2:2018
01-september-2018
Upravljanje z jezikovnimi viri - Ogrodje za skladenjsko označevanje (SynAF) - 2.
del: Serializacija XML (nabor Tiger)

Language resource management -- Syntactic annotation framework (SynAF) -- Part 2:

XML serialization (ISOTiger)

Gestion de ressources linguistiques -- Cadre d'annotation syntaxique (SynAF) -- Partie 2:

Sérialisation XML (ISOTiger)
Ta slovenski standard je istoveten z: ISO 24615-2:2018
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24615-2:2018 en,fr,de

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------
SIST ISO 24615-2:2018
---------------------- Page: 2 ----------------------
SIST ISO 24615-2:2018
INTERNATIONAL ISO
STANDARD 24615-2
First edition
2018-02
Language resource management —
Syntactic annotation framework
(SynAF) —
Part 2:
XML serialization (Tiger vocabulary)
Gestion de ressources linguistiques — Cadre d'annotation syntaxique
(SynAF) —
Partie 2: Sérialisation XML (vocabulaire Tiger)
Reference number
ISO 24615-2:2018(E)
ISO 2018
---------------------- Page: 3 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2018

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may

be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting

on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address

below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2018 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
Contents Page

Foreword ........................................................................................................................................................................................................................................iv

Introduction ..................................................................................................................................................................................................................................v

1 Scope ................................................................................................................................................................................................................................. 1

2 Normative references ...................................................................................................................................................................................... 1

3 Terms and definitions ..................................................................................................................................................................................... 1

4 Graph structure and meta-model ....................................................................................................................................................... 2

5 Meta-model objects in XML serialization ................................................................................................................................... 3

6 Primary data and terminal representation .............................................................................................................................. 4

6.1 General ........................................................................................................................................................................................................... 4

6.2 Sequential representation ............................................................................................................................................................. 4

6.3 Standoff representation .................................................................................................................................................................. 5

6.4 Typing of nodes and edges ........................................................................................................................................................... 6

7 Annotations................................................................................................................................................................................................................ 6

7.1 General ........................................................................................................................................................................................................... 6

7.2 Domain declaration ............................................................................................................................................................................. 7

7.3 Declaring annotations in an external file ......................................................................................................................... 7

7.4 Feature values .......................................................................................................................................................................................... 8

7.5 Data category references ................................................................................................................................................................ 8

7.6 Using the @type attribute ............................................................................................................................................................ 9

8 Corpus structure ................................................................................................................................................................................................10

Bibliography .............................................................................................................................................................................................................................12

© ISO 2018 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards

bodies (ISO member bodies). The work of preparing International Standards is normally carried out

through ISO technical committees. Each member body interested in a subject for which a technical

committee has been established has the right to be represented on that committee. International

organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.

ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of

electrotechnical standardization.

The procedures used to develop this document and those intended for its further maintenance are

described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the

different types of ISO documents should be noted. This document was drafted in accordance with the

editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).

Attention is drawn to the possibility that some of the elements of this document may be the subject of

patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of

any patent rights identified during the development of the document will be in the Introduction and/or

on the ISO list of patent declarations received (see www .iso .org/ patents).

Any trade name used in this document is information given for the convenience of users and does not

constitute an endorsement.

For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and

expressions related to conformity assessment, as well as information about ISO's adherence to the

World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following

URL: www .iso .org/ iso/ foreword .html.

This document was prepared by Technical Committee ISO/TC 37, Terminology and other language and

content resources, Subcommittee SC 4, Language resource management.
A list of all parts in the ISO 24615 series can be found on the ISO website.
iv © ISO 2018 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
Introduction

The need for standardization of syntactic annotation was recognized and addressed in detail with the

publication of ISO 24615-1:2014. As a result of the work on ISO 24615-1:2014, it was anticipated that

such a reference model for syntactic annotation should be associated with a concrete XML serialization

in order to meet the specific needs of such applications as syntactic parsers or syntactic treebanks,

where representations have to be exchanged and reused. Furthermore, such a serialization should be

independent from the theoretical orientation and specific details of any specific annotation scheme.

This document answers this need on the basis of the seminal work carried out on the TigerXML

[3]

format . This starting point was chosen as a reference because it is widely used as a de facto standard

for unrelated XML treebanks, with the advantages in terms of interoperability offered by its XML-

based representations, as opposed to other frequently used formats, in particular, the Penn Treebank

[5] [4]

bracketing format or the CoNLL format for dependency structures (see Reference ).

The document is designed to complement ISO 24615-1:2014 and to coordinate closely with ISO 24610,

ISO 24611, ISO 24612 and ISO 12620.

This document therefore extends ISO 24615-1:2014 with an XML model based upon the Tiger XML

vocabulary for the interchange of syntactically annotated data which is both standardized as well

as language- and theory-independent. The proposed format directly instantiates all features of the

meta-model defined in ISO 24615-1 and defines concrete serialized interfaces to the complementary

ISO 24611 and ISO 12620, which provides the background for the DatCatInfo data category registry.

© ISO 2018 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24615-2:2018
---------------------- Page: 8 ----------------------
SIST ISO 24615-2:2018
INTERNATIONAL STANDARD ISO 24615-2:2018(E)
Language resource management — Syntactic annotation
framework (SynAF) —
Part 2:
XML serialization (Tiger vocabulary)
1 Scope

This document describes an XML-conformant serialization of the ISO 24615-1 meta-model, with the

objective of supporting interoperability across language resources or language processing components

in the domain of syntactic annotations. As an extension of ISO 24615-1, this document is also

coordinated with ISO 24612.
2 Normative references

The following documents are referred to in the text in such a way that some or all of their content

constitutes requirements of this document. For dated references, only the edition cited applies. For

undated references, the latest edition of the referenced document (including any amendments) applies.

ISO 12620, Terminology and other language and content resources — Data category specifications

ISO 24610 (all parts), Language resource management — Feature structures

ISO 24611, Language resource management — Morpho-syntactic annotation framework (MAF)

ISO 24612, Language resource management — Linguistic annotation framework (LAF)

ISO 24615-1, Language resource management — Syntactic annotation framework (SynAF) — Part 1:

Syntactic model
3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 12620, ISO 24610 (all parts),

ISO 24611, ISO 24612 and ISO 24615-1 and the following apply.

ISO and IEC maintain terminological databases for use in standardization at the following addresses:

— IEC Electropedia: available at http:// www .electropedia .org/
— ISO Online browsing platform: available at http:// www .iso .org/ obp
3.1
domain
class of elements to which a certain set of labels (3.2) can be assigned

Note 1 to entry: Domains can refer generally to the set of all edges, terminal nodes or non-terminal nodes.

3.2
label

unit of annotation consisting of the name of a feature and a value, which together can be applied to

appropriate model elements and add arbitrary feature-value annotations to such elements

© ISO 2018 – All rights reserved 1
---------------------- Page: 9 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
3.3
primary data
initial raw linguistic content that is being encoded
3.4
sequential representation

representation of annotation content where the XML element structure mirrors the sequence of

linguistic objects in the primary source
4 Graph structure and meta-model

In the XML Tiger format, annotations are represented in a graph structure. The graph structure can be

described as G = (V, E, A) with
— a set of nodes V,
— a set of edges E with e = (v ∈ V, v ∈ V) ∈ E,

— a set of annotations A, where an annotation a is defined by a feature-value pair, and

— a function annot: E ∪ V → A.

A graph represents a bundle of interrelated nodes and edges. It is not specified which parts of a primary

text are covered by a single graph, e.g. a sentence, a sub-sentence, a chapter or a whole text. Linguistic

annotations represented by labels can be attached to nodes as well as edges.
The meta-model (see Figure 1) consists of three parts:
a) the structural organization of corpora and associated meta-data;
b) an annotation tagset definition;
c) the linguistic annotation graph.

The structural organization of corpora is represented by a recursively defined corpus element (Corpus)

and its corresponding metadata (Meta). A corpus can contain subcorpora.

The annotation tagset definition is represented by a list of categories (Feature) containing the name of

a category (Feature.name) and a list of category values (FeatureValue). Each FeatureValue contains a

string representation of the value (FeatureValue.value). Together, both elements declare a tagset which

is part of a specific corpus object. Such a tagset declaration is derivable, which means that all categories

defined in a supercorpus object can also be used by its subcorpus objects. Further attributes are used

to declare to which types of nodes and edges a category is applicable. Both elements allow reference

to DatCatInfo entries in compliance with ISO 12620 via Unified Resource Identifiers (URIs) in the

attributes Feature.dcrReference and FeatureValue.dcrReference.

The final part of the meta-model, the linguistic annotation graph, defines a set of elements containing

the primary data and the annotation structure covering the primary data. It consists of the graph

element itself (Graph), two classes of syntactic nodes (Terminal and NonTerminal), an edge element

(Edge) and an annotation element (Annotation) realizing the annot-function and therefore referring

to a feature name and its value. Graph is contained within Segment, which is a grouping mechanism

to aggregate a set of syntactic nodes together. Such a group can have linguistic structural semantics,

corresponding usually to a sentence, but possibly also to a line in a manuscript or other meaningful

segments depending on the application and annotation scheme used by a specific project.

A terminal node (in ISO 24615-1 referred to as T_Node) constitutes the point of reference to the primary

data. This can be a direct reference to a text span within the XML document or an indirect reference to

an object outside the model, e.g. an element contained by a MAF file in compliance with ISO 24611, the

Morpho-syntactic annotation framework. A non-terminal node is an inner node, referring directly or

indirectly to a terminal node within the XML document.
2 © ISO 2018 – All rights reserved
---------------------- Page: 10 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)

An edge shall always have a source and a target node. Both of them can be either a terminal or a non-

terminal node.
Figure 1 — Meta-model for the serialization format
5 Meta-model objects in XML serialization

The names of the XML elements and attributes follow those of the corresponding meta-model elements

and attributes. All XML elements belong to the following namespace: http:// www .clarin .eu/ standards/

ns/ synaf. In this document, unless specified otherwise, all XML elements will be assumed to belong to

this namespace.

Terminal nodes in the XML serialization are represented by the element and are nested together in

a element.

A segment node is represented by the element. The element shall contain one or more

elements, which may be used to express possible multiple annotation graphs alternating within a single

segment or to represent a sequence of subgraphs.

A non-terminal node is represented by the element and is nested in the

element.
© ISO 2018 – All rights reserved 3
---------------------- Page: 11 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)

An edge is represented by the element and its source is given by the surrounding node element.

An element is therefore always the child element of a or an element. The target is

given by the @target attribute and shall refer to a node within the same document.

The example shows the representation of the graph structure.
EXAMPLE Graph structure of a syntactic annotation.













6 Primary data and terminal representation
6.1 General

Terminal nodes and the primary data to which they refer can be defined in two different ways. The

primary data can either be included within the XML document (sequential representation) or they

can be specified in another file and referred to externally (standoff representation). The sequential

representation makes it possible to have all data in just one XML document, whereas the standoff

representation makes it possible to refer to other formats, for instance, a MAF file containing tokens

and wordForms.

Both options are possible and the decision is largely made based on the needs of the corpus project.

6.2 Sequential representation

In a sequential representation, primary data is represented sequentially directly within the

elements of each element in the document. As a result, all the primary data of a document is

grouped together in one single XML document. This improves human readability and basic machine

processing, but reduces representational flexibility. Tokens of the primary data are encoded as values

of the @word attribute of the element as illustrated in Example 1.
EXAMPLE 1 Simple sequential representation of word forms.

INTERNATIONAL ISO
STANDARD 24615-2
First edition
2018-02
Language resource management —
Syntactic annotation framework
(SynAF) —
Part 2:
XML serialization (Tiger vocabulary)
Gestion de ressources linguistiques — Cadre d'annotation syntaxique
(SynAF) —
Partie 2: Sérialisation XML (vocabulaire Tiger)
Reference number
ISO 24615-2:2018(E)
ISO 2018
---------------------- Page: 1 ----------------------
ISO 24615-2:2018(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2018

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may

be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting

on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address

below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2018 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24615-2:2018(E)
Contents Page

Foreword ........................................................................................................................................................................................................................................iv

Introduction ..................................................................................................................................................................................................................................v

1 Scope ................................................................................................................................................................................................................................. 1

2 Normative references ...................................................................................................................................................................................... 1

3 Terms and definitions ..................................................................................................................................................................................... 1

4 Graph structure and meta-model ....................................................................................................................................................... 2

5 Meta-model objects in XML serialization ................................................................................................................................... 3

6 Primary data and terminal representation .............................................................................................................................. 4

6.1 General ........................................................................................................................................................................................................... 4

6.2 Sequential representation ............................................................................................................................................................. 4

6.3 Standoff representation .................................................................................................................................................................. 5

6.4 Typing of nodes and edges ........................................................................................................................................................... 6

7 Annotations................................................................................................................................................................................................................ 6

7.1 General ........................................................................................................................................................................................................... 6

7.2 Domain declaration ............................................................................................................................................................................. 7

7.3 Declaring annotations in an external file ......................................................................................................................... 7

7.4 Feature values .......................................................................................................................................................................................... 8

7.5 Data category references ................................................................................................................................................................ 8

7.6 Using the @type attribute ............................................................................................................................................................ 9

8 Corpus structure ................................................................................................................................................................................................10

Bibliography .............................................................................................................................................................................................................................12

© ISO 2018 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO 24615-2:2018(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards

bodies (ISO member bodies). The work of preparing International Standards is normally carried out

through ISO technical committees. Each member body interested in a subject for which a technical

committee has been established has the right to be represented on that committee. International

organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.

ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of

electrotechnical standardization.

The procedures used to develop this document and those intended for its further maintenance are

described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the

different types of ISO documents should be noted. This document was drafted in accordance with the

editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).

Attention is drawn to the possibility that some of the elements of this document may be the subject of

patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of

any patent rights identified during the development of the document will be in the Introduction and/or

on the ISO list of patent declarations received (see www .iso .org/ patents).

Any trade name used in this document is information given for the convenience of users and does not

constitute an endorsement.

For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and

expressions related to conformity assessment, as well as information about ISO's adherence to the

World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following

URL: www .iso .org/ iso/ foreword .html.

This document was prepared by Technical Committee ISO/TC 37, Terminology and other language and

content resources, Subcommittee SC 4, Language resource management.
A list of all parts in the ISO 24615 series can be found on the ISO website.
iv © ISO 2018 – All rights reserved
---------------------- Page: 4 ----------------------
ISO 24615-2:2018(E)
Introduction

The need for standardization of syntactic annotation was recognized and addressed in detail with the

publication of ISO 24615-1:2014. As a result of the work on ISO 24615-1:2014, it was anticipated that

such a reference model for syntactic annotation should be associated with a concrete XML serialization

in order to meet the specific needs of such applications as syntactic parsers or syntactic treebanks,

where representations have to be exchanged and reused. Furthermore, such a serialization should be

independent from the theoretical orientation and specific details of any specific annotation scheme.

This document answers this need on the basis of the seminal work carried out on the TigerXML

[3]

format . This starting point was chosen as a reference because it is widely used as a de facto standard

for unrelated XML treebanks, with the advantages in terms of interoperability offered by its XML-

based representations, as opposed to other frequently used formats, in particular, the Penn Treebank

[5] [4]

bracketing format or the CoNLL format for dependency structures (see Reference ).

The document is designed to complement ISO 24615-1:2014 and to coordinate closely with ISO 24610,

ISO 24611, ISO 24612 and ISO 12620.

This document therefore extends ISO 24615-1:2014 with an XML model based upon the Tiger XML

vocabulary for the interchange of syntactically annotated data which is both standardized as well

as language- and theory-independent. The proposed format directly instantiates all features of the

meta-model defined in ISO 24615-1 and defines concrete serialized interfaces to the complementary

ISO 24611 and ISO 12620, which provides the background for the DatCatInfo data category registry.

© ISO 2018 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO 24615-2:2018(E)
Language resource management — Syntactic annotation
framework (SynAF) —
Part 2:
XML serialization (Tiger vocabulary)
1 Scope

This document describes an XML-conformant serialization of the ISO 24615-1 meta-model, with the

objective of supporting interoperability across language resources or language processing components

in the domain of syntactic annotations. As an extension of ISO 24615-1, this document is also

coordinated with ISO 24612.
2 Normative references

The following documents are referred to in the text in such a way that some or all of their content

constitutes requirements of this document. For dated references, only the edition cited applies. For

undated references, the latest edition of the referenced document (including any amendments) applies.

ISO 12620, Terminology and other language and content resources — Data category specifications

ISO 24610 (all parts), Language resource management — Feature structures

ISO 24611, Language resource management — Morpho-syntactic annotation framework (MAF)

ISO 24612, Language resource management — Linguistic annotation framework (LAF)

ISO 24615-1, Language resource management — Syntactic annotation framework (SynAF) — Part 1:

Syntactic model
3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 12620, ISO 24610 (all parts),

ISO 24611, ISO 24612 and ISO 24615-1 and the following apply.

ISO and IEC maintain terminological databases for use in standardization at the following addresses:

— IEC Electropedia: available at http:// www .electropedia .org/
— ISO Online browsing platform: available at http:// www .iso .org/ obp
3.1
domain
class of elements to which a certain set of labels (3.2) can be assigned

Note 1 to entry: Domains can refer generally to the set of all edges, terminal nodes or non-terminal nodes.

3.2
label

unit of annotation consisting of the name of a feature and a value, which together can be applied to

appropriate model elements and add arbitrary feature-value annotations to such elements

© ISO 2018 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO 24615-2:2018(E)
3.3
primary data
initial raw linguistic content that is being encoded
3.4
sequential representation

representation of annotation content where the XML element structure mirrors the sequence of

linguistic objects in the primary source
4 Graph structure and meta-model

In the XML Tiger format, annotations are represented in a graph structure. The graph structure can be

described as G = (V, E, A) with
— a set of nodes V,
— a set of edges E with e = (v ∈ V, v ∈ V) ∈ E,

— a set of annotations A, where an annotation a is defined by a feature-value pair, and

— a function annot: E ∪ V → A.

A graph represents a bundle of interrelated nodes and edges. It is not specified which parts of a primary

text are covered by a single graph, e.g. a sentence, a sub-sentence, a chapter or a whole text. Linguistic

annotations represented by labels can be attached to nodes as well as edges.
The meta-model (see Figure 1) consists of three parts:
a) the structural organization of corpora and associated meta-data;
b) an annotation tagset definition;
c) the linguistic annotation graph.

The structural organization of corpora is represented by a recursively defined corpus element (Corpus)

and its corresponding metadata (Meta). A corpus can contain subcorpora.

The annotation tagset definition is represented by a list of categories (Feature) containing the name of

a category (Feature.name) and a list of category values (FeatureValue). Each FeatureValue contains a

string representation of the value (FeatureValue.value). Together, both elements declare a tagset which

is part of a specific corpus object. Such a tagset declaration is derivable, which means that all categories

defined in a supercorpus object can also be used by its subcorpus objects. Further attributes are used

to declare to which types of nodes and edges a category is applicable. Both elements allow reference

to DatCatInfo entries in compliance with ISO 12620 via Unified Resource Identifiers (URIs) in the

attributes Feature.dcrReference and FeatureValue.dcrReference.

The final part of the meta-model, the linguistic annotation graph, defines a set of elements containing

the primary data and the annotation structure covering the primary data. It consists of the graph

element itself (Graph), two classes of syntactic nodes (Terminal and NonTerminal), an edge element

(Edge) and an annotation element (Annotation) realizing the annot-function and therefore referring

to a feature name and its value. Graph is contained within Segment, which is a grouping mechanism

to aggregate a set of syntactic nodes together. Such a group can have linguistic structural semantics,

corresponding usually to a sentence, but possibly also to a line in a manuscript or other meaningful

segments depending on the application and annotation scheme used by a specific project.

A terminal node (in ISO 24615-1 referred to as T_Node) constitutes the point of reference to the primary

data. This can be a direct reference to a text span within the XML document or an indirect reference to

an object outside the model, e.g. an element contained by a MAF file in compliance with ISO 24611, the

Morpho-syntactic annotation framework. A non-terminal node is an inner node, referring directly or

indirectly to a terminal node within the XML document.
2 © ISO 2018 – All rights reserved
---------------------- Page: 7 ----------------------
ISO 24615-2:2018(E)

An edge shall always have a source and a target node. Both of them can be either a terminal or a non-

terminal node.
Figure 1 — Meta-model for the serialization format
5 Meta-model objects in XML serialization

The names of the XML elements and attributes follow those of the corresponding meta-model elements

and attributes. All XML elements belong to the following namespace: http:// www .clarin .eu/ standards/

ns/ synaf. In this document, unless specified otherwise, all XML elements will be assumed to belong to

this namespace.

Terminal nodes in the XML serialization are represented by the element and are nested together in

a element.

A segment node is represented by the element. The element shall contain one or more

elements, which may be used to express possible multiple annotation graphs alternating within a single

segment or to represent a sequence of subgraphs.

A non-terminal node is represented by the element and is nested in the

element.
© ISO 2018 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO 24615-2:2018(E)

An edge is represented by the element and its source is given by the surrounding node element.

An element is therefore always the child element of a or an element. The target is

given by the @target attribute and shall refer to a node within the same document.

The example shows the representation of the graph structure.
EXAMPLE Graph structure of a syntactic annotation.













6 Primary data and terminal representation
6.1 General

Terminal nodes and the primary data to which they refer can be defined in two different ways. The

primary data can either be included within the XML document (sequential representation) or they

can be specified in another file and referred to externally (standoff representation). The sequential

representation makes it possible to have all data in just one XML document, whereas the standoff

representation makes it possible to refer to other formats, for instance, a MAF file containing tokens

and wordForms.

Both options are possible and the decision is largely made based on the needs of the corpus project.

6.2 Sequential representation

In a sequential representation, primary data is represented sequentially directly within the

elements of each element in the document. As a result, all the primary data of a document is

grouped together in one single XML document. This improves human readability and basic machine

processing, but reduces representational flexibility. Tokens of the primary data are encoded as values

of the @word attribute of the element as illustrated in Example 1.
EXAMPLE 1 Simple sequential representation of word forms.

SLOVENSKI STANDARD
SIST ISO 24615-2:2018
01-september-2018
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DVNODGHQMVNRR]QDþHYDQMH 6\Q$) 
GHO6HULDOL]DFLMD;0/ QDERU7LJHU

Language resource management -- Syntactic annotation framework (SynAF) -- Part 2:

XML serialization (ISOTiger)

Gestion de ressources linguistiques -- Cadre d'annotation syntaxique (SynAF) -- Partie 2:

Sérialisation XML (ISOTiger)
Ta slovenski standard je istoveten z: ISO 24615-2:2018
ICS:
01.020 7HUPLQRORJLMD QDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
35.060 Jeziki, ki se uporabljajo v Languages used in
informacijski tehniki in information technology
tehnologiji
SIST ISO 24615-2:2018 en,fr,de

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------
SIST ISO 24615-2:2018
---------------------- Page: 2 ----------------------
SIST ISO 24615-2:2018
INTERNATIONAL ISO
STANDARD 24615-2
First edition
2018-02
Language resource management —
Syntactic annotation framework
(SynAF) —
Part 2:
XML serialization (Tiger vocabulary)
Gestion de ressources linguistiques — Cadre d'annotation syntaxique
(SynAF) —
Partie 2: Sérialisation XML (vocabulaire Tiger)
Reference number
ISO 24615-2:2018(E)
ISO 2018
---------------------- Page: 3 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2018

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may

be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting

on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address

below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2018 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
Contents Page

Foreword ........................................................................................................................................................................................................................................iv

Introduction ..................................................................................................................................................................................................................................v

1 Scope ................................................................................................................................................................................................................................. 1

2 Normative references ...................................................................................................................................................................................... 1

3 Terms and definitions ..................................................................................................................................................................................... 1

4 Graph structure and meta-model ....................................................................................................................................................... 2

5 Meta-model objects in XML serialization ................................................................................................................................... 3

6 Primary data and terminal representation .............................................................................................................................. 4

6.1 General ........................................................................................................................................................................................................... 4

6.2 Sequential representation ............................................................................................................................................................. 4

6.3 Standoff representation .................................................................................................................................................................. 5

6.4 Typing of nodes and edges ........................................................................................................................................................... 6

7 Annotations................................................................................................................................................................................................................ 6

7.1 General ........................................................................................................................................................................................................... 6

7.2 Domain declaration ............................................................................................................................................................................. 7

7.3 Declaring annotations in an external file ......................................................................................................................... 7

7.4 Feature values .......................................................................................................................................................................................... 8

7.5 Data category references ................................................................................................................................................................ 8

7.6 Using the @type attribute ............................................................................................................................................................ 9

8 Corpus structure ................................................................................................................................................................................................10

Bibliography .............................................................................................................................................................................................................................12

© ISO 2018 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards

bodies (ISO member bodies). The work of preparing International Standards is normally carried out

through ISO technical committees. Each member body interested in a subject for which a technical

committee has been established has the right to be represented on that committee. International

organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.

ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of

electrotechnical standardization.

The procedures used to develop this document and those intended for its further maintenance are

described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the

different types of ISO documents should be noted. This document was drafted in accordance with the

editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).

Attention is drawn to the possibility that some of the elements of this document may be the subject of

patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of

any patent rights identified during the development of the document will be in the Introduction and/or

on the ISO list of patent declarations received (see www .iso .org/ patents).

Any trade name used in this document is information given for the convenience of users and does not

constitute an endorsement.

For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and

expressions related to conformity assessment, as well as information about ISO's adherence to the

World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following

URL: www .iso .org/ iso/ foreword .html.

This document was prepared by Technical Committee ISO/TC 37, Terminology and other language and

content resources, Subcommittee SC 4, Language resource management.
A list of all parts in the ISO 24615 series can be found on the ISO website.
iv © ISO 2018 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
Introduction

The need for standardization of syntactic annotation was recognized and addressed in detail with the

publication of ISO 24615-1:2014. As a result of the work on ISO 24615-1:2014, it was anticipated that

such a reference model for syntactic annotation should be associated with a concrete XML serialization

in order to meet the specific needs of such applications as syntactic parsers or syntactic treebanks,

where representations have to be exchanged and reused. Furthermore, such a serialization should be

independent from the theoretical orientation and specific details of any specific annotation scheme.

This document answers this need on the basis of the seminal work carried out on the TigerXML

[3]

format . This starting point was chosen as a reference because it is widely used as a de facto standard

for unrelated XML treebanks, with the advantages in terms of interoperability offered by its XML-

based representations, as opposed to other frequently used formats, in particular, the Penn Treebank

[5] [4]

bracketing format or the CoNLL format for dependency structures (see Reference ).

The document is designed to complement ISO 24615-1:2014 and to coordinate closely with ISO 24610,

ISO 24611, ISO 24612 and ISO 12620.

This document therefore extends ISO 24615-1:2014 with an XML model based upon the Tiger XML

vocabulary for the interchange of syntactically annotated data which is both standardized as well

as language- and theory-independent. The proposed format directly instantiates all features of the

meta-model defined in ISO 24615-1 and defines concrete serialized interfaces to the complementary

ISO 24611 and ISO 12620, which provides the background for the DatCatInfo data category registry.

© ISO 2018 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24615-2:2018
---------------------- Page: 8 ----------------------
SIST ISO 24615-2:2018
INTERNATIONAL STANDARD ISO 24615-2:2018(E)
Language resource management — Syntactic annotation
framework (SynAF) —
Part 2:
XML serialization (Tiger vocabulary)
1 Scope

This document describes an XML-conformant serialization of the ISO 24615-1 meta-model, with the

objective of supporting interoperability across language resources or language processing components

in the domain of syntactic annotations. As an extension of ISO 24615-1, this document is also

coordinated with ISO 24612.
2 Normative references

The following documents are referred to in the text in such a way that some or all of their content

constitutes requirements of this document. For dated references, only the edition cited applies. For

undated references, the latest edition of the referenced document (including any amendments) applies.

ISO 12620, Terminology and other language and content resources — Data category specifications

ISO 24610 (all parts), Language resource management — Feature structures

ISO 24611, Language resource management — Morpho-syntactic annotation framework (MAF)

ISO 24612, Language resource management — Linguistic annotation framework (LAF)

ISO 24615-1, Language resource management — Syntactic annotation framework (SynAF) — Part 1:

Syntactic model
3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 12620, ISO 24610 (all parts),

ISO 24611, ISO 24612 and ISO 24615-1 and the following apply.

ISO and IEC maintain terminological databases for use in standardization at the following addresses:

— IEC Electropedia: available at http:// www .electropedia .org/
— ISO Online browsing platform: available at http:// www .iso .org/ obp
3.1
domain
class of elements to which a certain set of labels (3.2) can be assigned

Note 1 to entry: Domains can refer generally to the set of all edges, terminal nodes or non-terminal nodes.

3.2
label

unit of annotation consisting of the name of a feature and a value, which together can be applied to

appropriate model elements and add arbitrary feature-value annotations to such elements

© ISO 2018 – All rights reserved 1
---------------------- Page: 9 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)
3.3
primary data
initial raw linguistic content that is being encoded
3.4
sequential representation

representation of annotation content where the XML element structure mirrors the sequence of

linguistic objects in the primary source
4 Graph structure and meta-model

In the XML Tiger format, annotations are represented in a graph structure. The graph structure can be

described as G = (V, E, A) with
— a set of nodes V,
— a set of edges E with e = (v ∈ V, v ∈ V) ∈ E,

— a set of annotations A, where an annotation a is defined by a feature-value pair, and

— a function annot: E ∪ V → A.

A graph represents a bundle of interrelated nodes and edges. It is not specified which parts of a primary

text are covered by a single graph, e.g. a sentence, a sub-sentence, a chapter or a whole text. Linguistic

annotations represented by labels can be attached to nodes as well as edges.
The meta-model (see Figure 1) consists of three parts:
a) the structural organization of corpora and associated meta-data;
b) an annotation tagset definition;
c) the linguistic annotation graph.

The structural organization of corpora is represented by a recursively defined corpus element (Corpus)

and its corresponding metadata (Meta). A corpus can contain subcorpora.

The annotation tagset definition is represented by a list of categories (Feature) containing the name of

a category (Feature.name) and a list of category values (FeatureValue). Each FeatureValue contains a

string representation of the value (FeatureValue.value). Together, both elements declare a tagset which

is part of a specific corpus object. Such a tagset declaration is derivable, which means that all categories

defined in a supercorpus object can also be used by its subcorpus objects. Further attributes are used

to declare to which types of nodes and edges a category is applicable. Both elements allow reference

to DatCatInfo entries in compliance with ISO 12620 via Unified Resource Identifiers (URIs) in the

attributes Feature.dcrReference and FeatureValue.dcrReference.

The final part of the meta-model, the linguistic annotation graph, defines a set of elements containing

the primary data and the annotation structure covering the primary data. It consists of the graph

element itself (Graph), two classes of syntactic nodes (Terminal and NonTerminal), an edge element

(Edge) and an annotation element (Annotation) realizing the annot-function and therefore referring

to a feature name and its value. Graph is contained within Segment, which is a grouping mechanism

to aggregate a set of syntactic nodes together. Such a group can have linguistic structural semantics,

corresponding usually to a sentence, but possibly also to a line in a manuscript or other meaningful

segments depending on the application and annotation scheme used by a specific project.

A terminal node (in ISO 24615-1 referred to as T_Node) constitutes the point of reference to the primary

data. This can be a direct reference to a text span within the XML document or an indirect reference to

an object outside the model, e.g. an element contained by a MAF file in compliance with ISO 24611, the

Morpho-syntactic annotation framework. A non-terminal node is an inner node, referring directly or

indirectly to a terminal node within the XML document.
2 © ISO 2018 – All rights reserved
---------------------- Page: 10 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)

An edge shall always have a source and a target node. Both of them can be either a terminal or a non-

terminal node.
Figure 1 — Meta-model for the serialization format
5 Meta-model objects in XML serialization

The names of the XML elements and attributes follow those of the corresponding meta-model elements

and attributes. All XML elements belong to the following namespace: http:// www .clarin .eu/ standards/

ns/ synaf. In this document, unless specified otherwise, all XML elements will be assumed to belong to

this namespace.

Terminal nodes in the XML serialization are represented by the element and are nested together in

a element.

A segment node is represented by the element. The element shall contain one or more

elements, which may be used to express possible multiple annotation graphs alternating within a single

segment or to represent a sequence of subgraphs.

A non-terminal node is represented by the element and is nested in the

element.
© ISO 2018 – All rights reserved 3
---------------------- Page: 11 ----------------------
SIST ISO 24615-2:2018
ISO 24615-2:2018(E)

An edge is represented by the element and its source is given by the surrounding node element.

An element is therefore always the child element of a or an element. The target is

given by the @target attribute and shall refer to a node within the same document.

The example shows the representation of the graph structure.
EXAMPLE Graph structure of a syntactic annotation.













6 Primary data and terminal representation
6.1 General

Terminal nodes and the primary data to which they refer can be defined in two different ways. The

primary data can either be included within the XML document (sequential representation) or they

can be specified in another file and referred to externally (standoff representation). The sequential

representation makes it possible to have all data in just one XML document, whereas the standoff

representation makes it possible to refer to other formats, for instance, a MAF file containing tokens

and wordForms.

Both options are possible and the decision is largely made based on the needs of the corpus project.

6.2 Sequential representation

In a sequential representation, primary data is represented sequentially directly within the

elements of each element in the document. As a result, all the primary data of a document is

grouped together in one single XML document. This improves human readability and basic machine

processing, but reduces representational flexibility. Tokens of the primary data are encoded as values

of the @word attribute of the element as illustrated in Example 1.
EXAMPLE 1 Simple sequential representation of word forms.

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.