Language resource management -- Morpho-syntactic annotation framework (MAF)

This International Standard provides a framework for the representation of annotations of word-forms in texts; such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties. It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding initiative).

Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)

L'ISO 24611:2012 fournit un cadre pour la repr�sentation des annotations des mots-formes dans les textes; ces annotations concernent les segments, leurs relations avec les unit�s lexicales, et leurs propri�t�s morphosyntaxiques.
Elle pr�sente un m�tamod�le pour l'annotation morphosyntaxique qui r�f�rence les cat�gories de donn�es dans le registre des cat�gories de donn�es ISOCat (DCR tel que d�fini dans l'ISO 12620). Elle d�crit aussi une s�rialisation XML pour l'annotation morphosyntaxique, avec les �quivalences des lignes directrices de la TEI (Text Encoding Initiative).

Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF)

Ta mednarodni standard zagotavlja ogrodje za predstavitev označevanja besednih oblik v besedilih; to označevanje vključuje žetone, njihov odnos z leksikalnimi enotami in njihove oblikoskladenjske lastnosti. Opisuje metamodel za oblikoskladenjsko označevanje, ki je povezan s sklicevanjem na podatkovne kategorije iz registra kategorij podatkov ISOCat (kot ga določa ISO 12620). Prav tako opisuje serializacijo oblikoskladenjskega označevanja XML z upoštevanjem smernic TEI (iniciativa za zapis besedil).

General Information

Status
Published
Publication Date
06-Jun-2013
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
31-May-2013
Due Date
05-Aug-2013
Completion Date
07-Jun-2013

Buy Standard

Standard
SIST ISO 24611:2013 - BARVE na PDF-str 15,16,33
English language
65 pages
sale 10% off
Preview
sale 10% off
Preview

e-Library read for
1 day
Standard
ISO 24611:2012 - Language resource management -- Morpho-syntactic annotation framework (MAF)
English language
58 pages
sale 15% off
Preview
sale 15% off
Preview
Standard
SIST ISO 24611:2013
English language
65 pages
sale 10% off
Preview
sale 10% off
Preview

e-Library read for
1 day
Standard
ISO 24611:2012 - Gestion des ressources langagieres -- Cadre d'annotation morphosyntaxique (MAF)
French language
63 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (sample)

SLOVENSKI STANDARD
SIST ISO 24611:2013
01-julij-2013
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF)
Language resource management -- Morpho-syntactic annotation framework (MAF)
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
Ta slovenski standard je istoveten z: ISO 24611:2012
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24611:2013 en,fr,de

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------
SIST ISO 24611:2013
---------------------- Page: 2 ----------------------
SIST ISO 24611:2013
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012

All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,

electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or

ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Contents Page

Foreword ............................................................................................................................................................. v

Introduction ........................................................................................................................................................ vi

1 Scope ...................................................................................................................................................... 1

2 Normative references ............................................................................................................................ 1

3 Terms and definitions ........................................................................................................................... 1

4 The MAF meta-model ............................................................................................................................ 4

4.1 Overview ................................................................................................................................................. 4

4.2 MAF Meta-model .................................................................................................................................... 4

5 Segmenting with tokens ....................................................................................................................... 6

5.1 General ................................................................................................................................................... 6

5.2 Formal description: ................................................................................................................ 7

5.3 Embedding notation .............................................................................................................................. 7

5.4 Alternate representation for TEI based documents ........................................................................... 8

5.5 Stand-off notation .................................................................................................................................. 9

5.6 Informative attributes ............................................................................................................................ 9

5.7 Completing the inline token notation ................................................................................................ 10

5.7.1 Joining tokens in embedded mode ................................................................................................... 10

5.7.2 Overlapping tokens ............................................................................................................................. 11

6 Word-forms as linguistic units ........................................................................................................... 11

6.1 Formal description: ...................................................................................................... 12

6.2 Token attachment ................................................................................................................................ 12

6.2.1 One token; one word-form ................................................................................................................. 12

6.2.2 Several contiguous tokens; one word-form ..................................................................................... 12

6.2.3 Several discontinuous tokens; one word-form ................................................................................ 13

6.2.4 Zero token; one word-form ................................................................................................................. 13

6.2.5 One token; several word-forms ......................................................................................................... 14

6.3 Referring to lexical entries ................................................................................................................. 14

6.4 Compound word-forms ....................................................................................................................... 15

6.5 Identification of word-forms within a TEI-compliant document ..................................................... 15

7 Morpho-syntactic content ................................................................................................................... 18

7.1 General ................................................................................................................................................. 18

7.2 Using feature structures ..................................................................................................................... 18

7.3 Compact morpho-syntactic tags ....................................................................................................... 18

7.4 FSR libraries ........................................................................................................................................ 19

7.5 Designing tagsets ................................................................................................................................ 20

7.6 Formal description: ............................................................................................................. 22

8 Handling ambiguities .......................................................................................................................... 22

8.1 Word-form content ambiguities ......................................................................................................... 22

8.2 Lexical Ambiguities ............................................................................................................................. 23

8.3 Structural ambiguities ......................................................................................................................... 23

8.3.1 Structural ambiguities with word-forms ........................................................................................... 23

8.3.2 Structural ambiguities with tokens .................................................................................................... 24

8.4 Simplified structuring variants .......................................................................................................... 24

8.4.1 Non-ambiguous linear representation .............................................................................................. 24

8.4.2 Mixed linear and lattice representation ............................................................................................. 25

8.5 Expanding the simplified variants ..................................................................................................... 26

8.5.1 Separating tokens and word-forms ................................................................................................... 26

8.5.2 Wrapping into local lattices ................................................................................................................ 26

© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)

8.5.3 Merging local lattices ..........................................................................................................................27

8.5.4 Removing ................................................................................................................................28

8.6 Formal description: and ............................................................................................29

Annex A (informative) Encoded example using the MAF serialization ........................................................30

Annex B (normative) MAF specification .........................................................................................................33

B.1 Elements ...............................................................................................................................................33

B.1.1 ....................................................................................................................................................33

B.1.2 ....................................................................................................................................................34

B.1.3 ....................................................................................................................................................34

B.1.4 ................................................................................................................................................35

B.1.5 .................................................................................................................................................35

B.1.6 ..........................................................................................................................................36

B.1.7 ..................................................................................................................................................36

B.1.8 .........................................................................................................................................37

B.2 Model classes .......................................................................................................................................38

B.3 Attribute classes ..................................................................................................................................38

B.3.1 att.token.information ...........................................................................................................................38

B.3.2 att.token.join .........................................................................................................................................39

B.3.3 att.token.span .......................................................................................................................................39

B.3.4 att.wordForm.content ..........................................................................................................................39

B.3.5 att.wordForm.tokens ...........................................................................................................................40

B.4 Macros ..................................................................................................................................................40

B.4.1 data.certainty ........................................................................................................................................40

B.4.2 data.code ..............................................................................................................................................40

B.4.3 data.count .............................................................................................................................................40

B.4.4 data.duration.w3c ................................................................................................................................41

B.4.5 data.enumerated ..................................................................................................................................41

B.4.6 data.key .................................................................................................................................................41

B.4.7 data.language .......................................................................................................................................42

B.4.8 data.name .............................................................................................................................................43

B.4.9 data.numeric .........................................................................................................................................43

B.4.10 data.pointer ..........................................................................................................................................43

B.4.11 data.probability ....................................................................................................................................44

B.4.12 data.temporal.w3c................................................................................................................................44

B.4.13 data.truthValue .....................................................................................................................................44

B.4.14 data.word ..............................................................................................................................................45

B.4.15 data.xTruthValue ..................................................................................................................................45

Annex C (normative) Morpho-syntactic data categories ..............................................................................46

Bibliography ......................................................................................................................................................58

iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies

(ISO member bodies). The work of preparing International Standards is normally carried out through ISO

technical committees. Each member body interested in a subject for which a technical committee has been

established has the right to be represented on that committee. International organizations, governmental and

non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the

International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards

adopted by the technical committees are circulated to the member bodies for voting. Publication as an

International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent

rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content

resources, Subcommittee SC 4, Language resource management.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Introduction

ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language

resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the

representation of terminological data [Romary, 2001], through which linguistic data models are seen as the

combination of a generic data pattern (a meta-model), which is further refined through a selection of data

categories that provide the descriptors for this specific annotation level. Such models are defined

independently of any specific formats, and ensure that an implementer has the necessary conceptual

instrument with which to design and compare formats with regard to their degrees of interoperability.

One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable

semantics for the various descriptors used, either in the form of formal features and feature values, or directly

as objects in a representation that is expressed, for instance, in XML. In order to be shared across various

annotation schemas and encoding applications, such a semantics should be implemented as a centralised

registry of concepts: we will henceforth refer to these as data categories. As such, data categories should

bear the following constraints.

 From a technical point of view, they must provide unique, stable references (implemented as persistent

identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to

them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they

are in fact defined in relation to the same data categories (as feature and feature value).

 From a descriptive point of view, each unique semantic reference should be associated with precise

documentation combining a full text elicitation of the meaning of the descriptor with the expression of

specific constraints that bear upon the category.

In recent years, ISO has developed a general framework for representing and maintaining such a registry of

data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,

has led to the implementation of an online environment providing access to all data categories that have been

standardized in the context of the various language resource-related activities within ISO, or specifically as

part of the maintenance of the data category registry. It also provides access to the various data categories

that individual language technology practitioners have defined in the course of their own work and decided to

share with the community.

The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended

as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective

there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are

easily inserted and reused without the need for any strong consistency check with the registry at large.

Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:

 simple generic-specific relations, when these are useful for the proper identification of interoperability

descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/

makes it possible to compare morpho-syntactic annotations based on different descriptive levels of

granularity;

 the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,

the possible value of so-called complex data categories For instance, it can be used to record that

possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be

a subset of {/masculine/, /feminine/ and /neutral/};

 language-specific constraints, either in the form of specific application notes or as explicit restrictions

bearing upon the conceptual domains of complex data categories. For instance, it is possible to express

explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.

vi © ISO 2012 – All rights reserved
---------------------- Page: 8 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)

This International Standard provides a comprehensive framework for the representation of morpho-syntactic

(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical

abstraction level over language data (textual or spoken) and, depending on the language to be annotated,

together with the characteristics of the annotation tool or annotation scheme that is being used, can vary

enormously in structure and complexity.

In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this

International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens

(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions

associated with groups of tokens). These two levels share the following specificities: on the one hand, they

can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous

compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.

As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),

tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by

means of so-called stand-off annotations.

As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the

morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.

Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry

in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are

expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in

Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of

speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.

In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means

of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data

categories available in ISOCat. A normative annex of this International Standard elicits a core set of data

categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual

context. However, when implementers of this International Standard find these categories inappropriate in

either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in

compliance with ISO/TC 37 principles.

Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-

compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)

guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is

essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-

compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope

with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).

Finally, it should be noted here that this International Standard forms the conceptual basis for the

development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined

in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be

understood according to the token–word-form dichotomy.
© ISO 2012 – All rights reserved vii
---------------------- Page: 9 ----------------------
SIST ISO 24611:2013
---------------------- Page: 10 ----------------------
SIST ISO 24611:2013
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope

This International Standard provides a framework for the representation of annotations of word-forms in texts;

such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.

It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories

contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML

serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding

initiative).
2 Normative references

The following referenced documents are indispensable for the application of this document. For dated

references, only the edition cited applies. For undated references, the latest edition of the referenced

document (including any amendments) applies.

ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure

representation
3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.

3.1
DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure

set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-

syntactic content
Note 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata

graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state

Note 1 to entry: See also DAG (3.1).
© ISO 2012 – All rights reserved 1
---------------------- Page: 11 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection

modification or marking of a lexeme that reflects its morpho-syntactic properties

3.7
inflected form
form that a word can take when used in a sentence or a phrase

Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as

grammatical number and case.
3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme

Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the

masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are

defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be

the third person singular with the accomplished aspect.
3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry

container for managing a set of word-forms and possibly one or more meanings to describe a lexeme

3.11
lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme

smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller

meaningful units

Note 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).

3.13
morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word

Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.

EXAMPLE “grammaticalGender”.
3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset

comprehensive set of tags used for the morpho-syntactic description of a language

Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.

3.17
part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.

Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.

3.18
phoneme
minimal unit in the sound system of a language
3.19
script
set o
...

INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012

All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,

electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or

ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24611:2012(E)
Contents Page

Foreword ............................................................................................................................................................. v

Introduction ........................................................................................................................................................ vi

1 Scope ...................................................................................................................................................... 1

2 Normative references ............................................................................................................................ 1

3 Terms and definitions ........................................................................................................................... 1

4 The MAF meta-model ............................................................................................................................ 4

4.1 Overview ................................................................................................................................................. 4

4.2 MAF Meta-model .................................................................................................................................... 4

5 Segmenting with tokens ....................................................................................................................... 6

5.1 General ................................................................................................................................................... 6

5.2 Formal description: ................................................................................................................ 7

5.3 Embedding notation .............................................................................................................................. 7

5.4 Alternate representation for TEI based documents ........................................................................... 8

5.5 Stand-off notation .................................................................................................................................. 9

5.6 Informative attributes ............................................................................................................................ 9

5.7 Completing the inline token notation ................................................................................................ 10

5.7.1 Joining tokens in embedded mode ................................................................................................... 10

5.7.2 Overlapping tokens ............................................................................................................................. 11

6 Word-forms as linguistic units ........................................................................................................... 11

6.1 Formal description: ...................................................................................................... 12

6.2 Token attachment ................................................................................................................................ 12

6.2.1 One token; one word-form ................................................................................................................. 12

6.2.2 Several contiguous tokens; one word-form ..................................................................................... 12

6.2.3 Several discontinuous tokens; one word-form ................................................................................ 13

6.2.4 Zero token; one word-form ................................................................................................................. 13

6.2.5 One token; several word-forms ......................................................................................................... 14

6.3 Referring to lexical entries ................................................................................................................. 14

6.4 Compound word-forms ....................................................................................................................... 15

6.5 Identification of word-forms within a TEI-compliant document ..................................................... 15

7 Morpho-syntactic content ................................................................................................................... 18

7.1 General ................................................................................................................................................. 18

7.2 Using feature structures ..................................................................................................................... 18

7.3 Compact morpho-syntactic tags ....................................................................................................... 18

7.4 FSR libraries ........................................................................................................................................ 19

7.5 Designing tagsets ................................................................................................................................ 20

7.6 Formal description: ............................................................................................................. 22

8 Handling ambiguities .......................................................................................................................... 22

8.1 Word-form content ambiguities ......................................................................................................... 22

8.2 Lexical Ambiguities ............................................................................................................................. 23

8.3 Structural ambiguities ......................................................................................................................... 23

8.3.1 Structural ambiguities with word-forms ........................................................................................... 23

8.3.2 Structural ambiguities with tokens .................................................................................................... 24

8.4 Simplified structuring variants .......................................................................................................... 24

8.4.1 Non-ambiguous linear representation .............................................................................................. 24

8.4.2 Mixed linear and lattice representation ............................................................................................. 25

8.5 Expanding the simplified variants ..................................................................................................... 26

8.5.1 Separating tokens and word-forms ................................................................................................... 26

8.5.2 Wrapping into local lattices ................................................................................................................ 26

© ISO 2012 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO 24611:2012(E)

8.5.3 Merging local lattices ..........................................................................................................................27

8.5.4 Removing ................................................................................................................................28

8.6 Formal description: and ............................................................................................29

Annex A (informative) Encoded example using the MAF serialization ........................................................30

Annex B (normative) MAF specification .........................................................................................................33

B.1 Elements ...............................................................................................................................................33

B.1.1 ....................................................................................................................................................33

B.1.2 ....................................................................................................................................................34

B.1.3 ....................................................................................................................................................34

B.1.4 ................................................................................................................................................35

B.1.5 .................................................................................................................................................35

B.1.6 ..........................................................................................................................................36

B.1.7 ..................................................................................................................................................36

B.1.8 .........................................................................................................................................37

B.2 Model classes .......................................................................................................................................38

B.3 Attribute classes ..................................................................................................................................38

B.3.1 att.token.information ...........................................................................................................................38

B.3.2 att.token.join .........................................................................................................................................39

B.3.3 att.token.span .......................................................................................................................................39

B.3.4 att.wordForm.content ..........................................................................................................................39

B.3.5 att.wordForm.tokens ...........................................................................................................................40

B.4 Macros ..................................................................................................................................................40

B.4.1 data.certainty ........................................................................................................................................40

B.4.2 data.code ..............................................................................................................................................40

B.4.3 data.count .............................................................................................................................................40

B.4.4 data.duration.w3c ................................................................................................................................41

B.4.5 data.enumerated ..................................................................................................................................41

B.4.6 data.key .................................................................................................................................................41

B.4.7 data.language .......................................................................................................................................42

B.4.8 data.name .............................................................................................................................................43

B.4.9 data.numeric .........................................................................................................................................43

B.4.10 data.pointer ..........................................................................................................................................43

B.4.11 data.probability ....................................................................................................................................44

B.4.12 data.temporal.w3c................................................................................................................................44

B.4.13 data.truthValue .....................................................................................................................................44

B.4.14 data.word ..............................................................................................................................................45

B.4.15 data.xTruthValue ..................................................................................................................................45

Annex C (normative) Morpho-syntactic data categories ..............................................................................46

Bibliography ......................................................................................................................................................58

iv © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
ISO 24611:2012(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies

(ISO member bodies). The work of preparing International Standards is normally carried out through ISO

technical committees. Each member body interested in a subject for which a technical committee has been

established has the right to be represented on that committee. International organizations, governmental and

non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the

International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards

adopted by the technical committees are circulated to the member bodies for voting. Publication as an

International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent

rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content

resources, Subcommittee SC 4, Language resource management.
© ISO 2012 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO 24611:2012(E)
Introduction

ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language

resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the

representation of terminological data [Romary, 2001], through which linguistic data models are seen as the

combination of a generic data pattern (a meta-model), which is further refined through a selection of data

categories that provide the descriptors for this specific annotation level. Such models are defined

independently of any specific formats, and ensure that an implementer has the necessary conceptual

instrument with which to design and compare formats with regard to their degrees of interoperability.

One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable

semantics for the various descriptors used, either in the form of formal features and feature values, or directly

as objects in a representation that is expressed, for instance, in XML. In order to be shared across various

annotation schemas and encoding applications, such a semantics should be implemented as a centralised

registry of concepts: we will henceforth refer to these as data categories. As such, data categories should

bear the following constraints.

 From a technical point of view, they must provide unique, stable references (implemented as persistent

identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to

them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they

are in fact defined in relation to the same data categories (as feature and feature value).

 From a descriptive point of view, each unique semantic reference should be associated with precise

documentation combining a full text elicitation of the meaning of the descriptor with the expression of

specific constraints that bear upon the category.

In recent years, ISO has developed a general framework for representing and maintaining such a registry of

data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,

has led to the implementation of an online environment providing access to all data categories that have been

standardized in the context of the various language resource-related activities within ISO, or specifically as

part of the maintenance of the data category registry. It also provides access to the various data categories

that individual language technology practitioners have defined in the course of their own work and decided to

share with the community.

The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended

as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective

there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are

easily inserted and reused without the need for any strong consistency check with the registry at large.

Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:

 simple generic-specific relations, when these are useful for the proper identification of interoperability

descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/

makes it possible to compare morpho-syntactic annotations based on different descriptive levels of

granularity;

 the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,

the possible value of so-called complex data categories For instance, it can be used to record that

possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be

a subset of {/masculine/, /feminine/ and /neutral/};

 language-specific constraints, either in the form of specific application notes or as explicit restrictions

bearing upon the conceptual domains of complex data categories. For instance, it is possible to express

explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.

vi © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
ISO 24611:2012(E)

This International Standard provides a comprehensive framework for the representation of morpho-syntactic

(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical

abstraction level over language data (textual or spoken) and, depending on the language to be annotated,

together with the characteristics of the annotation tool or annotation scheme that is being used, can vary

enormously in structure and complexity.

In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this

International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens

(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions

associated with groups of tokens). These two levels share the following specificities: on the one hand, they

can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous

compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.

As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),

tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by

means of so-called stand-off annotations.

As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the

morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.

Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry

in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are

expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in

Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of

speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.

In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means

of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data

categories available in ISOCat. A normative annex of this International Standard elicits a core set of data

categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual

context. However, when implementers of this International Standard find these categories inappropriate in

either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in

compliance with ISO/TC 37 principles.

Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-

compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)

guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is

essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-

compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope

with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).

Finally, it should be noted here that this International Standard forms the conceptual basis for the

development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined

in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be

understood according to the token–word-form dichotomy.
© ISO 2012 – All rights reserved vii
---------------------- Page: 7 ----------------------
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope

This International Standard provides a framework for the representation of annotations of word-forms in texts;

such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.

It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories

contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML

serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding

initiative).
2 Normative references

The following referenced documents are indispensable for the application of this document. For dated

references, only the edition cited applies. For undated references, the latest edition of the referenced

document (including any amendments) applies.

ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure

representation
3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.

3.1
DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure

set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-

syntactic content
Note 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata

graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state

Note 1 to entry: See also DAG (3.1).
© ISO 2012 – All rights reserved 1
---------------------- Page: 8 ----------------------
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection

modification or marking of a lexeme that reflects its morpho-syntactic properties

3.7
inflected form
form that a word can take when used in a sentence or a phrase

Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as

grammatical number and case.
3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme

Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the

masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are

defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be

the third person singular with the accomplished aspect.
3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry

container for managing a set of word-forms and possibly one or more meanings to describe a lexeme

3.11
lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme

smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller

meaningful units

Note 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).

3.13
morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word

Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.

EXAMPLE “grammaticalGender”.
3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 9 ----------------------
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset

comprehensive set of tags used for the morpho-syntactic description of a language

Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.

3.17
part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.

Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.

3.18
phoneme
minimal unit in the sound system of a language
3.19
script
set of graphic characters used for the written form of one or more languages
3.20
syntagmatic relation
relation by which linguistic units in a discourse are associated
3.21
token
non-empty contiguous sequence of graphemes or phonemes in a document

Note 1 to entry: For editorial reasons, some annotation scheme may extend the notion of token to an empty sequence.

See the section on token attachment (6.2).
3.22
tokenization
process identifying tokens
3.23
transcription
form resulting from a coherent method of writing down speech sounds
3.24
transliteration

form resulting from the conversion of one script into another, usually through a one-to-one correspondence

between characters
3.25
word-form
morpho-syntactic unit

contiguous or non-contiguous linguistic unit identified as corresponding to a lexical entity in a language

Note 1 to entry: Word-forms may have no acoustic or graphic realization, or may correspond to one or more tokens.

© ISO 2012 – All rights reserved 3
---------------------- Page: 10 ----------------------
ISO 24611:2012(E)
3.26
word lattice

set of possible alternative decompositions of a text or speech segment into word-forms

...

SLOVENSKI STANDARD
SIST ISO 24611:2013
01-julij-2013
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DREOLNRVNODGHQMVNRR]QDþHYDQMH 0$)
Language resource management -- Morpho-syntactic annotation framework (MAF)
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
Ta slovenski standard je istoveten z: ISO 24611:2012
ICS:
01.020 7HUPLQRORJLMD QDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
SIST ISO 24611:2013 en,fr,de

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------
SIST ISO 24611:2013
---------------------- Page: 2 ----------------------
SIST ISO 24611:2013
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012

All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,

electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or

ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Contents Page

Foreword ............................................................................................................................................................. v

Introduction ........................................................................................................................................................ vi

1 Scope ...................................................................................................................................................... 1

2 Normative references ............................................................................................................................ 1

3 Terms and definitions ........................................................................................................................... 1

4 The MAF meta-model ............................................................................................................................ 4

4.1 Overview ................................................................................................................................................. 4

4.2 MAF Meta-model .................................................................................................................................... 4

5 Segmenting with tokens ....................................................................................................................... 6

5.1 General ................................................................................................................................................... 6

5.2 Formal description: ................................................................................................................ 7

5.3 Embedding notation .............................................................................................................................. 7

5.4 Alternate representation for TEI based documents ........................................................................... 8

5.5 Stand-off notation .................................................................................................................................. 9

5.6 Informative attributes ............................................................................................................................ 9

5.7 Completing the inline token notation ................................................................................................ 10

5.7.1 Joining tokens in embedded mode ................................................................................................... 10

5.7.2 Overlapping tokens ............................................................................................................................. 11

6 Word-forms as linguistic units ........................................................................................................... 11

6.1 Formal description: ...................................................................................................... 12

6.2 Token attachment ................................................................................................................................ 12

6.2.1 One token; one word-form ................................................................................................................. 12

6.2.2 Several contiguous tokens; one word-form ..................................................................................... 12

6.2.3 Several discontinuous tokens; one word-form ................................................................................ 13

6.2.4 Zero token; one word-form ................................................................................................................. 13

6.2.5 One token; several word-forms ......................................................................................................... 14

6.3 Referring to lexical entries ................................................................................................................. 14

6.4 Compound word-forms ....................................................................................................................... 15

6.5 Identification of word-forms within a TEI-compliant document ..................................................... 15

7 Morpho-syntactic content ................................................................................................................... 18

7.1 General ................................................................................................................................................. 18

7.2 Using feature structures ..................................................................................................................... 18

7.3 Compact morpho-syntactic tags ....................................................................................................... 18

7.4 FSR libraries ........................................................................................................................................ 19

7.5 Designing tagsets ................................................................................................................................ 20

7.6 Formal description: ............................................................................................................. 22

8 Handling ambiguities .......................................................................................................................... 22

8.1 Word-form content ambiguities ......................................................................................................... 22

8.2 Lexical Ambiguities ............................................................................................................................. 23

8.3 Structural ambiguities ......................................................................................................................... 23

8.3.1 Structural ambiguities with word-forms ........................................................................................... 23

8.3.2 Structural ambiguities with tokens .................................................................................................... 24

8.4 Simplified structuring variants .......................................................................................................... 24

8.4.1 Non-ambiguous linear representation .............................................................................................. 24

8.4.2 Mixed linear and lattice representation ............................................................................................. 25

8.5 Expanding the simplified variants ..................................................................................................... 26

8.5.1 Separating tokens and word-forms ................................................................................................... 26

8.5.2 Wrapping into local lattices ................................................................................................................ 26

© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)

8.5.3 Merging local lattices ..........................................................................................................................27

8.5.4 Removing ................................................................................................................................28

8.6 Formal description: and ............................................................................................29

Annex A (informative) Encoded example using the MAF serialization ........................................................30

Annex B (normative) MAF specification .........................................................................................................33

B.1 Elements ...............................................................................................................................................33

B.1.1 ....................................................................................................................................................33

B.1.2 ....................................................................................................................................................34

B.1.3 ....................................................................................................................................................34

B.1.4 ................................................................................................................................................35

B.1.5 .................................................................................................................................................35

B.1.6 ..........................................................................................................................................36

B.1.7 ..................................................................................................................................................36

B.1.8 .........................................................................................................................................37

B.2 Model classes .......................................................................................................................................38

B.3 Attribute classes ..................................................................................................................................38

B.3.1 att.token.information ...........................................................................................................................38

B.3.2 att.token.join .........................................................................................................................................39

B.3.3 att.token.span .......................................................................................................................................39

B.3.4 att.wordForm.content ..........................................................................................................................39

B.3.5 att.wordForm.tokens ...........................................................................................................................40

B.4 Macros ..................................................................................................................................................40

B.4.1 data.certainty ........................................................................................................................................40

B.4.2 data.code ..............................................................................................................................................40

B.4.3 data.count .............................................................................................................................................40

B.4.4 data.duration.w3c ................................................................................................................................41

B.4.5 data.enumerated ..................................................................................................................................41

B.4.6 data.key .................................................................................................................................................41

B.4.7 data.language .......................................................................................................................................42

B.4.8 data.name .............................................................................................................................................43

B.4.9 data.numeric .........................................................................................................................................43

B.4.10 data.pointer ..........................................................................................................................................43

B.4.11 data.probability ....................................................................................................................................44

B.4.12 data.temporal.w3c................................................................................................................................44

B.4.13 data.truthValue .....................................................................................................................................44

B.4.14 data.word ..............................................................................................................................................45

B.4.15 data.xTruthValue ..................................................................................................................................45

Annex C (normative) Morpho-syntactic data categories ..............................................................................46

Bibliography ......................................................................................................................................................58

iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies

(ISO member bodies). The work of preparing International Standards is normally carried out through ISO

technical committees. Each member body interested in a subject for which a technical committee has been

established has the right to be represented on that committee. International organizations, governmental and

non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the

International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards

adopted by the technical committees are circulated to the member bodies for voting. Publication as an

International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent

rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content

resources, Subcommittee SC 4, Language resource management.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Introduction

ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language

resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the

representation of terminological data [Romary, 2001], through which linguistic data models are seen as the

combination of a generic data pattern (a meta-model), which is further refined through a selection of data

categories that provide the descriptors for this specific annotation level. Such models are defined

independently of any specific formats, and ensure that an implementer has the necessary conceptual

instrument with which to design and compare formats with regard to their degrees of interoperability.

One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable

semantics for the various descriptors used, either in the form of formal features and feature values, or directly

as objects in a representation that is expressed, for instance, in XML. In order to be shared across various

annotation schemas and encoding applications, such a semantics should be implemented as a centralised

registry of concepts: we will henceforth refer to these as data categories. As such, data categories should

bear the following constraints.

 From a technical point of view, they must provide unique, stable references (implemented as persistent

identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to

them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they

are in fact defined in relation to the same data categories (as feature and feature value).

 From a descriptive point of view, each unique semantic reference should be associated with precise

documentation combining a full text elicitation of the meaning of the descriptor with the expression of

specific constraints that bear upon the category.

In recent years, ISO has developed a general framework for representing and maintaining such a registry of

data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,

has led to the implementation of an online environment providing access to all data categories that have been

standardized in the context of the various language resource-related activities within ISO, or specifically as

part of the maintenance of the data category registry. It also provides access to the various data categories

that individual language technology practitioners have defined in the course of their own work and decided to

share with the community.

The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended

as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective

there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are

easily inserted and reused without the need for any strong consistency check with the registry at large.

Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:

 simple generic-specific relations, when these are useful for the proper identification of interoperability

descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/

makes it possible to compare morpho-syntactic annotations based on different descriptive levels of

granularity;

 the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,

the possible value of so-called complex data categories For instance, it can be used to record that

possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be

a subset of {/masculine/, /feminine/ and /neutral/};

 language-specific constraints, either in the form of specific application notes or as explicit restrictions

bearing upon the conceptual domains of complex data categories. For instance, it is possible to express

explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.

vi © ISO 2012 – All rights reserved
---------------------- Page: 8 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)

This International Standard provides a comprehensive framework for the representation of morpho-syntactic

(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical

abstraction level over language data (textual or spoken) and, depending on the language to be annotated,

together with the characteristics of the annotation tool or annotation scheme that is being used, can vary

enormously in structure and complexity.

In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this

International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens

(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions

associated with groups of tokens). These two levels share the following specificities: on the one hand, they

can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous

compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.

As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),

tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by

means of so-called stand-off annotations.

As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the

morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.

Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry

in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are

expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in

Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of

speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.

In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means

of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data

categories available in ISOCat. A normative annex of this International Standard elicits a core set of data

categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual

context. However, when implementers of this International Standard find these categories inappropriate in

either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in

compliance with ISO/TC 37 principles.

Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-

compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)

guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is

essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-

compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope

with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).

Finally, it should be noted here that this International Standard forms the conceptual basis for the

development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined

in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be

understood according to the token–word-form dichotomy.
© ISO 2012 – All rights reserved vii
---------------------- Page: 9 ----------------------
SIST ISO 24611:2013
---------------------- Page: 10 ----------------------
SIST ISO 24611:2013
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope

This International Standard provides a framework for the representation of annotations of word-forms in texts;

such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.

It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories

contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML

serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding

initiative).
2 Normative references

The following referenced documents are indispensable for the application of this document. For dated

references, only the edition cited applies. For undated references, the latest edition of the referenced

document (including any amendments) applies.

ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure

representation
3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.

3.1
DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure

set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-

syntactic content
Note 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata

graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state

Note 1 to entry: See also DAG (3.1).
© ISO 2012 – All rights reserved 1
---------------------- Page: 11 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection

modification or marking of a lexeme that reflects its morpho-syntactic properties

3.7
inflected form
form that a word can take when used in a sentence or a phrase

Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as

grammatical number and case.
3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme

Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the

masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are

defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be

the third person singular with the accomplished aspect.
3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry

container for managing a set of word-forms and possibly one or more meanings to describe a lexeme

3.11
lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme

smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller

meaningful units

Note 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).

3.13
morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word

Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.

EXAMPLE “grammaticalGender”.
3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset

comprehensive set of tags used for the morpho-syntactic description of a language

Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.

3.17
part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.

Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.

3.18
phoneme
minimal unit in the sound system of a language
3.19
script
set of graphic characters used for the written form of one or more languages
3.20
syntagmatic relation
relation by which linguistic units in a discourse are associated
...

NORME ISO
INTERNATIONALE 24611
Première édition
2012-11-01
Gestion des ressources langagières —
Cadre d'annotation morphosyntaxique
(MAF)
Language resource management — Morpho-syntactic annotation
framework (MAF)
Numéro de référence
ISO 24611:2012(F)
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(F)
DOCUMENT PROTÉGÉ PAR COPYRIGHT
© ISO 2012, Publié en Suisse

Droits de reproduction réservés. Sauf indication contraire, aucune partie de cette publication ne peut être reproduite ni utilisée sous

quelque forme que ce soit et par aucun procédé, électronique ou mécanique, y compris la photocopie, l’affichage sur l’internet ou sur un

Intranet, sans autorisation écrite préalable. Les demandes d’autorisation peuvent être adressées à l’ISO à l’adresse ci-après ou au comité

membre de l’ISO dans le pays du demandeur.
ISO copyright office
Ch. de Blandonnet 8  CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2012 – Tous droits réservés
---------------------- Page: 2 ----------------------
ISO 24611:2012(F)
Sommaire Page

Avant-propos ................................................................................................................................................................... v

Introduction ................................................................................................................................................................... vi

1 Domaine d’application ................................................................................................................................... 1

2 Références normatives .................................................................................................................................. 1

3 Termes et définitions ..................................................................................................................................... 1

4 Le métamodèle MAF ....................................................................................................................................... 4

4.1 Vue d’ensemble ................................................................................................................................................ 4

4.2 Métamodèle MAF ............................................................................................................................................. 5

5 Segmentation .................................................................................................................................................... 6

5.1 Aspect général .................................................................................................................................................. 6

5.2 Description formelle: .................................................................................................................... 7

5.3 Notation enchâssée ......................................................................................................................................... 8

5.4 Représentation alternative pour les documents conformes à la TEI ........................................... 8

5.5 Notation déportée ........................................................................................................................................... 9

5.6 Attributs informatifs ................................................................................................................................... 10

5.7 Compléter la notation enchâssée ............................................................................................................ 10

5.7.1 Joindre des segments dans le mode enchâssé .................................................................................... 11

5.7.2 Segments chevauchants ............................................................................................................................. 11

6 Les mots-formes en tant qu’unités linguistiques .............................................................................. 12

6.1 Description formelle: ....................................................................................................... 13

6.2 Attachement de segment ........................................................................................................................... 13

6.2.1 Un segment, un mot-forme ........................................................................................................................ 13

6.2.2 Plusieurs segments contigus, un mot-forme ...................................................................................... 13

6.2.3 Plusieurs segments discontigus, un mot forme ................................................................................. 13

6.2.4 Absence de segment, un mot-forme ....................................................................................................... 14

6.2.5 Un segment, plusieurs mots-formes ...................................................................................................... 14

6.3 Référencer les entrées lexicales .............................................................................................................. 15

6.4 Mots-formes composés ............................................................................................................................... 16

6.5 Identification des mots-formes au sein d’un document conforme à la TEI ............................. 16

7 Contenu morphosyntaxique ..................................................................................................................... 19

7.1 Aspect général ............................................................................................................................................... 19

7.2 Utiliser les structures de traits ................................................................................................................ 19

7.3 Balises morphosyntaxiques compactes................................................................................................ 20

7.4 Les bibliothèques FSR ................................................................................................................................. 20

7.5 Conception des ensembles de balises ................................................................................................... 21

7.6 Description formelle: ................................................................................................................ 23

8 Gestion des ambiguïtés ............................................................................................................................... 23

8.1 Ambiguïtés du contenu des mots-formes ............................................................................................ 23

8.2 Ambiguïtés lexicales .................................................................................................................................... 24

8.3 Ambiguïtés structurelles ........................................................................................................................... 24

8.3.1 Ambiguïtés structurelles avec des mots-formes ............................................................................... 24

8.3.2 Ambiguïtés structurelles avec les segments ....................................................................................... 25

8.4 Variantes structurées simplement ......................................................................................................... 25

© ISO 2012 – Tous droits réservés iii
---------------------- Page: 3 ----------------------
ISO 24611:2012(F)

8.4.1 Représentation linéaire non ambiguë .................................................................................................. 25

8.4.2 Représentation mixte linéaire et en treillis ....................................................................................... 26

8.5 Expanser les variantes simplifiées ........................................................................................................ 27

8.5.1 Séparer les segments et les mots-formes ............................................................................................ 27

8.5.2 Envelopper dans les treillis locaux ........................................................................................................ 27

8.5.3 Fusion de treillis locaux ............................................................................................................................. 28

8.5.4 Suppression de ............................................................................................................................. 30

8.6 Description formelle: and ........................................................................................... 30

Annexe A (informative) Exemple encodé selon la sérialisation MAF ..................................................... 31

Annexe B (normative) Spécification MAF ........................................................................................................... 34

B.1 Eléments .......................................................................................................................................................... 34

B.1.1 ............................................................................................................................................................... 34

B.1.2 ................................................................................................................................................................ 35

B.1.3 ................................................................................................................................................................ 35

B.1.4 ........................................................................................................................................................... 36

B.1.5 ............................................................................................................................................................ 36

B.1.6 ................................................................................................................................................... 37

B.1.7 ............................................................................................................................................................. 37

B.1.8 .................................................................................................................................................. 38

B.2 Classes de modèles ...................................................................................................................................... 39

B.3 Classes d’attributs ........................................................................................................................................ 39

B.3.1 att.token.information ................................................................................................................................. 39

B.3.2 att.token.join .................................................................................................................................................. 40

B.3.3 att.token.span ................................................................................................................................................ 40

B.3.4 att.wordForm.content ................................................................................................................................ 40

B.3.5 att.wordForm.tokens .................................................................................................................................. 41

B.4

Macros .............................................................................................................................................................. 41

B.4.1 data.certainty................................................................................................................................................. 41

B.4.2 data.code ......................................................................................................................................................... 41

B.4.3 data.count ....................................................................................................................................................... 42

B.4.4 data.duration.w3c ........................................................................................................................................ 42

B.4.5 data.enumerated .......................................................................................................................................... 42

B.4.6 data.key............................................................................................................................................................ 43

B.4.7 data.language ................................................................................................................................................. 43

B.4.8 data.name ........................................................................................................................................................ 44

B.4.9 data.numeric .................................................................................................................................................. 45

B.4.10 data.pointer .................................................................................................................................................... 45

B.4.11 data.probability ............................................................................................................................................ 46

B.4.12 data.temporal.w3c ....................................................................................................................................... 46

B.4.13 data.truthValue ............................................................................................................................................. 46

B.4.14 data.word ........................................................................................................................................................ 47

B.4.15 data.xTruthValue ......................................................................................................................................... 47

Annexe C (normative) Catégories de données morphosyntaxiques ......................................................... 48

Bibliographie ............................................................................................................................................................... 62

iv © ISO 2012 – Tous droits réservés
---------------------- Page: 4 ----------------------
ISO 24611:2012(F)
Avant-propos

L'ISO (Organisation internationale de normalisation) est une fédération mondiale d'organismes

nationaux de normalisation (comités membres de l'ISO). L'élaboration des Normes internationales est

en général confiée aux comités techniques de l'ISO. Chaque comité membre intéressé par une étude a le

droit de faire partie du comité technique créé à cet effet. Les organisations internationales,

gouvernementales et non gouvernementales, en liaison avec l'ISO participent également aux travaux.

L'ISO collabore étroitement avec la Commission électrotechnique internationale (IEC) en ce qui

concerne la normalisation électrotechnique.

Les procédures utilisées pour élaborer le présent document et celles destinées à sa mise à jour sont

décrites dans les Directives ISO/IEC, Partie 1. Il convient, en particulier de prendre note des différents

critères d'approbation requis pour les différents types de documents ISO. Le présent document a été

rédigé conformément aux règles de rédaction données dans les Directives ISO/IEC, Partie 2

(voir www.iso.org/directives).

L'attention est appelée sur le fait que certains des éléments du présent document peuvent faire l'objet

de droits de propriété intellectuelle ou de droits analogues. L'ISO ne saurait être tenue pour

responsable de ne pas avoir identifié de tels droits de propriété et averti de leur existence. Les détails

concernant les références aux droits de propriété intellectuelle ou autres droits analogues identifiés

lors de l'élaboration du document sont indiqués dans l'Introduction et/ou dans la liste des déclarations

de brevets reçues par l'ISO (voir www.iso.org/brevets).

Les appellations commerciales éventuellement mentionnées dans le présent document sont données

pour information, par souci de commodité, à l’intention des utilisateurs et ne sauraient constituer un

engagement.

Pour une explication de la signification des termes et expressions spécifiques de l'ISO liés à l'évaluation

de la conformité, ou pour toute information au sujet de l'adhésion de l'ISO aux principes de

l’Organisation mondiale du commerce (OMC) concernant les obstacles techniques au commerce (OTC),

voir le lien suivant: www.iso.org/iso/fr/avant‐propos.html.

Le comité chargé de l'élaboration du présent document est l'ISO/TC 37, Terminologie et autres

ressources langagières et ressources de contenu, sous‐comité SC4, Gestion de ressources linguistiques.

© ISO 2012 – Tous droits réservés v
---------------------- Page: 5 ----------------------
ISO 24611:2012(F)
Introduction

L’ISO/TC 37/SC 4 se concentre sur la définition des modèles et des formats utilisés pour représenter les

ressources linguistiques annotées. A cette fin, il généralise la stratégie de modélisation initialisée par

son comité frère le SC 3 pour la représentation des données terminologiques [Romary, 2001], selon

laquelle les modèles de données linguistiques sont considérés comme la combinaison d’un patron de

données génériques (un métamodèle), qui est ensuite perfectionné au moyen d’une sélection de

catégories de données qui fournissent les descripteurs correspondant à ce niveau spécifique

d’annotation. Ces modèles sont définis indépendamment des formats spécifiques et permettent à

l’implémenteur de disposer de l’outil conceptuel nécessaire pour concevoir et comparer les formats en

fonction de leurs niveaux d’interopérabilité.

Pour représenter tout type d’annotation, il est important de mettre à disposition une sémantique claire

et fiable pour les divers descripteurs utilisés, soit sous la forme de traits valués formels, soit

directement comme objets d’une représentation exprimée par exemple en XML. Pour que cette

sémantique puisse être partagée entre différents schémas d’annotation et d’applications d’encodage, il

convient de l’implémenter comme un registre centralisé de concepts: aussi, nous considérerons ces

concepts comme des catégories de données. En tant que telles, il convient que ces catégories de données

remplissent les conditions suivantes:

 d’un point de vue technique, elles doivent fournir des références uniques et stables (implémentées

sous la forme d’identifiants pérennes au sens de l’ISO 24619) de telle manière que le concepteur

d’un schéma spécifique d’encodage puisse les référencer dans ses spécifications. Ainsi, deux

annotations seront considérées comme équivalentes quand elles feront référence à la même

catégorie de données (en tant que trait et valeur).

 d’un point de vue descriptif, il convient que chaque référence sémantiquement unique soit associée

à une documentation précise combinant une explication en prose de la signification du descripteur

avec l’expression des contraintes spécifiques qui portent sur la catégorie.

Ces dernières années, l’ISO a développé un cadre général pour représenter et maintenir un tel registre

de catégories de données couvrant tous les domaines des ressources linguistiques. Cette initiative,

spécifiée par l’ISO 12620, a abouti à l’implémentation d’un environnement mis en ligne afin d’une part

de fournir l’accès à toutes les catégories de données qui ont été normalisées dans le contexte des

activités liées aux diverses ressources linguistiques au sein de l’ISO, et d’autre part spécifiquement au

titre de la maintenance du registre de catégories de données. Le système propose aussi un accès aux

diverses catégories de données que les praticiens des technologies du langage ont définies dans le cadre

de leur propre travail et qu’ils ont partagé ensuite avec la communauté.

Le registre de catégories de données, accessible via l’implémentation ISOCat (www.isocat.org) est juste

un espace d’objets sémantiques n’offrant qu’un ensemble limité de contraintes ontologiques. L’objectif

est de faciliter la maintenance d’un environnement au sein duquel de nouvelles catégories sont

facilement insérées et réutilisées sans qu’il soit nécessaire de procéder à une vérification approfondie

de la cohérence par rapport au reste du registre. En effet, les contraintes de base sont intrinsèques au

modèle de catégorie de données tel que défini par l’ISO 12620:

 de simples relations génériques‐spécifiques quand elles sont utiles à une identification exacte des

descripteurs d’interopérabilité entre catégories de données. Par exemple, le fait que /properNoun/

soit une sous‐catégorie de /noun/ permet de comparer des annotations morphosyntaxiques

fondées sur différents niveaux de granularité;
vi © ISO 2012 – Tous droits réservés
---------------------- Page: 6 ----------------------
ISO 24611:2012(F)

 la description des domaines conceptuels au sens de l’ISO 11179 pour identifier, quand elle est

connue ou identifiable la valeur possible de la dite catégorie de donnée complexe. Par exemple, elle

peut être utilisée pour enregistrer que la valeur possible de /grammaticalGender/ (limitée à un

petit groupe de langues [Romary 2011]), peut être un sous‐ensemble de {/ masculine/, /feminine/

et /neutral/};

 des contraintes linguistiques spécifiques, soit sous la forme de notes d’application ou comme des

restrictions explicites portant sur les domaines conceptuels des catégories de données. Par

exemple, il est possible d’exprimer explicitement que /grammaticalGender/ en français ne peut

prendre que les deux valeurs: {/masculine/ et /feminine/}.

La présente Norme internationale fournit un cadre complet pour la représentation des annotations

morphosyntactiques (aussi dénommées annotations en partie du discours). Ce niveau d’annotation

correspond à un premier niveau d’abstraction par rapport aux données linguistiques (textuelles ou

parlées), dont la structure et la complexité peuvent varier considérablement en fonction de la langue à

annoter, de même que selon les caractéristiques de l’outil d’annotation ou du schéma d’annotation

utilisé.

Pour résoudre les problématiques complexes de l’ambiguïté et du déterminisme en annotation

morphosyntaxique, la présente Norme internationale introduit un méta‐modèle qui établit une

distinction nette entre les deux niveaux que sont les segments (représentant le découpage de surface de

la source) et lesmots‐formes (identifiant les abstractions lexicales associées aux groupes de segments).

Ces deux niveaux partagent les caractéristiques suivantes: d’une part, ils peuvent être représentés

comme de simples séquences et des graphes locaux tels que segmentations multiples et éléments

ambigus, et d’autre part, toute combinaison N à M peut relier les segments et les mots‐formes.

En tant que segments linguistiques (quelquefois dénommés ‘tokens’ ou ‘markables ‘ dans la littérature

technique anglaise [par exemple, Carletta et al. 1997]), ces segments peuvent être enchâssés dans le

document source comme une balise en ligne, ou peuvent y faire référence par l’intermédiaire

d’annotations déportées (‘stand‐off annotation’ en anglais).

En tant qu’abstractions linguistiques, les mots‐formes peuvent être qualifiés par divers traits

linguistiques caractérisant les propriétés morphosyntaxiques qui sont instanciées dans la réalisation de

l’entrée lexicale dans le texte annoté. Ces propriétés peuvent prendre diverses formes: de la simple

indication d'un lemme à une référence explicite à une entrée lexicale dans un dictionnaire. Dans la

plupart des applications existantes de l’annotation morphosyntaxique, les propriétés linguistiques sont

exprimées au moyen de balises; ces codes font référence aux structures de traits basiques (voir les

exemples dans Monachini and Calzolari, 1994). Ces codes peuvent aussi fournir de l’information

morphologique, incluant la partie du discours (par exemple, nom, adjectif ou verbe), et des traits

comme le nombre, le genre, la personne, le mode et le temps du verbe.

En phase avec la stratégie générale de modélisation de l’ISO/TC 37, la présente Norme internationale/le

cadre MAF fournit les moyens de mise en relation des balises morphosyntaxiques exprimées en tant

que structures de traits (conformes à l’ISO 24610) avec les catégories de données d’ISOCat. Une annexe

normative de la présente Norme internationale explicite un jeu de base de catégories de données qui

peuvent être utilisées comme référence pour la plupart des tâches d’annotation morphosyntaxiques

dans un contexte multilingue. Néanmoins, si des utilisateurs de la présente Norme internationale

estiment que ces catégories sont inappropriées du point de vue de la couverture, du domaine

d’application ou de la sémantique, ils sont invités à utiliser ISOCat pour définir leurs propres catégories

en conformité avec les principes de l’ISO/TC 37.

Associé au méta‐modèle, le cadre MAF fournit aussi une syntaxe XML par défaut qui peut être utilisée

pour sérialiser les modèles d’annotation conformes. Etant donné que de nombreux projets existants

sont basés sur les lignes directrices émanant du consortium TEI (Text Encoding Initiative, www.tei‐

c.org) — particulièrement dans les humanités numériques, où un encodage correct des sources

textuelles est essentiel — la présente Norme internationale fournira aussi des informations sur la façon

© ISO 2012 – Tous droits réservés vii
---------------------- Page: 7 ----------------------
ISO 24611:2012(F)

concilier le modèle MAF et les encodages conformes à la TEI. En effet, les lignes directrices de la TEI

offrent d’ores et déjà une grande variété de constructions et de mécanismes pour prendre en charge les

nombreux défis posés par les corpus oraux et leurs annotations (Romary and Witt, 2012).

Enfin, il convient de noter que la présente Norme internationale constitue la base conceptuelle

permettant d’élaborer la série de normes ISO 24614 relative à la segmentation des unités lexicales. La

totalité des règles et principes généraux définis dans l’ISO 24614‐1 de même que les contraintes

exprimées dans des parties complémentaires traitant de langues spécifiques, doivent être appréhendés

dans le respect de la dichotomie segment / mot‐forme.
viii © ISO 2012 – Tous droits réservés
---------------------- Page: 8 ----------------------
NORME INTERNATIONALE ISO 24611:2012(F)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
1 Domaine d’application

La présente Norme internationale fournit un cadre pour la représentation des annotations des mots‐

formes dans les textes; ces annotations concernent les segments, leurs relations avec les unités

lexicales, et leurs propriétés morphosyntaxiques.

Elle présente un métamodèle pour l’annotation morphosyntaxique qui référence les catégories de

données dans le registre des catégories de données ISOCat (DCR tel que défini dans l’ISO 12620). Elle

décrit aussi une sérialisation XML pour l’annotation morphosyntaxique, avec les équivalences des lignes

directrices de la TEI (Text Encoding Initiative).
2 Références normatives

Les documents référencés sont indispensables à l’application de ce document. Pour les références

datées, seule l’édition citée s’applique. Pour les références non datées, la dernière édition du document

référencé s’applique (incluant ses éventuels amendements).

ISO 24610‐1, Gestion des ressources linguistiques — Structures de traits — Partie 1: Représentation de

structures de traits
3 Termes et définitions

Pour les besoins du présent document, les termes et définitions donnés dans l’ISO 24610‐1 ainsi que les

suivants s’appliquent:
3.1
GOA
DAG
graphe orienté acyclique
graphe contenant des arcs orientés et sans cycle

Note 1 à l’article: les graphes orientés acycliques sont des sous‐ensembles des automates finis (3.4).

3.3
structure de trait

ensemble des spécifications de trait, utilisé dans le cadre d’annotation morphosyntaxique (MAF) pour

exprimer le contenu morphosyntaxique
Note 1 à l’article: les structures de trait sont spécifiées dans l’ISO 24610‐1.
© ISO 2012 – Tous droits réservés 1
---------------------- Page: 9 ----------------------
ISO 24611:2012(F)
3.4
AEF
FSA
automate fini

graphes comprenant plusieurs états avec un état initial et un état final, et un ensemble fini de

transitions pour passer d'un état à l'autre
Note 1 à l’article: Voir aussi GOA (3.1).
3.5
graphème
unité minimale dans une langue écrite
EXEMPLE Lettre, pictogramme, idéogramme, numérique, ponctuation.
3.6
flexion
modification ou balise d’un lexème qui reflète ses propriétés morphosyntaxiques
3.7
forme fléchie
forme qu’un mot peut prendre dans une phrase ou une proposition

Note 1 à l’article: Une forme fléchie d’un mot est associée avec une combinaison de traits morphologiques

comme le nombre grammatical ou le cas.
3.8
lemme
forme lemmatisée
forme conventionnelle choisie pour représenter un lexème
Note 1 à l’article: Dans les langues europ
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.