SIST ISO 24611:2013
(Main)Language resource management -- Morpho-syntactic annotation framework (MAF)
Language resource management -- Morpho-syntactic annotation framework (MAF)
This International Standard provides a framework for the representation of annotations of word-forms in texts; such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties. It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding initiative).
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
L'ISO 24611:2012 fournit un cadre pour la repr�sentation des annotations des mots-formes dans les textes; ces annotations concernent les segments, leurs relations avec les unit�s lexicales, et leurs propri�t�s morphosyntaxiques.
Elle pr�sente un m�tamod�le pour l'annotation morphosyntaxique qui r�f�rence les cat�gories de donn�es dans le registre des cat�gories de donn�es ISOCat (DCR tel que d�fini dans l'ISO 12620). Elle d�crit aussi une s�rialisation XML pour l'annotation morphosyntaxique, avec les �quivalences des lignes directrices de la TEI (Text Encoding Initiative).
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF)
Ta mednarodni standard zagotavlja ogrodje za predstavitev označevanja besednih oblik v besedilih; to označevanje vključuje žetone, njihov odnos z leksikalnimi enotami in njihove oblikoskladenjske lastnosti. Opisuje metamodel za oblikoskladenjsko označevanje, ki je povezan s sklicevanjem na podatkovne kategorije iz registra kategorij podatkov ISOCat (kot ga določa ISO 12620). Prav tako opisuje serializacijo oblikoskladenjskega označevanja XML z upoštevanjem smernic TEI (iniciativa za zapis besedil).
General Information
Buy Standard
Standards Content (sample)
SLOVENSKI STANDARD
SIST ISO 24611:2013
01-julij-2013
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF)
Language resource management -- Morpho-syntactic annotation framework (MAF)
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
Ta slovenski standard je istoveten z: ISO 24611:2012
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24611:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------SIST ISO 24611:2013
---------------------- Page: 2 ----------------------
SIST ISO 24611:2013
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Contents Page
Foreword ............................................................................................................................................................. v
Introduction ........................................................................................................................................................ vi
1 Scope ...................................................................................................................................................... 1
2 Normative references ............................................................................................................................ 1
3 Terms and definitions ........................................................................................................................... 1
4 The MAF meta-model ............................................................................................................................ 4
4.1 Overview ................................................................................................................................................. 4
4.2 MAF Meta-model .................................................................................................................................... 4
5 Segmenting with tokens ....................................................................................................................... 6
5.1 General ................................................................................................................................................... 6
5.2 Formal description: ................................................................................................................ 7
5.3 Embedding notation .............................................................................................................................. 7
5.4 Alternate representation for TEI based documents ........................................................................... 8
5.5 Stand-off notation .................................................................................................................................. 9
5.6 Informative attributes ............................................................................................................................ 9
5.7 Completing the inline token notation ................................................................................................ 10
5.7.1 Joining tokens in embedded mode ................................................................................................... 10
5.7.2 Overlapping tokens ............................................................................................................................. 11
6 Word-forms as linguistic units ........................................................................................................... 11
6.1 Formal description: ...................................................................................................... 12
6.2 Token attachment ................................................................................................................................ 12
6.2.1 One token; one word-form ................................................................................................................. 12
6.2.2 Several contiguous tokens; one word-form ..................................................................................... 12
6.2.3 Several discontinuous tokens; one word-form ................................................................................ 13
6.2.4 Zero token; one word-form ................................................................................................................. 13
6.2.5 One token; several word-forms ......................................................................................................... 14
6.3 Referring to lexical entries ................................................................................................................. 14
6.4 Compound word-forms ....................................................................................................................... 15
6.5 Identification of word-forms within a TEI-compliant document ..................................................... 15
7 Morpho-syntactic content ................................................................................................................... 18
7.1 General ................................................................................................................................................. 18
7.2 Using feature structures ..................................................................................................................... 18
7.3 Compact morpho-syntactic tags ....................................................................................................... 18
7.4 FSR libraries ........................................................................................................................................ 19
7.5 Designing tagsets ................................................................................................................................ 20
7.6 Formal description: ............................................................................................................. 22
8 Handling ambiguities .......................................................................................................................... 22
8.1 Word-form content ambiguities ......................................................................................................... 22
8.2 Lexical Ambiguities ............................................................................................................................. 23
8.3 Structural ambiguities ......................................................................................................................... 23
8.3.1 Structural ambiguities with word-forms ........................................................................................... 23
8.3.2 Structural ambiguities with tokens .................................................................................................... 24
8.4 Simplified structuring variants .......................................................................................................... 24
8.4.1 Non-ambiguous linear representation .............................................................................................. 24
8.4.2 Mixed linear and lattice representation ............................................................................................. 25
8.5 Expanding the simplified variants ..................................................................................................... 26
8.5.1 Separating tokens and word-forms ................................................................................................... 26
8.5.2 Wrapping into local lattices ................................................................................................................ 26
© ISO 2012 – All rights reserved iii---------------------- Page: 5 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
8.5.3 Merging local lattices ..........................................................................................................................27
8.5.4 Removing ................................................................................................................................28
8.6 Formal description: and ............................................................................................29
Annex A (informative) Encoded example using the MAF serialization ........................................................30
Annex B (normative) MAF specification .........................................................................................................33
B.1 Elements ...............................................................................................................................................33
B.1.1 ....................................................................................................................................................33
B.1.2 ....................................................................................................................................................34
B.1.3 ....................................................................................................................................................34
B.1.4 ................................................................................................................................................35
B.1.5 .................................................................................................................................................35
B.1.6 ..........................................................................................................................................36
B.1.7 ..................................................................................................................................................36
B.1.8 .........................................................................................................................................37
B.2 Model classes .......................................................................................................................................38
B.3 Attribute classes ..................................................................................................................................38
B.3.1 att.token.information ...........................................................................................................................38
B.3.2 att.token.join .........................................................................................................................................39
B.3.3 att.token.span .......................................................................................................................................39
B.3.4 att.wordForm.content ..........................................................................................................................39
B.3.5 att.wordForm.tokens ...........................................................................................................................40
B.4 Macros ..................................................................................................................................................40
B.4.1 data.certainty ........................................................................................................................................40
B.4.2 data.code ..............................................................................................................................................40
B.4.3 data.count .............................................................................................................................................40
B.4.4 data.duration.w3c ................................................................................................................................41
B.4.5 data.enumerated ..................................................................................................................................41
B.4.6 data.key .................................................................................................................................................41
B.4.7 data.language .......................................................................................................................................42
B.4.8 data.name .............................................................................................................................................43
B.4.9 data.numeric .........................................................................................................................................43
B.4.10 data.pointer ..........................................................................................................................................43
B.4.11 data.probability ....................................................................................................................................44
B.4.12 data.temporal.w3c................................................................................................................................44
B.4.13 data.truthValue .....................................................................................................................................44
B.4.14 data.word ..............................................................................................................................................45
B.4.15 data.xTruthValue ..................................................................................................................................45
Annex C (normative) Morpho-syntactic data categories ..............................................................................46
Bibliography ......................................................................................................................................................58
iv © ISO 2012 – All rights reserved---------------------- Page: 6 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language
resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the
representation of terminological data [Romary, 2001], through which linguistic data models are seen as the
combination of a generic data pattern (a meta-model), which is further refined through a selection of data
categories that provide the descriptors for this specific annotation level. Such models are defined
independently of any specific formats, and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or directly
as objects in a representation that is expressed, for instance, in XML. In order to be shared across various
annotation schemas and encoding applications, such a semantics should be implemented as a centralised
registry of concepts: we will henceforth refer to these as data categories. As such, data categories should
bear the following constraints. From a technical point of view, they must provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to
them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they
are in fact defined in relation to the same data categories (as feature and feature value).
From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.In recent years, ISO has developed a general framework for representing and maintaining such a registry of
data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,
has led to the implementation of an online environment providing access to all data categories that have been
standardized in the context of the various language resource-related activities within ISO, or specifically as
part of the maintenance of the data category registry. It also provides access to the various data categories
that individual language technology practitioners have defined in the course of their own work and decided to
share with the community.The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended
as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective
there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are
easily inserted and reused without the need for any strong consistency check with the registry at large.
Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:
simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/
makes it possible to compare morpho-syntactic annotations based on different descriptive levels of
granularity; the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,
the possible value of so-called complex data categories For instance, it can be used to record that
possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be
a subset of {/masculine/, /feminine/ and /neutral/}; language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.
vi © ISO 2012 – All rights reserved---------------------- Page: 8 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
This International Standard provides a comprehensive framework for the representation of morpho-syntactic
(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical
abstraction level over language data (textual or spoken) and, depending on the language to be annotated,
together with the characteristics of the annotation tool or annotation scheme that is being used, can vary
enormously in structure and complexity.In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this
International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens
(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions
associated with groups of tokens). These two levels share the following specificities: on the one hand, they
can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous
compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.
As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),
tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by
means of so-called stand-off annotations.As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.
Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry
in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are
expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in
Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of
speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means
of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data
categories available in ISOCat. A normative annex of this International Standard elicits a core set of data
categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual
context. However, when implementers of this International Standard find these categories inappropriate in
either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in
compliance with ISO/TC 37 principles.Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-
compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)
guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is
essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-
compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope
with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).
Finally, it should be noted here that this International Standard forms the conceptual basis for the
development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined
in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be
understood according to the token–word-form dichotomy.© ISO 2012 – All rights reserved vii
---------------------- Page: 9 ----------------------
SIST ISO 24611:2013
---------------------- Page: 10 ----------------------
SIST ISO 24611:2013
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope
This International Standard provides a framework for the representation of annotations of word-forms in texts;
such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.
It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories
contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML
serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding
initiative).2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure
representation3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.
3.1DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure
set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-
syntactic contentNote 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.1).© ISO 2012 – All rights reserved 1
---------------------- Page: 11 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection
modification or marking of a lexeme that reflects its morpho-syntactic properties
3.7inflected form
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as
grammatical number and case.3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be
the third person singular with the accomplished aspect.3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry
container for managing a set of word-forms and possibly one or more meanings to describe a lexeme
3.11lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme
smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller
meaningful unitsNote 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).
3.13morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.
EXAMPLE “grammaticalGender”.3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset
comprehensive set of tags used for the morpho-syntactic description of a language
Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.
3.17part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.
3.18phoneme
minimal unit in the sound system of a language
3.19
script
set o
...
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24611:2012(E)
Contents Page
Foreword ............................................................................................................................................................. v
Introduction ........................................................................................................................................................ vi
1 Scope ...................................................................................................................................................... 1
2 Normative references ............................................................................................................................ 1
3 Terms and definitions ........................................................................................................................... 1
4 The MAF meta-model ............................................................................................................................ 4
4.1 Overview ................................................................................................................................................. 4
4.2 MAF Meta-model .................................................................................................................................... 4
5 Segmenting with tokens ....................................................................................................................... 6
5.1 General ................................................................................................................................................... 6
5.2 Formal description: ................................................................................................................ 7
5.3 Embedding notation .............................................................................................................................. 7
5.4 Alternate representation for TEI based documents ........................................................................... 8
5.5 Stand-off notation .................................................................................................................................. 9
5.6 Informative attributes ............................................................................................................................ 9
5.7 Completing the inline token notation ................................................................................................ 10
5.7.1 Joining tokens in embedded mode ................................................................................................... 10
5.7.2 Overlapping tokens ............................................................................................................................. 11
6 Word-forms as linguistic units ........................................................................................................... 11
6.1 Formal description: ...................................................................................................... 12
6.2 Token attachment ................................................................................................................................ 12
6.2.1 One token; one word-form ................................................................................................................. 12
6.2.2 Several contiguous tokens; one word-form ..................................................................................... 12
6.2.3 Several discontinuous tokens; one word-form ................................................................................ 13
6.2.4 Zero token; one word-form ................................................................................................................. 13
6.2.5 One token; several word-forms ......................................................................................................... 14
6.3 Referring to lexical entries ................................................................................................................. 14
6.4 Compound word-forms ....................................................................................................................... 15
6.5 Identification of word-forms within a TEI-compliant document ..................................................... 15
7 Morpho-syntactic content ................................................................................................................... 18
7.1 General ................................................................................................................................................. 18
7.2 Using feature structures ..................................................................................................................... 18
7.3 Compact morpho-syntactic tags ....................................................................................................... 18
7.4 FSR libraries ........................................................................................................................................ 19
7.5 Designing tagsets ................................................................................................................................ 20
7.6 Formal description: ............................................................................................................. 22
8 Handling ambiguities .......................................................................................................................... 22
8.1 Word-form content ambiguities ......................................................................................................... 22
8.2 Lexical Ambiguities ............................................................................................................................. 23
8.3 Structural ambiguities ......................................................................................................................... 23
8.3.1 Structural ambiguities with word-forms ........................................................................................... 23
8.3.2 Structural ambiguities with tokens .................................................................................................... 24
8.4 Simplified structuring variants .......................................................................................................... 24
8.4.1 Non-ambiguous linear representation .............................................................................................. 24
8.4.2 Mixed linear and lattice representation ............................................................................................. 25
8.5 Expanding the simplified variants ..................................................................................................... 26
8.5.1 Separating tokens and word-forms ................................................................................................... 26
8.5.2 Wrapping into local lattices ................................................................................................................ 26
© ISO 2012 – All rights reserved iii---------------------- Page: 3 ----------------------
ISO 24611:2012(E)
8.5.3 Merging local lattices ..........................................................................................................................27
8.5.4 Removing ................................................................................................................................28
8.6 Formal description: and ............................................................................................29
Annex A (informative) Encoded example using the MAF serialization ........................................................30
Annex B (normative) MAF specification .........................................................................................................33
B.1 Elements ...............................................................................................................................................33
B.1.1 ....................................................................................................................................................33
B.1.2 ....................................................................................................................................................34
B.1.3 ....................................................................................................................................................34
B.1.4 ................................................................................................................................................35
B.1.5 .................................................................................................................................................35
B.1.6 ..........................................................................................................................................36
B.1.7 ..................................................................................................................................................36
B.1.8 .........................................................................................................................................37
B.2 Model classes .......................................................................................................................................38
B.3 Attribute classes ..................................................................................................................................38
B.3.1 att.token.information ...........................................................................................................................38
B.3.2 att.token.join .........................................................................................................................................39
B.3.3 att.token.span .......................................................................................................................................39
B.3.4 att.wordForm.content ..........................................................................................................................39
B.3.5 att.wordForm.tokens ...........................................................................................................................40
B.4 Macros ..................................................................................................................................................40
B.4.1 data.certainty ........................................................................................................................................40
B.4.2 data.code ..............................................................................................................................................40
B.4.3 data.count .............................................................................................................................................40
B.4.4 data.duration.w3c ................................................................................................................................41
B.4.5 data.enumerated ..................................................................................................................................41
B.4.6 data.key .................................................................................................................................................41
B.4.7 data.language .......................................................................................................................................42
B.4.8 data.name .............................................................................................................................................43
B.4.9 data.numeric .........................................................................................................................................43
B.4.10 data.pointer ..........................................................................................................................................43
B.4.11 data.probability ....................................................................................................................................44
B.4.12 data.temporal.w3c................................................................................................................................44
B.4.13 data.truthValue .....................................................................................................................................44
B.4.14 data.word ..............................................................................................................................................45
B.4.15 data.xTruthValue ..................................................................................................................................45
Annex C (normative) Morpho-syntactic data categories ..............................................................................46
Bibliography ......................................................................................................................................................58
iv © ISO 2012 – All rights reserved---------------------- Page: 4 ----------------------
ISO 24611:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.© ISO 2012 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO 24611:2012(E)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language
resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the
representation of terminological data [Romary, 2001], through which linguistic data models are seen as the
combination of a generic data pattern (a meta-model), which is further refined through a selection of data
categories that provide the descriptors for this specific annotation level. Such models are defined
independently of any specific formats, and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or directly
as objects in a representation that is expressed, for instance, in XML. In order to be shared across various
annotation schemas and encoding applications, such a semantics should be implemented as a centralised
registry of concepts: we will henceforth refer to these as data categories. As such, data categories should
bear the following constraints. From a technical point of view, they must provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to
them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they
are in fact defined in relation to the same data categories (as feature and feature value).
From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.In recent years, ISO has developed a general framework for representing and maintaining such a registry of
data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,
has led to the implementation of an online environment providing access to all data categories that have been
standardized in the context of the various language resource-related activities within ISO, or specifically as
part of the maintenance of the data category registry. It also provides access to the various data categories
that individual language technology practitioners have defined in the course of their own work and decided to
share with the community.The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended
as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective
there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are
easily inserted and reused without the need for any strong consistency check with the registry at large.
Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:
simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/
makes it possible to compare morpho-syntactic annotations based on different descriptive levels of
granularity; the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,
the possible value of so-called complex data categories For instance, it can be used to record that
possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be
a subset of {/masculine/, /feminine/ and /neutral/}; language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.
vi © ISO 2012 – All rights reserved---------------------- Page: 6 ----------------------
ISO 24611:2012(E)
This International Standard provides a comprehensive framework for the representation of morpho-syntactic
(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical
abstraction level over language data (textual or spoken) and, depending on the language to be annotated,
together with the characteristics of the annotation tool or annotation scheme that is being used, can vary
enormously in structure and complexity.In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this
International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens
(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions
associated with groups of tokens). These two levels share the following specificities: on the one hand, they
can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous
compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.
As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),
tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by
means of so-called stand-off annotations.As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.
Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry
in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are
expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in
Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of
speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means
of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data
categories available in ISOCat. A normative annex of this International Standard elicits a core set of data
categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual
context. However, when implementers of this International Standard find these categories inappropriate in
either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in
compliance with ISO/TC 37 principles.Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-
compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)
guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is
essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-
compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope
with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).
Finally, it should be noted here that this International Standard forms the conceptual basis for the
development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined
in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be
understood according to the token–word-form dichotomy.© ISO 2012 – All rights reserved vii
---------------------- Page: 7 ----------------------
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope
This International Standard provides a framework for the representation of annotations of word-forms in texts;
such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.
It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories
contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML
serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding
initiative).2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure
representation3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.
3.1DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure
set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-
syntactic contentNote 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.1).© ISO 2012 – All rights reserved 1
---------------------- Page: 8 ----------------------
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection
modification or marking of a lexeme that reflects its morpho-syntactic properties
3.7inflected form
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as
grammatical number and case.3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be
the third person singular with the accomplished aspect.3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry
container for managing a set of word-forms and possibly one or more meanings to describe a lexeme
3.11lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme
smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller
meaningful unitsNote 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).
3.13morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.
EXAMPLE “grammaticalGender”.3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 9 ----------------------
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset
comprehensive set of tags used for the morpho-syntactic description of a language
Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.
3.17part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.
3.18phoneme
minimal unit in the sound system of a language
3.19
script
set of graphic characters used for the written form of one or more languages
3.20
syntagmatic relation
relation by which linguistic units in a discourse are associated
3.21
token
non-empty contiguous sequence of graphemes or phonemes in a document
Note 1 to entry: For editorial reasons, some annotation scheme may extend the notion of token to an empty sequence.
See the section on token attachment (6.2).3.22
tokenization
process identifying tokens
3.23
transcription
form resulting from a coherent method of writing down speech sounds
3.24
transliteration
form resulting from the conversion of one script into another, usually through a one-to-one correspondence
between characters3.25
word-form
morpho-syntactic unit
contiguous or non-contiguous linguistic unit identified as corresponding to a lexical entity in a language
Note 1 to entry: Word-forms may have no acoustic or graphic realization, or may correspond to one or more tokens.
© ISO 2012 – All rights reserved 3---------------------- Page: 10 ----------------------
ISO 24611:2012(E)
3.26
word lattice
set of possible alternative decompositions of a text or speech segment into word-forms
...SLOVENSKI STANDARD
SIST ISO 24611:2013
01-julij-2013
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DREOLNRVNODGHQMVNRR]QDþHYDQMH0$)
Language resource management -- Morpho-syntactic annotation framework (MAF)
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
Ta slovenski standard je istoveten z: ISO 24611:2012
ICS:
01.020 7HUPLQRORJLMDQDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
SIST ISO 24611:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------SIST ISO 24611:2013
---------------------- Page: 2 ----------------------
SIST ISO 24611:2013
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Contents Page
Foreword ............................................................................................................................................................. v
Introduction ........................................................................................................................................................ vi
1 Scope ...................................................................................................................................................... 1
2 Normative references ............................................................................................................................ 1
3 Terms and definitions ........................................................................................................................... 1
4 The MAF meta-model ............................................................................................................................ 4
4.1 Overview ................................................................................................................................................. 4
4.2 MAF Meta-model .................................................................................................................................... 4
5 Segmenting with tokens ....................................................................................................................... 6
5.1 General ................................................................................................................................................... 6
5.2 Formal description: ................................................................................................................ 7
5.3 Embedding notation .............................................................................................................................. 7
5.4 Alternate representation for TEI based documents ........................................................................... 8
5.5 Stand-off notation .................................................................................................................................. 9
5.6 Informative attributes ............................................................................................................................ 9
5.7 Completing the inline token notation ................................................................................................ 10
5.7.1 Joining tokens in embedded mode ................................................................................................... 10
5.7.2 Overlapping tokens ............................................................................................................................. 11
6 Word-forms as linguistic units ........................................................................................................... 11
6.1 Formal description: ...................................................................................................... 12
6.2 Token attachment ................................................................................................................................ 12
6.2.1 One token; one word-form ................................................................................................................. 12
6.2.2 Several contiguous tokens; one word-form ..................................................................................... 12
6.2.3 Several discontinuous tokens; one word-form ................................................................................ 13
6.2.4 Zero token; one word-form ................................................................................................................. 13
6.2.5 One token; several word-forms ......................................................................................................... 14
6.3 Referring to lexical entries ................................................................................................................. 14
6.4 Compound word-forms ....................................................................................................................... 15
6.5 Identification of word-forms within a TEI-compliant document ..................................................... 15
7 Morpho-syntactic content ................................................................................................................... 18
7.1 General ................................................................................................................................................. 18
7.2 Using feature structures ..................................................................................................................... 18
7.3 Compact morpho-syntactic tags ....................................................................................................... 18
7.4 FSR libraries ........................................................................................................................................ 19
7.5 Designing tagsets ................................................................................................................................ 20
7.6 Formal description: ............................................................................................................. 22
8 Handling ambiguities .......................................................................................................................... 22
8.1 Word-form content ambiguities ......................................................................................................... 22
8.2 Lexical Ambiguities ............................................................................................................................. 23
8.3 Structural ambiguities ......................................................................................................................... 23
8.3.1 Structural ambiguities with word-forms ........................................................................................... 23
8.3.2 Structural ambiguities with tokens .................................................................................................... 24
8.4 Simplified structuring variants .......................................................................................................... 24
8.4.1 Non-ambiguous linear representation .............................................................................................. 24
8.4.2 Mixed linear and lattice representation ............................................................................................. 25
8.5 Expanding the simplified variants ..................................................................................................... 26
8.5.1 Separating tokens and word-forms ................................................................................................... 26
8.5.2 Wrapping into local lattices ................................................................................................................ 26
© ISO 2012 – All rights reserved iii---------------------- Page: 5 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
8.5.3 Merging local lattices ..........................................................................................................................27
8.5.4 Removing ................................................................................................................................28
8.6 Formal description: and ............................................................................................29
Annex A (informative) Encoded example using the MAF serialization ........................................................30
Annex B (normative) MAF specification .........................................................................................................33
B.1 Elements ...............................................................................................................................................33
B.1.1 ....................................................................................................................................................33
B.1.2 ....................................................................................................................................................34
B.1.3 ....................................................................................................................................................34
B.1.4 ................................................................................................................................................35
B.1.5 .................................................................................................................................................35
B.1.6 ..........................................................................................................................................36
B.1.7 ..................................................................................................................................................36
B.1.8 .........................................................................................................................................37
B.2 Model classes .......................................................................................................................................38
B.3 Attribute classes ..................................................................................................................................38
B.3.1 att.token.information ...........................................................................................................................38
B.3.2 att.token.join .........................................................................................................................................39
B.3.3 att.token.span .......................................................................................................................................39
B.3.4 att.wordForm.content ..........................................................................................................................39
B.3.5 att.wordForm.tokens ...........................................................................................................................40
B.4 Macros ..................................................................................................................................................40
B.4.1 data.certainty ........................................................................................................................................40
B.4.2 data.code ..............................................................................................................................................40
B.4.3 data.count .............................................................................................................................................40
B.4.4 data.duration.w3c ................................................................................................................................41
B.4.5 data.enumerated ..................................................................................................................................41
B.4.6 data.key .................................................................................................................................................41
B.4.7 data.language .......................................................................................................................................42
B.4.8 data.name .............................................................................................................................................43
B.4.9 data.numeric .........................................................................................................................................43
B.4.10 data.pointer ..........................................................................................................................................43
B.4.11 data.probability ....................................................................................................................................44
B.4.12 data.temporal.w3c................................................................................................................................44
B.4.13 data.truthValue .....................................................................................................................................44
B.4.14 data.word ..............................................................................................................................................45
B.4.15 data.xTruthValue ..................................................................................................................................45
Annex C (normative) Morpho-syntactic data categories ..............................................................................46
Bibliography ......................................................................................................................................................58
iv © ISO 2012 – All rights reserved---------------------- Page: 6 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language
resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the
representation of terminological data [Romary, 2001], through which linguistic data models are seen as the
combination of a generic data pattern (a meta-model), which is further refined through a selection of data
categories that provide the descriptors for this specific annotation level. Such models are defined
independently of any specific formats, and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or directly
as objects in a representation that is expressed, for instance, in XML. In order to be shared across various
annotation schemas and encoding applications, such a semantics should be implemented as a centralised
registry of concepts: we will henceforth refer to these as data categories. As such, data categories should
bear the following constraints. From a technical point of view, they must provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to
them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they
are in fact defined in relation to the same data categories (as feature and feature value).
From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.In recent years, ISO has developed a general framework for representing and maintaining such a registry of
data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,
has led to the implementation of an online environment providing access to all data categories that have been
standardized in the context of the various language resource-related activities within ISO, or specifically as
part of the maintenance of the data category registry. It also provides access to the various data categories
that individual language technology practitioners have defined in the course of their own work and decided to
share with the community.The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended
as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective
there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are
easily inserted and reused without the need for any strong consistency check with the registry at large.
Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:
simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/
makes it possible to compare morpho-syntactic annotations based on different descriptive levels of
granularity; the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,
the possible value of so-called complex data categories For instance, it can be used to record that
possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be
a subset of {/masculine/, /feminine/ and /neutral/}; language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.
vi © ISO 2012 – All rights reserved---------------------- Page: 8 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
This International Standard provides a comprehensive framework for the representation of morpho-syntactic
(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical
abstraction level over language data (textual or spoken) and, depending on the language to be annotated,
together with the characteristics of the annotation tool or annotation scheme that is being used, can vary
enormously in structure and complexity.In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this
International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens
(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions
associated with groups of tokens). These two levels share the following specificities: on the one hand, they
can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous
compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.
As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),
tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by
means of so-called stand-off annotations.As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.
Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry
in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are
expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in
Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of
speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means
of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data
categories available in ISOCat. A normative annex of this International Standard elicits a core set of data
categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual
context. However, when implementers of this International Standard find these categories inappropriate in
either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in
compliance with ISO/TC 37 principles.Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-
compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)
guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is
essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-
compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope
with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).
Finally, it should be noted here that this International Standard forms the conceptual basis for the
development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined
in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be
understood according to the token–word-form dichotomy.© ISO 2012 – All rights reserved vii
---------------------- Page: 9 ----------------------
SIST ISO 24611:2013
---------------------- Page: 10 ----------------------
SIST ISO 24611:2013
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope
This International Standard provides a framework for the representation of annotations of word-forms in texts;
such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.
It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories
contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML
serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding
initiative).2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure
representation3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.
3.1DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure
set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-
syntactic contentNote 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.1).© ISO 2012 – All rights reserved 1
---------------------- Page: 11 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection
modification or marking of a lexeme that reflects its morpho-syntactic properties
3.7inflected form
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as
grammatical number and case.3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be
the third person singular with the accomplished aspect.3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry
container for managing a set of word-forms and possibly one or more meanings to describe a lexeme
3.11lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme
smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller
meaningful unitsNote 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).
3.13morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.
EXAMPLE “grammaticalGender”.3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset
comprehensive set of tags used for the morpho-syntactic description of a language
Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.
3.17part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.
3.18phoneme
minimal unit in the sound system of a language
3.19
script
set of graphic characters used for the written form of one or more languages
3.20
syntagmatic relation
relation by which linguistic units in a discourse are associated
...
NORME ISO
INTERNATIONALE 24611
Première édition
2012-11-01
Gestion des ressources langagières —
Cadre d'annotation morphosyntaxique
(MAF)
Language resource management — Morpho-syntactic annotation
framework (MAF)
Numéro de référence
ISO 24611:2012(F)
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(F)
DOCUMENT PROTÉGÉ PAR COPYRIGHT
© ISO 2012, Publié en Suisse
Droits de reproduction réservés. Sauf indication contraire, aucune partie de cette publication ne peut être reproduite ni utilisée sous
quelque forme que ce soit et par aucun procédé, électronique ou mécanique, y compris la photocopie, l’affichage sur l’internet ou sur un
Intranet, sans autorisation écrite préalable. Les demandes d’autorisation peuvent être adressées à l’ISO à l’adresse ci-après ou au comité
membre de l’ISO dans le pays du demandeur.ISO copyright office
Ch. de Blandonnet 8 CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2012 – Tous droits réservés
---------------------- Page: 2 ----------------------
ISO 24611:2012(F)
Sommaire Page
Avant-propos ................................................................................................................................................................... v
Introduction ................................................................................................................................................................... vi
1 Domaine d’application ................................................................................................................................... 1
2 Références normatives .................................................................................................................................. 1
3 Termes et définitions ..................................................................................................................................... 1
4 Le métamodèle MAF ....................................................................................................................................... 4
4.1 Vue d’ensemble ................................................................................................................................................ 4
4.2 Métamodèle MAF ............................................................................................................................................. 5
5 Segmentation .................................................................................................................................................... 6
5.1 Aspect général .................................................................................................................................................. 6
5.2 Description formelle: .................................................................................................................... 7
5.3 Notation enchâssée ......................................................................................................................................... 8
5.4 Représentation alternative pour les documents conformes à la TEI ........................................... 8
5.5 Notation déportée ........................................................................................................................................... 9
5.6 Attributs informatifs ................................................................................................................................... 10
5.7 Compléter la notation enchâssée ............................................................................................................ 10
5.7.1 Joindre des segments dans le mode enchâssé .................................................................................... 11
5.7.2 Segments chevauchants ............................................................................................................................. 11
6 Les mots-formes en tant qu’unités linguistiques .............................................................................. 12
6.1 Description formelle: ....................................................................................................... 13
6.2 Attachement de segment ........................................................................................................................... 13
6.2.1 Un segment, un mot-forme ........................................................................................................................ 13
6.2.2 Plusieurs segments contigus, un mot-forme ...................................................................................... 13
6.2.3 Plusieurs segments discontigus, un mot forme ................................................................................. 13
6.2.4 Absence de segment, un mot-forme ....................................................................................................... 14
6.2.5 Un segment, plusieurs mots-formes ...................................................................................................... 14
6.3 Référencer les entrées lexicales .............................................................................................................. 15
6.4 Mots-formes composés ............................................................................................................................... 16
6.5 Identification des mots-formes au sein d’un document conforme à la TEI ............................. 16
7 Contenu morphosyntaxique ..................................................................................................................... 19
7.1 Aspect général ............................................................................................................................................... 19
7.2 Utiliser les structures de traits ................................................................................................................ 19
7.3 Balises morphosyntaxiques compactes................................................................................................ 20
7.4 Les bibliothèques FSR ................................................................................................................................. 20
7.5 Conception des ensembles de balises ................................................................................................... 21
7.6 Description formelle: ................................................................................................................ 23
8 Gestion des ambiguïtés ............................................................................................................................... 23
8.1 Ambiguïtés du contenu des mots-formes ............................................................................................ 23
8.2 Ambiguïtés lexicales .................................................................................................................................... 24
8.3 Ambiguïtés structurelles ........................................................................................................................... 24
8.3.1 Ambiguïtés structurelles avec des mots-formes ............................................................................... 24
8.3.2 Ambiguïtés structurelles avec les segments ....................................................................................... 25
8.4 Variantes structurées simplement ......................................................................................................... 25
© ISO 2012 – Tous droits réservés iii---------------------- Page: 3 ----------------------
ISO 24611:2012(F)
8.4.1 Représentation linéaire non ambiguë .................................................................................................. 25
8.4.2 Représentation mixte linéaire et en treillis ....................................................................................... 26
8.5 Expanser les variantes simplifiées ........................................................................................................ 27
8.5.1 Séparer les segments et les mots-formes ............................................................................................ 27
8.5.2 Envelopper dans les treillis locaux ........................................................................................................ 27
8.5.3 Fusion de treillis locaux ............................................................................................................................. 28
8.5.4 Suppression de ............................................................................................................................. 30
8.6 Description formelle: and ........................................................................................... 30
Annexe A (informative) Exemple encodé selon la sérialisation MAF ..................................................... 31
Annexe B (normative) Spécification MAF ........................................................................................................... 34
B.1 Eléments .......................................................................................................................................................... 34
B.1.1 ............................................................................................................................................................... 34
B.1.2 ................................................................................................................................................................ 35
B.1.3 ................................................................................................................................................................ 35
B.1.4 ........................................................................................................................................................... 36
B.1.5 ............................................................................................................................................................ 36
B.1.6 ................................................................................................................................................... 37
B.1.7 ............................................................................................................................................................. 37
B.1.8 .................................................................................................................................................. 38
B.2 Classes de modèles ...................................................................................................................................... 39
B.3 Classes d’attributs ........................................................................................................................................ 39
B.3.1 att.token.information ................................................................................................................................. 39
B.3.2 att.token.join .................................................................................................................................................. 40
B.3.3 att.token.span ................................................................................................................................................ 40
B.3.4 att.wordForm.content ................................................................................................................................ 40
B.3.5 att.wordForm.tokens .................................................................................................................................. 41
B.4Macros .............................................................................................................................................................. 41
B.4.1 data.certainty................................................................................................................................................. 41
B.4.2 data.code ......................................................................................................................................................... 41
B.4.3 data.count ....................................................................................................................................................... 42
B.4.4 data.duration.w3c ........................................................................................................................................ 42
B.4.5 data.enumerated .......................................................................................................................................... 42
B.4.6 data.key............................................................................................................................................................ 43
B.4.7 data.language ................................................................................................................................................. 43
B.4.8 data.name ........................................................................................................................................................ 44
B.4.9 data.numeric .................................................................................................................................................. 45
B.4.10 data.pointer .................................................................................................................................................... 45
B.4.11 data.probability ............................................................................................................................................ 46
B.4.12 data.temporal.w3c ....................................................................................................................................... 46
B.4.13 data.truthValue ............................................................................................................................................. 46
B.4.14 data.word ........................................................................................................................................................ 47
B.4.15 data.xTruthValue ......................................................................................................................................... 47
Annexe C (normative) Catégories de données morphosyntaxiques ......................................................... 48
Bibliographie ............................................................................................................................................................... 62
iv © ISO 2012 – Tous droits réservés---------------------- Page: 4 ----------------------
ISO 24611:2012(F)
Avant-propos
L'ISO (Organisation internationale de normalisation) est une fédération mondiale d'organismes
nationaux de normalisation (comités membres de l'ISO). L'élaboration des Normes internationales est
en général confiée aux comités techniques de l'ISO. Chaque comité membre intéressé par une étude a le
droit de faire partie du comité technique créé à cet effet. Les organisations internationales,
gouvernementales et non gouvernementales, en liaison avec l'ISO participent également aux travaux.
L'ISO collabore étroitement avec la Commission électrotechnique internationale (IEC) en ce qui
concerne la normalisation électrotechnique.Les procédures utilisées pour élaborer le présent document et celles destinées à sa mise à jour sont
décrites dans les Directives ISO/IEC, Partie 1. Il convient, en particulier de prendre note des différents
critères d'approbation requis pour les différents types de documents ISO. Le présent document a été
rédigé conformément aux règles de rédaction données dans les Directives ISO/IEC, Partie 2
(voir www.iso.org/directives).L'attention est appelée sur le fait que certains des éléments du présent document peuvent faire l'objet
de droits de propriété intellectuelle ou de droits analogues. L'ISO ne saurait être tenue pour
responsable de ne pas avoir identifié de tels droits de propriété et averti de leur existence. Les détails
concernant les références aux droits de propriété intellectuelle ou autres droits analogues identifiés
lors de l'élaboration du document sont indiqués dans l'Introduction et/ou dans la liste des déclarations
de brevets reçues par l'ISO (voir www.iso.org/brevets).Les appellations commerciales éventuellement mentionnées dans le présent document sont données
pour information, par souci de commodité, à l’intention des utilisateurs et ne sauraient constituer un
engagement.Pour une explication de la signification des termes et expressions spécifiques de l'ISO liés à l'évaluation
de la conformité, ou pour toute information au sujet de l'adhésion de l'ISO aux principes de
l’Organisation mondiale du commerce (OMC) concernant les obstacles techniques au commerce (OTC),
voir le lien suivant: www.iso.org/iso/fr/avant‐propos.html.Le comité chargé de l'élaboration du présent document est l'ISO/TC 37, Terminologie et autres
ressources langagières et ressources de contenu, sous‐comité SC4, Gestion de ressources linguistiques.
© ISO 2012 – Tous droits réservés v---------------------- Page: 5 ----------------------
ISO 24611:2012(F)
Introduction
L’ISO/TC 37/SC 4 se concentre sur la définition des modèles et des formats utilisés pour représenter les
ressources linguistiques annotées. A cette fin, il généralise la stratégie de modélisation initialisée par
son comité frère le SC 3 pour la représentation des données terminologiques [Romary, 2001], selon
laquelle les modèles de données linguistiques sont considérés comme la combinaison d’un patron de
données génériques (un métamodèle), qui est ensuite perfectionné au moyen d’une sélection de
catégories de données qui fournissent les descripteurs correspondant à ce niveau spécifique
d’annotation. Ces modèles sont définis indépendamment des formats spécifiques et permettent à
l’implémenteur de disposer de l’outil conceptuel nécessaire pour concevoir et comparer les formats en
fonction de leurs niveaux d’interopérabilité.Pour représenter tout type d’annotation, il est important de mettre à disposition une sémantique claire
et fiable pour les divers descripteurs utilisés, soit sous la forme de traits valués formels, soit
directement comme objets d’une représentation exprimée par exemple en XML. Pour que cette
sémantique puisse être partagée entre différents schémas d’annotation et d’applications d’encodage, il
convient de l’implémenter comme un registre centralisé de concepts: aussi, nous considérerons ces
concepts comme des catégories de données. En tant que telles, il convient que ces catégories de données
remplissent les conditions suivantes: d’un point de vue technique, elles doivent fournir des références uniques et stables (implémentées
sous la forme d’identifiants pérennes au sens de l’ISO 24619) de telle manière que le concepteur
d’un schéma spécifique d’encodage puisse les référencer dans ses spécifications. Ainsi, deux
annotations seront considérées comme équivalentes quand elles feront référence à la même
catégorie de données (en tant que trait et valeur). d’un point de vue descriptif, il convient que chaque référence sémantiquement unique soit associée
à une documentation précise combinant une explication en prose de la signification du descripteur
avec l’expression des contraintes spécifiques qui portent sur la catégorie.Ces dernières années, l’ISO a développé un cadre général pour représenter et maintenir un tel registre
de catégories de données couvrant tous les domaines des ressources linguistiques. Cette initiative,
spécifiée par l’ISO 12620, a abouti à l’implémentation d’un environnement mis en ligne afin d’une part
de fournir l’accès à toutes les catégories de données qui ont été normalisées dans le contexte des
activités liées aux diverses ressources linguistiques au sein de l’ISO, et d’autre part spécifiquement au
titre de la maintenance du registre de catégories de données. Le système propose aussi un accès aux
diverses catégories de données que les praticiens des technologies du langage ont définies dans le cadre
de leur propre travail et qu’ils ont partagé ensuite avec la communauté.Le registre de catégories de données, accessible via l’implémentation ISOCat (www.isocat.org) est juste
un espace d’objets sémantiques n’offrant qu’un ensemble limité de contraintes ontologiques. L’objectif
est de faciliter la maintenance d’un environnement au sein duquel de nouvelles catégories sont
facilement insérées et réutilisées sans qu’il soit nécessaire de procéder à une vérification approfondie
de la cohérence par rapport au reste du registre. En effet, les contraintes de base sont intrinsèques au
modèle de catégorie de données tel que défini par l’ISO 12620: de simples relations génériques‐spécifiques quand elles sont utiles à une identification exacte des
descripteurs d’interopérabilité entre catégories de données. Par exemple, le fait que /properNoun/
soit une sous‐catégorie de /noun/ permet de comparer des annotations morphosyntaxiques
fondées sur différents niveaux de granularité;vi © ISO 2012 – Tous droits réservés
---------------------- Page: 6 ----------------------
ISO 24611:2012(F)
la description des domaines conceptuels au sens de l’ISO 11179 pour identifier, quand elle est
connue ou identifiable la valeur possible de la dite catégorie de donnée complexe. Par exemple, elle
peut être utilisée pour enregistrer que la valeur possible de /grammaticalGender/ (limitée à un
petit groupe de langues [Romary 2011]), peut être un sous‐ensemble de {/ masculine/, /feminine/
et /neutral/}; des contraintes linguistiques spécifiques, soit sous la forme de notes d’application ou comme des
restrictions explicites portant sur les domaines conceptuels des catégories de données. Par
exemple, il est possible d’exprimer explicitement que /grammaticalGender/ en français ne peut
prendre que les deux valeurs: {/masculine/ et /feminine/}.La présente Norme internationale fournit un cadre complet pour la représentation des annotations
morphosyntactiques (aussi dénommées annotations en partie du discours). Ce niveau d’annotation
correspond à un premier niveau d’abstraction par rapport aux données linguistiques (textuelles ou
parlées), dont la structure et la complexité peuvent varier considérablement en fonction de la langue à
annoter, de même que selon les caractéristiques de l’outil d’annotation ou du schéma d’annotation
utilisé.Pour résoudre les problématiques complexes de l’ambiguïté et du déterminisme en annotation
morphosyntaxique, la présente Norme internationale introduit un méta‐modèle qui établit une
distinction nette entre les deux niveaux que sont les segments (représentant le découpage de surface de
la source) et lesmots‐formes (identifiant les abstractions lexicales associées aux groupes de segments).
Ces deux niveaux partagent les caractéristiques suivantes: d’une part, ils peuvent être représentés
comme de simples séquences et des graphes locaux tels que segmentations multiples et éléments
ambigus, et d’autre part, toute combinaison N à M peut relier les segments et les mots‐formes.
En tant que segments linguistiques (quelquefois dénommés ‘tokens’ ou ‘markables ‘ dans la littérature
technique anglaise [par exemple, Carletta et al. 1997]), ces segments peuvent être enchâssés dans le
document source comme une balise en ligne, ou peuvent y faire référence par l’intermédiaire
d’annotations déportées (‘stand‐off annotation’ en anglais).En tant qu’abstractions linguistiques, les mots‐formes peuvent être qualifiés par divers traits
linguistiques caractérisant les propriétés morphosyntaxiques qui sont instanciées dans la réalisation de
l’entrée lexicale dans le texte annoté. Ces propriétés peuvent prendre diverses formes: de la simple
indication d'un lemme à une référence explicite à une entrée lexicale dans un dictionnaire. Dans la
plupart des applications existantes de l’annotation morphosyntaxique, les propriétés linguistiques sont
exprimées au moyen de balises; ces codes font référence aux structures de traits basiques (voir les
exemples dans Monachini and Calzolari, 1994). Ces codes peuvent aussi fournir de l’information
morphologique, incluant la partie du discours (par exemple, nom, adjectif ou verbe), et des traits
comme le nombre, le genre, la personne, le mode et le temps du verbe.En phase avec la stratégie générale de modélisation de l’ISO/TC 37, la présente Norme internationale/le
cadre MAF fournit les moyens de mise en relation des balises morphosyntaxiques exprimées en tant
que structures de traits (conformes à l’ISO 24610) avec les catégories de données d’ISOCat. Une annexe
normative de la présente Norme internationale explicite un jeu de base de catégories de données qui
peuvent être utilisées comme référence pour la plupart des tâches d’annotation morphosyntaxiques
dans un contexte multilingue. Néanmoins, si des utilisateurs de la présente Norme internationale
estiment que ces catégories sont inappropriées du point de vue de la couverture, du domaine
d’application ou de la sémantique, ils sont invités à utiliser ISOCat pour définir leurs propres catégories
en conformité avec les principes de l’ISO/TC 37.Associé au méta‐modèle, le cadre MAF fournit aussi une syntaxe XML par défaut qui peut être utilisée
pour sérialiser les modèles d’annotation conformes. Etant donné que de nombreux projets existants
sont basés sur les lignes directrices émanant du consortium TEI (Text Encoding Initiative, www.tei‐
c.org) — particulièrement dans les humanités numériques, où un encodage correct des sources
textuelles est essentiel — la présente Norme internationale fournira aussi des informations sur la façon
© ISO 2012 – Tous droits réservés vii---------------------- Page: 7 ----------------------
ISO 24611:2012(F)
concilier le modèle MAF et les encodages conformes à la TEI. En effet, les lignes directrices de la TEI
offrent d’ores et déjà une grande variété de constructions et de mécanismes pour prendre en charge les
nombreux défis posés par les corpus oraux et leurs annotations (Romary and Witt, 2012).
Enfin, il convient de noter que la présente Norme internationale constitue la base conceptuelle
permettant d’élaborer la série de normes ISO 24614 relative à la segmentation des unités lexicales. La
totalité des règles et principes généraux définis dans l’ISO 24614‐1 de même que les contraintes
exprimées dans des parties complémentaires traitant de langues spécifiques, doivent être appréhendés
dans le respect de la dichotomie segment / mot‐forme.viii © ISO 2012 – Tous droits réservés
---------------------- Page: 8 ----------------------
NORME INTERNATIONALE ISO 24611:2012(F)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
1 Domaine d’application
La présente Norme internationale fournit un cadre pour la représentation des annotations des mots‐
formes dans les textes; ces annotations concernent les segments, leurs relations avec les unités
lexicales, et leurs propriétés morphosyntaxiques.Elle présente un métamodèle pour l’annotation morphosyntaxique qui référence les catégories de
données dans le registre des catégories de données ISOCat (DCR tel que défini dans l’ISO 12620). Elle
décrit aussi une sérialisation XML pour l’annotation morphosyntaxique, avec les équivalences des lignes
directrices de la TEI (Text Encoding Initiative).2 Références normatives
Les documents référencés sont indispensables à l’application de ce document. Pour les références
datées, seule l’édition citée s’applique. Pour les références non datées, la dernière édition du document
référencé s’applique (incluant ses éventuels amendements).ISO 24610‐1, Gestion des ressources linguistiques — Structures de traits — Partie 1: Représentation de
structures de traits3 Termes et définitions
Pour les besoins du présent document, les termes et définitions donnés dans l’ISO 24610‐1 ainsi que les
suivants s’appliquent:3.1
GOA
DAG
graphe orienté acyclique
graphe contenant des arcs orientés et sans cycle
Note 1 à l’article: les graphes orientés acycliques sont des sous‐ensembles des automates finis (3.4).
3.3structure de trait
ensemble des spécifications de trait, utilisé dans le cadre d’annotation morphosyntaxique (MAF) pour
exprimer le contenu morphosyntaxiqueNote 1 à l’article: les structures de trait sont spécifiées dans l’ISO 24610‐1.
© ISO 2012 – Tous droits réservés 1
---------------------- Page: 9 ----------------------
ISO 24611:2012(F)
3.4
AEF
FSA
automate fini
graphes comprenant plusieurs états avec un état initial et un état final, et un ensemble fini de
transitions pour passer d'un état à l'autreNote 1 à l’article: Voir aussi GOA (3.1).
3.5
graphème
unité minimale dans une langue écrite
EXEMPLE Lettre, pictogramme, idéogramme, numérique, ponctuation.
3.6
flexion
modification ou balise d’un lexème qui reflète ses propriétés morphosyntaxiques
3.7
forme fléchie
forme qu’un mot peut prendre dans une phrase ou une proposition
Note 1 à l’article: Une forme fléchie d’un mot est associée avec une combinaison de traits morphologiques
comme le nombre grammatical ou le cas.3.8
lemme
forme lemmatisée
forme conventionnelle choisie pour représenter un lexème
Note 1 à l’article: Dans les langues europ
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.