Language resource management -- Linguistic annotation framework (LAF)

This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic annotations of language data such as corpora, speech signal and video. The framework includes an abstract data model and an XML serialization of that model for representing annotations of primary data. The serialization serves as a pivot format to allow annotations expressed in one representation format to be mapped onto another. NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and other related International Standards.

Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)

Upravljanje z jezikovnimi viri - Ogrodje za jezikoslovno označevanje (LAF)

Ta mednarodni standard določa ogrodje za jezikoslovno označevanje (LAF) za predstavitev jezikoslovnega označevanja jezikovnih podatkov, kot so korpusi, govorni signali in videoposnetki. Ogrodje vključuje abstraktni podatkovni model in serializacijo XML tega modela za predstavitev označevanja primarnih podatkov. Serializacija je ključni format, ki omogoča, da je označevanje iz ene predstavitve preslikano v drugo. OPOMBA Standardizacijo kategorij jezikovnih podatkov, ki zagotavljajo vsebino označevanja, določajo ISO 12620 in drugi z njim povezani mednarodni standardi.

General Information

Status
Published
Publication Date
06-Jun-2013
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
30-May-2013
Due Date
04-Aug-2013
Completion Date
07-Jun-2013

Buy Standard

Standard
SIST ISO 24612:2013
English language
24 pages
sale 10% off
Preview
sale 10% off
Preview

e-Library read for
1 day
Standard
ISO 24612:2012 - Language resource management -- Linguistic annotation framework (LAF)
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview
Standard
SIST ISO 24612:2013
English language
24 pages
sale 10% off
Preview
sale 10% off
Preview

e-Library read for
1 day

Standards Content (sample)

SLOVENSKI STANDARD
SIST ISO 24612:2013
01-julij-2013
Upravljanje z jezikovnimi viri - Ogrodje za jezikoslovno označevanje (LAF)
Language resource management -- Linguistic annotation framework (LAF)
Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)
Ta slovenski standard je istoveten z: ISO 24612:2012
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24612:2013 en,fr,de

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------
SIST ISO 24612:2013
---------------------- Page: 2 ----------------------
SIST ISO 24612:2013
INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
ISO 24612:2012(E)
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012

All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,

electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or

ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Contents Page

Foreword ............................................................................................................................................................ iv

Introduction ......................................................................................................................................................... v

1  Scope ...................................................................................................................................................... 1

2  Terms and definitions ........................................................................................................................... 1

3  LAF specification ................................................................................................................................... 3

3.1  Overview ................................................................................................................................................. 3

3.2  LAF data model ...................................................................................................................................... 3

3.3  LAF architecture .................................................................................................................................... 4

3.4  XML pivot format ................................................................................................................................... 6

3.5  XML elements for the resource header ............................................................................................. 11

3.6  Elements in the primary data document header .............................................................................. 16

Bibliography ...................................................................................................................................................... 19

© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies

(ISO member bodies). The work of preparing International Standards is normally carried out through

ISO technical committees. Each member body interested in a subject for which a technical committee has

been established has the right to be represented on that committee. International organizations, governmental

and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the

International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards

adopted by the technical committees are circulated to the member bodies for voting. Publication as an

International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent

rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content

resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Introduction

Effective creation, encoding, processing and management of language resources is facilitated by a single

high-level data model that supports analysis and design of both annotation schemes and representation

formats. This International Standard is designed to support the development and use of computer applications

relying on linguistically annotated resources and the exchange of these resources among different

applications.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24612:2013
---------------------- Page: 8 ----------------------
SIST ISO 24612:2013
INTERNATIONAL STANDARD ISO 24612:2012(E)
Language resource management — Linguistic annotation
framework (LAF)
1 Scope

This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic

annotations of language data such as corpora, speech signal and video. The framework includes an abstract

data model and an XML serialization of that model for representing annotations of primary data. The

serialization serves as a pivot format to allow annotations expressed in one representation format to be

mapped onto another.

NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and

other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.

Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of

characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech

annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary

data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun

linguistic information added to primary data (2.1), independent of its representation

2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation

annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)

Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-

and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment

typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks

© ISO 2012 – All rights reserved
---------------------- Page: 9 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in

the document containing the primary data itself.
2.6
linguistic annotation

annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)

EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in

the data.

Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.

In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that

segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that

identifies the segment as a word or sentence).
2.7
stand-off annotation

annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing

the primary data

Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,

elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can

refer to the same primary document (e.g. two different part of speech annotations for a given text).

2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)

Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,

audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be

coordinates.
2.10
region

area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)

2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G

Note to entry: The terms node and vertex are used interchangeably in this document.

2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
© ISO 2012 – All rights reserved
---------------------- Page: 10 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3 LAF specification
3.1 Overview
LAF consists of the following.
 A data model for linguistic annotations and the data to which they apply.
 An architecture for representing language data and its annotations.

 An XML serialization of the data model, which describes the referential structure of annotations

associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be

linked to regions of primary data. Nodes and edges may be associated with feature structures describing

linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of

a) a structure for describing media, consisting of anchors that reference locations in primary data and

regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and

c) an annotation structure for representing annotation content with feature structures.

The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary

data as well as other annotations, in which nodes are associated with feature structures providing the

annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the

mapping) isomorphic to the LAF data model.

NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated

linguistic phenomena).
Figure 1 — LAF data model
© ISO 2012 – All rights reserved
---------------------- Page: 11 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3.3 LAF architecture
3.3.1 Overview

Language resources conforming to the LAF architecture consist of the following, described in more detail in

3.3.2 to 3.3.5.
 One or more primary data documents (see 3.3.2).

 Any number of annotation documents containing nodes, edges and feature structures associated with

some or all of the nodes and/or edges in a directed graph. All nodes reference either a base

segmentation document (in which case the node has no outgoing edges) or other nodes in the same or

other annotation documents via edges. (See 3.3.3).

 One or more documents defining regions that reference each primary data document, which serve as the

base segmentation for annotations (see 3.3.4.)

 A set of headers, including a resource header describing a collection of primary data documents and

annotations, as well as headers for each primary data document and each annotation document in the

collection (see 3.3.5).

It is recommended that whenever possible, each primary data document also be associated with an original

artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the

original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data

Primary data consists of electronic data in any format, including character (text), image, audio and video.

Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to

locations within the document or documents. Corrections and modifications to the primary data are treated as

annotations and stored in a separate annotation document. Primary data documents containing textual data

are encoded in UTF-8 (default) or UTF-16.

In the general case, primary data does not contain markup of any kind. If markup does exist in primary data

(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is

made between markup and other characters in the data when referring to locations in the document.

3.3.3 Annotation documents

Annotation documents contain linguistic information describing primary data. Annotations are always

associated with a node in a graph that directly references regions defined over primary data, either directly or

via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary

data. LAF recommends representing each of the linguistic layers defined in language resource management,

in a separate annotation document for the purposes of exchange.

The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is

dependent on the application. For example, a single annotation over text may cover a phoneme, word,

sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a

temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data

Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are

located between the base units of the primary data representation.

Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound

the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying

© ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors

that refer to one or more points in the medium (e.g. an “instant” or “timestamp”). Anchors are represented by

n-tuples consisting of sets of spatial and temporal offsets. For example, consider the text “My dog has fleas”:

1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
The anchors for each word are the following:
My: start=0, end=2
dog: start=3, end=6
has: start=7, end=10
fleas: start=11, end=16

A set of regions defined over a document containing primary data need not be contiguous (i.e. there may be

portions of the primary data not included in any region), but they should not, in general, overlap. Overlapping

regions should be treated as composed of finer-grained sub-components. For example, two spans, <5, 9> and

<7, 15>, can be reconstrued as three spans, a = <5, 7>, b = <7, 9, and c = <9, 15>. Two graph nodes can

then be created that reference nodes and , thereby providing the coverage of regions <5, 9> and

<7, 15>. Discontiguous regions are referenced by creating nodes referencing each component region and

adding a node that is in turn linked to them.

The media types included in the resource are defined in the resource header. Each medium is associated with

one or more anchor types. The header for each primary data document identifies the medium for that

document, which in turn indicates the type of anchors used.

In the general case, primary data does not contain markup of any kind. If markup appears in primary data (e.g.

HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made

between markup and other characters in the data when referring to locations in the document. For primary

data comprising a valid XML document, anchors may reference XML elements using the W3C XPath 2.0

Language (www.w3.org/TR/xpath20/), in which case the associated anchor type is defined in the resource

header as an XPath expression. References to locations within these XML elements (i.e. XML element

content) can be made using standard offsets, which will be computed by including the markup as part of the

data stream; in this case, two media types would be associated with the primary document’s file type. See

3.3.5.2 for a full description of anchor and media type definitions in the resource header.

3.3.5 Headers
3.3.5.1 Overview

LAF defines a header for a resource consisting of a collection of primary data documents and annotations, as

well as headers for primary data and annotation documents themselves. This set of headers provides all

metadata describing the provenance and encoding conventions for the data and its annotations, information

required for processing such as anchor types or relations among primary data and annotation documents in

the corpus.
3.3.5.2 Resource header

The resource header describes the resource as a whole, including its contents, file structure and encoding,

and establishes definitions that are used in the primary data document and annotation document headers.

Among these are the following.

 Categories used to describe primary data documents, typically the domain/subject area of general text.

 File types providing their naming conventions, media, annotation type, and dependencies (i.e. other file

types that are referenced and therefore required). The specification of file types enables automatic

validation that all required elements of the resource are present.
© ISO 2012 – All rights reserved
---------------------- Page: 13 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

 Annotation spaces used to provide context for annotations and enable resolution of naming conflicts.

 Annotation declarations describing the annotations in the resource, including their names, creator, links to

relevant documentation and, optionally, an associated annotation schema.

 Media definitions specifying the media types included in the corpus and file naming conventions for files

containing data of that type.
 Anchor types associating anchor type definitions with media types.

 Group definitions providing the names, descriptions and members of user-defined groups of annotations.

3.3.5.3 Primary data document header

Each primary data document is associated with an XML header file containing information describing its

contents. Because the primary data document is not an XML document, the LAF primary data header is

obligatory and shall be provided as a standalone file.

The primary data document header provides information about the source and contents of the primary data,

as well as specifying category definitions and medium type by reference to definitions in the resource header.

The primary data document header provides the PID for the primary data document and all associated

annotation documents. The primary data document header provides all the information needed to process

annotations associated with a given primary data document. It is presumed that this file is loaded first when a

document and its annotations are to be processed.
The elements in the primary data document header are given in 3.6.
3.3.5.4 Annotation document header

The annotation document header includes a relevant subset of elements from the primary data header (i.e.

those that describe the file contents rather than the provenance of an original text, etc.), together with

additional elements that provide or point to information concerning the annotation content categories and

dependencies between the annotation document and other documents. The annotation document header is

not a separate document, but rather is included at the beginning of the annotation document. The elements in

the annotation document header are given in 3.4.3.
3.4 XML pivot format
3.4.1 Overview

The LAF provides an XML serialization of the data model that is designated as the pivot format. A pivot format

is intended to serve as an “interlingua” for translation among multiple other formats, by providing a common

target into and out of which other formats can be transduced. Although the LAF pivot format may be used in

any context, it is assumed that users will represent annotations using their own formats, which can then be

transduced to the LAF pivot format for the purposes of exchange, merging and comparison.

The graph annotation format (GrAF) specifies the XML serialization of the LAF pivot format.

In GrAF:

 The fundamental data structure is a directed graph consisting of a set of nodes and a set of edges.

 An annotation is a label and (optionally) a feature structure associated with a node or an edge in the

graph.
© ISO 2012 – All rights reserved
---------------------- Page: 14 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

 A feature structure is an attribute value graph (AVG). The value of a feature may be an atomic value or

another feature structure.

 An atomic feature value is a mapping from one string (the feature name) to another string (the atomic

value). GrAF makes no attempt to do typing of feature values.

 Nodes may be associated with regions in the primary document, or connected to other nodes in the same

or another annotation document. Nodes are associated with regions by elements. Edges are used

to connect (associate) nodes to other nodes.

 An edge represents a relationship between nodes. By default, the set of out edges from a node represent

an ordered set of constituents of the annotation associated with the node. Other relationships may be

specified by associating an annotation with the edge.
For all GrAF documents the namespace http://www.xces.org/ns/GrAF/1.0/ is used.
3.4.2 XML elements for annotation documents
The overall content model of a GrAF standoff annotation file is as follows:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
The root element is defined in Table 1. Required attributes are given in bold.
Table 1 — Root element for GrAF annotation documents
Root node of the graph.
Attribute @xmlns [URL]: namespace declaration for the GrAF schema
Example
3.4.3 The annotation document header
The overall content model for the annotation document header is as follows:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader

Elements of the annotation document header are given in Table 2, and elements to define graphs and

annotations are given in Table 3. Required attributes are given in bold.
© ISO 2012 – All rights reserved
---------------------- Page: 15 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 2 — Elements of the annotation document header
Bracketing tag for elements of the annotation document header.

List of the annotation labels used in the document and their frequencies.

Information for individual annotation labels.
Attributes @label [string]: Element name.
@occurs [integer]: Number of occurrences in the document.

Documents required to process the annotations in this document, which will include a

segmentation document and/or any annotation documents directly referenced in this

document.
File required to process this annotation.

Attributes @ann.id [IDREF]: The ID of the document as given in the associated primary data

document.
Annotation spaces referenced in this document.
Annotation space used in this document.

Attributes @as.id [IDREF]: The ID of the annotation space as defined in the resource header.

@default [yes | no]: Indicates whether or not this annotation space is the default in this

document. If the attribute is not present, no is assumed.
Table 3 — Graph and annotation elements in GrAF annotation documents

One or more root elements that identify root nodes in the graph. This element is used


when the graph contains either a graph that is a tree or a forest, i.e. more than one graph

that is a well-formed tree.

The node ID of a root node in the graph. Not all graphs will form a tree, but those that do


can use the root element to identify the root node of the tree.
@node.id [IDREF]: The ID of the root node.
Attribute

Region in the artefact being annotated, defined as the area bounded by a non-empty,


ordered list of anchors. The number of anchors required to bound a region depends on

the medium being annotated.
@xml:id [ID]: Unique ID for reference from nodes in the graph.
Attributes

@anchors [string] (alternative to @refs): The anchors that bound this region. The anchors

attribute contains a whitespace-delimited list of values that represent the anchor values.

Applications are expected to know how to parse the string representation of an anchor

into a location in the artefact being annotated. The element shall have either an

@anchors attribute or an @ref attribute.

@refs [IDREFS] (alternative to @anchors): ID references to the anchors that bound the

regions. The element shall have either an @anchors attribute or an @refs

attribute.
@anchor.id [IDREF]: The anchor type of the anchors referenced in the @anchors

attribute. This is the @xml:id of one of the anchorTypes defined in the resource header. If

no @anchor.id is specified for the region, the default anchor type for the document

(indicated on the element in the resource header) is assumed. If the @refs

attribute is used to refer to elements, the @anchor.id attribute will be specified

on the elements and should not be given on .
© ISO 2012 – All rights reserved
---------------------- Page: 16 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 3 (continued)
Example anchors="980 983"/>
anchors="10,59 10,173 149,173 149,59"/>
anchors="34 42"/>

Location in the artefact being annotated. How the location is represented is medium-


dependent. Applications are required to be able to serialize and de-serialize location

values to and from strings appearing as attributes on the @value attribute as well as the

@anchors attribute on the element.
@xml:id [ID]: Unique ID for reference from nodes in the graph
Attributes

@value [string]: The offset value of the anchor. How the attribute value is interpreted as

a location in the artefact being annotated is medium-dependent.

@anchor.id [string]: The @xml:id of an anchor type defined in the resource header.

Node in the graph. The element is empty when connected by an element to

another node in the graph (i.e. when the node is a non-terminal node). A child

element is used when the node refers to a region or regions of primary data (i.e. when the

node is a terminal/leaf node).
@xml:id [ID]: Unique ID for reference from edges and annotations.
Attribute
Identifies region(s) in a base segmentation document referred to by this node

@targets [IDREFS]: Identifiers of referenced region(s)
Attribute

Example


Edge in the graph.

Attributes @xml:id [ID]: Unique ID for reference from nodes and annotations.
@from [IDREF]: ID of the start node of the edge.
@ to [IDREF]: ID of the end node of the edge.
Example
Annotation information associated with a node
...

INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
ISO 24612:2012(E)
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24612:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012

All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,

electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or

ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24612:2012(E)
Contents Page

Foreword ............................................................................................................................................................ iv

Introduction ......................................................................................................................................................... v

1  Scope ...................................................................................................................................................... 1

2  Terms and definitions ........................................................................................................................... 1

3  LAF specification ................................................................................................................................... 3

3.1  Overview ................................................................................................................................................. 3

3.2  LAF data model ...................................................................................................................................... 3

3.3  LAF architecture .................................................................................................................................... 4

3.4  XML pivot format ................................................................................................................................... 6

3.5  XML elements for the resource header ............................................................................................. 11

3.6  Elements in the primary data document header .............................................................................. 16

Bibliography ...................................................................................................................................................... 19

© ISO 2012 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO 24612:2012(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies

(ISO member bodies). The work of preparing International Standards is normally carried out through

ISO technical committees. Each member body interested in a subject for which a technical committee has

been established has the right to be represented on that committee. International organizations, governmental

and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the

International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards

adopted by the technical committees are circulated to the member bodies for voting. Publication as an

International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent

rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content

resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
ISO 24612:2012(E)
Introduction

Effective creation, encoding, processing and management of language resources is facilitated by a single

high-level data model that supports analysis and design of both annotation schemes and representation

formats. This International Standard is designed to support the development and use of computer applications

relying on linguistically annotated resources and the exchange of these resources among different

applications.
© ISO 2012 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO 24612:2012(E)
Language resource management — Linguistic annotation
framework (LAF)
1 Scope

This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic

annotations of language data such as corpora, speech signal and video. The framework includes an abstract

data model and an XML serialization of that model for representing annotations of primary data. The

serialization serves as a pivot format to allow annotations expressed in one representation format to be

mapped onto another.

NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and

other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.

Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of

characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech

annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary

data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun

linguistic information added to primary data (2.1), independent of its representation

2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation

annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)

Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-

and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment

typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks

© ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
ISO 24612:2012(E)

(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in

the document containing the primary data itself.
2.6
linguistic annotation

annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)

EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in

the data.

Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.

In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that

segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that

identifies the segment as a word or sentence).
2.7
stand-off annotation

annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing

the primary data

Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,

elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can

refer to the same primary document (e.g. two different part of speech annotations for a given text).

2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)

Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,

audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be

coordinates.
2.10
region

area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)

2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G

Note to entry: The terms node and vertex are used interchangeably in this document.

2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
© ISO 2012 – All rights reserved
---------------------- Page: 7 ----------------------
ISO 24612:2012(E)
3 LAF specification
3.1 Overview
LAF consists of the following.
 A data model for linguistic annotations and the data to which they apply.
 An architecture for representing language data and its annotations.

 An XML serialization of the data model, which describes the referential structure of annotations

associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be

linked to regions of primary data. Nodes and edges may be associated with feature structures describing

linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of

a) a structure for describing media, consisting of anchors that reference locations in primary data and

regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and

c) an annotation structure for representing annotation content with feature structures.

The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary

data as well as other annotations, in which nodes are associated with feature structures providing the

annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the

mapping) isomorphic to the LAF data model.

NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated

linguistic phenomena).
Figure 1 — LAF data model
© ISO 2012 – All rights reserved
---------------------- Page: 8 ----------------------
ISO 24612:2012(E)
3.3 LAF architecture
3.3.1 Overview

Language resources conforming to the LAF architecture consist of the following, described in more detail in

3.3.2 to 3.3.5.
 One or more primary data documents (see 3.3.2).

 Any number of annotation documents containing nodes, edges and feature structures associated with

some or all of the nodes and/or edges in a directed graph. All nodes reference either a base

segmentation document (in which case the node has no outgoing edges) or other nodes in the same or

other annotation documents via edges. (See 3.3.3).

 One or more documents defining regions that reference each primary data document, which serve as the

base segmentation for annotations (see 3.3.4.)

 A set of headers, including a resource header describing a collection of primary data documents and

annotations, as well as headers for each primary data document and each annotation document in the

collection (see 3.3.5).

It is recommended that whenever possible, each primary data document also be associated with an original

artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the

original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data

Primary data consists of electronic data in any format, including character (text), image, audio and video.

Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to

locations within the document or documents. Corrections and modifications to the primary data are treated as

annotations and stored in a separate annotation document. Primary data documents containing textual data

are encoded in UTF-8 (default) or UTF-16.

In the general case, primary data does not contain markup of any kind. If markup does exist in primary data

(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is

made between markup and other characters in the data when referring to locations in the document.

3.3.3 Annotation documents

Annotation documents contain linguistic information describing primary data. Annotations are always

associated with a node in a graph that directly references regions defined over primary data, either directly or

via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary

data. LAF recommends representing each of the linguistic layers defined in language resource management,

in a separate annotation document for the purposes of exchange.

The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is

dependent on the application. For example, a single annotation over text may cover a phoneme, word,

sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a

temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data

Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are

located between the base units of the primary data representation.

Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound

the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying

© ISO 2012 – All rights reserved
---------------------- Page: 9 ----------------------
ISO 24612:2012(E)

one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors

that refer to one or more points in the medium (e.g. an “instant” or “timestamp”). Anchors are represented by

n-tuples consisting of sets of spatial and temporal offsets. For example, consider the text “My dog has fleas”:

1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
The anchors for each word are the following:
My: start=0, end=2
dog: start=3, end=6
has: start=7, end=10
fleas: start=11, end=16

A set of regions defined over a document containing primary data need not be contiguous (i.e. there may be

portions of the primary data not included in any region), but they should not, in general, overlap. Overlapping

regions should be treated as composed of finer-grained sub-components. For example, two spans, <5, 9> and

<7, 15>, can be reconstrued as three spans, a = <5, 7>, b = <7, 9, and c = <9, 15>. Two graph nodes can

then be created that reference nodes and , thereby providing the coverage of regions <5, 9> and

<7, 15>. Discontiguous regions are referenced by creating nodes referencing each component region and

adding a node that is in turn linked to them.

The media types included in the resource are defined in the resource header. Each medium is associated with

one or more anchor types. The header for each primary data document identifies the medium for that

document, which in turn indicates the type of anchors used.

In the general case, primary data does not contain markup of any kind. If markup appears in primary data (e.g.

HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made

between markup and other characters in the data when referring to locations in the document. For primary

data comprising a valid XML document, anchors may reference XML elements using the W3C XPath 2.0

Language (www.w3.org/TR/xpath20/), in which case the associated anchor type is defined in the resource

header as an XPath expression. References to locations within these XML elements (i.e. XML element

content) can be made using standard offsets, which will be computed by including the markup as part of the

data stream; in this case, two media types would be associated with the primary document’s file type. See

3.3.5.2 for a full description of anchor and media type definitions in the resource header.

3.3.5 Headers
3.3.5.1 Overview

LAF defines a header for a resource consisting of a collection of primary data documents and annotations, as

well as headers for primary data and annotation documents themselves. This set of headers provides all

metadata describing the provenance and encoding conventions for the data and its annotations, information

required for processing such as anchor types or relations among primary data and annotation documents in

the corpus.
3.3.5.2 Resource header

The resource header describes the resource as a whole, including its contents, file structure and encoding,

and establishes definitions that are used in the primary data document and annotation document headers.

Among these are the following.

 Categories used to describe primary data documents, typically the domain/subject area of general text.

 File types providing their naming conventions, media, annotation type, and dependencies (i.e. other file

types that are referenced and therefore required). The specification of file types enables automatic

validation that all required elements of the resource are present.
© ISO 2012 – All rights reserved
---------------------- Page: 10 ----------------------
ISO 24612:2012(E)

 Annotation spaces used to provide context for annotations and enable resolution of naming conflicts.

 Annotation declarations describing the annotations in the resource, including their names, creator, links to

relevant documentation and, optionally, an associated annotation schema.

 Media definitions specifying the media types included in the corpus and file naming conventions for files

containing data of that type.
 Anchor types associating anchor type definitions with media types.

 Group definitions providing the names, descriptions and members of user-defined groups of annotations.

3.3.5.3 Primary data document header

Each primary data document is associated with an XML header file containing information describing its

contents. Because the primary data document is not an XML document, the LAF primary data header is

obligatory and shall be provided as a standalone file.

The primary data document header provides information about the source and contents of the primary data,

as well as specifying category definitions and medium type by reference to definitions in the resource header.

The primary data document header provides the PID for the primary data document and all associated

annotation documents. The primary data document header provides all the information needed to process

annotations associated with a given primary data document. It is presumed that this file is loaded first when a

document and its annotations are to be processed.
The elements in the primary data document header are given in 3.6.
3.3.5.4 Annotation document header

The annotation document header includes a relevant subset of elements from the primary data header (i.e.

those that describe the file contents rather than the provenance of an original text, etc.), together with

additional elements that provide or point to information concerning the annotation content categories and

dependencies between the annotation document and other documents. The annotation document header is

not a separate document, but rather is included at the beginning of the annotation document. The elements in

the annotation document header are given in 3.4.3.
3.4 XML pivot format
3.4.1 Overview

The LAF provides an XML serialization of the data model that is designated as the pivot format. A pivot format

is intended to serve as an “interlingua” for translation among multiple other formats, by providing a common

target into and out of which other formats can be transduced. Although the LAF pivot format may be used in

any context, it is assumed that users will represent annotations using their own formats, which can then be

transduced to the LAF pivot format for the purposes of exchange, merging and comparison.

The graph annotation format (GrAF) specifies the XML serialization of the LAF pivot format.

In GrAF:

 The fundamental data structure is a directed graph consisting of a set of nodes and a set of edges.

 An annotation is a label and (optionally) a feature structure associated with a node or an edge in the

graph.
© ISO 2012 – All rights reserved
---------------------- Page: 11 ----------------------
ISO 24612:2012(E)

 A feature structure is an attribute value graph (AVG). The value of a feature may be an atomic value or

another feature structure.

 An atomic feature value is a mapping from one string (the feature name) to another string (the atomic

value). GrAF makes no attempt to do typing of feature values.

 Nodes may be associated with regions in the primary document, or connected to other nodes in the same

or another annotation document. Nodes are associated with regions by elements. Edges are used

to connect (associate) nodes to other nodes.

 An edge represents a relationship between nodes. By default, the set of out edges from a node represent

an ordered set of constituents of the annotation associated with the node. Other relationships may be

specified by associating an annotation with the edge.
For all GrAF documents the namespace http://www.xces.org/ns/GrAF/1.0/ is used.
3.4.2 XML elements for annotation documents
The overall content model of a GrAF standoff annotation file is as follows:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
The root element is defined in Table 1. Required attributes are given in bold.
Table 1 — Root element for GrAF annotation documents
Root node of the graph.
Attribute @xmlns [URL]: namespace declaration for the GrAF schema
Example
3.4.3 The annotation document header
The overall content model for the annotation document header is as follows:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader

Elements of the annotation document header are given in Table 2, and elements to define graphs and

annotations are given in Table 3. Required attributes are given in bold.
© ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
ISO 24612:2012(E)
Table 2 — Elements of the annotation document header
Bracketing tag for elements of the annotation document header.

List of the annotation labels used in the document and their frequencies.

Information for individual annotation labels.
Attributes @label [string]: Element name.
@occurs [integer]: Number of occurrences in the document.

Documents required to process the annotations in this document, which will include a

segmentation document and/or any annotation documents directly referenced in this

document.
File required to process this annotation.

Attributes @ann.id [IDREF]: The ID of the document as given in the associated primary data

document.
Annotation spaces referenced in this document.
Annotation space used in this document.

Attributes @as.id [IDREF]: The ID of the annotation space as defined in the resource header.

@default [yes | no]: Indicates whether or not this annotation space is the default in this

document. If the attribute is not present, no is assumed.
Table 3 — Graph and annotation elements in GrAF annotation documents

One or more root elements that identify root nodes in the graph. This element is used


when the graph contains either a graph that is a tree or a forest, i.e. more than one graph

that is a well-formed tree.

The node ID of a root node in the graph. Not all graphs will form a tree, but those that do


can use the root element to identify the root node of the tree.
@node.id [IDREF]: The ID of the root node.
Attribute

Region in the artefact being annotated, defined as the area bounded by a non-empty,


ordered list of anchors. The number of anchors required to bound a region depends on

the medium being annotated.
@xml:id [ID]: Unique ID for reference from nodes in the graph.
Attributes

@anchors [string] (alternative to @refs): The anchors that bound this region. The anchors

attribute contains a whitespace-delimited list of values that represent the anchor values.

Applications are expected to know how to parse the string representation of an anchor

into a location in the artefact being annotated. The element shall have either an

@anchors attribute or an @ref attribute.

@refs [IDREFS] (alternative to @anchors): ID references to the anchors that bound the

regions. The element shall have either an @anchors attribute or an @refs

attribute.
@anchor.id [IDREF]: The anchor type of the anchors referenced in the @anchors

attribute. This is the @xml:id of one of the anchorTypes defined in the resource header. If

no @anchor.id is specified for the region, the default anchor type for the document

(indicated on the element in the resource header) is assumed. If the @refs

attribute is used to refer to elements, the @anchor.id attribute will be specified

on the elements and should not be given on .
© ISO 2012 – All rights reserved
---------------------- Page: 13 ----------------------
ISO 24612:2012(E)
Table 3 (continued)
Example anchors="980 983"/>
anchors="10,59 10,173 149,173 149,59"/>
anchors="34 42"/>

Location in the artefact being annotated. How the location is represented is medium-


dependent. Applications are required to be able to serialize and de-serialize location

values to and from strings appearing as attributes on the @value attribute as well as the

@anchors attribute on the element.
@xml:id [ID]: Unique ID for reference from nodes in the graph
Attributes

@value [string]: The offset value of the anchor. How the attribute value is interpreted as

a location in the artefact being annotated is medium-dependent.

@anchor.id [string]: The @xml:id of an anchor type defined in the resource header.

Node in the graph. The element is empty when connected by an element to

another node in the graph (i.e. when the node is a non-terminal node). A child

element is used when the node refers to a region or regions of primary data (i.e. when the

node is a terminal/leaf node).
@xml:id [ID]: Unique ID for reference from edges and annotations.
Attribute
Identifies region(s) in a base segmentation document referred to by this node

@targets [IDREFS]: Identifiers of referenced region(s)
Attribute

Example


Edge in the graph.

Attributes @xml:id [ID]: Unique ID for reference from nodes and annotations.
@from [IDREF]: ID of the start node of the edge.
@ to [IDREF]: ID of the end node of the edge.
Example

Annotation information associated with a node or edge. This tag may be empty if the

annotation consists of a label only.

Attributes @label [string]: The label of the annotation. This may be the string used to identify the

annotation as described by the annotation documentation, a category identifier from a

data category registry, an identifier from a feature structure library, or any reference to an

external annotation specification.

@ref [IDREF]: The ID of the node or edge with which the annotation is associated.

@as [string]: The ID of the annotation space of which this annotation is a part, as defined

in the resource header; if no @as attribute is specified, the annotation space designated

as the default in the annotation document header is assumed.

Feature structure providing additional annotation information. An element may not

contain more than one element. The element may contain one or more
elements.

Attribute/value pair. In the concise form (given here), the element is empty and

includes attributes providing simple name/value pairs. More complex feature structures

may be represented according to the specification in ISO 24610-1, which should be

consulted for details.
© ISO 2012 – All rights reserved
-------
...

SLOVENSKI STANDARD
SIST ISO 24612:2013
01-julij-2013
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DMH]LNRVORYQRR]QDþHYDQMH /$)
Language resource management -- Linguistic annotation framework (LAF)
Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)
Ta slovenski standard je istoveten z: ISO 24612:2012
ICS:
01.020 7HUPLQRORJLMD QDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
SIST ISO 24612:2013 en,fr,de

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------
SIST ISO 24612:2013
---------------------- Page: 2 ----------------------
SIST ISO 24612:2013
INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
ISO 24612:2012(E)
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012

All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,

electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or

ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Contents Page

Foreword ............................................................................................................................................................ iv

Introduction ......................................................................................................................................................... v

1  Scope ...................................................................................................................................................... 1

2  Terms and definitions ........................................................................................................................... 1

3  LAF specification ................................................................................................................................... 3

3.1  Overview ................................................................................................................................................. 3

3.2  LAF data model ...................................................................................................................................... 3

3.3  LAF architecture .................................................................................................................................... 4

3.4  XML pivot format ................................................................................................................................... 6

3.5  XML elements for the resource header ............................................................................................. 11

3.6  Elements in the primary data document header .............................................................................. 16

Bibliography ...................................................................................................................................................... 19

© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies

(ISO member bodies). The work of preparing International Standards is normally carried out through

ISO technical committees. Each member body interested in a subject for which a technical committee has

been established has the right to be represented on that committee. International organizations, governmental

and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the

International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards

adopted by the technical committees are circulated to the member bodies for voting. Publication as an

International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent

rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content

resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Introduction

Effective creation, encoding, processing and management of language resources is facilitated by a single

high-level data model that supports analysis and design of both annotation schemes and representation

formats. This International Standard is designed to support the development and use of computer applications

relying on linguistically annotated resources and the exchange of these resources among different

applications.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24612:2013
---------------------- Page: 8 ----------------------
SIST ISO 24612:2013
INTERNATIONAL STANDARD ISO 24612:2012(E)
Language resource management — Linguistic annotation
framework (LAF)
1 Scope

This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic

annotations of language data such as corpora, speech signal and video. The framework includes an abstract

data model and an XML serialization of that model for representing annotations of primary data. The

serialization serves as a pivot format to allow annotations expressed in one representation format to be

mapped onto another.

NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and

other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.

Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of

characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech

annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary

data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun

linguistic information added to primary data (2.1), independent of its representation

2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation

annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)

Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-

and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment

typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks

© ISO 2012 – All rights reserved
---------------------- Page: 9 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in

the document containing the primary data itself.
2.6
linguistic annotation

annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)

EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in

the data.

Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.

In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that

segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that

identifies the segment as a word or sentence).
2.7
stand-off annotation

annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing

the primary data

Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,

elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can

refer to the same primary document (e.g. two different part of speech annotations for a given text).

2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)

Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,

audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be

coordinates.
2.10
region

area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)

2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G

Note to entry: The terms node and vertex are used interchangeably in this document.

2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
© ISO 2012 – All rights reserved
---------------------- Page: 10 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3 LAF specification
3.1 Overview
LAF consists of the following.
 A data model for linguistic annotations and the data to which they apply.
 An architecture for representing language data and its annotations.

 An XML serialization of the data model, which describes the referential structure of annotations

associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be

linked to regions of primary data. Nodes and edges may be associated with feature structures describing

linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of

a) a structure for describing media, consisting of anchors that reference locations in primary data and

regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and

c) an annotation structure for representing annotation content with feature structures.

The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary

data as well as other annotations, in which nodes are associated with feature structures providing the

annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the

mapping) isomorphic to the LAF data model.

NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated

linguistic phenomena).
Figure 1 — LAF data model
© ISO 2012 – All rights reserved
---------------------- Page: 11 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3.3 LAF architecture
3.3.1 Overview

Language resources conforming to the LAF architecture consist of the following, described in more detail in

3.3.2 to 3.3.5.
 One or more primary data documents (see 3.3.2).

 Any number of annotation documents containing nodes, edges and feature structures associated with

some or all of the nodes and/or edges in a directed graph. All nodes reference either a base

segmentation document (in which case the node has no outgoing edges) or other nodes in the same or

other annotation documents via edges. (See 3.3.3).

 One or more documents defining regions that reference each primary data document, which serve as the

base segmentation for annotations (see 3.3.4.)

 A set of headers, including a resource header describing a collection of primary data documents and

annotations, as well as headers for each primary data document and each annotation document in the

collection (see 3.3.5).

It is recommended that whenever possible, each primary data document also be associated with an original

artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the

original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data

Primary data consists of electronic data in any format, including character (text), image, audio and video.

Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to

locations within the document or documents. Corrections and modifications to the primary data are treated as

annotations and stored in a separate annotation document. Primary data documents containing textual data

are encoded in UTF-8 (default) or UTF-16.

In the general case, primary data does not contain markup of any kind. If markup does exist in primary data

(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is

made between markup and other characters in the data when referring to locations in the document.

3.3.3 Annotation documents

Annotation documents contain linguistic information describing primary data. Annotations are always

associated with a node in a graph that directly references regions defined over primary data, either directly or

via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary

data. LAF recommends representing each of the linguistic layers defined in language resource management,

in a separate annotation document for the purposes of exchange.

The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is

dependent on the application. For example, a single annotation over text may cover a phoneme, word,

sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a

temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data

Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are

located between the base units of the primary data representation.

Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound

the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying

© ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors

that refer to one or more points in the medium (e.g. an “instant” or “timestamp”). Anchors are represented by

n-tuples consisting of sets of spatial and temporal offsets. For example, consider the text “My dog has fleas”:

1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
The anchors for each word are the following:
My: start=0, end=2
dog: start=3, end=6
has: start=7, end=10
fleas: start=11, end=16

A set of regions defined over a document containing primary data need not be contiguous (i.e. there may be

portions of the primary data not included in any region), but they should not, in general, overlap. Overlapping

regions should be treated as composed of finer-grained sub-components. For example, two spans, <5, 9> and

<7, 15>, can be reconstrued as three spans, a = <5, 7>, b = <7, 9, and c = <9, 15>. Two graph nodes can

then be created that reference nodes and , thereby providing the coverage of regions <5, 9> and

<7, 15>. Discontiguous regions are referenced by creating nodes referencing each component region and

adding a node that is in turn linked to them.

The media types included in the resource are defined in the resource header. Each medium is associated with

one or more anchor types. The header for each primary data document identifies the medium for that

document, which in turn indicates the type of anchors used.

In the general case, primary data does not contain markup of any kind. If markup appears in primary data (e.g.

HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made

between markup and other characters in the data when referring to locations in the document. For primary

data comprising a valid XML document, anchors may reference XML elements using the W3C XPath 2.0

Language (www.w3.org/TR/xpath20/), in which case the associated anchor type is defined in the resource

header as an XPath expression. References to locations within these XML elements (i.e. XML element

content) can be made using standard offsets, which will be computed by including the markup as part of the

data stream; in this case, two media types would be associated with the primary document’s file type. See

3.3.5.2 for a full description of anchor and media type definitions in the resource header.

3.3.5 Headers
3.3.5.1 Overview

LAF defines a header for a resource consisting of a collection of primary data documents and annotations, as

well as headers for primary data and annotation documents themselves. This set of headers provides all

metadata describing the provenance and encoding conventions for the data and its annotations, information

required for processing such as anchor types or relations among primary data and annotation documents in

the corpus.
3.3.5.2 Resource header

The resource header describes the resource as a whole, including its contents, file structure and encoding,

and establishes definitions that are used in the primary data document and annotation document headers.

Among these are the following.

 Categories used to describe primary data documents, typically the domain/subject area of general text.

 File types providing their naming conventions, media, annotation type, and dependencies (i.e. other file

types that are referenced and therefore required). The specification of file types enables automatic

validation that all required elements of the resource are present.
© ISO 2012 – All rights reserved
---------------------- Page: 13 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

 Annotation spaces used to provide context for annotations and enable resolution of naming conflicts.

 Annotation declarations describing the annotations in the resource, including their names, creator, links to

relevant documentation and, optionally, an associated annotation schema.

 Media definitions specifying the media types included in the corpus and file naming conventions for files

containing data of that type.
 Anchor types associating anchor type definitions with media types.

 Group definitions providing the names, descriptions and members of user-defined groups of annotations.

3.3.5.3 Primary data document header

Each primary data document is associated with an XML header file containing information describing its

contents. Because the primary data document is not an XML document, the LAF primary data header is

obligatory and shall be provided as a standalone file.

The primary data document header provides information about the source and contents of the primary data,

as well as specifying category definitions and medium type by reference to definitions in the resource header.

The primary data document header provides the PID for the primary data document and all associated

annotation documents. The primary data document header provides all the information needed to process

annotations associated with a given primary data document. It is presumed that this file is loaded first when a

document and its annotations are to be processed.
The elements in the primary data document header are given in 3.6.
3.3.5.4 Annotation document header

The annotation document header includes a relevant subset of elements from the primary data header (i.e.

those that describe the file contents rather than the provenance of an original text, etc.), together with

additional elements that provide or point to information concerning the annotation content categories and

dependencies between the annotation document and other documents. The annotation document header is

not a separate document, but rather is included at the beginning of the annotation document. The elements in

the annotation document header are given in 3.4.3.
3.4 XML pivot format
3.4.1 Overview

The LAF provides an XML serialization of the data model that is designated as the pivot format. A pivot format

is intended to serve as an “interlingua” for translation among multiple other formats, by providing a common

target into and out of which other formats can be transduced. Although the LAF pivot format may be used in

any context, it is assumed that users will represent annotations using their own formats, which can then be

transduced to the LAF pivot format for the purposes of exchange, merging and comparison.

The graph annotation format (GrAF) specifies the XML serialization of the LAF pivot format.

In GrAF:

 The fundamental data structure is a directed graph consisting of a set of nodes and a set of edges.

 An annotation is a label and (optionally) a feature structure associated with a node or an edge in the

graph.
© ISO 2012 – All rights reserved
---------------------- Page: 14 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)

 A feature structure is an attribute value graph (AVG). The value of a feature may be an atomic value or

another feature structure.

 An atomic feature value is a mapping from one string (the feature name) to another string (the atomic

value). GrAF makes no attempt to do typing of feature values.

 Nodes may be associated with regions in the primary document, or connected to other nodes in the same

or another annotation document. Nodes are associated with regions by elements. Edges are used

to connect (associate) nodes to other nodes.

 An edge represents a relationship between nodes. By default, the set of out edges from a node represent

an ordered set of constituents of the annotation associated with the node. Other relationships may be

specified by associating an annotation with the edge.
For all GrAF documents the namespace http://www.xces.org/ns/GrAF/1.0/ is used.
3.4.2 XML elements for annotation documents
The overall content model of a GrAF standoff annotation file is as follows:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
The root element is defined in Table 1. Required attributes are given in bold.
Table 1 — Root element for GrAF annotation documents
Root node of the graph.
Attribute @xmlns [URL]: namespace declaration for the GrAF schema
Example
3.4.3 The annotation document header
The overall content model for the annotation document header is as follows:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader

Elements of the annotation document header are given in Table 2, and elements to define graphs and

annotations are given in Table 3. Required attributes are given in bold.
© ISO 2012 – All rights reserved
---------------------- Page: 15 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 2 — Elements of the annotation document header
Bracketing tag for elements of the annotation document header.

List of the annotation labels used in the document and their frequencies.

Information for individual annotation labels.
Attributes @label [string]: Element name.
@occurs [integer]: Number of occurrences in the document.

Documents required to process the annotations in this document, which will include a

segmentation document and/or any annotation documents directly referenced in this

document.
File required to process this annotation.

Attributes @ann.id [IDREF]: The ID of the document as given in the associated primary data

document.
Annotation spaces referenced in this document.
Annotation space used in this document.

Attributes @as.id [IDREF]: The ID of the annotation space as defined in the resource header.

@default [yes | no]: Indicates whether or not this annotation space is the default in this

document. If the attribute is not present, no is assumed.
Table 3 — Graph and annotation elements in GrAF annotation documents

One or more root elements that identify root nodes in the graph. This element is used


when the graph contains either a graph that is a tree or a forest, i.e. more than one graph

that is a well-formed tree.

The node ID of a root node in the graph. Not all graphs will form a tree, but those that do


can use the root element to identify the root node of the tree.
@node.id [IDREF]: The ID of the root node.
Attribute

Region in the artefact being annotated, defined as the area bounded by a non-empty,


ordered list of anchors. The number of anchors required to bound a region depends on

the medium being annotated.
@xml:id [ID]: Unique ID for reference from nodes in the graph.
Attributes

@anchors [string] (alternative to @refs): The anchors that bound this region. The anchors

attribute contains a whitespace-delimited list of values that represent the anchor values.

Applications are expected to know how to parse the string representation of an anchor

into a location in the artefact being annotated. The element shall have either an

@anchors attribute or an @ref attribute.

@refs [IDREFS] (alternative to @anchors): ID references to the anchors that bound the

regions. The element shall have either an @anchors attribute or an @refs

attribute.
@anchor.id [IDREF]: The anchor type of the anchors referenced in the @anchors

attribute. This is the @xml:id of one of the anchorTypes defined in the resource header. If

no @anchor.id is specified for the region, the default anchor type for the document

(indicated on the element in the resource header) is assumed. If the @refs

attribute is used to refer to elements, the @anchor.id attribute will be specified

on the elements and should not be given on .
© ISO 2012 – All rights reserved
---------------------- Page: 16 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 3 (continued)
Example anchors="980 983"/>
anchors="10,59 10,173 149,173 149,59"/>
anchors="34 42"/>

Location in the artefact being annotated. How the location is represented is medium-


dependent. Applications are required to be able to serialize and de-serialize location

values to and from strings appearing as attributes on the @value attribute as well as the

@anchors attribute on the element.
@xml:id [ID]: Unique ID for reference from nodes in the graph
Attributes

@value [string]: The offset value of the anchor. How the attribute value is interpreted as

a location in the artefact being annotated is medium-dependent.

@anchor.id [string]: The @xml:id of an anchor type defined in the resource header.

Node in the graph. The element is empty when connected by an element to

another node in the graph (i.e. when the node is a non-terminal node). A child

element is used when the node refers to a region or regions of primary data (i.e. when the

node is a terminal/leaf node).
@xml:id [ID]: Unique ID for reference from edges and annotations.
Attribute
Identifies region(s) in a base segmentation document referred to by this node

@targets [IDREFS]: Identifiers of referenced region(s)
Attribute

Example


Edge in the graph.

Attributes @xml:id [ID]: Unique ID for reference from nodes and annotations.
@from [IDREF]: ID of the start node of the edge.
@ to [IDREF]: ID of the end node of the edge.
Example

Annotation information associated with a node or edge. This tag may be empty if the

annotation consists of a label only.

Attributes @label [string]: The label of the annotation. This may be the string us

...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.