ISO 24612:2012
(Main)Language resource management — Linguistic annotation framework (LAF)
Language resource management — Linguistic annotation framework (LAF)
ISO 24612:2012 specifies a linguistic annotation framework (LAF) for representing linguistic annotations of language data such as corpora, speech signal and video. The framework includes an abstract data model and an XML serialization of that model for representing annotations of primary data. The serialization serves as a pivot format to allow annotations expressed in one representation format to be mapped onto another.
Gestion des ressources linguistiques — Cadre d'annotation linguistique (LAF)
Upravljanje z jezikovnimi viri - Ogrodje za jezikoslovno označevanje (LAF)
Ta mednarodni standard določa ogrodje za jezikoslovno označevanje (LAF) za predstavitev jezikoslovnega označevanja jezikovnih podatkov, kot so korpusi, govorni signali in videoposnetki. Ogrodje vključuje abstraktni podatkovni model in serializacijo XML tega modela za predstavitev označevanja primarnih podatkov. Serializacija je ključni format, ki omogoča, da je označevanje iz ene predstavitve preslikano v drugo. OPOMBA Standardizacijo kategorij jezikovnih podatkov, ki zagotavljajo vsebino označevanja, določajo ISO 12620 in drugi z njim povezani mednarodni standardi.
General Information
Buy Standard
Standards Content (Sample)
МЕЖДУНАРОДНЫЙ ISO
СТАНДАРТ 24612
Первое издание
2012-06-15
Управление языковыми ресурсами.
Лингвистическая аннотационная
система (LAF)
Language resource management. – Linguistic annotation framework
(LAF)
Ответственность за подготовку русской версии несѐт GOST R
(Российская Федерация) в соответствии со статьѐй 18.1 Устава ISO
Ссылочный номер
ISO 24612:2012(R)
©
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24612:2012(R)
ДОКУМЕНТ ЗАЩИЩЁН АВТОРСКИМ ПРАВОМ
© ISO 2012
Все права сохраняются. Если не указано иное, никакую часть настоящей публикации нельзя копировать или использовать в
какой-либо форме или каким-либо электронным или механическим способом, включая фотокопии и микрофильмы, без
предварительного получения письменного согласия ISO по указанному ниже адресу или организации-члена ISO в стране
запрашивающей стороны.
Бюро ISO по авторским правам:
Case postale 56 CH-1211 Geneva 20
Тел.: + 41 22 749 01 11
Факс: + 41 22 749 09 47
Эл. почта: copyright@iso.org
Веб-сайт: www.iso.org
Опубликовано в Швейцарии
©
ii ISO 2012 – Все права сохраняются
---------------------- Page: 2 ----------------------
ISO 24612:2012(R)
Содержание Страница
Предисловие . iv
Введение . v
1 Область применения . 1
2 Термины и определения . 1
3 Спецификация LAF. 3
3.1 Общий обзор . 3
3.2 Модель данных LAF . 3
3.3 Архитектура LAF . 4
3.4 Базовый формат XML . 7
3.5 XML-элементы заголовка ресурса . 12
3.6 Элементы заголовка документа, содержащего первичные данные . 17
Библиография . 19
©
ISO 2012 – Все права сохраняются iii
---------------------- Page: 3 ----------------------
ISO 24612:2012(R)
Предисловие
Международная организация по стандартизации (ISO) является всемирной федерацией
национальных организаций по стандартизации (комитетов-членов ISO). Разработка
международных стандартов обычно осуществляется техническими комитетами ISO. Каждый
комитет-член, заинтересованный в деятельности, для которой был создан технический комитет,
имеет право быть представленным в этом комитете. Международные правительственные и
неправительственные организации, имеющие связь с ISO, также принимают участие в работе. ISO
работает в тесном сотрудничестве с Международной электротехнической комиссией (IEC) по всем
вопросам стандартизации в области электротехники.
Проекты международных стандартов разрабатываются согласно правилам, приведѐнным в
Директивах ISO/IEC, Часть 2.
Разработка международных стандартов является основной задачей технических комитетов.
Проекты международных стандартов, принятые техническими комитетами, рассылаются
комитетам-членам на голосование. Для публикации в качестве международного стандарта
требуется одобрение не менее 75 % комитетов-членов, принявших участие в голосовании.
Принимается во внимание тот факт, что некоторые из элементов настоящего документа могут
быть объектом патентных прав. ISO не принимает на себя обязательств по определению
отдельных или всех таких патентных прав.
ISO 24612 был подготовлен Техническим комитетом ISO/TC 37, Терминология и другие языковые
и информационные ресурсы, Подкомитетом SC 4, Управление языковыми ресурсами.
©
iv ISO 2012 – Все права сохраняются
---------------------- Page: 4 ----------------------
ISO 24612:2012(R)
Введение
Эффективные процедуры создания, кодирования, обработки языковых ресурсов и управления ими
значительно упрощаются при наличии единой высокоуровневой модели данных, которая
обеспечивает возможность анализа и проектирования как различных схем аннотирования, так и
разнообразных форматов представления аннотаций. Настоящий Международный стандарт
предназначен для технической поддержки разработки и использования компьютерных приложений,
основой которых служат языковые ресурсы с лингвистическими аннотациями и процедуры обмена
такими ресурсами между различными прикладными системами.
©
ISO 2012 – Все права сохраняются v
---------------------- Page: 5 ----------------------
МЕЖДУНАРОДНЫЙ СТАНДАРТ ISO 24612:2012(R)
Управление языковыми ресурсами. Лингвистическая
аннотационная система (LAF)
1 Область применения
Настоящий международный стандарт содержит определение лингвистической аннотационной
системы (LAF), которая предназначена для представления лингвистических аннотаций различных
языковых данных, таких как текстовые корпуса, речевые сигналы и видеоданные. Эта
аннотационная система состоит из абстрактной модели данных и преобразованных в
последовательную форму описаний этой модели на языке XML (XML-сериализаций) для
представления аннотаций первичных данных. Сериализация служит базовым форматом,
позволяющим устанавливать соответствие между аннотациями, представляемыми в разных
форматах.
ПРИМЕЧАНИЕ Вопросы стандартизации категорий лингвистических данных, составляющих содержание
аннотаций, рассматриваются в ISO 12620 и других аналогичных международных стандартах.
2 Термины и определения
Для целей данного документа используются термины и определения, представленные ниже.
2.1
первичные данные
primary data
языковая информация, представленная в электронной форме
ПРИМЕРЫ Текст, изображение, речевой сигнал.
Примечание к статье: Как правило, обращение к объектам первичных данных осуществляется по адресам
их ―местоположения‖ в электронном файле: например, по адресу области памяти, в которой располагаются
символы, составляющие предложение или слово, либо по адресу точки, в которой начинается или
заканчивается информация об определѐнном событии (как в случае аннотации речевого сообщения). Более
сложные информационные объекты могут представлять собой список или группы последовательно
расположенных или разрозненных элементов первичных данных.
2.2
аннотировать, составлять аннотацию
annotate
добавлять лингвистическую информацию к первичным данным (2.1)
2.3
аннотация
annotation, noun
лингвистическая информация, добавленная к первичным данным (2.1) и не зависящая от формы
их представления
2.4
представление
representation
формат, в котором отображается аннотация (2.3) , не зависящий от еѐ содержания
© ISO 2012 – All rights reserved
1
---------------------- Page: 6 ----------------------
ISO 24612:2012(R)
ПРИМЕР формат XML, списковый или скобочный формат, текст с разделителями в виде знака табуляции.
2.5
аннотация сегментирования
segmentation annotation
аннотация (2.3), разграничивающая лингвистические элементы, появляющиеся в первичных
данных (2.1)
Примечание к статье: К числу таких элементов относятся: (1) неразрывные сегменты (появляющиеся в
первичных данных совместно); (2) сегменты более высокого или более низкого уровня, являющиеся
составными частями более крупного сегмента (например, сегмент из смежных слов, обычно входящий в
состав сегмента предложения); (3) дискретные сегменты (для связывания неразрывных сегментов) и (4)
реперы (например, отметки времени), обозначающие определѐнные позиции в первичных данных. В
современной практике аннотирования информация сегментирования может присутствовать, а может и не
присутствовать в самом документе, содержащем первичные данные.
2.6
лингвистическая аннотация
linguistic annotation
аннотация (2.3), которая предоставляет лингвистическую информацию о сегментах первичных
данных (2.1)
ПРИМЕР Морфосинтаксическая аннотация, в которой с каждым сегментом данных ассоциируются
некоторая часть речи и некоторая лемма.
Примечание к статье: Идентификатор сегмента как слова, предложения, именной группы и т.п. тоже
образует лингвистическую аннотацию. В современной практике аннотирования всюду, где это возможно,
сегментация часто сочетается с идентификацией лингвистической роли или характеристик сегмента
(например, скобочная запись синтаксических свойств или разграничение слов документа с помощью XML-
элемента, который определяет сегмент как слово или как предложение).
2.7
автономная аннотация
stand-off annotation
аннотация (2.3), охватывающая различные слои первичных данных (2.1) и сериализуемая в
документе, отделѐнном от документа, который содержит первичные данные
Примечание к статье: Автономные аннотации, связываются с конкретными участками первичных данных
посредством адресации соответствующих символьных смещений, элементов и т.п. С одни и тем же
первичным документом может быть связано множество документированных автономных аннотаций
(например, могут существовать аннотации двух разных частей речи, фигурирующих в аннотируемом тексте).
2.8
аннотационный документ, документированная аннотация
annotation document
документ в формате XML, содержащий аннотации (2.3)
2.9
якорь, привязка
anchor
жѐсткая неизменная позиция в первичных данных (2.1), которые необходимо аннотировать (2.2)
Примечание к статье: Способ описания якоря определяется конкретной языковой средой. Например,
текстовыми якорями могут быть смещения символов, якорями аудиоданных – сдвиги по времени, якорями
видеоинформации – временные сдвиги или указатели кадров, а якорями изображений – системы координат.
2.10
местоположение, участок
region
область первичных данных (2.1), определяемая непустым упорядоченным списком якорей (2.9)
© ISO 2012 – Все права сохраняются
2
---------------------- Page: 7 ----------------------
ISO 24612:2012(R)
2.11
исходный артефакт
original artefact
искусственный объект или аннотация (2.3), используемые для извлечения первичных данных (2.1)
2.12
граф
graph
совокупность узлов (вершин) V(G) и связывающих их рѐбер E(G)
2.13
узел, вершина
node
vertex
конечная точка в графе G или точка пересечения его рѐбер
Примечание к статье: Термины узел и вершина используются в настоящем документе как синонимы.
2.14
ребро
edge
упорядоченная пара [u,v] узлов, принадлежащих графу, V(G)
Примечание к статье: Порядок следования узлов определяет ориентацию ребра.
3 Спецификация LAF
3.1 Общий обзор
LAF состоит из следующих компонентов:
информационной модели лингвистических аннотаций и данных, к которым относятся эти
аннотации;
структурной схемы представления языковых данных и их аннотаций;
сериализованного XML-описания информационной модели, которое характеризует
представленную одним или несколькими ориентированными графами ссылочную структуру
аннотаций, ассоциируемых с языковыми данными. Узлы графа могут связываться с
конкретными участками первичных данных, а в совокупности с рѐбрами могут
ассоциироваться с соответствующими признаковыми структурами, которые описывают
лингвистические свойства участков первичных данных, относящихся к достижимым узлам.
3.2 Модель данных LAF
Модель данных LAF включает в себя следующие блоки:
a) структурное описание информационного носителя, состоящее из якорей, указывающих
участки первичных данных и их местоположение,
b) графовой структуры, образованной узлами, рѐбрами и ссылками на конкретные участки, и
c) аннотационной структуры для представления содержания аннотации с использованием
признаковых структур элементов.
Таким образом, информационная модель аннотаций состоит из ориентированного графа,
охватывающего n-мерные участки первичных данных, и прочих аннотационных представлений, в
рамках которых узлы графа ассоциируются с признаковыми структурами, предоставляющими
© ISO 2012 – Все права сохраняются
3
---------------------- Page: 8 ----------------------
ISO 24612:2012(R)
контент аннотации. Аннотация считается соответствующей LAF, если еѐ схема изоморфна модели
данных LAF или может быть преобразована к ней.
ПРИМЕЧАНИЕ В состав лингвистической аннотационной системы не входят спецификации категорий
содержания аннотаций (то есть сущностей соответствующих лингвистических явлений).
Рисунок 1 — Модель данных LAF
3.3 Архитектура LAF
3.3.1 Общее описание
Языковые ресурсы, соответствующие архитектуре LAF, состоят из перечисленных ниже
компонентов, которые более подробно рассматриваются в подразделах 3.3.2 - 3.3.5:
один или несколько документов, содержащих первичные данные (см. 3.3.2);
произвольное число документированных аннотаций, охватывающих различные узлы, рѐбра
графов и ассоциируемые с ними признаковые структуры, все или часть которых могут
принадлежать ориентированному графу (орграфу); при этом все узлы снабжаются ссылками
либо на базовый документ сегментации (в данном случае узел не имеет исходящих рѐбер),
либо на другие узлы того же самого или других документов через соответствующие пути в
графе (см. 3.3.3);
один или несколько документов, определяющих области, которые содержат ссылки на каждый
документ с первичными данными, служащий основой для сегментации аннотаций (см. 3.3.4.);
множество заголовочных блоков, включая ресурсный заголовок, описывающий коллекцию
документов с первичными данными и аннотациями, равно как и заголовки для каждого
первичного документа и каждой аннотации из соответствующей коллекции (см. 3.3.5).
Рекомендуется всегда, когда это возможно, ассоциировать каждый первичный документ с
исходным артефактом, первичные данные которого извлекаются или адаптируются для
аннотации (например, исходный текстовый файл конкретного текстового процессора или
программы визуального представления файлов).
© ISO 2012 – Все права сохраняются
4
---------------------- Page: 9 ----------------------
ISO 24612:2012(R)
3.3.2 Первичные данные
Первичные данные – это сведения, представленные в электронном виде в любом формате,
включающие в себя текстовые символы, изображения, аудиоинформацию и видеоданные.
Первичные данные в LAF-совместимых ресурсах «замораживаются» как доступные только для
чтения (―read-only‖) – для обеспечения целостности ссылок на различные участки данных в рамках
используемых документов. Внесение корректировок и изменений в первичные данные
рассматривается как аннотирование и документируется в отдельной аннотации. Данные текстовых
первичных документов имеют кодировку UTF-8 (используемую по умолчанию) или UTF-16.
В общем случае первичные данные не содержат никаких разметочных символов. Если же в
первичных данных присутствует разметка (типа тегов HTML или XML), то она воспринимается
посредством ссылок на аннотации как часть потока данных; при этом в случае обращения к тем
или иным участкам документа не делается никакого различия между символами разметки и
символами данных.
3.3.3 Документированные аннотации
Документированная аннотация содержит лингвистическую информацию, предназначенную для
описания первичных данных. Аннотации всегда ассоциируются с каким-либо узлом в графе,
реализующим прямое обращение к участкам документа, местоположение которых определяется
первичными данными непосредственно или по пути, проходящему через достижимые узлы. В
последнем случае говорят, что аннотации расслаиваются согласно первичным данным. В рамках
LAF рекомендуется представлять каждый из лингвистических слоѐв, определѐнных в системе
управления языковыми ресурсами, отдельной аннотацией – в целях организации надлежащего
информационного обмена.
Степень разбиения аннотации (то есть минимальная единица информации, к которой она
применима) зависит от конкретного используемого приложения. Например, единая аннотация
некоторого текста может охватывать фонему, слово, предложение, абзац, документ или весь
текстовый корпус; а в случае аудиоинформации это могут быть любой временной интервал,
включая конкретный «момент времени» (квант времени, временная метка и др.).
3.3.4 Ссылки на первичные данные
Прямое обращение к конкретным участкам первичных данных выполняется с помощью узлов,
называемых якорями. В большинстве случаев такие узлы располагаются между базовыми
единицами представления первичных данных.
Якоря не зависят от характера носителя. Местоположение нужного ресурса может определяться
путѐм задания якорей, ограничивающих участок документа. Участки таких артефактов, как
изображение или видеозапись, могут определяться заданием якорей в виде координат
местоположения, указателей кадров и т.п. В аудиоданных якоря могут охватывать одну или
несколько точек носителя звукозаписи (как, например, ―момент‖ или ―временной интервал‖). Якоря,
осуществляющие такую привязку, представляются комбинациями служебных символов,
обозначающих пространственные и временные сдвиги. Например, в английском предложении ―My
dog has fleas‖ разбиение может быть произведено так, как показано ниже:
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
Здесь со словами связаны следующие якоря:
My: начало=0, конец=2
dog: начало=3, конец=6
has: начало=7, конец=10
fleas: начало=11, конец=16
© ISO 2012 – Все права сохраняются
5
---------------------- Page: 10 ----------------------
ISO 24612:2012(R)
Участки документа с первичными данными не обязательно должны быть смежными (то есть могут
существовать фрагменты первичных данных, не включѐнные ни в какой участок), однако, как
правило, не допускается их перекрытие. Участки, перекрывающие друг друга, должны
рассматриваться как составленные из более мелких субкомпонентов. Например, два промежутка с
границами <5, 9> и <7, 15> могут быть представлены как три промежутка: a = <5, 7>, b = <7, 9> и c
= <9, 15>. В этом случае могут быть созданы два графовых представления с узлами and
c>, и они перекроют участки <5, 9> и <7, 15>. Несмежные участки охватываются путѐм создания
узлов, покрывающих каждый составляющий участок и добавления узла, поочерѐдно соединѐнного
с ними.
Типы носителей данных, включаемые в тот или иной ресурс, определяются в заголовочном блоке
ресурса. Каждый носитель ассоциируется с одним или несколькими типами якорей. В заголовке
каждого документа, содержащего первичные данные, определяется его носитель, который, в свою
очередь, указывает тип используемых якорей.
В общем случае первичные данные не содержат никаких разметочных символов. Если же в
первичных данных присутствует разметка (типа тегов HTML или XML), то она воспринимается
посредством ссылок на аннотации как часть потока данных; при этом в случае обращения к тем
или иным участкам документа не делается никакого различия между символами разметки и
символами данных. Применительно к первичным данным, образующим корректный XML-документ,
якоря могут обеспечивать обращение к элементам XML с использованием языка XPath 2.0
консорциума W3C (www.w3.org/TR/xpath20/), в среде которого ассоциируемый тип якоря
определяется заголовком ресурса как выражение XPath. Обращения к участкам внутри этих XML-
элементов (то есть к контенту элементов на языке XML) может осуществляться с помощью
стандартных смещений, которые вычисляются с учѐтом разметочных символов как части
непрерывного потока данных; в этом случае с конкретным типом файла первичного документа
должны ассоциироваться два типа носителей. Полное описание способов определения якоря и
носителя в заголовке ресурса приводится в подразделе 3.3.5.2.
3.3.5 Заголовочные блоки
3.3.5.1 Общие замечания
LAF определяет заголовок ресурса как блок, состоящий из коллекции документов с первичными
данными и аннотаций, а также заголовков первичных данных и самих документированных
аннотаций. Такой набор заголовков обеспечивает наличие всех метаданных, описывающих
соглашения по источникам и кодированию применительно к аннотируемым данным и их
аннотациям, а также информации, необходимой для обработки данных, такой как типы якорей или
отношения, связывающие первичные данные с документированными аннотациями корпуса текстов.
3.3.5.2 Заголовок ресурса
Заголовок ресурса описывает ресурс в целом, включая его содержательную часть, файловую
структуру и методы кодирования, и обеспечивает формирование определений, которые
используются затем в заголовках документов, содержащих первичные данные, и
документированных аннотаций. К числу таких компонентов относятся:
категории, используемые для описания документов-источников первичных данных,
относящихся, как правило, к сфере знаний или предметной области основного текста;
типы файлов, содержащие соглашения по именам, носителям, типам аннотаций и
взаимосвязям (применительно к другим типам запрашиваемых файлов); спецификация типов
файлов позволяет осуществлять автоматическую проверку наличия всех необходимых
элементов нужного ресурса;
аннотационные области, используемые для предоставления контекста аннотаций и для
разрешения конфликтов по именам.
© ISO 2012 – Все права сохраняются
6
---------------------- Page: 11 ----------------------
ISO 24612:2012(R)
декларации аннотаций, которые характеризуют аннотации, присутствующие в ресурсе,
предоставляя их наименования, сведения о разработчике, ссылки на релевантную
документацию и (не обязательно) ассоциированную схему аннотирования;
определения носителей, характеризующие те их типы, которые включаются в корпус, и
соглашения по именованию файлов, содержащих информацию такого типа;
типы якорей, определения которых связывают их с определѐнными типами носителей;
групповые определения, обеспечивающие присваивание имѐн, привязку описаний и
установление принадлежности к группам аннотаций, определяемых пользователем.
3.3.5.3 Заголовок документа, содержащего первичные данные
Каждый документ, содержащий первичные данные, ассоциируется с файлом XML-заголовка,
содержащим информацию о его контенте. Поскольку документ, содержащий первичные данные,
не является документом в формате XML, наличие заголовка первичных данных LAF обязательно,
и этот заголовок должен представляться как отдельный автономный файл.
Заголовок документа, содержащего первичные данные, предоставляет информацию об их
источнике и содержании, а также устанавливает определения категорий данных и типа носителя
посредством ссылки на определения в заголовке ресурса.
Заголовок документа, содержащего первичные данные, обеспечивает получение его
идентификатора и всех ассоциируемых с ним документированных аннотаций. Он также
предоставляет всю информацию, необходимую для обработки таких аннотаций. Предполагается,
что при необходимости обработки документа и его аннотаций этот файл загружается первым.
Элементы документа, содержащего первичные данные, описываются в разделе 3.6.
3.3.5.4 Заголовок аннотационного документа
Заголовок аннотационного документа включает в себя соответствующее подмножество элементов
заголовка первичных данных (описывающее содержимое этого файла, а не исходный текст или
что-то другое) вместе с дополнительными элементами, которые предоставляют или дают по
ссылке информацию, касающуюся категорий содержания аннотации и взаимосвязей
аннотационного документа с другими типами документов. Заголовок аннотационного документа не
является самостоятельным документом, а представляет собой заголовочный блок, размещаемый
в начале документированной аннотации. Элементы такого заголовка рассматриваются ниже в
подразделе 3.4.3.
3.4 Базовый формат XML
3.4.1 Общий обзор
LAF обеспечивает XML-сериализацию модели данных которая выступает в роли базового
формата. Этот формат призван служить некоторым ―промежуточным языком‖ преобразования
других многочисленных форматов. Хотя базовый формат LAF может использоваться в любом
контексте, предполагается, что пользователи будут представлять аннотации в своих собственных
форматах, которые затем смогут преобразовываться к базовому формату для удобства
выполнения операций информационного обмена, слияния и сравнения.
Сериализация базового формата LAF определяется графовым представлением аннотаций (GrAF),
в котором:
базисная структура данных является ориентированным графом, состоящим из множества
узлов, соединяемых множеством дуг;
аннотация отображается меткой и (факультативно) признаковой структурой, ассоциируемой с
узлом или дугой графа;
© ISO 2012 – Все права сохраняются
7
---------------------- Page: 12 ----------------------
ISO 24612:2012(R)
признаковая структура играет роль графового представления значения атрибута (AVG),
которое может быть конкретным атомарным (элементарным) значением либо другой
признаковой структурой;
атомарное значение структурного элемента является результатом отображения одной
последовательности символов (имени элемента) на другую (признаковую структуру);
форматом GrAF не предусматривается выделение типов значений элементов;
узлы графа могут ассоциироваться с участками первичного документа или соединяться с
другими узлами того же самого или другого аннотационного документа; узлы привязываются к
участкам с помощью элементов ; для связи (ассоциирования) одних узлов с другими
используются дуги графа;
дуги представляют отношения между узлами; по умолчанию множество исходящих дуг узла
представляет группу составляющих элементов аннотации, ассоциируемую с данным узлом;
другие отношения могут описываться аннотацией, ассоциируемой с конкретной дугой графа.
Для всех документов в формате GrAF используется пространство имѐн, определѐнное в
нормативном документе http://www.xces.org/ns/GrAF/1.0/ .
3.4.2 XML-элементы для аннотационных документов
Файл общей модели автономной аннотации в формате GrAF выглядит следующим образом:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
Корневой элемент определяется Таблицей 1, в которой требуемые атрибуты выделены
полужирным шрифтом.
Таблица 1 — Корневой элемент аннотационных документов в формате GrAF
корневой элемент графа
Атрибут @xmlns [URL]: объявление пространства имѐн для схемы GrAF.
Пример
3.4.3 Заголовок аннотационного документа
Общая модель содержания заголовка аннотационного документа имеет вид:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader
Элементы заголовка аннотационного документа представлены в Таблице 2, а элементы для
определения графов и аннотаций – в Таблице 3. Требуемые атрибуты выделены полужирным
шрифтом.
© ISO 2012 – Все права сохраняются
8
---------------------- Page: 13 ----------------------
ISO 24612:2012(R)
Таблица 2 — Элементы заголовка аннотационного документа
скобочный тег для элементов заголовка аннотационного документа.
список аннотационных меток, используемых в документе, и их частотность.
информация по конкретным аннотационным меткам.
Атрибуты @label [строка]: имя элемента.
@occurs [целое число]: число вхождений в документе.
документы, которые требуются для обработки аннотаций в рамках данного
документа и должны включать в себя документ по сегментированию и/или
любые документы, непосредственно запрашиваемые в данном документе.
файл, необходимый для обработки аннотации.
Атрибуты @ann.id [IDREF]: идентификатор документа в том виде, как он дан в
ассоциируемом документе, который содержит первичные данные.
области аннотаций, запрашиваемые в данном документе.
аннотационное пространство, используемое в данном документе.
@as.id [IDREF]: идентификатор аннотационного пространства, как он
Атрибуты
определѐн в заголовке ресурса.
@default [да | нет]: указывает на значение идентификатора аннотационного
пространства, принимаемое в данном документе по умолчанию; при отсутствии
такого атрибута принимается стандартное значение нет (no).
Таблица 3 — Элементы графов и аннотаций в аннотационных документах формата GrAF
корневые элементы (один или несколько), идентифицирующие корневые узлы
графа; такие элементы используются в тех случаях, когда граф имеет древовидную
структуру или представляет собой лес, то есть состоит из нескольких правильных
деревьев.
идентификатор корневого узла графа; не все графы имеют структуру дерева, но
если граф является деревом, то его корневой элемент может использоваться для
идентификации корневого узла этого дерева.
@node.id [IDREF]: идентификатор корневого узла.
Атрибут
участок аннотируемого артефакта, определяемый как область, ограниченная
непустым списком якорей; число якорей, требующихся для ограничения участка,
зависит от характеристик носителя, подлежащего аннотированию.
@xml:id [ID]: уникальный идентификатор ссылок, исходящих от узла графа.
Атрибуты
@anchors [строка] (альтернатива @refs): якоря, которые осуществляют привязку
данного участка; атрибут якоря содержит список его значений, разделѐнных
пробелами; предполагается, что приложения имеют достаточную информацию для
осуществления грамматического разбора строкового представления якоря в
соответствующем местоположении аннотируемого артефакта; элемент
должен иметь атрибут @anchors или @ref.
...
SLOVENSKI STANDARD
SIST ISO 24612:2013
01-julij-2013
Upravljanje z jezikovnimi viri - Ogrodje za jezikoslovno označevanje (LAF)
Language resource management -- Linguistic annotation framework (LAF)
Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)
Ta slovenski standard je istoveten z: ISO 24612:2012
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24612:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ISO 24612:2013
---------------------- Page: 2 ----------------------
SIST ISO 24612:2013
INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
ISO 24612:2012(E)
©
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Contents Page
Foreword . iv
Introduction . v
1 Scope . 1
2 Terms and definitions . 1
3 LAF specification . 3
3.1 Overview . 3
3.2 LAF data model . 3
3.3 LAF architecture . 4
3.4 XML pivot format . 6
3.5 XML elements for the resource header . 11
3.6 Elements in the primary data document header . 16
Bibliography . 19
© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has
been established has the right to be represented on that committee. International organizations, governmental
and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Introduction
Effective creation, encoding, processing and management of language resources is facilitated by a single
high-level data model that supports analysis and design of both annotation schemes and representation
formats. This International Standard is designed to support the development and use of computer applications
relying on linguistically annotated resources and the exchange of these resources among different
applications.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24612:2013
---------------------- Page: 8 ----------------------
SIST ISO 24612:2013
INTERNATIONAL STANDARD ISO 24612:2012(E)
Language resource management — Linguistic annotation
framework (LAF)
1 Scope
This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic
annotations of language data such as corpora, speech signal and video. The framework includes an abstract
data model and an XML serialization of that model for representing annotations of primary data. The
serialization serves as a pivot format to allow annotations expressed in one representation format to be
mapped onto another.
NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and
other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.
Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of
characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech
annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary
data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun
linguistic information added to primary data (2.1), independent of its representation
2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation
annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)
Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-
and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment
typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks
© ISO 2012 – All rights reserved
1
---------------------- Page: 9 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in
the document containing the primary data itself.
2.6
linguistic annotation
annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)
EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in
the data.
Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.
In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that
segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that
identifies the segment as a word or sentence).
2.7
stand-off annotation
annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing
the primary data
Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,
elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can
refer to the same primary document (e.g. two different part of speech annotations for a given text).
2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)
Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,
audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be
coordinates.
2.10
region
area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)
2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G
Note to entry: The terms node and vertex are used interchangeably in this document.
2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
© ISO 2012 – All rights reserved
2
---------------------- Page: 10 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3 LAF specification
3.1 Overview
LAF consists of the following.
A data model for linguistic annotations and the data to which they apply.
An architecture for representing language data and its annotations.
An XML serialization of the data model, which describes the referential structure of annotations
associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be
linked to regions of primary data. Nodes and edges may be associated with feature structures describing
linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of
a) a structure for describing media, consisting of anchors that reference locations in primary data and
regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and
c) an annotation structure for representing annotation content with feature structures.
The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary
data as well as other annotations, in which nodes are associated with feature structures providing the
annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the
mapping) isomorphic to the LAF data model.
NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated
linguistic phenomena).
Figure 1 — LAF data model
© ISO 2012 – All rights reserved
3
---------------------- Page: 11 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3.3 LAF architecture
3.3.1 Overview
Language resources conforming to the LAF architecture consist of the following, described in more detail in
3.3.2 to 3.3.5.
One or more primary data documents (see 3.3.2).
Any number of annotation documents containing nodes, edges and feature structures associated with
some or all of the nodes and/or edges in a directed graph. All nodes reference either a base
segmentation document (in which case the node has no outgoing edges) or other nodes in the same or
other annotation documents via edges. (See 3.3.3).
One or more documents defining regions that reference each primary data document, which serve as the
base segmentation for annotations (see 3.3.4.)
A set of headers, including a resource header describing a collection of primary data documents and
annotations, as well as headers for each primary data document and each annotation document in the
collection (see 3.3.5).
It is recommended that whenever possible, each primary data document also be associated with an original
artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the
original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data
Primary data consists of electronic data in any format, including character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to
locations within the document or documents. Corrections and modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is
made between markup and other characters in the data when referring to locations in the document.
3.3.3 Annotation documents
Annotation documents contain linguistic information describing primary data. Annotations are always
associated with a node in a graph that directly references regions defined over primary data, either directly or
via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary
data. LAF recommends representing each of the linguistic layers defined in language resource management,
in a separate annotation document for the purposes of exchange.
The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is
dependent on the application. For example, a single annotation over text may cover a phoneme, word,
sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a
temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data
Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are
located between the base units of the primary data representation.
Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound
the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying
© ISO 2012 – All rights reserved
4
---------------------- Page: 12 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors
that refer to one or more points in the medium (e.g. an “instant” or “timestamp”). Anchors are represented by
n-tuples consisting of sets of spatial and temporal offsets. For example, consider the text “My dog has fleas”:
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
The anchors for each word are the following:
My: start=0, end=2
dog: start=3, end=6
has: start=7, end=10
fleas: start=11, end=16
A set of regions defined over a document containing primary data need not be contiguous (i.e. there may be
portions of the primary data not included in any region), but they should not, in general, overlap. Overlapping
regions should be treated as composed of finer-grained sub-components. For example, two spans, <5, 9> and
<7, 15>, can be reconstrued as three spans, a = <5, 7>, b = <7, 9, and c = <9, 15>. Two graph nodes can
then be created that reference nodes and , thereby providing the coverage of regions <5, 9> and
<7, 15>. Discontiguous regions are referenced by creating nodes referencing each component region and
adding a node that is in turn linked to them.
The media types included in the resource are defined in the resource header. Each medium is associated with
one or more anchor types. The header for each primary data document identifies the medium for that
document, which in turn indicates the type of anchors used.
In the general case, primary data does not contain markup of any kind. If markup appears in primary data (e.g.
HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made
between markup and other characters in the data when referring to locations in the document. For primary
data comprising a valid XML document, anchors may reference XML elements using the W3C XPath 2.0
Language (www.w3.org/TR/xpath20/), in which case the associated anchor type is defined in the resource
header as an XPath expression. References to locations within these XML elements (i.e. XML element
content) can be made using standard offsets, which will be computed by including the markup as part of the
data stream; in this case, two media types would be associated with the primary document’s file type. See
3.3.5.2 for a full description of anchor and media type definitions in the resource header.
3.3.5 Headers
3.3.5.1 Overview
LAF defines a header for a resource consisting of a collection of primary data documents and annotations, as
well as headers for primary data and annotation documents themselves. This set of headers provides all
metadata describing the provenance and encoding conventions for the data and its annotations, information
required for processing such as anchor types or relations among primary data and annotation documents in
the corpus.
3.3.5.2 Resource header
The resource header describes the resource as a whole, including its contents, file structure and encoding,
and establishes definitions that are used in the primary data document and annotation document headers.
Among these are the following.
Categories used to describe primary data documents, typically the domain/subject area of general text.
File types providing their naming conventions, media, annotation type, and dependencies (i.e. other file
types that are referenced and therefore required). The specification of file types enables automatic
validation that all required elements of the resource are present.
© ISO 2012 – All rights reserved
5
---------------------- Page: 13 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Annotation spaces used to provide context for annotations and enable resolution of naming conflicts.
Annotation declarations describing the annotations in the resource, including their names, creator, links to
relevant documentation and, optionally, an associated annotation schema.
Media definitions specifying the media types included in the corpus and file naming conventions for files
containing data of that type.
Anchor types associating anchor type definitions with media types.
Group definitions providing the names, descriptions and members of user-defined groups of annotations.
3.3.5.3 Primary data document header
Each primary data document is associated with an XML header file containing information describing its
contents. Because the primary data document is not an XML document, the LAF primary data header is
obligatory and shall be provided as a standalone file.
The primary data document header provides information about the source and contents of the primary data,
as well as specifying category definitions and medium type by reference to definitions in the resource header.
The primary data document header provides the PID for the primary data document and all associated
annotation documents. The primary data document header provides all the information needed to process
annotations associated with a given primary data document. It is presumed that this file is loaded first when a
document and its annotations are to be processed.
The elements in the primary data document header are given in 3.6.
3.3.5.4 Annotation document header
The annotation document header includes a relevant subset of elements from the primary data header (i.e.
those that describe the file contents rather than the provenance of an original text, etc.), together with
additional elements that provide or point to information concerning the annotation content categories and
dependencies between the annotation document and other documents. The annotation document header is
not a separate document, but rather is included at the beginning of the annotation document. The elements in
the annotation document header are given in 3.4.3.
3.4 XML pivot format
3.4.1 Overview
The LAF provides an XML serialization of the data model that is designated as the pivot format. A pivot format
is intended to serve as an “interlingua” for translation among multiple other formats, by providing a common
target into and out of which other formats can be transduced. Although the LAF pivot format may be used in
any context, it is assumed that users will represent annotations using their own formats, which can then be
transduced to the LAF pivot format for the purposes of exchange, merging and comparison.
The graph annotation format (GrAF) specifies the XML serialization of the LAF pivot format.
In GrAF:
The fundamental data structure is a directed graph consisting of a set of nodes and a set of edges.
An annotation is a label and (optionally) a feature structure associated with a node or an edge in the
graph.
© ISO 2012 – All rights reserved
6
---------------------- Page: 14 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
A feature structure is an attribute value graph (AVG). The value of a feature may be an atomic value or
another feature structure.
An atomic feature value is a mapping from one string (the feature name) to another string (the atomic
value). GrAF makes no attempt to do typing of feature values.
Nodes may be associated with regions in the primary document, or connected to other nodes in the same
or another annotation document. Nodes are associated with regions by elements. Edges are used
to connect (associate) nodes to other nodes.
An edge represents a relationship between nodes. By default, the set of out edges from a node represent
an ordered set of constituents of the annotation associated with the node. Other relationships may be
specified by associating an annotation with the edge.
For all GrAF documents the namespace http://www.xces.org/ns/GrAF/1.0/ is used.
3.4.2 XML elements for annotation documents
The overall content model of a GrAF standoff annotation file is as follows:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
The root element is defined in Table 1. Required attributes are given in bold.
Table 1 — Root element for GrAF annotation documents
Root node of the graph.
Attribute @xmlns [URL]: namespace declaration for the GrAF schema
Example
3.4.3 The annotation document header
The overall content model for the annotation document header is as follows:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader
Elements of the annotation document header are given in Table 2, and elements to define graphs and
annotations are given in Table 3. Required attributes are given in bold.
© ISO 2012 – All rights reserved
7
---------------------- Page: 15 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 2 — Elements of the annotation document header
Bracketing tag for elements of the annotation document header.
List of the annotation labels used in the document and their frequencies.
Information for individual annotation labels.
Attributes @label [string]: Element name.
@occurs [integer]: Number of occurrences in the document.
Documents required to process the annotations in this document, which will include a
segmentation document and/or any annotation documents directly referenced in this
document.
File required to process this annotation.
Attributes @ann.id [IDREF]: The ID of the document as given in the associated primary data
document.
Annotation spaces referenced in this document.
Annotation space used in this document.
Attributes @as.id [IDREF]: The ID of the annotation space as defined in the resource header.
@default [yes | no]: Indicates whether or not this annotation space is the default in this
document. If the attribute is not present, no is assumed.
Table 3 — Graph and annotation elements in GrAF annotation documents
One or more root elements that identify root nodes in the graph. This element is used
when the graph contains either a graph that is a tree or a forest, i.e. more than one graph
that is a well-formed tree.
The node ID of a root node in the graph. Not all graphs will form a tree, but those that do
can use the root element to identify the root node of the tree.
@node.id [IDREF]: The ID of the root node.
Attribute
Region in the artefact being annotated, defined as the area bounded by a non-empty,
ordered list of anchors. The number of anchors required to bound a region depends on
the medium being annotated.
@xml:id [ID]: Unique ID for reference from nodes in the graph.
Attributes
@anchors [string] (alternative to @refs): The anchors that bound this region. The anchors
attribute contains a whitespace-delimited list of values that represent the anchor values.
Applications are expected to know how to parse the string representation of an anchor
into a location in the artefact being annotated. The element shall have either an
@anchors attribute or an @ref attribute.
@refs [IDREFS] (alternative to @anchors): ID references to the anchors that bound the
regions. The element shall have either an @anchors attribute or an @refs
attribute.
@anchor.id [IDREF]: The anchor type of the anchors referenced in the @anchors
attribute. This is the @xml:id of one of the anchorTypes defined in the resource header. If
no @anchor.id is specified for the region, the default anchor type for the document
(indicated on the element in the resource header) is assumed. If the @refs
attribute is used to refer to elements, the @anchor.id attribute will be specified
on the elements and should not be given on .
© ISO 2012 – All rights reserved
8
---------------------- Page: 16 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 3 (continued)
Example
anchors="980 983"/>
anchors="10,59 10,173 149,173 149,59"/>
anchors="34 42"/>
Location in the artefact being annotated. How the location is represented is medium-
dependent. Applications are required to be able to serialize and de-serialize location
values to and from strings appearing as attributes on the @value attribute as well as the
@anchors attribute on the element.
@xml:id [ID]: Unique ID for reference from nodes in the graph
Attributes
@value [string]: The offset value of the anchor. How the attribute value is interpreted as
a location in the artefact being annotated is medium-dependent.
@anchor.id [string]: The @xml:id of an anchor type defined in the resource header.
Node in the graph. The element is empty when connected by an element to
another node in the graph (i.e. when the node is a non-terminal node). A child
element is used when the node refers to a region or regions of primary data (i.e. when the
node is a terminal/leaf node).
@xml:id [ID]: Unique ID for reference from edges and annotations.
Attribute
Identifies region(s) in a base segmentation document referred to by this node
@targets [IDREFS]: Identifiers of referenced region(s)
Attribute
Example
Edge in the graph.
Attributes @xml:id [ID]: Unique ID for reference from nodes and annotations.
@from [IDREF]: ID of the start node of the edge.
@ to [IDREF]: ID of the end node of the edge.
Example
Annotation information associated with a node
...
INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
ISO 24612:2012(E)
©
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24612:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24612:2012(E)
Contents Page
Foreword . iv
Introduction . v
1 Scope . 1
2 Terms and definitions . 1
3 LAF specification . 3
3.1 Overview . 3
3.2 LAF data model . 3
3.3 LAF architecture . 4
3.4 XML pivot format . 6
3.5 XML elements for the resource header . 11
3.6 Elements in the primary data document header . 16
Bibliography . 19
© ISO 2012 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO 24612:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has
been established has the right to be represented on that committee. International organizations, governmental
and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
ISO 24612:2012(E)
Introduction
Effective creation, encoding, processing and management of language resources is facilitated by a single
high-level data model that supports analysis and design of both annotation schemes and representation
formats. This International Standard is designed to support the development and use of computer applications
relying on linguistically annotated resources and the exchange of these resources among different
applications.
© ISO 2012 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO 24612:2012(E)
Language resource management — Linguistic annotation
framework (LAF)
1 Scope
This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic
annotations of language data such as corpora, speech signal and video. The framework includes an abstract
data model and an XML serialization of that model for representing annotations of primary data. The
serialization serves as a pivot format to allow annotations expressed in one representation format to be
mapped onto another.
NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and
other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.
Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of
characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech
annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary
data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun
linguistic information added to primary data (2.1), independent of its representation
2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation
annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)
Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-
and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment
typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks
© ISO 2012 – All rights reserved
1
---------------------- Page: 6 ----------------------
ISO 24612:2012(E)
(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in
the document containing the primary data itself.
2.6
linguistic annotation
annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)
EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in
the data.
Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.
In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that
segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that
identifies the segment as a word or sentence).
2.7
stand-off annotation
annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing
the primary data
Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,
elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can
refer to the same primary document (e.g. two different part of speech annotations for a given text).
2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)
Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,
audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be
coordinates.
2.10
region
area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)
2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G
Note to entry: The terms node and vertex are used interchangeably in this document.
2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
© ISO 2012 – All rights reserved
2
---------------------- Page: 7 ----------------------
ISO 24612:2012(E)
3 LAF specification
3.1 Overview
LAF consists of the following.
A data model for linguistic annotations and the data to which they apply.
An architecture for representing language data and its annotations.
An XML serialization of the data model, which describes the referential structure of annotations
associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be
linked to regions of primary data. Nodes and edges may be associated with feature structures describing
linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of
a) a structure for describing media, consisting of anchors that reference locations in primary data and
regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and
c) an annotation structure for representing annotation content with feature structures.
The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary
data as well as other annotations, in which nodes are associated with feature structures providing the
annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the
mapping) isomorphic to the LAF data model.
NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated
linguistic phenomena).
Figure 1 — LAF data model
© ISO 2012 – All rights reserved
3
---------------------- Page: 8 ----------------------
ISO 24612:2012(E)
3.3 LAF architecture
3.3.1 Overview
Language resources conforming to the LAF architecture consist of the following, described in more detail in
3.3.2 to 3.3.5.
One or more primary data documents (see 3.3.2).
Any number of annotation documents containing nodes, edges and feature structures associated with
some or all of the nodes and/or edges in a directed graph. All nodes reference either a base
segmentation document (in which case the node has no outgoing edges) or other nodes in the same or
other annotation documents via edges. (See 3.3.3).
One or more documents defining regions that reference each primary data document, which serve as the
base segmentation for annotations (see 3.3.4.)
A set of headers, including a resource header describing a collection of primary data documents and
annotations, as well as headers for each primary data document and each annotation document in the
collection (see 3.3.5).
It is recommended that whenever possible, each primary data document also be associated with an original
artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the
original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data
Primary data consists of electronic data in any format, including character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to
locations within the document or documents. Corrections and modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is
made between markup and other characters in the data when referring to locations in the document.
3.3.3 Annotation documents
Annotation documents contain linguistic information describing primary data. Annotations are always
associated with a node in a graph that directly references regions defined over primary data, either directly or
via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary
data. LAF recommends representing each of the linguistic layers defined in language resource management,
in a separate annotation document for the purposes of exchange.
The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is
dependent on the application. For example, a single annotation over text may cover a phoneme, word,
sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a
temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data
Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are
located between the base units of the primary data representation.
Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound
the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying
© ISO 2012 – All rights reserved
4
---------------------- Page: 9 ----------------------
ISO 24612:2012(E)
one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors
that refer to one or more points in the medium (e.g. an “instant” or “timestamp”). Anchors are represented by
n-tuples consisting of sets of spatial and temporal offsets. For example, consider the text “My dog has fleas”:
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
The anchors for each word are the following:
My: start=0, end=2
dog: start=3, end=6
has: start=7, end=10
fleas: start=11, end=16
A set of regions defined over a document containing primary data need not be contiguous (i.e. there may be
portions of the primary data not included in any region), but they should not, in general, overlap. Overlapping
regions should be treated as composed of finer-grained sub-components. For example, two spans, <5, 9> and
<7, 15>, can be reconstrued as three spans, a = <5, 7>, b = <7, 9, and c = <9, 15>. Two graph nodes can
then be created that reference nodes and , thereby providing the coverage of regions <5, 9> and
<7, 15>. Discontiguous regions are referenced by creating nodes referencing each component region and
adding a node that is in turn linked to them.
The media types included in the resource are defined in the resource header. Each medium is associated with
one or more anchor types. The header for each primary data document identifies the medium for that
document, which in turn indicates the type of anchors used.
In the general case, primary data does not contain markup of any kind. If markup appears in primary data (e.g.
HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made
between markup and other characters in the data when referring to locations in the document. For primary
data comprising a valid XML document, anchors may reference XML elements using the W3C XPath 2.0
Language (www.w3.org/TR/xpath20/), in which case the associated anchor type is defined in the resource
header as an XPath expression. References to locations within these XML elements (i.e. XML element
content) can be made using standard offsets, which will be computed by including the markup as part of the
data stream; in this case, two media types would be associated with the primary document’s file type. See
3.3.5.2 for a full description of anchor and media type definitions in the resource header.
3.3.5 Headers
3.3.5.1 Overview
LAF defines a header for a resource consisting of a collection of primary data documents and annotations, as
well as headers for primary data and annotation documents themselves. This set of headers provides all
metadata describing the provenance and encoding conventions for the data and its annotations, information
required for processing such as anchor types or relations among primary data and annotation documents in
the corpus.
3.3.5.2 Resource header
The resource header describes the resource as a whole, including its contents, file structure and encoding,
and establishes definitions that are used in the primary data document and annotation document headers.
Among these are the following.
Categories used to describe primary data documents, typically the domain/subject area of general text.
File types providing their naming conventions, media, annotation type, and dependencies (i.e. other file
types that are referenced and therefore required). The specification of file types enables automatic
validation that all required elements of the resource are present.
© ISO 2012 – All rights reserved
5
---------------------- Page: 10 ----------------------
ISO 24612:2012(E)
Annotation spaces used to provide context for annotations and enable resolution of naming conflicts.
Annotation declarations describing the annotations in the resource, including their names, creator, links to
relevant documentation and, optionally, an associated annotation schema.
Media definitions specifying the media types included in the corpus and file naming conventions for files
containing data of that type.
Anchor types associating anchor type definitions with media types.
Group definitions providing the names, descriptions and members of user-defined groups of annotations.
3.3.5.3 Primary data document header
Each primary data document is associated with an XML header file containing information describing its
contents. Because the primary data document is not an XML document, the LAF primary data header is
obligatory and shall be provided as a standalone file.
The primary data document header provides information about the source and contents of the primary data,
as well as specifying category definitions and medium type by reference to definitions in the resource header.
The primary data document header provides the PID for the primary data document and all associated
annotation documents. The primary data document header provides all the information needed to process
annotations associated with a given primary data document. It is presumed that this file is loaded first when a
document and its annotations are to be processed.
The elements in the primary data document header are given in 3.6.
3.3.5.4 Annotation document header
The annotation document header includes a relevant subset of elements from the primary data header (i.e.
those that describe the file contents rather than the provenance of an original text, etc.), together with
additional elements that provide or point to information concerning the annotation content categories and
dependencies between the annotation document and other documents. The annotation document header is
not a separate document, but rather is included at the beginning of the annotation document. The elements in
the annotation document header are given in 3.4.3.
3.4 XML pivot format
3.4.1 Overview
The LAF provides an XML serialization of the data model that is designated as the pivot format. A pivot format
is intended to serve as an “interlingua” for translation among multiple other formats, by providing a common
target into and out of which other formats can be transduced. Although the LAF pivot format may be used in
any context, it is assumed that users will represent annotations using their own formats, which can then be
transduced to the LAF pivot format for the purposes of exchange, merging and comparison.
The graph annotation format (GrAF) specifies the XML serialization of the LAF pivot format.
In GrAF:
The fundamental data structure is a directed graph consisting of a set of nodes and a set of edges.
An annotation is a label and (optionally) a feature structure associated with a node or an edge in the
graph.
© ISO 2012 – All rights reserved
6
---------------------- Page: 11 ----------------------
ISO 24612:2012(E)
A feature structure is an attribute value graph (AVG). The value of a feature may be an atomic value or
another feature structure.
An atomic feature value is a mapping from one string (the feature name) to another string (the atomic
value). GrAF makes no attempt to do typing of feature values.
Nodes may be associated with regions in the primary document, or connected to other nodes in the same
or another annotation document. Nodes are associated with regions by elements. Edges are used
to connect (associate) nodes to other nodes.
An edge represents a relationship between nodes. By default, the set of out edges from a node represent
an ordered set of constituents of the annotation associated with the node. Other relationships may be
specified by associating an annotation with the edge.
For all GrAF documents the namespace http://www.xces.org/ns/GrAF/1.0/ is used.
3.4.2 XML elements for annotation documents
The overall content model of a GrAF standoff annotation file is as follows:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
The root element is defined in Table 1. Required attributes are given in bold.
Table 1 — Root element for GrAF annotation documents
Root node of the graph.
Attribute @xmlns [URL]: namespace declaration for the GrAF schema
Example
3.4.3 The annotation document header
The overall content model for the annotation document header is as follows:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader
Elements of the annotation document header are given in Table 2, and elements to define graphs and
annotations are given in Table 3. Required attributes are given in bold.
© ISO 2012 – All rights reserved
7
---------------------- Page: 12 ----------------------
ISO 24612:2012(E)
Table 2 — Elements of the annotation document header
Bracketing tag for elements of the annotation document header.
List of the annotation labels used in the document and their frequencies.
Information for individual annotation labels.
Attributes @label [string]: Element name.
@occurs [integer]: Number of occurrences in the document.
Documents required to process the annotations in this document, which will include a
segmentation document and/or any annotation documents directly referenced in this
document.
File required to process this annotation.
Attributes @ann.id [IDREF]: The ID of the document as given in the associated primary data
document.
Annotation spaces referenced in this document.
Annotation space used in this document.
Attributes @as.id [IDREF]: The ID of the annotation space as defined in the resource header.
@default [yes | no]: Indicates whether or not this annotation space is the default in this
document. If the attribute is not present, no is assumed.
Table 3 — Graph and annotation elements in GrAF annotation documents
One or more root elements that identify root nodes in the graph. This element is used
when the graph contains either a graph that is a tree or a forest, i.e. more than one graph
that is a well-formed tree.
The node ID of a root node in the graph. Not all graphs will form a tree, but those that do
can use the root element to identify the root node of the tree.
@node.id [IDREF]: The ID of the root node.
Attribute
Region in the artefact being annotated, defined as the area bounded by a non-empty,
ordered list of anchors. The number of anchors required to bound a region depends on
the medium being annotated.
@xml:id [ID]: Unique ID for reference from nodes in the graph.
Attributes
@anchors [string] (alternative to @refs): The anchors that bound this region. The anchors
attribute contains a whitespace-delimited list of values that represent the anchor values.
Applications are expected to know how to parse the string representation of an anchor
into a location in the artefact being annotated. The element shall have either an
@anchors attribute or an @ref attribute.
@refs [IDREFS] (alternative to @anchors): ID references to the anchors that bound the
regions. The element shall have either an @anchors attribute or an @refs
attribute.
@anchor.id [IDREF]: The anchor type of the anchors referenced in the @anchors
attribute. This is the @xml:id of one of the anchorTypes defined in the resource header. If
no @anchor.id is specified for the region, the default anchor type for the document
(indicated on the element in the resource header) is assumed. If the @refs
attribute is used to refer to elements, the @anchor.id attribute will be specified
on the elements and should not be given on .
© ISO 2012 – All rights reserved
8
---------------------- Page: 13 ----------------------
ISO 24612:2012(E)
Table 3 (continued)
Example
anchors="980 983"/>
anchors="10,59 10,173 149,173 149,59"/>
anchors="34 42"/>
Location in the artefact being annotated. How the location is represented is medium-
dependent. Applications are required to be able to serialize and de-serialize location
values to and from strings appearing as attributes on the @value attribute as well as the
@anchors attribute on the element.
@xml:id [ID]: Unique ID for reference from nodes in the graph
Attributes
@value [string]: The offset value of the anchor. How the attribute value is interpreted as
a location in the artefact being annotated is medium-dependent.
@anchor.id [string]: The @xml:id of an anchor type defined in the resource header.
Node in the graph. The element is empty when connected by an element to
another node in the graph (i.e. when the node is a non-terminal node). A child
element is used when the node refers to a region or regions of primary data (i.e. when the
node is a terminal/leaf node).
@xml:id [ID]: Unique ID for reference from edges and annotations.
Attribute
Identifies region(s) in a base segmentation document referred to by this node
@targets [IDREFS]: Identifiers of referenced region(s)
Attribute
Example
Edge in the graph.
Attributes @xml:id [ID]: Unique ID for reference from nodes and annotations.
@from [IDREF]: ID of the start node of the edge.
@ to [IDREF]: ID of the end node of the edge.
Example
Annotation information associated with a node or edge. This tag may be empty if the
annotation consists of a label only.
Attributes @label [string]: The label of the annotation. This may be the string used to identify the
annotation as described by the annotation documentation, a category identifier from a
data category registry, an identifier from a feature structure library, or any reference to an
external annotation specification.
@ref [IDREF]: The ID of the node or edge with which the annotation is associated.
@as [string]: The ID of the annotation space of which this annotation is a part, as defined
in the resource header; if no @as attribute is specified, the annotation space designated
as the default in the annotation document header is assumed.
Feature structure providing additional annotation information. An element may not
contain more than one element. The element may contain one or more
elements.
Attribute/value pair. In the concise form (given here), the element is empty and
includes attributes providing simple name/value pairs. More complex feature structures
may be represented according to the specification in ISO 24610-1, which should be
consulted for details.
© ISO 2012 – All rights reserved
9
-------
...
SLOVENSKI STANDARD
SIST ISO 24612:2013
01-julij-2013
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DMH]LNRVORYQRR]QDþHYDQMH/$)
Language resource management -- Linguistic annotation framework (LAF)
Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)
Ta slovenski standard je istoveten z: ISO 24612:2012
ICS:
01.020 7HUPLQRORJLMDQDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
SIST ISO 24612:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ISO 24612:2013
---------------------- Page: 2 ----------------------
SIST ISO 24612:2013
INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
ISO 24612:2012(E)
©
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Contents Page
Foreword . iv
Introduction . v
1 Scope . 1
2 Terms and definitions . 1
3 LAF specification . 3
3.1 Overview . 3
3.2 LAF data model . 3
3.3 LAF architecture . 4
3.4 XML pivot format . 6
3.5 XML elements for the resource header . 11
3.6 Elements in the primary data document header . 16
Bibliography . 19
© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has
been established has the right to be represented on that committee. International organizations, governmental
and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Introduction
Effective creation, encoding, processing and management of language resources is facilitated by a single
high-level data model that supports analysis and design of both annotation schemes and representation
formats. This International Standard is designed to support the development and use of computer applications
relying on linguistically annotated resources and the exchange of these resources among different
applications.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24612:2013
---------------------- Page: 8 ----------------------
SIST ISO 24612:2013
INTERNATIONAL STANDARD ISO 24612:2012(E)
Language resource management — Linguistic annotation
framework (LAF)
1 Scope
This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic
annotations of language data such as corpora, speech signal and video. The framework includes an abstract
data model and an XML serialization of that model for representing annotations of primary data. The
serialization serves as a pivot format to allow annotations expressed in one representation format to be
mapped onto another.
NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and
other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.
Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of
characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech
annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary
data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun
linguistic information added to primary data (2.1), independent of its representation
2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation
annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)
Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-
and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment
typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks
© ISO 2012 – All rights reserved
1
---------------------- Page: 9 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in
the document containing the primary data itself.
2.6
linguistic annotation
annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)
EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in
the data.
Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.
In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that
segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that
identifies the segment as a word or sentence).
2.7
stand-off annotation
annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing
the primary data
Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,
elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can
refer to the same primary document (e.g. two different part of speech annotations for a given text).
2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)
Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,
audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be
coordinates.
2.10
region
area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)
2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G
Note to entry: The terms node and vertex are used interchangeably in this document.
2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
© ISO 2012 – All rights reserved
2
---------------------- Page: 10 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3 LAF specification
3.1 Overview
LAF consists of the following.
A data model for linguistic annotations and the data to which they apply.
An architecture for representing language data and its annotations.
An XML serialization of the data model, which describes the referential structure of annotations
associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be
linked to regions of primary data. Nodes and edges may be associated with feature structures describing
linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of
a) a structure for describing media, consisting of anchors that reference locations in primary data and
regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and
c) an annotation structure for representing annotation content with feature structures.
The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary
data as well as other annotations, in which nodes are associated with feature structures providing the
annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the
mapping) isomorphic to the LAF data model.
NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated
linguistic phenomena).
Figure 1 — LAF data model
© ISO 2012 – All rights reserved
3
---------------------- Page: 11 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
3.3 LAF architecture
3.3.1 Overview
Language resources conforming to the LAF architecture consist of the following, described in more detail in
3.3.2 to 3.3.5.
One or more primary data documents (see 3.3.2).
Any number of annotation documents containing nodes, edges and feature structures associated with
some or all of the nodes and/or edges in a directed graph. All nodes reference either a base
segmentation document (in which case the node has no outgoing edges) or other nodes in the same or
other annotation documents via edges. (See 3.3.3).
One or more documents defining regions that reference each primary data document, which serve as the
base segmentation for annotations (see 3.3.4.)
A set of headers, including a resource header describing a collection of primary data documents and
annotations, as well as headers for each primary data document and each annotation document in the
collection (see 3.3.5).
It is recommended that whenever possible, each primary data document also be associated with an original
artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the
original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data
Primary data consists of electronic data in any format, including character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to
locations within the document or documents. Corrections and modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is
made between markup and other characters in the data when referring to locations in the document.
3.3.3 Annotation documents
Annotation documents contain linguistic information describing primary data. Annotations are always
associated with a node in a graph that directly references regions defined over primary data, either directly or
via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary
data. LAF recommends representing each of the linguistic layers defined in language resource management,
in a separate annotation document for the purposes of exchange.
The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is
dependent on the application. For example, a single annotation over text may cover a phoneme, word,
sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a
temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data
Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are
located between the base units of the primary data representation.
Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound
the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying
© ISO 2012 – All rights reserved
4
---------------------- Page: 12 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors
that refer to one or more points in the medium (e.g. an “instant” or “timestamp”). Anchors are represented by
n-tuples consisting of sets of spatial and temporal offsets. For example, consider the text “My dog has fleas”:
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s|
The anchors for each word are the following:
My: start=0, end=2
dog: start=3, end=6
has: start=7, end=10
fleas: start=11, end=16
A set of regions defined over a document containing primary data need not be contiguous (i.e. there may be
portions of the primary data not included in any region), but they should not, in general, overlap. Overlapping
regions should be treated as composed of finer-grained sub-components. For example, two spans, <5, 9> and
<7, 15>, can be reconstrued as three spans, a = <5, 7>, b = <7, 9, and c = <9, 15>. Two graph nodes can
then be created that reference nodes and , thereby providing the coverage of regions <5, 9> and
<7, 15>. Discontiguous regions are referenced by creating nodes referencing each component region and
adding a node that is in turn linked to them.
The media types included in the resource are defined in the resource header. Each medium is associated with
one or more anchor types. The header for each primary data document identifies the medium for that
document, which in turn indicates the type of anchors used.
In the general case, primary data does not contain markup of any kind. If markup appears in primary data (e.g.
HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made
between markup and other characters in the data when referring to locations in the document. For primary
data comprising a valid XML document, anchors may reference XML elements using the W3C XPath 2.0
Language (www.w3.org/TR/xpath20/), in which case the associated anchor type is defined in the resource
header as an XPath expression. References to locations within these XML elements (i.e. XML element
content) can be made using standard offsets, which will be computed by including the markup as part of the
data stream; in this case, two media types would be associated with the primary document’s file type. See
3.3.5.2 for a full description of anchor and media type definitions in the resource header.
3.3.5 Headers
3.3.5.1 Overview
LAF defines a header for a resource consisting of a collection of primary data documents and annotations, as
well as headers for primary data and annotation documents themselves. This set of headers provides all
metadata describing the provenance and encoding conventions for the data and its annotations, information
required for processing such as anchor types or relations among primary data and annotation documents in
the corpus.
3.3.5.2 Resource header
The resource header describes the resource as a whole, including its contents, file structure and encoding,
and establishes definitions that are used in the primary data document and annotation document headers.
Among these are the following.
Categories used to describe primary data documents, typically the domain/subject area of general text.
File types providing their naming conventions, media, annotation type, and dependencies (i.e. other file
types that are referenced and therefore required). The specification of file types enables automatic
validation that all required elements of the resource are present.
© ISO 2012 – All rights reserved
5
---------------------- Page: 13 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Annotation spaces used to provide context for annotations and enable resolution of naming conflicts.
Annotation declarations describing the annotations in the resource, including their names, creator, links to
relevant documentation and, optionally, an associated annotation schema.
Media definitions specifying the media types included in the corpus and file naming conventions for files
containing data of that type.
Anchor types associating anchor type definitions with media types.
Group definitions providing the names, descriptions and members of user-defined groups of annotations.
3.3.5.3 Primary data document header
Each primary data document is associated with an XML header file containing information describing its
contents. Because the primary data document is not an XML document, the LAF primary data header is
obligatory and shall be provided as a standalone file.
The primary data document header provides information about the source and contents of the primary data,
as well as specifying category definitions and medium type by reference to definitions in the resource header.
The primary data document header provides the PID for the primary data document and all associated
annotation documents. The primary data document header provides all the information needed to process
annotations associated with a given primary data document. It is presumed that this file is loaded first when a
document and its annotations are to be processed.
The elements in the primary data document header are given in 3.6.
3.3.5.4 Annotation document header
The annotation document header includes a relevant subset of elements from the primary data header (i.e.
those that describe the file contents rather than the provenance of an original text, etc.), together with
additional elements that provide or point to information concerning the annotation content categories and
dependencies between the annotation document and other documents. The annotation document header is
not a separate document, but rather is included at the beginning of the annotation document. The elements in
the annotation document header are given in 3.4.3.
3.4 XML pivot format
3.4.1 Overview
The LAF provides an XML serialization of the data model that is designated as the pivot format. A pivot format
is intended to serve as an “interlingua” for translation among multiple other formats, by providing a common
target into and out of which other formats can be transduced. Although the LAF pivot format may be used in
any context, it is assumed that users will represent annotations using their own formats, which can then be
transduced to the LAF pivot format for the purposes of exchange, merging and comparison.
The graph annotation format (GrAF) specifies the XML serialization of the LAF pivot format.
In GrAF:
The fundamental data structure is a directed graph consisting of a set of nodes and a set of edges.
An annotation is a label and (optionally) a feature structure associated with a node or an edge in the
graph.
© ISO 2012 – All rights reserved
6
---------------------- Page: 14 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
A feature structure is an attribute value graph (AVG). The value of a feature may be an atomic value or
another feature structure.
An atomic feature value is a mapping from one string (the feature name) to another string (the atomic
value). GrAF makes no attempt to do typing of feature values.
Nodes may be associated with regions in the primary document, or connected to other nodes in the same
or another annotation document. Nodes are associated with regions by elements. Edges are used
to connect (associate) nodes to other nodes.
An edge represents a relationship between nodes. By default, the set of out edges from a node represent
an ordered set of constituents of the annotation associated with the node. Other relationships may be
specified by associating an annotation with the edge.
For all GrAF documents the namespace http://www.xces.org/ns/GrAF/1.0/ is used.
3.4.2 XML elements for annotation documents
The overall content model of a GrAF standoff annotation file is as follows:
graph = graphHeader (node | edge | a | anchor)*
node = link*
a = fs?
fs = f+
f = atomic | fs
start = graph
The root element is defined in Table 1. Required attributes are given in bold.
Table 1 — Root element for GrAF annotation documents
Root node of the graph.
Attribute @xmlns [URL]: namespace declaration for the GrAF schema
Example
3.4.3 The annotation document header
The overall content model for the annotation document header is as follows:
graphHeader = labelsDecl, dependencies, annotationSpaces
labelsDecl = labelUsage+
dependencies = dependsOn+
annotationSpaces = annotationSpace+
start = graphHeader
Elements of the annotation document header are given in Table 2, and elements to define graphs and
annotations are given in Table 3. Required attributes are given in bold.
© ISO 2012 – All rights reserved
7
---------------------- Page: 15 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 2 — Elements of the annotation document header
Bracketing tag for elements of the annotation document header.
List of the annotation labels used in the document and their frequencies.
Information for individual annotation labels.
Attributes @label [string]: Element name.
@occurs [integer]: Number of occurrences in the document.
Documents required to process the annotations in this document, which will include a
segmentation document and/or any annotation documents directly referenced in this
document.
File required to process this annotation.
Attributes @ann.id [IDREF]: The ID of the document as given in the associated primary data
document.
Annotation spaces referenced in this document.
Annotation space used in this document.
Attributes @as.id [IDREF]: The ID of the annotation space as defined in the resource header.
@default [yes | no]: Indicates whether or not this annotation space is the default in this
document. If the attribute is not present, no is assumed.
Table 3 — Graph and annotation elements in GrAF annotation documents
One or more root elements that identify root nodes in the graph. This element is used
when the graph contains either a graph that is a tree or a forest, i.e. more than one graph
that is a well-formed tree.
The node ID of a root node in the graph. Not all graphs will form a tree, but those that do
can use the root element to identify the root node of the tree.
@node.id [IDREF]: The ID of the root node.
Attribute
Region in the artefact being annotated, defined as the area bounded by a non-empty,
ordered list of anchors. The number of anchors required to bound a region depends on
the medium being annotated.
@xml:id [ID]: Unique ID for reference from nodes in the graph.
Attributes
@anchors [string] (alternative to @refs): The anchors that bound this region. The anchors
attribute contains a whitespace-delimited list of values that represent the anchor values.
Applications are expected to know how to parse the string representation of an anchor
into a location in the artefact being annotated. The element shall have either an
@anchors attribute or an @ref attribute.
@refs [IDREFS] (alternative to @anchors): ID references to the anchors that bound the
regions. The element shall have either an @anchors attribute or an @refs
attribute.
@anchor.id [IDREF]: The anchor type of the anchors referenced in the @anchors
attribute. This is the @xml:id of one of the anchorTypes defined in the resource header. If
no @anchor.id is specified for the region, the default anchor type for the document
(indicated on the element in the resource header) is assumed. If the @refs
attribute is used to refer to elements, the @anchor.id attribute will be specified
on the elements and should not be given on .
© ISO 2012 – All rights reserved
8
---------------------- Page: 16 ----------------------
SIST ISO 24612:2013
ISO 24612:2012(E)
Table 3 (continued)
Example
anchors="980 983"/>
anchors="10,59 10,173 149,173 149,59"/>
anchors="34 42"/>
Location in the artefact being annotated. How the location is represented is medium-
dependent. Applications are required to be able to serialize and de-serialize location
values to and from strings appearing as attributes on the @value attribute as well as the
@anchors attribute on the element.
@xml:id [ID]: Unique ID for reference from nodes in the graph
Attributes
@value [string]: The offset value of the anchor. How the attribute value is interpreted as
a location in the artefact being annotated is medium-dependent.
@anchor.id [string]: The @xml:id of an anchor type defined in the resource header.
Node in the graph. The element is empty when connected by an element to
another node in the graph (i.e. when the node is a non-terminal node). A child
element is used when the node refers to a region or regions of primary data (i.e. when the
node is a terminal/leaf node).
@xml:id [ID]: Unique ID for reference from edges and annotations.
Attribute
Identifies region(s) in a base segmentation document referred to by this node
@targets [IDREFS]: Identifiers of referenced region(s)
Attribute
Example
Edge in the graph.
Attributes @xml:id [ID]: Unique ID for reference from nodes and annotations.
@from [IDREF]: ID of the start node of the edge.
@ to [IDREF]: ID of the end node of the edge.
Example
Annotation information associated with a node or edge. This tag may be empty if the
annotation consists of a label only.
Attributes @label [string]: The label of the annotation. This may be the string us
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.