SIST ISO 24614-1:2013
(Main)Language resource management -- Word segmentation of written texts -- Part 1: Basic concepts and general principles
Language resource management -- Word segmentation of written texts -- Part 1: Basic concepts and general principles
This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU).
Gestion des ressources langagières -- Segmentation des mots dans les textes écrits -- Partie 1: Notions fondamentales et principes généraux
Upravljanje z jezikovnimi viri - Segmentacija v besede v pisnih besedilih - 1. del: Osnovni pojmi in splošna načela
Ta del standarda ISO 24614 predstavlja osnovne pojme in splošna načela za segmentacijo v besede in zagotavlja od jezika neodvisne smernice za omogočanje zanesljive in ponovljive segmentacije pisnih besedil v enote segmentacije v besede.
General Information
Buy Standard
Standards Content (Sample)
SLOVENSKI STANDARD
SIST ISO 24614-1:2013
01-julij-2013
Upravljanje z jezikovnimi viri - Segmentacija v besede v pisnih besedilih - 1. del:
Osnovni pojmi in splošna načela
Language resource management -- Word segmentation of written texts -- Part 1: Basic
concepts and general principles
Gestion des ressources langagières -- Segmentation des mots dans les textes écrits --
Partie 1: Notions fondamentales et principes généraux
Ta slovenski standard je istoveten z: ISO 24614-1:2010
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.10 Pisanje in prečrkovanje Writing and transliteration
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24614-1:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ISO 24614-1:2013
---------------------- Page: 2 ----------------------
SIST ISO 24614-1:2013
INTERNATIONAL ISO
STANDARD 24614-1
First edition
2010-11-01
Language resource management — Word
segmentation of written texts —
Part 1:
Basic concepts and general principles
Gestion des ressources langagières — Segmentation des mots dans
les textes écrits —
Partie 1: Notions fondamentales et principes généraux
Reference number
ISO 24614-1:2010(E)
©
ISO 2010
---------------------- Page: 3 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
COPYRIGHT PROTECTED DOCUMENT
© ISO 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2010 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Terms and definitions .2
3 Basic framework for word segmentation.6
4 General principles of word segmentation.10
Annex A (informative) Representing word segmentation in XML.13
Bibliography.14
© ISO 2010 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24614-1 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
ISO 24614 consists of the following parts, under the general title Language resource management — Word
segmentation of written texts:
⎯ Part 1: Basic concepts and general principles
⎯ Part 2: Word segmentation for Chinese, Japanese and Korean
Word segmentation for other languages is to form the subject of a future Part 3.
iv © ISO 2010 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
Introduction
Word segmentation is the dividing of text into linguistic units that carry meaning. For example, “the white
house” can be divided into three meaningful units, “the,” “white,” and “house”, when it refers to a house that is
white; whereas “the White House” corresponds to only one meaningful unit when it refers to the residence of
the US President.
For the purposes of ISO 24614, such meaningful linguistic units are called word segmentation units (WSU).
As demonstrated in the previous example, a WSU can be comprised of more than one word. A WSU can
consist of a stem and affixes (e.g. “re+work+ing”). It can be a compound word (e.g. “blackboard”), a proper
noun (e.g. “Cape Town”), an idiom (e.g. “It's raining cats and dogs”), or a multiword expression (e.g. “take
care of”). For languages that have spaces between words, such as English, segmenting a text into WSU is
facilitated by using the spaces as a basis for establishing the boundaries of a WSU, although additional
considerations need to be taken into account for handling abbreviations, punctuation and multiword units of
meaning, among others. For languages that do not have spaces between words, such as Chinese and
Japanese, or for languages that have spaces partially between words, such as Thai and Korean, segmenting
a text into WSU requires a different approach.
Furthermore, word segmentation is complex for languages that are characterized by extensive compounding,
such as Chinese, and for languages that are characterized by extensive agglutination, such as Japanese,
Korean and Hungarian. On the other hand, the fact that Japanese supports multiple scripts is beneficial for
word segmentation.
However, white space alone is not sufficient to segment a text. “Apple pie,” for example, is understood as a
kind of pie made of apples, so “apple” and “pie” are treated as two distinct WSUs. Alternatively, it can be
viewed as a single entity due to its collocational and idiomatic properties, and treated as a single WSU.
Segmentation rules can differ between languages, even when applied to equivalent expressions (as
discussed in ISO 24614-2).
Elaborating standards for the rules and methods for word segmentation can facilitate innovation and
development in areas such as language learning and translation. It could improve language-related
technologies, including spell checking, grammar checking, dictionary lookup, terminology management,
translation memory, information retrieval, information extraction and machine translation. For instance, by
failing to identify “kick the bucket” as a single WSU, translation memory and machine translation technologies
would produce a literal rather than idiomatic translation.
This part of ISO 24614 is the first in a series of International Standards targeted at word segmentation in
written languages. It focuses on the basic concepts and general principles of word segmentation that apply to
languages in general. The subsequent parts will, however, focus on the issues specific to particular languages.
© ISO 2010 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24614-1:2013
---------------------- Page: 8 ----------------------
SIST ISO 24614-1:2013
INTERNATIONAL STANDARD ISO 24614-1:2010(E)
Language resource management — Word segmentation of
written texts —
Part 1:
Basic concepts and general principles
1 Scope
This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and
provides language-independent guidelines to enable written texts to be segmented, in a reliable and
reproducible manner, into word segmentation units (WSU).
NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical
to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot
simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as
hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word
segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and
Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.
The many applications and fields that need to segment texts into words — and thus to which this part of
ISO 24614 can be applied — include the following.
Translation
Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard
function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is
performed by term extraction tools, which are sometimes provided in terminology management systems and
CAT tools.
Content management
Most content management systems and databases allow for searching by individual words. The content being
searched has to be segmented to permit matching with a search word. Furthermore, search functions require
knowledge of the boundaries of words.
Speech technologies
Text-to-speech systems generate speech based on words and therefore require word segmentation for
lexicon lookup, stress assignment, prosodic pattern assignment, etc.
Computational linguistics
Various natural language processing (NLP) systems must segment text into words in order to carry out their
functions. NLP systems include
⎯ morphosyntactic processors,
⎯ syntactic parsers,
⎯ spellcheckers,
© ISO 2010 – All rights reserved 1
---------------------- Page: 9 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
⎯ text classification systems, and
⎯ corpus linguistics annotators.
Lexicography
Lexical resources are often evaluated by size, usually by referring to the number of words.
NOTE 2 The size of language resources is an essential benchmark for their management. Quantifying the size of
language resources is typically achieved by counting the words. However, because NLP applications use different
segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A
reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use
their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text
into smaller or larger units compared to another application.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
abbreviation
verbal designation formed by omitting words or letters from a longer form and designating the same concept
[ISO 1087-1:2000]
2.2
affix
bound morpheme (2.5) which may be added to a stem (2.22) or a lexeme (2.14)
NOTE Affixes can be classified into several sub-types such as prefix, suffix, infix and circumfix. Affixes can be
derivational or they can be inflectional or agglutinative.
2.3
agglutination
process of concatenating one or more affixes (2.2) to a stem (2.22)
[ISO 24613:2008]
2.4
borrowing
process of word formation in which a linguistic expression is adopted from another language, usually when no
term exists for the new object or concept
2.5
bound morpheme
morpheme (2.18) that appears only together with one or several other morphemes
[ISO 24613:2008]
EXAMPLE 1 Chinese: 伟 means “great,” but cannot stand by itself as a word in text. Instead, it is used as a constituent
element of many words, such as 伟大 (“great”), 伟人 (“giant”), and 雄伟 (“majesty”).
EXAMPLE 2 Korean: the suffix “-e”, which is equivalent to the English preposition “to” — as in “hakkyo-e” (to school)
— is a bound morpheme.
2 © ISO 2010 – All rights reserved
---------------------- Page: 10 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
2.6
compound
word (2.23) built from two or more lexemes (2.14)
NOTE 1 Adapted from ISO 24613:2008, definition 3.10.
NOTE 2 A compound may be endocentric if it has a head (i.e. the fundamental part that contains the basic meaning of
the whole compound) and modifiers (which restrict this meaning), or exocentric if it does not have a head. A compound
can be long. There are two main sub-types of compound according to their degree of lexicalization: word compound and
phrasal compound.
2.7
compounding
word formation in which a new word is formed by adjoining at least two lexemes (2.14), in their original forms
or with slight transformations
[ISO 24613:2008]
2.8
derivation
change in the form of a word (2.23) to create a new word (2.23), usually by modifying the stem (2.22) or by
affixation
[ISO 24613:2008]
2.9
free morpheme
morpheme (2.18) that can be used as a word (2.23) by itself
EXAMPLE Given the word “goodness,” “good” is a free morpheme, whereas “-ness” is not. The latter is a bound
morpheme.
2.10
homograph
each of two or more word forms (2.24) or words (2.23) with identical spelling but representing different
concepts (semantic homography) or syntactic functions (syntactic homography)
[ISO 1087-2:2000]
2.11
inflection
process in which a word form (2.24) is made up by adding an affix (2.2) to a stem (2.22)
NOTE Inflection is a grammatical rather than lexical process.
2.12
lemma
conventional form chosen to represent a lexeme (2.14)
[ISO 24613:2008]
EXAMPLE Given a set of word forms such as “find,” “finds,” “found,” and “finding” in English, the form “find” is
chosen as a lemma to represent the group of all these word forms.
2.13
lemmatization
process of determining the lemma (2.12) for a given word form (2.24) in a context
EXAMPLE Given the word “found” in English, lemmatization results in “find” as its lemma.
NOTE Adapted from ISO 1087-2:2000, definition 2.19 and ISO 30042:2008, definition 3.14.
© ISO 2010 – All rights reserved 3
---------------------- Page: 11 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
2.14
lexeme
abstract unit generally associated with a set of forms sharing a common meaning
[ISO 24613:2008]
NOTE 1 A lexeme may be a part of another lexeme, as a consequence of derivation and compounding.
NOTE 2 “Form” is defined in ISO 24613 as “sequence of morphs”.
2.15
lexicalization
process of making a linguistic unit function as a word
NOTE Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a
phrase, such as “kick the bucket”, that forms an idiomatic phrase.
2.16
lexicon
list of entries mainly headed by lemmas (2.12) with associated information
2.17
morph
surface form represented by a unique morpheme (2.18)
EXAMPLE In English, the morphs of the plural morpheme “-s” include “-s”, “-en”, and “-NULL” (as in “boys”, “oxen”,
and “sheep”), where “–NULL” has no unique surface form. Thus, the word “boys” consists of the two morphs, “boy” and
“-s”, whereas the morphemes corresponding to the morphs “ox” and “-en” are “ox” and “-s”, respectively.
2.18
morpheme
smallest unit of meaning expressed by a sequence of phonemes or a sequence of graphemes
[ISO 24613:2008]
NOTE There are two sub-types of morphemes: free morphemes and bound morphemes.
2.19
multiword expression
MWE
lexeme (2.14) made up of a sequence of lexemes that has properties that are not predictable from the
properties of the individual lexemes or their normal mode of combination
[ISO 24613:2008]
NOTE A multiword expression can be a compound [a word compound or phrasal compound, an idiom, a fragment of
a sentence or a sentence (e.g. a proverb or familiar quotation)]. It is not always possible to specify the part of speech for
the whole MWE span.
2.20
phrasal compound
word (2.23) consisting of two or more lexemes (2.14), the meaning of which is predictable from its constituent
elements
EXAMPLE “Apple pie” in English is a phrasal compound composed of two lexemes, “apple” and “pie”, whose
meanings are preserved in the meaning of the compound.
NOTE 1 Idioms use two or more lexical items, but do not compose a phrasal compound.
4 © ISO 2010 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24614-1:2013
ISO 24614-1:2010(E)
NOTE 2 A phrasal compound might be thought of as phrases by some linguists. In practice, however, there is not
always a clear distinction between a word compound and a phrasal compound, or between a phrasal compound and a
phrase, due to the fuzziness of semantic predictability and the degree of lexicalization. Lexico-statistics — word frequency
in particular — will play an important role in this respect.
2.21
reduplication
process in which the entire word (2.23), or part of it, is repeated
2.22
stem
linguistic unit whose form is smaller than or equal to the form of a single lexeme (2.14) and that may be
affected by an inflectional, agglutinative, compositional or derivational process
[ISO 24613:2008]
2.23
word
lexeme (2.14) that has, as a minimal property, a part of speech
[ISO 24613:2008]
2.24
word form
morphosyntactical variant of a given word (2.23)
[ISO 1087-2:2000]
EXAMPLE In English, the strings “find”, “finds”, “found” and “finding” are word forms of the word “find”.
2.25
word segmentation
process of splitting text into a sequence of word segmentation units (2.26)
2.26
word segmentation unit
WSU
word form (2.24) or character string of some other type that is treated as a unit
NOTE A character string that is not a word
...
INTERNATIONAL ISO
STANDARD 24614-1
First edition
2010-11-01
Language resource management — Word
segmentation of written texts —
Part 1:
Basic concepts and general principles
Gestion des ressources langagières — Segmentation des mots dans
les textes écrits —
Partie 1: Notions fondamentales et principes généraux
Reference number
ISO 24614-1:2010(E)
©
ISO 2010
---------------------- Page: 1 ----------------------
ISO 24614-1:2010(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
COPYRIGHT PROTECTED DOCUMENT
© ISO 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2010 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24614-1:2010(E)
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Terms and definitions .2
3 Basic framework for word segmentation.6
4 General principles of word segmentation.10
Annex A (informative) Representing word segmentation in XML.13
Bibliography.14
© ISO 2010 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO 24614-1:2010(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24614-1 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
ISO 24614 consists of the following parts, under the general title Language resource management — Word
segmentation of written texts:
⎯ Part 1: Basic concepts and general principles
⎯ Part 2: Word segmentation for Chinese, Japanese and Korean
Word segmentation for other languages is to form the subject of a future Part 3.
iv © ISO 2010 – All rights reserved
---------------------- Page: 4 ----------------------
ISO 24614-1:2010(E)
Introduction
Word segmentation is the dividing of text into linguistic units that carry meaning. For example, “the white
house” can be divided into three meaningful units, “the,” “white,” and “house”, when it refers to a house that is
white; whereas “the White House” corresponds to only one meaningful unit when it refers to the residence of
the US President.
For the purposes of ISO 24614, such meaningful linguistic units are called word segmentation units (WSU).
As demonstrated in the previous example, a WSU can be comprised of more than one word. A WSU can
consist of a stem and affixes (e.g. “re+work+ing”). It can be a compound word (e.g. “blackboard”), a proper
noun (e.g. “Cape Town”), an idiom (e.g. “It's raining cats and dogs”), or a multiword expression (e.g. “take
care of”). For languages that have spaces between words, such as English, segmenting a text into WSU is
facilitated by using the spaces as a basis for establishing the boundaries of a WSU, although additional
considerations need to be taken into account for handling abbreviations, punctuation and multiword units of
meaning, among others. For languages that do not have spaces between words, such as Chinese and
Japanese, or for languages that have spaces partially between words, such as Thai and Korean, segmenting
a text into WSU requires a different approach.
Furthermore, word segmentation is complex for languages that are characterized by extensive compounding,
such as Chinese, and for languages that are characterized by extensive agglutination, such as Japanese,
Korean and Hungarian. On the other hand, the fact that Japanese supports multiple scripts is beneficial for
word segmentation.
However, white space alone is not sufficient to segment a text. “Apple pie,” for example, is understood as a
kind of pie made of apples, so “apple” and “pie” are treated as two distinct WSUs. Alternatively, it can be
viewed as a single entity due to its collocational and idiomatic properties, and treated as a single WSU.
Segmentation rules can differ between languages, even when applied to equivalent expressions (as
discussed in ISO 24614-2).
Elaborating standards for the rules and methods for word segmentation can facilitate innovation and
development in areas such as language learning and translation. It could improve language-related
technologies, including spell checking, grammar checking, dictionary lookup, terminology management,
translation memory, information retrieval, information extraction and machine translation. For instance, by
failing to identify “kick the bucket” as a single WSU, translation memory and machine translation technologies
would produce a literal rather than idiomatic translation.
This part of ISO 24614 is the first in a series of International Standards targeted at word segmentation in
written languages. It focuses on the basic concepts and general principles of word segmentation that apply to
languages in general. The subsequent parts will, however, focus on the issues specific to particular languages.
© ISO 2010 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO 24614-1:2010(E)
Language resource management — Word segmentation of
written texts —
Part 1:
Basic concepts and general principles
1 Scope
This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and
provides language-independent guidelines to enable written texts to be segmented, in a reliable and
reproducible manner, into word segmentation units (WSU).
NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical
to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot
simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as
hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word
segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and
Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.
The many applications and fields that need to segment texts into words — and thus to which this part of
ISO 24614 can be applied — include the following.
Translation
Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard
function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is
performed by term extraction tools, which are sometimes provided in terminology management systems and
CAT tools.
Content management
Most content management systems and databases allow for searching by individual words. The content being
searched has to be segmented to permit matching with a search word. Furthermore, search functions require
knowledge of the boundaries of words.
Speech technologies
Text-to-speech systems generate speech based on words and therefore require word segmentation for
lexicon lookup, stress assignment, prosodic pattern assignment, etc.
Computational linguistics
Various natural language processing (NLP) systems must segment text into words in order to carry out their
functions. NLP systems include
⎯ morphosyntactic processors,
⎯ syntactic parsers,
⎯ spellcheckers,
© ISO 2010 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO 24614-1:2010(E)
⎯ text classification systems, and
⎯ corpus linguistics annotators.
Lexicography
Lexical resources are often evaluated by size, usually by referring to the number of words.
NOTE 2 The size of language resources is an essential benchmark for their management. Quantifying the size of
language resources is typically achieved by counting the words. However, because NLP applications use different
segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A
reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use
their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text
into smaller or larger units compared to another application.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
abbreviation
verbal designation formed by omitting words or letters from a longer form and designating the same concept
[ISO 1087-1:2000]
2.2
affix
bound morpheme (2.5) which may be added to a stem (2.22) or a lexeme (2.14)
NOTE Affixes can be classified into several sub-types such as prefix, suffix, infix and circumfix. Affixes can be
derivational or they can be inflectional or agglutinative.
2.3
agglutination
process of concatenating one or more affixes (2.2) to a stem (2.22)
[ISO 24613:2008]
2.4
borrowing
process of word formation in which a linguistic expression is adopted from another language, usually when no
term exists for the new object or concept
2.5
bound morpheme
morpheme (2.18) that appears only together with one or several other morphemes
[ISO 24613:2008]
EXAMPLE 1 Chinese: 伟 means “great,” but cannot stand by itself as a word in text. Instead, it is used as a constituent
element of many words, such as 伟大 (“great”), 伟人 (“giant”), and 雄伟 (“majesty”).
EXAMPLE 2 Korean: the suffix “-e”, which is equivalent to the English preposition “to” — as in “hakkyo-e” (to school)
— is a bound morpheme.
2 © ISO 2010 – All rights reserved
---------------------- Page: 7 ----------------------
ISO 24614-1:2010(E)
2.6
compound
word (2.23) built from two or more lexemes (2.14)
NOTE 1 Adapted from ISO 24613:2008, definition 3.10.
NOTE 2 A compound may be endocentric if it has a head (i.e. the fundamental part that contains the basic meaning of
the whole compound) and modifiers (which restrict this meaning), or exocentric if it does not have a head. A compound
can be long. There are two main sub-types of compound according to their degree of lexicalization: word compound and
phrasal compound.
2.7
compounding
word formation in which a new word is formed by adjoining at least two lexemes (2.14), in their original forms
or with slight transformations
[ISO 24613:2008]
2.8
derivation
change in the form of a word (2.23) to create a new word (2.23), usually by modifying the stem (2.22) or by
affixation
[ISO 24613:2008]
2.9
free morpheme
morpheme (2.18) that can be used as a word (2.23) by itself
EXAMPLE Given the word “goodness,” “good” is a free morpheme, whereas “-ness” is not. The latter is a bound
morpheme.
2.10
homograph
each of two or more word forms (2.24) or words (2.23) with identical spelling but representing different
concepts (semantic homography) or syntactic functions (syntactic homography)
[ISO 1087-2:2000]
2.11
inflection
process in which a word form (2.24) is made up by adding an affix (2.2) to a stem (2.22)
NOTE Inflection is a grammatical rather than lexical process.
2.12
lemma
conventional form chosen to represent a lexeme (2.14)
[ISO 24613:2008]
EXAMPLE Given a set of word forms such as “find,” “finds,” “found,” and “finding” in English, the form “find” is
chosen as a lemma to represent the group of all these word forms.
2.13
lemmatization
process of determining the lemma (2.12) for a given word form (2.24) in a context
EXAMPLE Given the word “found” in English, lemmatization results in “find” as its lemma.
NOTE Adapted from ISO 1087-2:2000, definition 2.19 and ISO 30042:2008, definition 3.14.
© ISO 2010 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO 24614-1:2010(E)
2.14
lexeme
abstract unit generally associated with a set of forms sharing a common meaning
[ISO 24613:2008]
NOTE 1 A lexeme may be a part of another lexeme, as a consequence of derivation and compounding.
NOTE 2 “Form” is defined in ISO 24613 as “sequence of morphs”.
2.15
lexicalization
process of making a linguistic unit function as a word
NOTE Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a
phrase, such as “kick the bucket”, that forms an idiomatic phrase.
2.16
lexicon
list of entries mainly headed by lemmas (2.12) with associated information
2.17
morph
surface form represented by a unique morpheme (2.18)
EXAMPLE In English, the morphs of the plural morpheme “-s” include “-s”, “-en”, and “-NULL” (as in “boys”, “oxen”,
and “sheep”), where “–NULL” has no unique surface form. Thus, the word “boys” consists of the two morphs, “boy” and
“-s”, whereas the morphemes corresponding to the morphs “ox” and “-en” are “ox” and “-s”, respectively.
2.18
morpheme
smallest unit of meaning expressed by a sequence of phonemes or a sequence of graphemes
[ISO 24613:2008]
NOTE There are two sub-types of morphemes: free morphemes and bound morphemes.
2.19
multiword expression
MWE
lexeme (2.14) made up of a sequence of lexemes that has properties that are not predictable from the
properties of the individual lexemes or their normal mode of combination
[ISO 24613:2008]
NOTE A multiword expression can be a compound [a word compound or phrasal compound, an idiom, a fragment of
a sentence or a sentence (e.g. a proverb or familiar quotation)]. It is not always possible to specify the part of speech for
the whole MWE span.
2.20
phrasal compound
word (2.23) consisting of two or more lexemes (2.14), the meaning of which is predictable from its constituent
elements
EXAMPLE “Apple pie” in English is a phrasal compound composed of two lexemes, “apple” and “pie”, whose
meanings are preserved in the meaning of the compound.
NOTE 1 Idioms use two or more lexical items, but do not compose a phrasal compound.
4 © ISO 2010 – All rights reserved
---------------------- Page: 9 ----------------------
ISO 24614-1:2010(E)
NOTE 2 A phrasal compound might be thought of as phrases by some linguists. In practice, however, there is not
always a clear distinction between a word compound and a phrasal compound, or between a phrasal compound and a
phrase, due to the fuzziness of semantic predictability and the degree of lexicalization. Lexico-statistics — word frequency
in particular — will play an important role in this respect.
2.21
reduplication
process in which the entire word (2.23), or part of it, is repeated
2.22
stem
linguistic unit whose form is smaller than or equal to the form of a single lexeme (2.14) and that may be
affected by an inflectional, agglutinative, compositional or derivational process
[ISO 24613:2008]
2.23
word
lexeme (2.14) that has, as a minimal property, a part of speech
[ISO 24613:2008]
2.24
word form
morphosyntactical variant of a given word (2.23)
[ISO 1087-2:2000]
EXAMPLE In English, the strings “find”, “finds”, “found” and “finding” are word forms of the word “find”.
2.25
word segmentation
process of splitting text into a sequence of word segmentation units (2.26)
2.26
word segmentation unit
WSU
word form (2.24) or character string of some other type that is treated as a unit
NOTE A character string that is not a word form may consist of numeric characters, foreign characters, punctuation
marks or some other miscellaneous characters such as Chinese radicals, chemical symbols, such as H O, or a mixture of
2
Latin and numeric characters, such as F16.
2.27
word structure
internal structure of a word (2.23) resulting from the morphological analysis
NOTE In agglutinative languages, such as Korean, Japanese and Turkish, a word may consist of a sequence of
morphemes, with a comparatively high morpheme-per-word ratio, where each affix involved (both derivational and
inflectional) typically expresses a particular grammatical meaning in a clear, one-to-one w
...
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.Gestion des ressources langagières -- Segmentation des mots dans les textes écrits -- Partie 1: Notions fondamentales et principes générauxLanguage resource management -- Word segmentation of written texts -- Part 1: Basic concepts and general principles01.140.10Writing and transliterationICS:Ta slovenski standard je istoveten z:ISO 24614-1:2010SIST ISO 24614-1:2013en,fr,de01-julij-2013SIST ISO 24614-1:2013SLOVENSKI
STANDARD
SIST ISO 24614-1:2013
Reference numberISO 24614-1:2010(E)© ISO 2010
INTERNATIONAL STANDARD ISO24614-1First edition2010-11-01Language resource management —Word segmentation of written texts — Part 1: Basic concepts and general principles Gestion des ressources langagières — Segmentation des mots dans les textes écrits — Partie 1: Notions fondamentales et principes généraux
SIST ISO 24614-1:2013
ISO 24614-1:2010(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
COPYRIGHT PROTECTED DOCUMENT
©
ISO 2010 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISO's member body in the country of the requester. ISO copyright office Case postale 56 • CH-1211 Geneva 20 Tel.
+ 41 22 749 01 11 Fax
+ 41 22 749 09 47 E-mail
copyright@iso.org Web
www.iso.org Published in Switzerland
ii © ISO 2010 – All rights reserved
SIST ISO 24614-1:2013
ISO 24614-1:2010(E) © ISO 2010 – All rights reserved iii Contents Page Foreword.iv Introduction.v 1 Scope.1 2 Terms and definitions.2 3 Basic framework for word segmentation.6 4 General principles of word segmentation.10 Annex A (informative)
Representing word segmentation in XML.13 Bibliography.14
SIST ISO 24614-1:2013
ISO 24614-1:2010(E) iv © ISO 2010 – All rights reserved Foreword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 24614-1 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content resources, Subcommittee SC 4, Language resource management. ISO 24614 consists of the following parts, under the general title Language resource management — Word segmentation of written texts: ⎯ Part 1: Basic concepts and general principles ⎯ Part 2: Word segmentation for Chinese, Japanese and Korean Word segmentation for other languages is to form the subject of a future Part 3. SIST ISO 24614-1:2013
ISO 24614-1:2010(E) © ISO 2010 – All rights reserved v Introduction Word segmentation is the dividing of text into linguistic units that carry meaning. For example, “the white house” can be divided into three meaningful units, “the,” “white,” and “house”, when it refers to a house that is white; whereas “the White House” corresponds to only one meaningful unit when it refers to the residence of the US President. For the purposes of ISO 24614, such meaningful linguistic units are called word segmentation units (WSU). As demonstrated in the previous example, a WSU can be comprised of more than one word. A WSU can consist of a stem and affixes (e.g. “re+work+ing”). It can be a compound word (e.g. “blackboard”), a proper noun (e.g. “Cape Town”), an idiom (e.g. “It's raining cats and dogs”), or a multiword expression (e.g. “take care of”). For languages that have spaces between words, such as English, segmenting a text into WSU is facilitated by using the spaces as a basis for establishing the boundaries of a WSU, although additional considerations need to be taken into account for handling abbreviations, punctuation and multiword units of meaning, among others. For languages that do not have spaces between words, such as Chinese and Japanese, or for languages that have spaces partially between words, such as Thai and Korean, segmenting a text into WSU requires a different approach. Furthermore, word segmentation is complex for languages that are characterized by extensive compounding, such as Chinese, and for languages that are characterized by extensive agglutination, such as Japanese, Korean and Hungarian. On the other hand, the fact that Japanese supports multiple scripts is beneficial for word segmentation. However, white space alone is not sufficient to segment a text. “Apple pie,” for example, is understood as a kind of pie made of apples, so “apple” and “pie” are treated as two distinct WSUs. Alternatively, it can be viewed as a single entity due to its collocational and idiomatic properties, and treated as a single WSU. Segmentation rules can differ between languages, even when applied to equivalent expressions (as discussed in ISO 24614-2). Elaborating standards for the rules and methods for word segmentation can facilitate innovation and development in areas such as language learning and translation. It could improve language-related technologies, including spell checking, grammar checking, dictionary lookup, terminology management, translation memory, information retrieval, information extraction and machine translation. For instance, by failing to identify “kick the bucket” as a single WSU, translation memory and machine translation technologies would produce a literal rather than idiomatic translation. This part of ISO 24614 is the first in a series of International Standards targeted at word segmentation in written languages. It focuses on the basic concepts and general principles of word segmentation that apply to languages in general. The subsequent parts will, however, focus on the issues specific to particular languages.
SIST ISO 24614-1:2013
SIST ISO 24614-1:2013
INTERNATIONAL STANDARD ISO 24614-1:2010(E) © ISO 2010 – All rights reserved 1 Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles 1 Scope This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU). NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean. The many applications and fields that need to segment texts into words — and thus to which this part of ISO 24614 can be applied — include the following. Translation Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is performed by term extraction tools, which are sometimes provided in terminology management systems and CAT tools. Content management Most content management systems and databases allow for searching by individual words. The content being searched has to be segmented to permit matching with a search word. Furthermore, search functions require knowledge of the boundaries of words. Speech technologies Text-to-speech systems generate speech based on words and therefore require word segmentation for lexicon lookup, stress assignment, prosodic pattern assignment, etc. Computational linguistics Various natural language processing (NLP) systems must segment text into words in order to carry out their functions. NLP systems include ⎯ morphosyntactic processors, ⎯ syntactic parsers, ⎯ spellcheckers, SIST ISO 24614-1:2013
ISO 24614-1:2010(E) 2 © ISO 2010 – All rights reserved ⎯ text classification systems, and ⎯ corpus linguistics annotators. Lexicography Lexical resources are often evaluated by size, usually by referring to the number of words. NOTE 2 The size of language resources is an essential benchmark for their management. Quantifying the size of language resources is typically achieved by counting the words. However, because NLP applications use different segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text into smaller or larger units compared to another application. 2 Terms and definitions For the purposes of this document, the following terms and definitions apply. 2.1 abbreviation verbal designation formed by omitting words or letters from a longer form and designating the same concept [ISO 1087-1:2000] 2.2 affix bound morpheme (2.5) which may be added to a stem (2.22) or a lexeme (2.14) NOTE Affixes can be classified into several sub-types such as prefix, suffix, infix and circumfix. Affixes can be derivational or they can be inflectional or agglutinative. 2.3 agglutination process of concatenating one or more affixes (2.2) to a stem (2.22) [ISO 24613:2008] 2.4 borrowing process of word formation in which a linguistic expression is adopted from another language, usually when no term exists for the new object or concept 2.5 bound morpheme morpheme (2.18) that appears only together with one or several other morphemes [ISO 24613:2008] EXAMPLE 1 Chinese: !® means “great,” but cannot stand by itself as a word in text. Instead, it is used as a constituent element of many words, such as !®+¶ (“great”), !®!I (“giant”), and iS!® (“majesty”). EXAMPLE 2 Korean: the suffix “-e”, which is equivalent to the English preposition “to” — as in “hakkyo-e” (to school) — is a bound morpheme. SIST ISO 24614-1:2013
ISO 24614-1:2010(E) © ISO 2010 – All rights reserved 3 2.6 compound word (2.23) built from two or more lexemes (2.14) NOTE 1 Adapted from ISO 24613:2008, definition 3.10. NOTE 2 A compound may be endocentric if it has a head (i.e. the fundamental part that contains the basic meaning of the whole compound) and modifiers (which restrict this meaning), or exocentric if it does not have a head. A compound can be long. There are two main sub-types of compound according to their degree of lexicalization: word compound and phrasal compound. 2.7 compounding word formation in which a new word is formed by adjoining at least two lexemes (2.14), in their original forms or with slight transformations [ISO 24613:2008] 2.8 derivation change in the form of a word (2.23) to create a new word (2.23), usually by modifying the stem (2.22) or by affixation [ISO 24613:2008] 2.9 free morpheme morpheme (2.18) that can be used as a word (2.23) by itself EXAMPLE Given the word “goodness,” “good” is a free morpheme, whereas “-ness” is not. The latter is a bound morpheme. 2.10 homograph each of two or more word forms (2.24) or words (2.23) with identical spelling but representing different concepts (semantic homography) or syntactic functions (syntactic homography) [ISO 1087-2:2000] 2.11 inflection process in which a word form (2.24) is made up by adding an affix (2.2) to a stem (2.22) NOTE Inflection is a grammatical rather than lexical process. 2.12 lemma conventional form chosen to represent a lexeme (2.14) [ISO 24613:2008] EXAMPLE Given a set of word forms such as “find,” “finds,” “found,” and “finding” in English, the form “find” is chosen as a lemma to represent the group of all these word forms. 2.13 lemmatization process of determining the lemma (2.12) for a given word form (2.24) in a context EXAMPLE Given the word “found” in English, lemmatization results in “find” as its lemma. NOTE Adapted from ISO 1087-2:2000, definition 2.19 and ISO 30042:2008, definition 3.14. SIST ISO 24614-1:2013
ISO 24614-1:2010(E) 4 © ISO 2010 – All rights reserved 2.14 lexeme abstract unit generally associated with a set of forms sharing a common meaning [ISO 24613:2008] NOTE 1 A lexeme may be a part of another lexeme, as a consequence of derivation and compounding. NOTE 2 “Form” is defined in ISO 24613 as “sequence of morphs”. 2.15 lexicalization process of making a linguistic unit function as a word NOTE Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a phrase, such as “kick the bucket”, that forms an idiomatic phrase. 2.16 lexicon list of entries mainly headed by lemmas (2.12) with associated information 2.17 morph surface form represented by a unique morpheme (2.18) EXAMPLE In English, the morphs of the plural morpheme “-s” include “-s”, “-en”, and “-NULL” (as in “boys”, “oxen”, and “sheep”), where “–NULL” has no unique surface form. Thus, the word “boys” consists of the two morphs, “boy” and “-s”, whereas the morphemes corresponding to the morphs “ox” and “-en” are “ox” and “-s”, respectively. 2.18 morpheme smallest unit of meaning expressed by a sequence of phonemes or a sequence of graphemes [ISO 24613:2008] NOTE There are two sub-types of morphemes: free morphemes and bound morphemes. 2.19 multiword expression MWE lexeme (2.14) made up of a sequence of lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination [ISO 24613:2008] NOTE A multiword expression can be a compound [a word compound or phrasal compound, an idiom, a fragment of a sentence or a sentence (e.g. a proverb or familiar quotation)]. It is not always possible to specify the part of speech for the whole MWE span. 2.20 phrasal compound word (2.23) consisting of two or more lexemes (2.14), the meaning of which is predictable from its constituent elements EXAMPLE “Apple pie” in English is a phrasal compound composed of two lexemes, “apple” and “pie”, whose meanings are preserved in the meaning of the compound. NOTE 1 Idioms use two or more lexical items, but do not compose a phrasal compound. SIST ISO 24614-1:2013
ISO 24614-1:2010(E) © ISO 2010 – All rights reserved 5 NOTE 2 A phrasal compound might be thought of as phrases by some linguists. In practice, however, there is not always a clear distinction between a word compound and a phrasal compound, or between a phrasal compound and a phrase, due to the fuzziness of semantic predictability and the degree of lexicalization. Lexico-statistics — word frequency in particular — will play an important role in this respect. 2.21 reduplication process in which the entire word (2.23), or part of it, is repeated 2.22 stem linguistic unit whose form is smaller than or equal to the form of a single lexeme (2.14) and that may be affected by an inflectional, agglutinative, compositional or derivational process [ISO 24613:2008] 2.23 word lexeme (2.14) that has, as a minimal property, a part of speech [ISO 24613:2008] 2.24 word form morphosyntactical variant of a given word (2.23) [ISO 1087-2:2000] EXAMPLE In English, the strings “find”, “finds”, “found” and “finding” are word forms of the word “find”. 2.25 word segmentation process of splitting text into a sequence of word segmentation units (2.26) 2.26 word segmentation unit WSU word form (2.24) or character string of some other type that is treated as a unit NOTE A character string that is not a word form may consist of numeric characters, foreign characters, punctuation marks or some other miscellaneous characters such as Chinese radicals, chemical symbols, such as
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.