Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles

ISO 24614-1:2010 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU). The many applications and fields that need to segment texts into words — and thus to which ISO 24614-1:2010 can be applied — include translation, content management, speech technologies, computational linguistics and lexicography.

Gestion des ressources langagières — Segmentation des mots dans les textes écrits — Partie 1: Notions fondamentales et principes généraux

Upravljanje z jezikovnimi viri - Segmentacija v besede v pisnih besedilih - 1. del: Osnovni pojmi in splošna načela

Ta del standarda ISO 24614 predstavlja osnovne pojme in splošna načela za segmentacijo v besede in zagotavlja od jezika neodvisne smernice za omogočanje zanesljive in ponovljive segmentacije pisnih besedil v enote segmentacije v besede.

General Information

Status: Published
Publication Date: 24-Oct-2010

ICS: 01.140.10 - Writing and transliteration

Technical Committee: ISO/TC 37/SC 4 - Language resource management
Drafting Committee: ISO/TC 37/SC 4/WG 6 - Linguistic annotation

Current Stage: 9093 - International Standard confirmed
Start Date: 08-May-2022
Completion Date: 12-Feb-2026

Ref Project: SIST ISO 24614-1:2013 - Language resource management -- Word segmentation of written texts -- Part 1: Basic concepts and general principles

Overview - ISO 24614-1:2010 (Word segmentation, basic concepts)

ISO 24614-1:2010 is an international standard in language resource management that defines the basic concepts and general principles for word segmentation of written texts. It provides language‑independent guidelines to segment text into reproducible units called word segmentation units (WSU). The standard is intended to make segmentation consistent across languages and applications so that processes like counting, lookup and automated processing are reliable and comparable.

Key topics and technical coverage

Scope and objectives: Establishes a universal framework for dividing written text into WSUs for a wide range of languages and scripts.
Terms and definitions: Precise definitions for core linguistic concepts used in segmentation - e.g. morpheme, lexeme, lemma, stem, word form, compound, multiword expression (MWE), WSU.
Basic framework: Conceptual relationships among morphemes, lexemes and WSUs; treatment of compounding, agglutination, and lexicalization.
General principles: Language‑independent rules and considerations (spaces, punctuation, abbreviations, numerics, multiword units, idioms) to achieve reliable and reproducible segmentation.
Annex A (informative): Guidance on representing word segmentation in XML for interoperability and tooling.
Standardization context: Prepared by ISO/TC 37 (Terminology and language resources); Part 2 addresses CJK specifics and Part 3 is planned for other languages.

Practical applications

ISO 24614-1 is applicable wherever accurate word boundaries matter:

Translation & localization: consistent word counts, translation memory and CAT tool segmentation.
Natural Language Processing (NLP): preprocessing for morphosyntactic analyzers, parsers, tokenizers, spellcheckers, text classification and corpus annotation.
Speech technologies: lexicon lookup, TTS prosody and speech synthesis that require consistent lexical units.
Content management & search: indexing and search require clear word boundaries for matching and retrieval.
Lexicography & terminology management: consistent corpus counts and lexicon construction.

Who should use this standard

NLP engineers and data scientists building tokenizers and text pipelines
CAT tool, TMS and CMS developers and integrators
Speech technology vendors (TTS, ASR)
Corpus linguists, lexicographers and terminology managers
Project managers needing reproducible word counts for costing and QA

Related standards

ISO 24614-2 - Word segmentation for Chinese, Japanese and Korean (CJK)
Future ISO 24614‑3 - planned for additional language‑specific rules

Using ISO 24614-1 helps ensure consistent, interoperable word segmentation across tools and languages, improving accuracy in NLP workflows, translation workflows, search and lexicon management.

ISO 24614-1:2013 - BARVE - Page 3 preview

ISO 24614-1:2013 - BARVE - Page 1 preview

ISO 24614-1:2013 - BARVE - Page 2 preview

Standard

ISO 24614-1:2013 - BARVE

English language

20 pages

Preview

e-Library read for

AI-Chat

1 day

Create e-Library subscription and get permanent access to the document. Subscriptions are available for: 01 01.140 01.140.10

ISO 24614-1:2010 - Language resource management -- Word segmentation of written texts - Page 1 preview

ISO 24614-1:2010 - Language resource management -- Word segmentation of written texts - Page 2 preview

ISO 24614-1:2010 - Language resource management -- Word segmentation of written texts - Page 3 preview

Standard

ISO 24614-1:2010 - Language resource management -- Word segmentation of written texts

English language

15 pages

sale 15% off

Preview

sale 15% off

Preview

Frequently Asked Questions

What is ISO 24614-1:2010?

ISO 24614-1:2010 is a standard published by the International Organization for Standardization (ISO). Its full title is "Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles". This standard covers: ISO 24614-1:2010 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU). The many applications and fields that need to segment texts into words — and thus to which ISO 24614-1:2010 can be applied — include translation, content management, speech technologies, computational linguistics and lexicography.

What is the scope of ISO 24614-1:2010?

What ICS categories does ISO 24614-1:2010 belong to?

ISO 24614-1:2010 is classified under the following ICS (International Classification for Standards) categories: 01.140.10 - Writing and transliteration. The ICS classification helps identify the subject area and facilitates finding related standards.

How can I access ISO 24614-1:2010?

ISO 24614-1:2010 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.

Standards Content (Sample)

SLOVENSKI STANDARD
01-julij-2013
Upravljanje z jezikovnimi viri - Segmentacija v besede v pisnih besedilih - 1. del:
Osnovni pojmi in splošna načela
Language resource management -- Word segmentation of written texts -- Part 1: Basic
concepts and general principles
Gestion des ressources langagières -- Segmentation des mots dans les textes écrits --
Partie 1: Notions fondamentales et principes généraux
Ta slovenski standard je istoveten z: ISO 24614-1:2010
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.10 Pisanje in prečrkovanje Writing and transliteration
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

INTERNATIONAL ISO
STANDARD 24614-1
First edition
2010-11-01
Language resource management — Word
segmentation of written texts —
Part 1:
Basic concepts and general principles
Gestion des ressources langagières — Segmentation des mots dans
les textes écrits —
Partie 1: Notions fondamentales et principes généraux

Reference number
©
ISO 2010
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.

© ISO 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2010 – All rights reserved

Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Terms and definitions .2
3 Basic framework for word segmentation.6
4 General principles of word segmentation.10
Annex A (informative) Representing word segmentation in XML.13
Bibliography.14

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24614-1 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
ISO 24614 consists of the following parts, under the general title Language resource management — Word
segmentation of written texts:
⎯ Part 1: Basic concepts and general principles
⎯ Part 2: Word segmentation for Chinese, Japanese and Korean
Word segmentation for other languages is to form the subject of a future Part 3.
iv © ISO 2010 – All rights reserved

Introduction
Word segmentation is the dividing of text into linguistic units that carry meaning. For example, “the white
house” can be divided into three meaningful units, “the,” “white,” and “house”, when it refers to a house that is
white; whereas “the White House” corresponds to only one meaningful unit when it refers to the residence of
the US President.
For the purposes of ISO 24614, such meaningful linguistic units are called word segmentation units (WSU).
As demonstrated in the previous example, a WSU can be comprised of more than one word. A WSU can
consist of a stem and affixes (e.g. “re+work+ing”). It can be a compound word (e.g. “blackboard”), a proper
noun (e.g. “Cape Town”), an idiom (e.g. “It's raining cats and dogs”), or a multiword expression (e.g. “take
care of”). For languages that have spaces between words, such as English, segmenting a text into WSU is
facilitated by using the spaces as a basis for establishing the boundaries of a WSU, although additional
considerations need to be taken into account for handling abbreviations, punctuation and multiword units of
meaning, among others. For languages that do not have spaces between words, such as Chinese and
Japanese, or for languages that have spaces partially between words, such as Thai and Korean, segmenting
a text into WSU requires a different approach.
Furthermore, word segmentation is complex for languages that are characterized by extensive compounding,
such as Chinese, and for languages that are characterized by extensive agglutination, such as Japanese,
Korean and Hungarian. On the other hand, the fact that Japanese supports multiple scripts is beneficial for
word segmentation.
However, white space alone is not sufficient to segment a text. “Apple pie,” for example, is understood as a
kind of pie made of apples, so “apple” and “pie” are treated as two distinct WSUs. Alternatively, it can be
viewed as a single entity due to its collocational and idiomatic properties, and treated as a single WSU.
Segmentation rules can differ between languages, even when applied to equivalent expressions (as
discussed in ISO 24614-2).
Elaborating standards for the rules and methods for word segmentation can facilitate innovation and
development in areas such as language learning and translation. It could improve language-related
technologies, including spell checking, grammar checking, dictionary lookup, terminology management,
translation memory, information retrieval, information extraction and machine translation. For instance, by
failing to identify “kick the bucket” as a single WSU, translation memory and machine translation technologies
would produce a literal rather than idiomatic translation.
This part of ISO 24614 is the first in a series of International Standards targeted at word segmentation in
written languages. It focuses on the basic concepts and general principles of word segmentation that apply to
languages in general. The subsequent parts will, however, focus on the issues specific to particular languages.

INTERNATIONAL STANDARD ISO 24614-1:2010(E)

Language resource management — Word segmentation of
written texts —
Part 1:
Basic concepts and general principles
1 Scope
This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and
provides language-independent guidelines to enable written texts to be segmented, in a reliable and
reproducible manner, into word segmentation units (WSU).
NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical
to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot
simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as
hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word
segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and
Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.
The many applications and fields that need to segment texts into words — and thus to which this part of
ISO 24614 can be applied — include the following.
Translation
Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard
function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is
performed by term extraction tools, which are sometimes provided in terminology management systems and
CAT tools.
Content management
Most content management systems and databases allow for searching by individual words. The content being
searched has to be segmented to permit matching with a search word. Furthermore, search functions require
knowledge of the boundaries of words.
Speech technologies
Text-to-speech systems generate speech based on words and therefore require word segmentation for
lexicon lookup, stress assignment, prosodic pattern assignment, etc.
Computational linguistics
Various natural language processing (NLP) systems must segment text into words in order to carry out their
functions. NLP systems include
⎯ morphosyntactic processors,
⎯ syntactic parsers,
⎯ spellcheckers,
⎯ text classification systems, and
⎯ corpus linguistics annotators.
Lexicography
Lexical resources are often evaluated by size, usually by referring to the number of words.
NOTE 2 The size of language resources is an essential benchmark for their management. Quantifying the size of
language resources is typically achieved by counting the words. However, because NLP applications use different
segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A
reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use
their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text
into smaller or larger units compared to another application.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
abbreviation
verbal designation formed by omitting words or letters from a longer form and designating the same concept
[ISO 1087-1:2000]
2.2
affix
bound morpheme (2.5) which may be added to a stem (2.22) or a lexeme (2.14)
NOTE Affixes can be classified into several sub-types such as prefix, suffix, infix and circumfix. Affixes can be
derivational or they can be inflectional or agglutinative.
2.3
agglutination
process of concatenating one or more affixes (2.2) to a stem (2.22)
[ISO 24613:2008]
2.4
borrowing
process of word formation in which a linguistic expression is adopted from another language, usually when no
term exists for the new object or concept
2.5
bound morpheme
morpheme (2.18) that appears only together with one or several other morphemes
[ISO 24613:2008]
EXAMPLE 1 Chinese: 伟 means “great,” but cannot stand by itself as a word in text. Instead, it is used as a constituent
element of many words, such as 伟大 (“great”), 伟人 (“giant”), and 雄伟 (“majesty”).
EXAMPLE 2 Korean: the suffix “-e”, which is equivalent to the English preposition “to” — as in “hakkyo-e” (to school)
— is a bound morpheme.
2 © ISO 2010 – All rights reserved

2.6
compound
word (2.23) built from two or more lexemes (2.14)
NOTE 1 Adapted from ISO 24613:2008, definition 3.10.
NOTE 2 A compound may be endocentric if it has a head (i.e. the fundamental part that contains the basic meaning of
the whole compound) and modifiers (which restrict this meaning), or exocentric if it does not have a head. A compound
can be long. There are two main sub-types of compound according to their degree of lexicalization: word compound and
phrasal compound.
2.7
compounding
word formation in which a new word is formed by adjoining at least two lexemes (2.14), in their original forms
or with slight transformations
[ISO 24613:2008]
2.8
derivation
change in the form of a word (2.23) to create a new word (2.23), usually by modifying the stem (2.22) or by
affixation
[ISO 24613:2008]
2.9
free morpheme
morpheme (2.18) that can be used as a word (2.23) by itself
EXAMPLE Given the word “goodness,” “good” is a free morpheme, whereas “-ness” is not. The latter is a bound
morpheme.
2.10
homograph
each of two or more word forms (2.24) or words (2.23) with identical spelling but representing different
concepts (semantic homography) or syntactic functions (syntactic homography)
[ISO 1087-2:2000]
2.11
inflection
process in which a word form (2.24) is made up by adding an affix (2.2) to a stem (2.22)
NOTE Inflection is a grammatical rather than lexical process.
2.12
lemma
conventional form chosen to represent a lexeme (2.14)
[ISO 24613:2008]
EXAMPLE Given a set of word forms such as “find,” “finds,” “found,” and “finding” in English, the form “find” is
chosen as a lemma to represent the group of all these word forms.
2.13
lemmatization
process of determining the lemma (2.12) for a given word form (2.24) in a context
EXAMPLE Given the word “found” in English, lemmatization results in “find” as its lemma.
NOTE Adapted from ISO 1087-2:2000, definition 2.19 and ISO 30042:2008, definition 3.14.
2.14
lexeme
abstract unit generally associated with a set of forms sharing a common meaning
[ISO 24613:2008]
NOTE 1 A lexeme may be a part of another lexeme, as a consequence of derivation and compounding.
NOTE 2 “Form” is defined in ISO 24613 as “sequence of morphs”.
2.15
lexicalization
process of making a linguistic unit function as a word
NOTE Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a
phrase, such as “kick the bucket”, that forms an idiomatic phrase.
2.16
lexicon
list of entries mainly headed by lemmas (2.12) with associated information
2.17
morph
surface form represented by a unique morpheme (2.18)
EXAMPLE In English, the morphs of the plural morpheme “-s” include “-s”, “-en”, and “-NULL” (as in “boys”, “oxen”,
and “sheep”), where “–NULL” has no unique surface form. Thus, the word “boys” consists of the two morphs, “boy” and
“-s”, whereas the morphemes corresponding to the morphs “ox” and “-en” are “ox” and “-s”, respectively.
2.18
morpheme
smallest unit of meaning expressed by a sequence of phonemes or a sequence of graphemes
[ISO 24613:2008]
NOTE There are two sub-types of morphemes: free morphemes and bound morphemes.
2.19
multiword expression
MWE
lexeme (2.14) made up of a sequence of lexemes that has properties that are not predictable from the
properties of the individual lexemes or their normal mode of combination
[ISO 24613:2008]
NOTE A multiword expression can be a compound [a word compound or phrasal compound, an idiom, a fragment of
a sentence or a sentence (e.g. a proverb or familiar quotation)]. It is not always possible to specify the part of speech for
the whole MWE span.
2.20
phrasal compound
word (2.23) consisting of two or more lexemes (2.14), the meaning of which is predictable from its constituent
elements
EXAMPLE “Apple pie” in English is a phrasal compound composed of two lexemes, “apple” and “pie”, whose
meanings are preserved in the meaning of the compound.
NOTE 1 Idioms use two or more lexical items, but do not compose a phrasal compound.
4 © ISO 2010 – All rights reserved

NOTE 2 A phrasal compound might be thought of as phrases by some linguists. In practice, however, there is not
always a clear distinction between a word compound and a phrasal compound, or between a phrasal compound and a
phrase, due to the fuzziness of semantic predictability and the degree of lexicalization. Lexico-statistics — word frequency
in particular — will play an important role in this respect.
2.21
reduplication
process in which the entire word (2.23), or part of it, is repeated
2.22
stem
linguistic unit whose form is smaller than or equal to the form of a single lexeme (2.14) and that may be
affected by an inflectional, agglutinative, compositional or derivational process
[ISO 24613:2008]
2.23
word
lexeme (2.14) that has, as a minimal property, a part of speech
[ISO 24613:2008]
2.24
word form
morphosyntactical variant of a given word (2.23)
[ISO 1087-2:2000]
EXAMPLE In English, the strings “find”, “finds”, “found” and “finding” are word forms of the word “find”.
2.25
word segmentation
process of splitting text into a sequence of word segmentation units (2.26)
2.26
word segmentation unit
WSU
word form (2.24) or character string of some other type that is treated as a unit
NOTE A character string that is not a word
...

INTERNATIONAL ISO
STANDARD 24614-1
First edition
2010-11-01
Language resource management — Word
segmentation of written texts —
Part 1:
Basic concepts and general principles
Gestion des ressources langagières — Segmentation des mots dans
les textes écrits —
Partie 1: Notions fondamentales et principes généraux

Reference number
©
ISO 2010
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.

© ISO 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2010 – All rights reserved

Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Terms and definitions .2
3 Basic framework for word segmentation.6
4 General principles of word segmentation.10
Annex A (informative) Representing word segmentation in XML.13
Bibliography.14

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24614-1 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
ISO 24614 consists of the following parts, under the general title Language resource management — Word
segmentation of written texts:
⎯ Part 1: Basic concepts and general principles
⎯ Part 2: Word segmentation for Chinese, Japanese and Korean
Word segmentation for other languages is to form the subject of a future Part 3.
iv © ISO 2010 – All rights reserved

Introduction
Word segmentation is the dividing of text into linguistic units that carry meaning. For example, “the white
house” can be divided into three meaningful units, “the,” “white,” and “house”, when it refers to a house that is
white; whereas “the White House” corresponds to only one meaningful unit when it refers to the residence of
the US President.
For the purposes of ISO 24614, such meaningful linguistic units are called word segmentation units (WSU).
As demonstrated in the previous example, a WSU can be comprised of more than one word. A WSU can
consist of a stem and affixes (e.g. “re+work+ing”). It can be a compound word (e.g. “blackboard”), a proper
noun (e.g. “Cape Town”), an idiom (e.g. “It's raining cats and dogs”), or a multiword expression (e.g. “take
care of”). For languages that have spaces between words, such as English, segmenting a text into WSU is
facilitated by using the spaces as a basis for establishing the boundaries of a WSU, although additional
considerations need to be taken into account for handling abbreviations, punctuation and multiword units of
meaning, among others. For languages that do not have spaces between words, such as Chinese and
Japanese, or for languages that have spaces partially between words, such as Thai and Korean, segmenting
a text into WSU requires a different approach.
Furthermore, word segmentation is complex for languages that are characterized by extensive compounding,
such as Chinese, and for languages that are characterized by extensive agglutination, such as Japanese,
Korean and Hungarian. On the other hand, the fact that Japanese supports multiple scripts is beneficial for
word segmentation.
However, white space alone is not sufficient to segment a text. “Apple pie,” for example, is understood as a
kind of pie made of apples, so “apple” and “pie” are treated as two distinct WSUs. Alternatively, it can be
viewed as a single entity due to its collocational and idiomatic properties, and treated as a single WSU.
Segmentation rules can differ between languages, even when applied to equivalent expressions (as
discussed in ISO 24614-2).
Elaborating standards for the rules and methods for word segmentation can facilitate innovation and
development in areas such as language learning and translation. It could improve language-related
technologies, including spell checking, grammar checking, dictionary lookup, terminology management,
translation memory, information retrieval, information extraction and machine translation. For instance, by
failing to identify “kick the bucket” as a single WSU, translation memory and machine translation technologies
would produce a literal rather than idiomatic translation.
This part of ISO 24614 is the first in a series of International Standards targeted at word segmentation in
written languages. It focuses on the basic concepts and general principles of word segmentation that apply to
languages in general. The subsequent parts will, however, focus on the issues specific to particular languages.

INTERNATIONAL STANDARD ISO 24614-1:2010(E)

Language resource management — Word segmentation of
written texts —
Part 1:
Basic concepts and general principles
1 Scope
This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and
provides language-independent guidelines to enable written texts to be segmented, in a reliable and
reproducible manner, into word segmentation units (WSU).
NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical
to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot
simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as
hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word
segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and
Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.
The many applications and fields that need to segment texts into words — and thus to which this part of
ISO 24614 can be applied — include the following.
Translation
Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard
function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is
performed by term extraction tools, which are sometimes provided in terminology management systems and
CAT tools.
Content management
Most content management systems and databases allow for searching by individual words. The content being
searched has to be segmented to permit matching with a search word. Furthermore, search functions require
knowledge of the boundaries of words.
Speech technologies
Text-to-speech systems generate speech based on words and therefore require word segmentation for
lexicon lookup, stress assignment, prosodic pattern assignment, etc.
Computational linguistics
Various natural language processing (NLP) systems must segment text into words in order to carry out their
functions. NLP systems include
⎯ morphosyntactic processors,
⎯ syntactic parsers,
⎯ spellcheckers,
⎯ text classification systems, and
⎯ corpus linguistics annotators.
Lexicography
Lexical resources are often evaluated by size, usually by referring to the number of words.
NOTE 2 The size of language resources is an essential benchmark for their management. Quantifying the size of
language resources is typically achieved by counting the words. However, because NLP applications use different
segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A
reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use
their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text
into smaller or larger units compared to another application.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
abbreviation
verbal designation formed by omitting words or letters from a longer form and designating the same concept
[ISO 1087-1:2000]
2.2
affix
bound morpheme (2.5) which may be added to a stem (2.22) or a lexeme (2.14)
NOTE Affixes can be classified into several sub-types such as prefix, suffix, infix and circumfix. Affixes can be
derivational or they can be inflectional or agglutinative.
2.3
agglutination
process of concatenating one or more affixes (2.2) to a stem (2.22)
[ISO 24613:2008]
2.4
borrowing
process of word formation in which a linguistic expression is adopted from another language, usually when no
term exists for the new object or concept
2.5
bound morpheme
morpheme (2.18) that appears only together with one or several other morphemes
[ISO 24613:2008]
EXAMPLE 1 Chinese: 伟 means “great,” but cannot stand by itself as a word in text. Instead, it is used as a constituent
element of many words, such as 伟大 (“great”), 伟人 (“giant”), and 雄伟 (“majesty”).
EXAMPLE 2 Korean: the suffix “-e”, which is equivalent to the English preposition “to” — as in “hakkyo-e” (to school)
— is a bound morpheme.
2 © ISO 2010 – All rights reserved

2.6
compound
word (2.23) built from two or more lexemes (2.14)
NOTE 1 Adapted from ISO 24613:2008, definition 3.10.
NOTE 2 A compound may be endocentric if it has a head (i.e. the fundamental part that contains the basic meaning of
the whole compound) and modifiers (which restrict this meaning), or exocentric if it does not have a head. A compound
can be long. There are two main sub-types of compound according to their degree of lexicalization: word compound and
phrasal compound.
2.7
compounding
word formation in which a new word is formed by adjoining at least two lexemes (2.14), in their original forms
or with slight transformations
[ISO 24613:2008]
2.8
derivation
change in the form of a word (2.23) to create a new word (2.23), usually by modifying the stem (2.22) or by
affixation
[ISO 24613:2008]
2.9
free morpheme
morpheme (2.18) that can be used as a word (2.23) by itself
EXAMPLE Given the word “goodness,” “good” is a free morpheme, whereas “-ness” is not. The latter is a bound
morpheme.
2.10
homograph
each of two or more word forms (2.24) or words (2.23) with identical spelling but representing different
concepts (semantic homography) or syntactic functions (syntactic homography)
[ISO 1087-2:2000]
2.11
inflection
process in which a word form (2.24) is made up by adding an affix (2.2) to a stem (2.22)
NOTE Inflection is a grammatical rather than lexical process.
2.12
lemma
conventional form chosen to represent a lexeme (2.14)
[ISO 24613:2008]
EXAMPLE Given a set of word forms such as “find,” “finds,” “found,” and “finding” in English, the form “find” is
chosen as a lemma to represent the group of all these word forms.
2.13
lemmatization
process of determining the lemma (2.12) for a given word form (2.24) in a context
EXAMPLE Given the word “found” in English, lemmatization results in “find” as its lemma.
NOTE Adapted from ISO 1087-2:2000, definition 2.19 and ISO 30042:2008, definition 3.14.
2.14
lexeme
abstract unit generally associated with a set of forms sharing a common meaning
[ISO 24613:2008]
NOTE 1 A lexeme may be a part of another lexeme, as a consequence of derivation and compounding.
NOTE 2 “Form” is defined in ISO 24613 as “sequence of morphs”.
2.15
lexicalization
process of making a linguistic unit function as a word
NOTE Such a linguistic unit can be a single morph, e.g. “laugh,” a sequence of morphs, e.g. “apple pie” or even a
phrase, such as “kick the bucket”, that forms an idiomatic phrase.
2.16
lexicon
list of entries mainly headed by lemmas (2.12) with associated information
2.17
morph
surface form represented by a unique morpheme (2.18)
EXAMPLE In English, the morphs of the plural morpheme “-s” include “-s”, “-en”, and “-NULL” (as in “boys”, “oxen”,
and “sheep”), where “–NULL” has no unique surface form. Thus, the word “boys” consists of the two morphs, “boy” and
“-s”, whereas the morphemes corresponding to the morphs “ox” and “-en” are “ox” and “-s”, respectively.
2.18
morpheme
smallest unit of meaning expressed by a sequence of phonemes or a sequence of graphemes
[ISO 24613:2008]
NOTE There are two sub-types of morphemes: free morphemes and bound morphemes.
2.19
multiword expression
MWE
lexeme (2.14) made up of a sequence of lexemes that has properties that are not predictable from the
properties of the individual lexemes or their normal mode of combination
[ISO 24613:2008]
NOTE A multiword expression can be a compound [a word compound or phrasal compound, an idiom, a fragment of
a sentence or a sentence (e.g. a proverb or familiar quotation)]. It is not always possible to specify the part of speech for
the whole MWE span.
2.20
phrasal compound
word (2.23) consisting of two or more lexemes (2.14), the meaning of which is predictable from its constituent
elements
EXAMPLE “Apple pie” in English is a phrasal compound composed of two lexemes, “apple” and “pie”, whose
meanings are preserved in the meaning of the compound.
NOTE 1 Idioms use two or more lexical items, but do not compose a phrasal compound.
4 © ISO 2010 – All rights reserved

NOTE 2 A phrasal compound might be thought of as phrases by some linguists. In practice, however, there is not
always a clear distinction between a word compound and a phrasal compound, or between a phrasal compound and a
phrase, due to the fuzziness of semantic predictability and the degree of lexicalization. Lexico-statistics — word frequency
in particular — will play an important role in this respect.
2.21
reduplication
process in which the entire word (2.23), or part of it, is repeated
2.22
stem
linguistic unit whose form is smaller than or equal to the form of a single lexeme (2.14) and that may be
affected by an inflectional, agglutinative, compositional or derivational process
[ISO 24613:2008]
2.23
word
lexeme (2.14) that has, as a minimal property, a part of speech
[ISO 24613:2008]
2.24
word form
morphosyntactical variant of a given word (2.23)
[ISO 1087-2:2000]
EXAMPLE In English, the strings “find”, “finds”, “found” and “finding” are word forms of the word “find”.
2.25
word segmentation
process of splitting text into a sequence of word segmentation units (2.26)
2.26
word segmentation unit
WSU
word form (2.24) or character string of some other type that is treated as a unit
NOTE A character string that is not a word form may consist of numeric characters, foreign characters, punctuation
marks or some other miscellaneous characters such as Chinese radicals, chemical symbols, such as H O, or a mixture of
Latin and numeric characters, such as F16.
2.27
word structure
internal structure of a word (2.23) resulting from the morphological analysis
NOTE In agglutinative languages, such as Korean, Japanese and Turkish, a word may consist of a sequence of
morphemes, with a comparatively high morpheme-per-word ratio, where each affix involved (both derivational and
inflectional) typically expresses a particular grammatical meaning in a clear, one-to-one w
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...

Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles

Gestion des ressources langagières — Segmentation des mots dans les textes écrits — Partie 1: Notions fondamentales et principes généraux

Upravljanje z jezikovnimi viri - Segmentacija v besede v pisnih besedilih - 1. del: Osnovni pojmi in splošna načela

General Information

Overview - ISO 24614-1:2010 (Word segmentation, basic concepts)

Key topics and technical coverage

Practical applications

Who should use this standard

Related standards

ISO 24614-1:2013 - BARVE

ISO 24614-1:2010 - Language resource management -- Word segmentation of written texts

Frequently Asked Questions

Standards Content (Sample)

Questions, Comments and Discussion

This May Also Interest You