Management of terminology resources — Terminology extraction

This document specifies methods for extracting candidate terms from text corpora and gives guidance on selecting relevant designations, definitions, concept relations and other terminology-related information.

Gestion des ressources terminologiques — Extraction de terminologie

Upravljanje terminoloških virov - Ekstrakcija terminologije

General Information

Status
Published
Publication Date
02-Feb-2025
Current Stage
6060 - International Standard published
Start Date
03-Feb-2025
Due Date
03-Feb-2025
Completion Date
03-Feb-2025

Buy Standard

Standard
ISO 5078:2025 - Management of terminology resources — Terminology extraction Released:3. 02. 2025
English language
23 pages
sale 15% off
Preview
sale 15% off
Preview
Draft
ISO/DIS 5078:2024
English language
29 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day

Standards Content (Sample)


International
Standard
ISO 5078
First edition
Management of terminology
2025-02
resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de
terminologie
Reference number
© ISO 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General .5
4.2 Text corpora and terminology extraction .5
4.3 Compilation of text corpora .6
4.3.1 Text corpora used for terminology extraction .6
4.3.2 Criteria for selecting texts for a text corpus .6
4.3.3 Considerations for text corpus creation .7
4.4 Terminology extraction approaches and methods.8
4.4.1 Classification of terminology extraction approaches.8
4.4.2 Extraction method according to the number of languages .10
4.4.3 Extraction method according to the process .11
4.4.4 Extraction method according to the underlying technique .11
4.4.5 Extraction method according to the underlying technology .14
4.4.6 Extraction method according to the extracted items .16
4.5 Term extraction output .17
4.5.1 Filtering candidate term lists .17
4.5.2 Assessing term eligibility .18
4.6 Uses for terminology extraction output .19
5 Implementation of terminology extraction . 19
5.1 General .19
5.2 Initial considerations for terminology extraction .19
5.3 Terminology extraction workflow . 20
5.3.1 Overview . 20
5.3.2 Starting the terminology extraction workflow . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms .21
5.3.6 Selecting relevant terms .21
5.3.7 Allocating terms to concepts . 22
5.3.8 Identifying concept relations and building concept systems . 22
5.3.9 Completing terminological entries . 22
Bibliography .23

iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.

iv
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations), from
text corpora has become an increasingly important task carried out in a wide variety of different fields.
Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a range of specialists
including language professionals in general, and terminologists in particular, as well as ontology engineers,
and both information and data scientists. Terminology extraction also serves several purposes that go
beyond the compilation of glossaries or the population of terminology databases, including the identification
of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other fields
such as information retrieval, stands in stark contrast to the rarity of individual documents that provide
definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology management,
their output becomes even more relevant when it is assessed and validated, using both qualitative and
quantitative approaches and criteria for selecting entities such as relevant terms, definitions and concept
relations. This extracted and then validated terminological data supports the building of high-quality
terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— compilation of text corpora (general principles and types of text corpora);
— methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— tool characteristics.
By objectively specifying these aspects, this document provides a reference framework for improving the
performance of terminology extraction tools and optimizing the use of their output.

v
International Standard ISO 5078:2025(en)
Management of terminology resources — Terminology
extraction
1 Scope
This document specifies methods for extracting candidate terms from text corpora and gives guidance on
selecting relevant designations, definitions, concept relations and other terminology-related information.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 704, Terminology work — Principles and methods
ISO 1087, Terminology work and terminology science — Vocabulary
ISO 16642, Management of terminology resources — Terminological markup framework
ISO 26162-1, Management of terminology resources — Terminology databases — Part 1: Design
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
annotation
process of adding metadata (3.10) to segments of language data
[SOURCE: ISO 24617-1:2012, 3.2, modified — “information” replaced by “metadata”; “or that information
itself” deleted.]
3.2
bitext
collection of texts (3.24) in two languages that can be considered translations of each other and that are
segmented and aligned
Note 1 to entry: Bitexts play a key role in training, evaluating and improving localization technologies, such as
translation memories, terminology management tools or machine translation engines.

3.3
candidate term
term candidate
provisional term
string of characters (3.5) that has been collected by means of term extraction (3.20) but has not yet been
selected as a relevant term (3.19) to be considered for inclusion in a terminological data (3.22) collection
[SOURCE: ISO 12616-1:2021, 3.18, modified — “text element to be
...


SLOVENSKI STANDARD
01-oktober-2024
Upravljanje terminoloških virov - Ekstrakcija terminologije
Management of terminology resources — Terminology extraction
Gestion des ressources terminologiques — Extraction de terminologie
Ta slovenski standard je istoveten z: ISO/DIS 5078
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

DRAFT INTERNATIONAL STANDARD
ISO/DIS 5078
ISO/TC 37/SC 3 Secretariat: DIN
Voting begins on: Voting terminates on:
2023-12-05 2024-02-27
Management of terminology resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de terminologie
ICS: 35.240.30; 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
NOT BE REFERRED TO AS AN INTERNATIONAL
STANDARD UNTIL PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
Reference number
NATIONAL REGULATIONS.
ISO/DIS 5078:2023(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION. © ISO 2023

ISO/DIS 5078:2023(E)
DRAFT INTERNATIONAL STANDARD
ISO/DIS 5078
ISO/TC 37/SC 3 Secretariat: DIN
Voting begins on: Voting terminates on:

Management of terminology resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de terminologie
ICS: 35.240.30; 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
© ISO 2023
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
NOT BE REFERRED TO AS AN INTERNATIONAL
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on STANDARD UNTIL PUBLISHED AS SUCH.
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
IN ADDITION TO THEIR EVALUATION AS
or ISO’s member body in the country of the requester. BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
ISO copyright office
USER PURPOSES, DRAFT INTERNATIONAL
CP 401 • Ch. de Blandonnet 8
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
CH-1214 Vernier, Geneva
POTENTIAL TO BECOME STANDARDS TO
Phone: +41 22 749 01 11
WHICH REFERENCE MAY BE MADE IN
Reference number
Email: copyright@iso.org
NATIONAL REGULATIONS.
Website: www.iso.org ISO/DIS 5078:2023(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
Published in Switzerland
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
ii
PROVIDE SUPPORTING DOCUMENTATION. © ISO 2023

ISO/DIS 5078:2023(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General . 5
4.2 Text corpora and term extraction . 5
4.3 Compilation of text corpora . 5
4.3.1 Text corpora used for terminology extraction . 5
4.3.2 Criteria for selecting corpus texts . 6
4.3.3 Kinds of documents to be included in a text corpus . 7
4.3.4 Considerations for text corpus creation . 7
4.3.5 Text corpus citation . 8
4.4 Terminology extraction approaches and methods. 8
4.4.1 Classification of terminology extraction approaches. 8
4.4.2 Extraction method according to the number of languages . 10
4.4.3 Extraction method according to the process . 10
4.4.4 Extraction method according to the underlying technique . 11
4.4.5 Extraction method according to the underlying technology . 14
4.4.6 Extraction method according to the extracted items . 16
4.5 Term extraction output . 17
4.5.1 Filtering candidate term lists . 17
4.5.2 Assessing term eligibility . 18
4.6 Uses for terminology extraction output . 19
5 Implementation of terminology extraction .19
5.1 General . 19
5.2 Initial considerations for terminology extraction . 19
5.3 Terminology extraction workflow . 20
5.3.1 Overview . 20
5.3.2 Specifying a terminology extraction method . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms . 21
5.3.6 Selecting relevant terms . 21
5.3.7 Allocating terms to concepts . 22
5.3.8 Identifying concept relations and building concept systems .22
5.3.9 Completing terminology entries . 22
Bibliography .23
iii
ISO/DIS 5078:2023(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment,
as well as information about ISO’s adherence to the World Trade Organization (WTO) principles in the
Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology,
Subcommittee SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
ISO/DIS 5078:2023(E)
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations),
from corpora has become an increasingly important task carried out in a wide variety of different
fields. Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a
range of specialists including language professionals in general, and terminologists in particular, as
well as ontology engineers, and both information and data scientists. Terminology extraction also
serves several purposes that go beyond the compilation of glossaries or the population of terminology
databases, including the identification of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other
fields such as information retrieval, stands in stark contrast to the rarity of individual documents that
provide definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology
management, their output becomes even more relevant when it is assessed and validated, using both
qualitative and quantitative approaches and criteria for selecting entities such as relevant terms,
definitions and concept relations. This validated terminology extraction data supports the building of
high-quality terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— Compilation of corpora (general principles and types of corpora);
— Methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— Criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— Tool characteristics.
By objectively specifying these aspects, this document will provide a reference framework for
im
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.