ISO/DIS 5078
(Main)Management of terminology resources — Terminology extraction
Management of terminology resources — Terminology extraction
Gestion des ressources terminologiques — Extraction de terminologie
Upravljanje terminoloških virov - Ekstrakcija terminologije
General Information
Standards Content (Sample)
SLOVENSKI STANDARD
01-oktober-2024
Upravljanje terminoloških virov - Ekstrakcija terminologije
Management of terminology resources — Terminology extraction
Gestion des ressources terminologiques — Extraction de terminologie
Ta slovenski standard je istoveten z: ISO/DIS 5078
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
DRAFT INTERNATIONAL STANDARD
ISO/TC 37/SC 3 Secretariat: DIN
Voting begins on: Voting terminates on:
2023-12-05 2024-02-27
Management of terminology resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de terminologie
ICS: 35.240.30; 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
NOT BE REFERRED TO AS AN INTERNATIONAL
STANDARD UNTIL PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
Reference number
NATIONAL REGULATIONS.
RECIPIENTS OF THIS DRAFT ARE INVITED
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION. © ISO 2023
DRAFT INTERNATIONAL STANDARD
ISO/TC 37/SC 3 Secretariat: DIN
Voting begins on: Voting terminates on:
Management of terminology resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de terminologie
ICS: 35.240.30; 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
© ISO 2023
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
NOT BE REFERRED TO AS AN INTERNATIONAL
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on STANDARD UNTIL PUBLISHED AS SUCH.
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
IN ADDITION TO THEIR EVALUATION AS
or ISO’s member body in the country of the requester. BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
ISO copyright office
USER PURPOSES, DRAFT INTERNATIONAL
CP 401 • Ch. de Blandonnet 8
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
CH-1214 Vernier, Geneva
POTENTIAL TO BECOME STANDARDS TO
Phone: +41 22 749 01 11
WHICH REFERENCE MAY BE MADE IN
Reference number
Email: copyright@iso.org
NATIONAL REGULATIONS.
Website: www.iso.org ISO/DIS 5078:2023(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
Published in Switzerland
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
ii
PROVIDE SUPPORTING DOCUMENTATION. © ISO 2023
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General . 5
4.2 Text corpora and term extraction . 5
4.3 Compilation of text corpora . 5
4.3.1 Text corpora used for terminology extraction . 5
4.3.2 Criteria for selecting corpus texts . 6
4.3.3 Kinds of documents to be included in a text corpus . 7
4.3.4 Considerations for text corpus creation . 7
4.3.5 Text corpus citation . 8
4.4 Terminology extraction approaches and methods. 8
4.4.1 Classification of terminology extraction approaches. 8
4.4.2 Extraction method according to the number of languages . 10
4.4.3 Extraction method according to the process . 10
4.4.4 Extraction method according to the underlying technique . 11
4.4.5 Extraction method according to the underlying technology . 14
4.4.6 Extraction method according to the extracted items . 16
4.5 Term extraction output . 17
4.5.1 Filtering candidate term lists . 17
4.5.2 Assessing term eligibility . 18
4.6 Uses for terminology extraction output . 19
5 Implementation of terminology extraction .19
5.1 General . 19
5.2 Initial considerations for terminology extraction . 19
5.3 Terminology extraction workflow . 20
5.3.1 Overview . 20
5.3.2 Specifying a terminology extraction method . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms . 21
5.3.6 Selecting relevant terms . 21
5.3.7 Allocating terms to concepts . 22
5.3.8 Identifying concept relations and building concept systems .22
5.3.9 Completing terminology entries . 22
Bibliography .23
iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment,
as well as information about ISO’s adherence to the World Trade Organization (WTO) principles in the
Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology,
Subcommittee SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations),
from corpora has become an increasingly important task carried out in a wide variety of different
fields. Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a
range of specialists including language professionals in general, and terminologists in particular, as
well as ontology engineers, and both information and data scientists. Terminology extraction also
serves several purposes that go beyond the compilation of glossaries or the population of terminology
databases, including the identification of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other
fields such as information retrieval, stands in stark contrast to the rarity of individual documents that
provide definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology
management, their output becomes even more relevant when it is assessed and validated, using both
qualitative and quantitative approaches and criteria for selecting entities such as relevant terms,
definitions and concept relations. This validated terminology extraction data supports the building of
high-quality terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— Compilation of corpora (general principles and types of corpora);
— Methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— Criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— Tool characteristics.
By objectively specifying these aspects, this document will provide a reference framework for
im
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.