Management of terminology resources — Terminology extraction

Gestion des ressources terminologiques — Extraction de terminologie

Upravljanje terminoloških virov - Ekstrakcija terminologije

General Information

Status
Not Published
Current Stage
5020 - FDIS ballot initiated: 2 months. Proof sent to secretariat
Start Date
14-Nov-2024
Due Date
14-Nov-2024
Completion Date
14-Nov-2024

Buy Standard

Draft
ISO/DIS 5078:2024
English language
29 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day
Draft
ISO/FDIS 5078 - Management of terminology resources — Terminology extraction Released:10/31/2024
English language
23 pages
sale 15% off
Preview
sale 15% off
Preview
Draft
REDLINE ISO/FDIS 5078 - Management of terminology resources — Terminology extraction Released:10/31/2024
English language
23 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)


SLOVENSKI STANDARD
01-oktober-2024
Upravljanje terminoloških virov - Ekstrakcija terminologije
Management of terminology resources — Terminology extraction
Gestion des ressources terminologiques — Extraction de terminologie
Ta slovenski standard je istoveten z: ISO/DIS 5078
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

DRAFT INTERNATIONAL STANDARD
ISO/DIS 5078
ISO/TC 37/SC 3 Secretariat: DIN
Voting begins on: Voting terminates on:
2023-12-05 2024-02-27
Management of terminology resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de terminologie
ICS: 35.240.30; 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
NOT BE REFERRED TO AS AN INTERNATIONAL
STANDARD UNTIL PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
Reference number
NATIONAL REGULATIONS.
ISO/DIS 5078:2023(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION. © ISO 2023

ISO/DIS 5078:2023(E)
DRAFT INTERNATIONAL STANDARD
ISO/DIS 5078
ISO/TC 37/SC 3 Secretariat: DIN
Voting begins on: Voting terminates on:

Management of terminology resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de terminologie
ICS: 35.240.30; 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
© ISO 2023
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
NOT BE REFERRED TO AS AN INTERNATIONAL
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on STANDARD UNTIL PUBLISHED AS SUCH.
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
IN ADDITION TO THEIR EVALUATION AS
or ISO’s member body in the country of the requester. BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
ISO copyright office
USER PURPOSES, DRAFT INTERNATIONAL
CP 401 • Ch. de Blandonnet 8
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
CH-1214 Vernier, Geneva
POTENTIAL TO BECOME STANDARDS TO
Phone: +41 22 749 01 11
WHICH REFERENCE MAY BE MADE IN
Reference number
Email: copyright@iso.org
NATIONAL REGULATIONS.
Website: www.iso.org ISO/DIS 5078:2023(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
Published in Switzerland
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
ii
PROVIDE SUPPORTING DOCUMENTATION. © ISO 2023

ISO/DIS 5078:2023(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General . 5
4.2 Text corpora and term extraction . 5
4.3 Compilation of text corpora . 5
4.3.1 Text corpora used for terminology extraction . 5
4.3.2 Criteria for selecting corpus texts . 6
4.3.3 Kinds of documents to be included in a text corpus . 7
4.3.4 Considerations for text corpus creation . 7
4.3.5 Text corpus citation . 8
4.4 Terminology extraction approaches and methods. 8
4.4.1 Classification of terminology extraction approaches. 8
4.4.2 Extraction method according to the number of languages . 10
4.4.3 Extraction method according to the process . 10
4.4.4 Extraction method according to the underlying technique . 11
4.4.5 Extraction method according to the underlying technology . 14
4.4.6 Extraction method according to the extracted items . 16
4.5 Term extraction output . 17
4.5.1 Filtering candidate term lists . 17
4.5.2 Assessing term eligibility . 18
4.6 Uses for terminology extraction output . 19
5 Implementation of terminology extraction .19
5.1 General . 19
5.2 Initial considerations for terminology extraction . 19
5.3 Terminology extraction workflow . 20
5.3.1 Overview . 20
5.3.2 Specifying a terminology extraction method . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms . 21
5.3.6 Selecting relevant terms . 21
5.3.7 Allocating terms to concepts . 22
5.3.8 Identifying concept relations and building concept systems .22
5.3.9 Completing terminology entries . 22
Bibliography .23
iii
ISO/DIS 5078:2023(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment,
as well as information about ISO’s adherence to the World Trade Organization (WTO) principles in the
Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology,
Subcommittee SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
ISO/DIS 5078:2023(E)
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations),
from corpora has become an increasingly important task carried out in a wide variety of different
fields. Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a
range of specialists including language professionals in general, and terminologists in particular, as
well as ontology engineers, and both information and data scientists. Terminology extraction also
serves several purposes that go beyond the compilation of glossaries or the population of terminology
databases, including the identification of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other
fields such as information retrieval, stands in stark contrast to the rarity of individual documents that
provide definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology
management, their output becomes even more relevant when it is assessed and validated, using both
qualitative and quantitative approaches and criteria for selecting entities such as relevant terms,
definitions and concept relations. This validated terminology extraction data supports the building of
high-quality terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— Compilation of corpora (general principles and types of corpora);
— Methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— Criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— Tool characteristics.
By objectively specifying these aspects, this document will provide a reference framework for
im
...


FINAL DRAFT
International
Standard
ISO/TC 37/SC 3
Management of terminology
Secretariat: DIN
resources — Terminology
Voting begins on:
extraction
2024-11-14
Gestion des ressources terminologiques — Extraction de
Voting terminates on:
terminologie
2025-01-09
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO­
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
Reference number
FINAL DRAFT
International
Standard
ISO/TC 37/SC 3
Management of terminology
Secretariat: DIN
resources — Terminology
Voting begins on:
extraction
Gestion des ressources terminologiques — Extraction de
Voting terminates on:
terminologie
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
© ISO 2024
IN ADDITION TO THEIR EVALUATION AS
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO­
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
or ISO’s member body in the country of the requester.
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland Reference number
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General .5
4.2 Text corpora and terminology extraction .5
4.3 Compilation of text corpora .6
4.3.1 Text corpora used for terminology extraction .6
4.3.2 Criteria for selecting texts for a text corpus .6
4.3.3 Considerations for text corpus creation .7
4.4 Terminology extraction approaches and methods.8
4.4.1 Classification of terminology extraction approaches.8
4.4.2 Extraction method according to the number of languages .10
4.4.3 Extraction method according to the process .11
4.4.4 Extraction method according to the underlying technique .11
4.4.5 Extraction method according to the underlying technology .14
4.4.6 Extraction method according to the extracted items .16
4.5 Term extraction output .17
4.5.1 Filtering candidate term lists .17
4.5.2 Assessing term eligibility .18
4.6 Uses for terminology extraction output .18
5 Implementation of terminology extraction . 19
5.1 General .19
5.2 Initial considerations for terminology extraction .19
5.3 Terminology extraction workflow .19
5.3.1 Overview .19
5.3.2 Starting the terminology extraction workflow . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms . 20
5.3.6 Selecting relevant terms .21
5.3.7 Allocating terms to concepts .21
5.3.8 Identifying concept relations and building concept systems .21
5.3.9 Completing terminological entries . 22
Bibliography .23

iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.

iv
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations), from
text corpora has become an increasingly important task carried out in a wide variety of different fields.
Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a range of specialists
including language professionals in general, and terminologists in particular, as well as ontology engineers,
and both information and data scientists. Terminology extraction also serves several purposes that go
beyond the compilation of glossaries or the population of terminology databases, including the identification
of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other fields
such as information retrieval, stands in stark contrast to the rarity of individual documents that provide
definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology management,
their output becomes even more relevant when it is assessed and validated, using both qualitative and
quantitative approaches and criteria for selecting entities such as relevant terms, definitions and concept
relations. This extracted and then validated terminological data supports the building of high-quality
terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— compilation of text corpora (general principles and types of text corpora);
— methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— tool characteristics.
By objectively specifying these aspects, this document provides a reference framework for improving the
performance of terminology extraction tools and optimizing the use of their output.

v
FINAL DRAFT International Standard ISO/FDIS 5078:2024(en)
Management of terminology resources — Terminology
extraction
1 Scope
This document specifies methods for extracting candidate terms from text corpora and gives guidance on
selecting relevant designations, definitions, concept relations and other terminology-related information.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 704, Terminology work — Principles and methods
ISO 1087, Terminology work and terminology science — Vocabulary
ISO 16642, Computer applications in terminology — Terminological markup framework
ISO 26162-1
...


Date: 2024-09-25
ISO/TC 37/SC 3/WG 5
Secretariat: DIN
Date: 2024-10-31
Management of terminology resources — Terminology extraction
Gestion des ressources terminologiques — Extraction de terminologie
FDIS stage
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication
may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying,
or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO
at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
E-mail: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents
Foreword . iv
Introduction . v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General . 5
4.2 Text corpora and terminology extraction . 6
4.3 Compilation of text corpora . 6
4.4 Terminology extraction approaches and methods . 8
4.5 Term extraction output . 18
4.6 Uses for terminology extraction output . 19
5 Implementation of terminology extraction . 20
5.1 General . 20
5.2 Initial considerations for terminology extraction . 20
5.3 Terminology extraction workflow. 20
Bibliography . 24

iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent rights
in respect thereof. As of the date of publication of this document, ISO had not received notice of (a) patent(s)
which may be required to implement this document. However, implementers are cautioned that this may not
represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations), from text
corpora has become an increasingly important task carried out in a wide variety of different fields.
Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a range of specialists
including language professionals in general, and terminologists in particular, as well as ontology engineers,
and both information and data scientists. Terminology extraction also serves several purposes that go beyond
the compilation of glossaries or the population of terminology databases, including the identification of
concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other fields such
as information retrieval, stands in stark contrast to the rarity of individual documents that provide definitions,
requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology management,
their output becomes even more relevant when it is assessed and validated, using both qualitative and
quantitative approaches and criteria for selecting entities such as relevant terms, definitions and concept
relations. This extracted and then validated terminology extractionterminological data supports the building
of high-quality terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— compilation of text corpora (general principles and types of text corpora);
— methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic, hybrid
and neural);
— criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— tool characteristics.
By objectively specifying these aspects, this document provides a reference framework for improving the
performance of terminology extraction tools and optimizing the use of their output.
v
Management of terminology resources — Terminology extraction
1 Scope
This document specifies methods for extracting candidate terms from text corpora and gives guidance on
selecting relevant designations, definitions, concept relations and other terminology-related information.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 704, Terminology work — Principles and methods
ISO 1087, Terminology work and terminology science — Vocabulary
ISO 16642, Computer applications in terminology — Terminological markup framework
ISO 26162-1, Management of terminology resources — Terminology databases — Part 1: Design
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https://www.iso.org/obp
— IEC Electropedia: available at https://www.electropedia.org/
3.1
annotation
process of adding metadata (3.10) to segments of language data
[SOURCE: ISO 24617-1:2012, 3.2, modified — “information” replaced by “metadata”; “or that information
itself” deleted.]
3.2
bitext
collection of texts (3.24) in two languages that can be considered translations of each other and that are
segmented and aligned
Note 1 to entry: Bitexts play a key role in training, evaluating and improving localization technologies, such as translation
memories, terminology management tools or machine translation engines.
3.3
candidate term
term candidate
provisional term
string of characters (3.5) that has been collected by means of term extraction (3.20(3.21)) but has not yet been
selected as a relevant term (3.19(3.20)) to be considered for inclusion in a terminological data (3.22) collection
[SOURCE: ISO 12616-1:2021, 3.18, modified — “text element to be documented in the” replaced by “term to
be considered for inclusion in a”.]
3.4
candidate terminological data
string of characters (3.5) that has been collected by means of terminology extraction (3.23(3.24)) but has not
yet been selected as relevant terminological data (3.22(3.23))
3.5
character
unit of textual information represented by one or more bytes
EXAMPLES EXAMPLE A single letter, numeral, punctuation mark, diacritic, symbol, ideograph, or space.
[SOURCE: ISO/IEC 14840:1996, 4.10, modified — Examples“textual” added.] to the definition; example
added.]
3.6
collocation
lexically or pragmatically constrained recurrent cooccurrence of at least two lexical units (3.8) which are in a
direct syntactic relation with each other
EXAMPLE “Commit a crime” instead of “do a crime”.
3.7
keyness
quantity proportional to the frequency of a lexical unit (3.8) in a subject-field-specific text corpus (3.25(3.26),),
relative to a reference corpus (3.15(3.16))
3.8
lexical unit
meaningful element in the lexicon (3.9) of a language
3.9
lexicon
complete set of meaningful elements in a language
3.10
metadata
data that defines and describes other data
[SOURCE: ISO 24531:2013, 4.32]
3.11
n-gram
sequence of n adjacent tokens (3.27(3.28))
Note 1 to entry: Frequently adjacent tokens can be an indicator for termhood (3.21(3.22).).
Note 2 to entry: The number of adjacent tokens (n) is usually 2, 3 or 4.
3.12
noise
non-relevant search results
Note 1 to entry: In terminology extraction (3.23(3.24),), “noise” means non-relevant data in the extraction output.
3.13
parsing
process of determining the syntactic structure of a lexical unit (3.8) by decomposing it into elementary
subunits and establishing the relationships among the subunits
[SOURCE: ISO/IEC/IEEE 24765:2017, 3.2818, modified — “parse” replaced by “parsing” as the term; “to
determine” replaced by “process of determining” in the definition; “more” before “elementary” deleted in the
definition; Example deleted.]
3.13 3.14
precision
ratio of relevant search results to all search results
Note 1 to entry: In terminology extraction (3.23(3.24),), “precision” means the ratio of relevant candidate terms (3.3)
retrieved to the total of candidate terms retrieved.
Note 2 to entry: Precision and recall (3.14(3.15)) generally have an inverse relationship; when one increases, the other
tends to decrease.
3.14 3.15
recall
ratio of relevant search results to all relevant items in a set that have been or should have been found from a
search query
Note 1 to entry: In terminology extraction (3.23(3.24),), “recall” means the relevant candidate terms (3.3) in a text corpus
(3.25(3.26). ).
Note 2 to entry: Recall and precision (3.13(3.14)) generally have an inverse relationship; when one increases, the other
tends to decrease.
3.15 3.16
reference corpus
text corpus (3.25(3.26)) to which a given text corpus for terminology extraction (3.23(3.24)) is compared
3.16 3.17
relevance
quality of being a successful search result in relation to the search query
3.17 3.18
silence
set of relevant search results that have not been found from a search query
Note 1 to entry: In terminology extraction (3.23(3.24),), “silence” means the set of valid candidate terms (3.3) that are
missing in the extraction results.
3.18 3.19
stop word
word that is not taken into account as a candidate term (3.3)
Note 1 to entry: Typical stop words are function words (e.g. prepositions, articles), brand names and non-special
language words to the specific subject field.
3.19 3.20
term
designation that represents a general concept by linguistic means
EXAMPLE : “laser printer”, “planet”, “pacemaker”, “chemical compound”, “¾ time”, “Influenza A virus”, “oil painting”.
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.