Information technology — User interface — Face-to-face speech translation — Part 2: System architecture and functional components

ISO/IEC 20382-2:2017 specifies the functional components of face-to-face speech translation designed to interoperate among multiple translation systems with different languages. It also specifies the speech translation features, general requirements and functionality, thus providing a framework to support a convenient speech translation service in face-to-face situations. This document is applicable to speech translation devices, servers and communication protocols among speech translation servers and clients in a high-level approach. ISO/IEC 20382-2:2017 also defines various system architectures in different environments. ISO/IEC 20382-2:2017 is not applicable to defining speech recognition engines, language translation engines and speech synthesis engines.

Technologies de l'information — Interface utilisateur — Face-à-face discours traduction — Partie 2: Architecture du système et des composants fonctionnels

General Information

Status
Published
Publication Date
23-Oct-2017
Current Stage
9060 - Close of review
Start Date
03-Jun-2028
Ref Project

Buy Standard

Standard
ISO/IEC 20382-2:2017 - Information technology -- User interface -- Face-to-face speech translation
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)

INTERNATIONAL ISO/IEC
STANDARD 20382-2
First edition
2017-10
Information technology — User
interface — Face-to-face speech
translation —
Part 2:
System architecture and functional
components
Technologies de l'information — Interface utilisateur — Face-à-face
discours traduction —
Partie 2: Architecture du système et des composants fonctionnels
Reference number
ISO/IEC 20382-2:2017(E)
©
ISO/IEC 2017

---------------------- Page: 1 ----------------------
ISO/IEC 20382-2:2017(E)

COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2017, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO/IEC 2017 – All rights reserved

---------------------- Page: 2 ----------------------
ISO/IEC 20382-2:2017(E)

Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 1
3.1 Terms and definitions . 1
3.2 Abbreviated terms . 1
4 Overview of face-to-face speech translation . 1
4.1 General . 1
4.2 Functional components of F2F speech translation . 2
5 Functional requirements . 2
5.1 General requirement . 2
5.2 Speech recognition requirements . 3
5.3 Language translation requirements . 3
5.4 Speech synthesizer requirements . 3
6 System architectures of F2F speech translation . 4
6.1 General . 4
6.2 Two persons with embedded F2F speech translation devices . 4
6.3 Two persons with remote speech translation functions . 6
6.4 Mixture of 6.2 and 6.3 . 7
6.5 Adding one more speaker to F2F speech translation conversation . 9
6.6 Two person with only one fixed F2F speech translation device .10
Annex A (informative) History of F2F speech translation .13
Annex B (informative) An example scenario of F2F speech translation protocol .18
Bibliography .19
© ISO/IEC 2017 – All rights reserved iii

---------------------- Page: 3 ----------------------
ISO/IEC 20382-2:2017(E)

Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form a specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organizations to deal with particular fields of technical
activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC also take part in the
work. In the field of information technology, ISO and IEC have established a joint technical committee,
ISO/IEC JTC 1.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for
the different types of documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following
URL: www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 35, User interfaces.
A list of all parts in the ISO/IEC 20382- series can be found on the ISO website.
iv © ISO/IEC 2017 – All rights reserved

---------------------- Page: 4 ----------------------
ISO/IEC 20382-2:2017(E)

Introduction
It is important to consider people with special requirements to ensure that they can gain the same
benefits from ICT. One of those special requirements is to help people to avoid language barriers in
global environments. Automatic speech translation systems have existed for a long time, but they have
functional limitations as well as technical ones with regard to usability and accessibility. Annex A
shows a history of face-to-face speech translation.
One reason for these limitations is the diversity of the languages currently used. It is difficult to support
many languages by one or several speech translation systems. A flexible and interoperable standardized
framework is needed to work with all different languages utilizing many speech translation systems
already developed in many countries. Other considerations to make a natural and usable speech
translation service possible include applying users’ characteristics within the system, such as emotion,
speech style, gender type and other attributes. To reflect those characteristics in the output speech
translation, a standardized user interface is required to reflect the input and output data and transfer
them to the user’s device.
This document aims to enable face-to-face speech translation among people with different languages.
The three technologies, i.e., speech recognition, language translation, and speech synthesis technologies,
are mature enough to build a speech translation function. There are many face-to-face speech
translation devices and/or services using mobile devices. However, the user needs to learn how to use
the service and needs to use both hands to control the speech translation system. If the user wishes to
use only one hand, which is usually the case, he or she cannot use the current speech translation systems
and/or services. To overcome this usability issue, this document suggests a method that exactly follows
the conversation among people with the same language. The method in this document is hands-free,
and does not require any pre-training. In this sense, this method is the ultimate user interface of face-
to-face speech translation and will open a world without language barriers.
© ISO/IEC 2017 – All rights reserved v

---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO/IEC 20382-2:2017(E)
Information technology — User interface — Face-to-face
speech translation —
Part 2:
System architecture and functional components
1 Scope
This document specifies the functional components of face-to-face speech translation designed to
interoperate among multiple translation systems with different languages. It also specifies the speech
translation features, general requirements and functionality, thus providing a framework to support a
convenient speech translation service in face-to-face situations. This document is applicable to speech
translation devices, servers and communication protocols among speech translation servers and
clients in a high-level approach. This document also defines various system architectures in different
environments. This document is not applicable to defining speech recognition engines, language
translation engines and speech synthesis engines.
2 Normative references
There are no normative references in this document.
3 Terms, definitions and abbreviated terms
3.1 Terms and definitions
No terms and definitions are listed in this document.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— IEC Electropedia: available at http://www.electropedia.org/
— ISO Online browsing platform: available at http://www.iso.org/obp
3.2 Abbreviated terms
Utf-8 Unicode standard defined in IETF RFC 2279 (1998), UTF-8, a transformation format of ISO/
IEC 10646
4 Overview of face-to-face speech translation
4.1 General
A face-to-face (F2F) speech translation system enables users of different languages in a face-to-face
situation to communicate with each other in spoken languages by providing machine-generated
translation results. A face-to-face speech translation system between a speaker and a listener shall
have a speech recognition module, language translation module and a speech synthesizer (TTS: text to
speech) as shown in Figure 1.
© ISO/IEC 2017 – All rights reserved 1

---------------------- Page: 6 ----------------------
ISO/IEC 20382-2:2017(E)

Figure 1 — Functional components of F2F speech translation
4.2 Functional components of F2F speech translation
For F2F speech translation, the speaker and the listener shall set up a UI (see ISO/IEC 20382-1.).
The functions of each component in Figure 1 are as follows.
1) The speaker speaks a sentence in his/her own language.
2) The speech recognition module recognizes the speech and outputs the corresponding text.
3) The text is translated into another language with the same meaning through the language
translation module.
4) The speech synthesizer generates the corresponding speech in a listener’s language based on the
translated text.
5) Listening to the speech, the listener answers in his/her own language.
6) Steps (2) to (5) continue until the users accomplish their goals.
5 Functional requirements
5.1 General requirement
Provides general requirements regarding face-to-face speech translation:
— there are three remote services in this document, remote translation service, remote speech
recognition service and remote speech synthesis service. All these remote services shall keep the
privacy of the face-to-face speech translation users;
2 © ISO/IEC 2017 – All rights reserved

---------------------- Page: 7 ----------------------
ISO/IEC 20382-2:2017(E)

— the translation system should allow the users to start a translation session as naturally as in
everyday conversation;
— the translation system should allow the users to start a translation session as quickly as in the
everyday conversation (i.e., not exceeding 2 seconds);
— the speech translation system should work in real time (i.e., not exceeding 2 seconds);
— the translation system should allow users to have a session with multiple users;
— the translation system should allow the users to add additional participants after the session has
started.
5.2 Speech recognition requirements
Provides the requirements regarding the speech recognition module of face-to-face speech translation:
— the speech recognition module shall recognize the speech and provide it in text of the same language;
— the speech recognition module shall accept most popular speech formats;
— the speech format should be defined as a metadata format such as the MIME format;
— the output of the speech recognition module should be written in utf-8 format (see IETF RFC 2279
(1998)).
NOTE This document does not specify the data format of the speech nor that of the text since there are many
off-the-shelf speech recognition modules with various input and output data formats.
5.3 Language translation requirements
Provides requirements for the user language translation module of face-to-face speech translation:
— the language translation module shall translate text from a source language into text in a target
language with the same meaning;
— if there is no direct language translation module between the source language and the target
language, one should use an intermediate language to accomplish the language translation. One
should translate the source language to the intermediate language, and then the intermediate
language to the target language. One should choose the intermediate language so that the language
translation performance is the best. If there is no performance data available, the intermediate
language should be chosen from the same language family or from languages with the same word
order as the source language or the target language.
NOTE This document does not specify the data formats of the input and output texts since there are many
off-the-shelf language translation modules with various input and output data formats.
5.4 Speech synthesizer requirements
Provides requirements for the speech synthesizer of face-to-face speech translation:
— the speech synthesizer shall generate the corresponding speech from text of the same language;
— in face-to-face speech translation the synthesized speech should be as close as possible to that of
the original speaker to increase the natural feel of the conversation. The gender of the synthesized
speech in language B should be the same as that of the user in language A. The natural feeling can
be increased if the base frequency, speed, prosody and/or speech colour of the synthesized speech
is similar to those of the original speaker;
— the text input of the speech synthesizer should be written in utf-8 format (see IETF RFC 2279
(1998)).
© ISO/IEC 2017 – All rights reserved 3

---------------------- Page: 8 ----------------------
ISO/IEC 20382-2:2017(E)

NOTE This document does not specify the data format of the speech nor that of the text since there are many
off-the-shelf speech synthesizers with various input and output data formats.
6 System architectures of F2F speech translation
6.1 General
Figure 2 shows the sequence diagram of face-to-face speech translation.
Figure 2 — Sequence diagram
6.2 Two persons with embedded F2F speech translation devices
The basic system architecture between two persons with embedded F2F speech translation devices is
described in Figure 3.
4 © ISO/IEC 2017 – All rights reserved

---------------------- Page: 9 ----------------------
ISO/IEC 20382-2:2017(E)

Figure 3 — System architecture between two persons with embedded F2F speech
translation devices
— In this configuration, the language A speech recognition module and the language A speech
synthesizer are embedded in the mobile device of the user in language A, and the language B speech
recognition module and the language B speech synthesizer are embedded in the mobile device of
the user in language B.
— The A-to-B and B-to-A language translation modules reside in the translation server of the translation
service.
— The data format of (2), (3), (7) and (8) can be any format. For example, one can use Modality
[3]
Conversion Markup Language .
— One of the mobile devices can be a fixed device with short range wireless communication capability.
Tellers or box offices can use such an architecture.
The following steps are speech translation service steps between two persons with embedded F2F
speech translation devices. Annex B shows an example scenario of face-to-face speech translation
protocol.
1) The user in language A speaks a sentence in language A. The language A speech recognition module
embedded in the mobile device of the user recognizes the speech in language A and outputs the
corresponding text in language A.
2) The text in language A is translated into text in language B with the same meaning through the
A-to-B language translation module in translation server K.
3) The translated text in language B is transferred to the mobile device of the user in language A.
4) The translated text in language B is then transferred through short range wireless communication
to the mobile device of the user in language B.
5) The language B speech synthesizer generates the corresponding speech in language B.
6) After listening to the speech in language B, the user in language B answers in language B. The
language B speech recognition module embedded in the mobile device of the user recognizes this
speech in language B and outputs the corresponding text in language B. This recognized text is
transferred to the B-to-A language translation module residing in translation server G.
© ISO/IEC 2017 – All rights reserved 5

---------------------- Page: 10 ----------------------
ISO/IEC 20382-2:2017(E)

7) The text in language B is translated into text in language A with the same meaning through the
B-to-A language translation module residing in translation server G.
8) The translated text in language A is transferred to the mobile device of the user in language B.
9) The translated text in language A is then transferred to the mobile device of the user in language A
through the short range wireless communication.
10) The language A speech synthesizer generates the corresponding speech in language A.
11) Steps (1) to (10) continues until both users accomplish their goals.
6.3 Two persons with remote speech translation functions
The system architecture between two persons with remote F2F speech translation devices is described
in Figure 4.
Figure 4 — System architecture between two persons with remote F2F speech translation devices
— In this configuration, the language A speech synthesizer is embedded in the mobile device of the
user in language A, and the language B speech synthesizer is embedded in the mobile device of the
user in language B.
— The A-to-B and B-to-A language translation modules, the language A speech recognition module and
the language B speech recognition module reside in a remote environment.
— The speech synthesizer can also reside in the remote environment.
— One of the mobile devices can be a fixed device with short range wireless communication capability.
Tellers or box offices can use such architecture.
The following steps are speech translation service steps between two persons with remote F2F speech
translation devices.
1) The user in language A speaks a sentence in language A.
2) The language A speech recognition module residing in the remote environment recognizes the
speech in language A and outputs corresponding text in language A.
6 © ISO/IEC 2017 – All rights reserved

---------------------- Page: 11 ----------------------
ISO/IEC 20382-2:2017(E)

3) The recognized text in language A is transferred to the mobile device of the user in language A.
4) The text in language A is translated
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.