ETSI TR 102 506 V1.4.1 (2011-08)
Speech and multimedia Transmission Quality (STQ); Estimating Speech Quality per Call
Speech and multimedia Transmission Quality (STQ); Estimating Speech Quality per Call
RTR/STQ-00174m
General Information
Standards Content (Sample)
Technical Report
Speech and multimedia Transmission Quality (STQ);
Estimating Speech Quality per Call
2 ETSI TR 102 506 V1.4.1 (2011-08)
Reference
RTR/STQ-00174m
Keywords
quality, speech
ETSI
650 Route des Lucioles
F-06921 Sophia Antipolis Cedex - FRANCE
Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
Siret N° 348 623 562 00017 - NAF 742 C
Association à but non lucratif enregistrée à la
Sous-Préfecture de Grasse (06) N° 7803/88
Important notice
Individual copies of the present document can be downloaded from:
http://www.etsi.org
The present document may be made available in more than one electronic version or in print. In any case of existing or
perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF).
In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive
within ETSI Secretariat.
Users of the present document should be aware that the document may be subject to revision or change of status.
Information on the current status of this and other ETSI documents is available at
http://portal.etsi.org/tb/status/status.asp
If you find errors in the present document, please send your comment to one of the following services:
http://portal.etsi.org/chaircor/ETSI_support.asp
Copyright Notification
No part may be reproduced except as authorized by written permission.
The copyright and the foregoing restriction extend to reproduction in all media.
© European Telecommunications Standards Institute 2011.
All rights reserved.
TM TM TM
DECT , PLUGTESTS , UMTS and the ETSI logo are Trade Marks of ETSI registered for the benefit of its Members.
TM
3GPP and LTE™ are Trade Marks of ETSI registered for the benefit of its Members and
of the 3GPP Organizational Partners.
GSM® and the GSM logo are Trade Marks registered and owned by the GSM Association.
ETSI
3 ETSI TR 102 506 V1.4.1 (2011-08)
Contents
Intellectual Property Rights . 5
Foreword . 5
1 Scope . 6
2 References . 6
2.1 Normative references . 6
2.2 Informative references . 6
3 Definitions and abbreviations . 7
3.1 Definitions . 7
3.2 Abbreviations . 7
4 General . 7
5 Call properties . 8
5.1 Call structure . 8
5.2 Call length . 8
5.2.1 Length of utterance (sample) . 8
5.2.2 Number of utterances (samples) . 8
5.3 Call design . 8
6 Call quality on a per sample basis . 9
6.1 Evaluation of the samples . 9
6.2 Mathematical modelling of the call quality . 9
6.2.1 Impact of bad samples towards the end of a call . 10
6.2.2 Impact of the a single very bad sample . 10
6.2.3 Applicability of the mathematical model . 10
6.2.4 Validation of the formula . 10
6.2.5 Calculation Example . 11
7 Conclusion . 12
Annex A: Empirical Study from March 2002 on the perceived call quality: PESQ-mobil . 13
A.1 Test concept and speech recordings . 13
A.1.1 Test description of the overall project . 13
A.2 Design of an auditory test methodology to assess the speech material . 14
A.2.1 Structure of the quality assessment . 14
A.2.2 Simulation of a conversation . 14
A.2.3 Assessment on an individual per-sample basis . 15
A.2.4 Distortion types for the voice transmission . 15
A.2.5 Structure of the speech material . 16
A.2.6 Quality of the speech material . 16
A.2.7 Results . 16
A.3 Modelling the overall quality mathematically on basis of the MOS-values . 17
A.3.1 Modelling of Speech Quality by averaging per-sample scores . 17
A.3.2 Modelling of Speech Quality by consideration of the "recency effect" . 18
A.3.3 Modelling of Speech Quality with consideration of a bad sample . 19
A.4 Assessment of the speech material by ITU-T Recommendation P.862 . 20
A.4.1 Assessment of the separated speech parts . 20
A.4.2 Result presentation . 21
A.4.3 Usage of the model with the ITU-T Recommendation P.862 results . 22
A.5 The rating of the samples . 23
A.5.1 Rating of the calls . 23
A.5.2 Rating of the utterances . 24
ETSI
4 ETSI TR 102 506 V1.4.1 (2011-08) ®
Annex B: Empirical Study on the perceived call quality with English samples (Ericsson AB,
2007) . 26
B.1 Introduction . 26
B.2 Test design . 26
B.3 Test results . 26
B.3.1 Results for 60 seconds calls . 27
B.3.2 Results for 120 seconds calls . 27
B.3.3 Results for the utterances . 28
B.3.4 Correlation Between MOS and P.862.1 for the individual utterances . 30
B.4 Call profiles . 31
B.4.1 Quality profiles for 120 seconds calls . 31
B.4.2 Quality profiles for 60 seconds calls . 32
Annex C: Study on the perceived call quality with German samples (T-Labs™, 2007) . 34
C.1 Introduction . 34
C.2 Test Design . 34
C.2.1 Material . 34
C.2.2 Subjects . 34
C.2.3 Procedure . 35
C.2.4 Results . 35
C.3 Detailed test results 60 seconds calls. 36
C.3.1 Rating of the calls . 36
C.3.2 Rating of the utterances . 37
C.4 Detailed test results 120 seconds calls. 40
C.4.1 Rating of the calls . 40
C.4.2 Rating of the utterances . 41
History . 44
ETSI
5 ETSI TR 102 506 V1.4.1 (2011-08)
Intellectual Property Rights
IPRs essential or potentially essential to the present document may have been declared to ETSI. The information
pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, and can be found
in ETSI SR 000 314: "Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in
respect of ETSI standards", which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web
server (http://ipr.etsi.org).
Pursuant to the ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee
can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web
server) which are, or may be, or may become, essential to the present document.
Foreword
This Technical Report (TR) has been produced by ETSI Technical Committee Speech and multimedia Transmission
Quality (STQ).
ETSI
6 ETSI TR 102 506 V1.4.1 (2011-08)
1 Scope
The present document proposes a way to model measurement results on a per sample basis that allow to estimate the
perceived end-to-end speech quality per call for narrowband circuit switched voice services in mobile networks.
It focuses on speech (listening) quality of a voice call. Speech quality per call calculation determines the speech quality
separate per each direction of the call. Conversational properties such as talker quality, round trip and other related
metrics are not considered. Speech Quality of video telephony is not considered either.
The scenario is focussing on test signals between 60 seconds and 120 seconds in duration with alternating
speech/silence periods as described in clause 5. The presented model is based on three studies but may not generalize to
other call scenarios than those used in the underlying studies.
Throughout the present document where ITU-T Recommendation P.862.1 [i.2] (or ITU-T Recommendation P.862 [i.1])
is quoted the same applies to all measurements of listening quality. This can be listening quality scores gained by
auditory tests (MOS-LQS) or objective measurements predicting MOS-LQO according to ITU-T Recommendation
P.800.1 [i.3] covering the relevant network distortions and speech processing components in their scope.
2 References
References are either specific (identified by date of publication and/or edition number or version number) or
non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the
reference document (including any amendments) applies.
Referenced documents which are not found to be publicly available in the expected location might be found at
http://docbox.etsi.org/Reference.
NOTE: While any hyperlinks included in this clause were valid at the time of publication ETSI cannot guarantee
their long term validity.
2.1 Normative references
The following referenced documents are necessary for the application of the present document.
Not applicable.
2.2 Informative references
The following referenced documents are not necessary for the application of the present document but they assist the
user with regard to a particular subject area.
[i.1] ITU-T Recommendation P.862: "Perceptual evaluation of speech quality (PESQ): An objective
method for end-to-end speech quality assessment of narrow-band telephone networks and speech
codecs".
[i.2] ITU-T Recommendation P.862.1: "Mapping function for transforming P.862 raw result scores to
MOS-LQO".
[i.3] ITU-T Recommendation P.800.1: "Mean Opinion Score (MOS) terminology".
[i.4] ETSI TS 102 250 (all parts): "Speech and multimedia Transmission Quality (STQ); QoS aspects
for popular services in mobile networks".
[i.5] ITU-T Recommendation P.862.3: "Application guide for objective quality measurement based on
Recommendations P.862, P.862.1 and P.862.2".
[i.6] ITU-T Recommendation P.800: "Methods for subjective determination of transmission quality".
[i.7] CENELEC EN 60645-2:1997: "Audiometers - Part 2: Equipment for speech audiometry".
ETSI
7 ETSI TR 102 506 V1.4.1 (2011-08)
[i.8] "Ergebnisbericht (Study) Berkom‚ PESQ-mobil" (in German), J. Berger, T-Systems.
[i.9] ETSI TR 102 506 (V1.1.1): "Speech Processing, Transmission and Quality Aspects (STQ);
Estimating Speech Quality per Call".
3 Definitions and abbreviations
3.1 Definitions
For the purposes of the present document, the following terms and definitions apply:
listening quality: quality as perceived by user in a listening situation
perceived quality: quality as perceived by a human user
speech quality per call: listening quality as perceived by a user (at the end) of a conversational call
3.2 Abbreviations
For the purposes of the present document, the following abbreviations apply:
ACR Absolute Category Rating
AMR Adaptive Multi Rate
EFR Enhanced Full Rate
FR Full Rate
HR Half Rate
IRS Intermediate Reference System
MOS Mean Opinion Score
NOTE: Commonly used term for quality assessment.
MOS-LQO MOS-Listening Quality from Objective testing
MOS-LQS MOS-Listening Quality from auditory tests (Subjective)
SpQ-C Speech (listening) Quality on Call basis
UMTS Universal Mobile Telecommunications System
VoIP Voice over IP
4 General
The established way of measuring the speech quality is the measurement on a per sample basis. Much standardization
work has been done by the ITU-T with the P.862 series of documents. Using that established way and taking advantage
of the data acquired in that fashion one can seek to estimate the perceived speech quality of a call.
Current models of averaging over a large amount of single speech samples do not necessarily paint an accurate picture
of the customer satisfaction. Since a bad sample can be outweighed by a couple of good samples. Averaging over the
calls mitigates the problem but still suffers from the shortcoming that a number of good samples may outweigh a very
bad sample. On the other hand threshold models that regard a call fair or poor on the basis of one or two degraded
samples do not take the number of good or excellent samples into account. Models where a certain percentage of the
samples need to be degraded to rate the call as bad disregards the temporal structure of the call and the relative timing
of the degradation towards the end.
It is worthwhile to model the measurement results to obtain a call quality value that allows understanding the impact of
varying speech quality during a conversation.
ETSI
8 ETSI TR 102 506 V1.4.1 (2011-08)
5 Call properties
For the determination of the call properties like call length and the samples specifics it can be drawn on existing
specification like ITU-T Recommendation P.862 [i.1] and TS 102 250 [i.4]. On that basis a reference speech quality
sensitive voice call can be characterized. The standard call length for instrumental voice quality testing is defined in
TS 102 250-5 [i.4] and the sample characteristics and evaluation are defined in ITU-T Recommendation P.862 [i.1] and
ITU-T Recommendation P.862.3 [i.5]. For the structure of the call the definition needs to be done.
5.1 Call structure
Calls, be they mobile originated, mobile terminated or mobile to mobile can be divided up into different groups. Short
calls of a couple of seconds where there is an announcement like pre paid account statements or voice boxes or wrong
destination and conversations where the parties exchange a couple of utterances. Assumed the listening quality sensitive
calls are the group where meaningful utterances are exchanged over a stretch of time, voicemail and speed dials can be
excluded from the consideration. The "typical" call is a dialog-like conversation, which is in line with the empirical
findings.
In an idealized dialog the utterances are exchanged and distributed evenly in length and frequency. On each side a
certain period of speech activity is followed by silence for the same length of time. Since the call quality on sample
basis is rated for each side independently it is sufficient in an instrumental or subjective realization to feed one side with
the required sample pattern.
5.2 Call length
The length of the call should give room for a couple of utterances (samples). The call length recommended in
TS 102 250-5 [i.4] is 120 seconds which is sufficient for this requirement. In fact the average call length is well below
this time. However if calls like those to the mailbox, to pre-paid account, far end voice boxes or wrong numbers are
excluded from that calculation the average time of calls goes up considerably. However for practical purposes it is
desirable to use call lengths that are considerably shorter that 120 seconds. The studies in annexes B and C provide
results for calls with a length of 60 seconds.
5.2.1 Length of utterance (sample)
The application guideline for objective speech measurement and the construction of samples for objective quality
measurements is ITU-T Recommendation P.862.3 [i.5]. The typical sample of measurement systems has a length from
5 seconds to 12 seconds with a speech activity of maximum 80 %. Such a sample typically contains leading and trailing
silence and in case of multiple sentences also silence in between. These individual samples and their ratings are the
basis of the call quality assessment. Therefore the speech activity part of the call consists of these samples.
5.2.2 Number of utterances (samples)
Depending on the length of the call in connection with length of the individual utterance it takes from
5 to 12 utterances and silence pairs to fill the different call lengths. From empirical evidence we know that a typical
conversational call contains around 4 utterances from each side so that 5 recurrences of the speech and silent pair can be
recommended. Considering that these values are applicable for short calls, longer calls can accommodate up to
12 speech and silence pairs with an individual sample length of 5 seconds.
5.3 Call design
The conversational call that is to be rated to estimate the call quality should consist of alternating phases of speech
activity and silence, the length of the phases should be 5 seconds to 12 seconds and that pair recurs 5 to 12 times during
the call.
ETSI
9 ETSI TR 102 506 V1.4.1 (2011-08)
Figure 1: Structure of the speech activity silence pair
Figure 2: Structure of the call with 5 recurrences for one side
Figure 3: Structure of the call 12 recurrences
Figure 4: Structure of the call with 5 recurrences and alternating speech activity
6 Call quality on a per sample basis
In this clause a mathematical model is proposed with which the call quality of voice call can be estimated.
6.1 Evaluation of the samples
The evaluation of the individual samples is made by end-to-end speech quality measurements. They can be either
evaluated by Listening Only tests according to ITU-T Recommendation P.800 [i.6] or by objective prediction of those
scores, e.g. by the current ITU-T Recommendation P.862.1 [i.2]. The use of objective prediction allows the application
of the proposed model in automated network evaluation tools.
6.2 Mathematical modelling of the call quality
The desired result of the calculation is a MOS value considering the entire call in its structure. A mathematical model is
necessary to aggregate the individual MOS values to one value. Two important effects are taken into account: the
"recency effect" and the effect of a very bad sample in a call.
ETSI
10 ETSI TR 102 506 V1.4.1 (2011-08)
6.2.1 Impact of bad samples towards the end of a call
The impact of degradations that occur towards the end of a call are considered in the so called "recency effect". The
closer certain degradation is towards the end of a conversation the stronger is its impact on the overall rating of the
entire call. In the chosen call structure the speech samples are numbered, from 1 to n. The weighing is made with an
individual parameter a at that is the weighing factor for each sample. A mathematical model here is:
i
n
a MOS
i i
∑
i=1
MOS =
RE
n
a
i
∑
i=1
If the time between the end of the last sample containing speech and the middle of sample i is t then we have for
i
samples with t < 19 the following weighing factor (n is positive and needs to be between 5 and 12).
i
a = 1 2(19 − t ) 19 +1 2
i i
For t ≥ 19 the weighing factor is constant with a = 1/2. This formula represents the increasing importance of a sample
i
for the general impression the closer it is located towards the end. The sample in this calculation is representing the
speech activity part in figures 1 to 5. The time between the samples is the silence.
6.2.2 Impact of the a single very bad sample
The correlation can be significantly improved by taking additionally into account the worst sample of the call.
Empirical evidence shows that one very bad sample deteriorates the impression strongly in addition to its temporal
occurrence; therefore it needs also to be taken into account. The model is extended to include the worst sample in the
call.
MOS = MOS − 0,3 (MOS − min(MOS ))
SpQ−C RE i
6.2.3 Applicability of the mathematical model
The formula is developed for conversations with a length between 60 seconds and 120 seconds containing 5 to 12
utterances per analysed direction and with sample and pause lengths of 5 seconds to 12 seconds each.
6.2.4 Validation of the formula
The formula has been validated with modelled conversations with various lengths and different speech sample lengths
in German and English. The scores predicted by the formula show a significant gain in correlation with the subjectively
obtained scores for the Call Quality in comparison with the linear averaging for all tested scenarios.
NOTE: The studies differ in the tests groups (e.g. few expert listeners in annex A and test material (different
distortion patterns), therefore the range of correlations. See annexes for details).
ETSI
11 ETSI TR 102 506 V1.4.1 (2011-08)
Table 1
Study "Annex B" (English) Study "Annex C" (German) Study "Annex A"
5 seconds samples 5 seconds to 6 seconds (German)
samples 12 seconds samples
Call length 120 seconds 60 seconds 120 seconds 60 seconds 120 seconds
Lin. Average with 92 % (0,66) 88 % (0,63) 83 % (0,51) 85 % (0,49) 57 % (0,84)
MOS-LQS (RMSE)
CallQuality model with 98 % (0,21) 97 % (0,22) 93 % (0,31) 94 % (0,26) 84 % (0,37)
MOS-LQS (RMSE)
CallQuality model with
97 % (0,32) 96 % (0,33) 84 % (0,42) 89 % (0,35) 80 % (0,43)
MOS-LQO P.862.1
(RMSE)
6.2.5 Calculation Example
Assume that a conversation with seven 5-second utterances is measured (the white blocks below marked 1 to 7).
Between each utterance there are 5-second (grey) blocks with silence. If there is speech in the other direction it has to
be treated independently. The total length of the measurement is thus 65 seconds.
The speech quality is measured for each of the seven speech samples, and shown in the figure. As can be seen the
quality is high in the beginning of the call (4,0), then drops down to 2,0 and ends at a better level (3,8).
Figure 5: Structure of the call with seven recurrences for one side
According to the weighting formula from clause 6.2.1, the following sample times and weighting factors should be used
for this scenario:
t = 62,5 seconds a = 0,5
1 1
t = 52,5 seconds a = 0,5
2 2
t = 42,5 seconds a = 0,5
3 3
t = 32,5 seconds a = 0,5
4 4
t = 22,5 seconds a = 0,5
5 5
t = 12,5 seconds a = 0,6711
6 6
t = 2,5 seconds a = 0,9342
7 7
Using these factors, the rest of the calculation is done using the formulas in clauses 6.2.1 and 6.2.2:
0,5 ()4,0 + 2,7 + 2,0 + 3,9 + 3,7 + 0,6711×3,6 + 0,9342×3,8
MOS = = 3,4385
RE
5×0,5 + 0,6711+ 0,9342
4,0 + 2,7 + 2,0 + 3,9 + 3,7 + 3,6 + 3,8
MOS = = 3,3857
min()MOS = 2,0
t
MOS = MOS − 0,3 (MOS − min(MOS )) = 3,4385 − 0,3()3,3857 − 2,0 = 3,0228
SpQ−C RE i
ETSI
12 ETSI TR 102 506 V1.4.1 (2011-08)
7 Conclusion
The perceived speech quality is not a simple aggregation (average) of the rated samples in a call. The experimental
evidence shows that the impact of a degraded speech is not simply outweigh by a longer stretch of good or acceptable
listening quality. For single calls the temporal structure of the call should be considered. Lower listening quality
towards the end of a call has a stronger impact on the overall rating of a call than degraded parts in the beginning.
With the presented formula in clause 6.2.2 it is possible to estimate the perceived (subjective) speech quality of a call
for each side on the basis of (objectively or subjectively) rated samples.
ETSI
13 ETSI TR 102 506 V1.4.1 (2011-08)
Annex A:
Empirical Study from March 2002 on the perceived call
quality: PESQ-mobil
In this annex an excerpt of the study "Ergebnisbericht (Study) Berkom‚ PESQ-mobil" (in German), J. Berger,
T-Systems [i.8] is presented. This study addresses a wider range than the evaluating a model for prediction of Speech
Quality per Call. This annex is focused strongly on topics related to the present document. This annex is referring to this
study from 2002. The formulas given there and cited in this annex are not up to date with regard of the current version
of this document [i.8].
A.1 Test concept and speech recordings
A.1.1 Test description of the overall project
In automatic measurement systems for speech quality evaluation in practical use, short speech samples (4 seconds to
8 seconds) are transferred over a telephone connection and evaluated with an algorithm. At the end of every call, several
measured speech quality samples are available, which will be averaged usually. With these measured quality results, the
assessment of a listening person being in a dialogue situation is emulated and thereby the overall quality of a telephone
call is described. This overall quality of the complete call should be called Speech Quality per Call (SpQ-C).
Existing problems by this usage:
• The measurement cycle is shorter than an average real phone call.
• The measurement result is based on "speech samples", which are restricted because of their short length in
variability and its phonetics.
• Due to different and time varying quality conditions of the connection during a measurement cycle, an average
of these single speech samples is only for limited use for the prediction of the Speech Quality per Call.
The speech quality assessments a person gives after a phone call is highly stamped by the time of appearance of a
possible distortion. This influence of the different quality states during a call on the overall result respects both the time
difference of a quality state at the time of the assessment and the loss in means of semantics. It can be assumed, that a
distortion at the beginning of the call is already forgotten at its end.
To evaluate this effect, a listening situation as natural as possible had been designed and test persons assessed the
experienced listening quality. The task of this investigation lies in the modelling of the assessment of a longer
conversation with varying listening quality. Therefore a conversation was modelled by a series of single "speech
samples". The assessment of the complete modelled conversation at its end by human listeners forms the reference of
the model. These "target values" for the Speech Quality per Call are to be emulated by an weighted average of short
term scores as they could be derived by an instrumental measurement method as well. Here it is assumed that this
instrumental method is in the position to assess a static quality like a human listener.
The intention had been to find a mathematical description for the consecutive speech parts to be able to calculate an
overall quality score, which emulates the assessments of the test persons. The method to develop a model for prediction
of Speech Quality per Call by means of instrumental measures can be divided into three steps:
1) Modelling and assessing of simulated conversations in a subjective test (gaining the "target values").
2) Assessing short parts of the conversation (single samples, "per sample scores") subjectively and developing a
model to predict the "target values" by processing that single scores.
3) Replacing the subjective per-sample scores by instrumental gained scores in model obtained in 2). Here
ITU-T Recommendation P.862 [i.1] was used.
ETSI
14 ETSI TR 102 506 V1.4.1 (2011-08)
The listening test used samples, which had been designed in a way, that they should partly cover the awaited distortions
in UMTS or VoIP. Particularly the distortion with longer duration and the accumulate appearance of short distortions
are of central interest.
A.2 Design of an auditory test methodology to assess the
speech material
A.2.1 Structure of the quality assessment
The quality assessment with auditory tests with test persons is separated into two parts:
1) Simulation of a conversation.
2) Assessment of shorter conversation parts without personal activity.
For this study eight employees from T-Systems Nova, Berkom had been invited. Before the test started, all persons had
been tested for normal hearing. The 6 men and 2 women were of the age between 21 years and 45 years and German
native speakers. All invited employees had been working in the quality and acceptance department. This means that the
test environment had been well known and they had no problems with their tasks and the way they had to give their
assessments. None of them had taken part in the development of this test.
A.2.2 Simulation of a conversation
A typical speech situation is a dialogue between two persons, thus the situation is divided in parts with hearing activity
and speech activity. The interest for the content of both persons is supposed. For that reason typical contents of
telephone conversations were chosen (e.g. request for a rental car).
The realization of such a modelled conversation consists of a series of shorter "utterances", which have a pause between
them for interaction but are connected logically with regard to the content of the presentation. Instead of the own speech
activity as an interaction a content orientated task is to be done (e.g. keyword spotting). The speech material is
constructed in such a way that 4 breaks are possible. After each replay of the whole simulated conversation the test
person is asked for an assessment for the complete simulated call.
Experiment 1 equals the automatically test methodology half duplex. The used speech material consists of 5 speech
parts (samples), which correspond to the utterances of one party. The design of the speech material is shown in
figure A.1. After a 12 seconds speech sample there is a 12 seconds pause during which the test person had to perform a
content regarding task. At the end of this experiment a score for the Speech Quality per Call is obtained.
Speech Pause
Speech quality assessment
Content oriented task
excellent
good
fair
poor
Speech s am ples
bad
Figure A.1: Schematic presentation of the speech situation assessment
ETSI
15 ETSI TR 102 506 V1.4.1 (2011-08)
A.2.3 Assessment on an individual per-sample basis
In the second experiment the test persons listen to the small conversational parts (samples, 12 seconds in length) which
were replayed in a casual sequence. This means that the different parts will be individually presented and assessed. This
scenario corresponds to an automatic test situation with only uplink or downlink speech samples of a short length. This
matches a simplified test according to ITU-T Recommendation P.800 [i.6] with short speech samples. (At the end of
this test "per-sample" scores for each individual part of the simulated conversations, as average for each sample, were
available).
A.2.4 Distortion types for the voice transmission
The focus of this research is on the influence of the time variable transmission faults on the perceived speech quality at
the end of the call. It is assumed that difference of the time of the distortion to the time of the assessment and its
intensity and length have the strongest influence. Based on this, distortion patterns are designed which will be shown in
figure A.2. Each pattern consists of five speech samples and reflects the temporal structure of the simulated
conversation. A difference was made between distortions perceptible over the complete sample (such as vocoders) and
"bursty" distortions such as interruptions.
Quality Quality
Speech s am ples
Speech s am ples
Quality
Quality
Quality
Speech samples Speech samples Speech s am ples
Quality Quality
Speech s am ples Speech s am ples
Figure A.2: The temporal structure of the ten quality pattern
ETSI
16 ETSI TR 102 506 V1.4.1 (2011-08)
All these examples are spoken by two different speakers and have different content.
A.2.5 Structure of the speech material
Four modelled conversation-examples with a longer period consisting of a series of five individual parts (speech
samples) that model a real telephone situation two of them were actually used in the investigation.
Of interest in this evaluation is the influence of the time distance between the occurrence of the distortion and the end of
the transmission ("recency-effect") on the overall quality scored at the end. Experiments have shown that the gradient of
the influence is decreasing with the time distance of the assessment. Later the influence nears zero (the influence of the
distortion is constant). The band of 50 seconds to 90 seconds before transmission end is seen for interest in this
evaluation. This means that the simulated dialogues should have at least this length.
The speech samples used for this auditory evaluation should be like a natural telephone situation, e.g. renting a car.
They are structured in the way that they are constructed out of 5 samples of 12 seconds to 13 seconds with active
speech. After each 12 seconds part, a pause of 12 seconds length is implemented. This results in an overall transmission
length of 110 seconds.
Speech activity
The speech material used in this evaluation is small parts of a conversation called speech samples. This means one
person is speaking, the other one is listening. The term Text describes the content (e.g. car rental), the subparts 1.1, 1.2,
etc. describe the individual phrases in this context. A phrase, spoken by a speaker, forms a 12 seconds speech sample.
The speech activity of the simulated dialogues is shown in table A.1.
Table A.1: Activity of speech samples
Speech part Speech activity Speech part Speech activity
Text 1.1: female 2, Sample 1 88 % Text 2.1: male 2, Sample 1 94 %
Text 1.2: female 2, Sample 2 96 % Text 2.2: male 2, Sample 2 90 %
Text 1.3: female 2, Sample 3 93 % Text 2.3: male 2, Sample 3 92 %
Text 1.4: female 2, Sample 4 85 % Text 2.4: male 2, Sample 4 94 %
Text 1.5: female 2, Sample 5 92 % Text 2.5: male 2, Sample 5 93 %
Together with the implemented pauses an overall speech activity of about 50 % is reached.
A.2.6 Quality of the speech material
The speech material is transmitted over test calls in a live network. One time the material is transmitted over a
transmission with the best possible available call quality to achieve the best speech quality in a real network. Then the
connection is influenced to reduce the speech quality. The material is degraded in a way that it covers all necessary
quality states for this test. This test requires the whole range from excellent/good to bad.
A.2.7 Results
In the first part of the test, the test persons listened to the simulated dialogues (all 10 fault patterns of every speaker
(see clause A.2.6)) one time. An average over 8 individual assessments is the result. The scores for the overall (per call)
quality obtained in the auditory experiment are shown in figure A.3.
ETSI
17 ETSI TR 102 506 V1.4.1 (2011-08)
Overall (per call) score
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
Figure A.3: Auditory MOS "per call" per pattern
In a second part the test persons listened to the separated speech samples twice during the test. This means that the
MOS value represent the average of 16 individual assessments. Here the Ci95 is smaller due to the higher number of
individual results (16) and a smaller inter-individual deviation in the scores.
Because of the two separated tests an integrative overall quality assessment and also an individual speech part
assessment exist.
A.3 Modelling the overall quality mathematically on basis
of the MOS-values
A.3.1 Modelling of Speech Quality by averaging per-sample
scores
Figure A.4 shows the simple arithmetical average of the auditory MOS assessment of the individual speech samples to
the overall quality assessment (Speech Quality per Call). It can easily be seen that a pure average will not be applicable
for predicting the Speech Quality per Call.
ETSI
MOS-LQS, Ci95
male_pattern 1
male_pattern 2
male_pattern 3
male_pattern 4
male_pattern 5
male_pattern 6
male_pattern 7
male_pattern 8
male_pattern 9
male_pattern 10
female_pattern 1
female_pattern 2
female_pattern 3
female_pattern 4
female_pattern 5
female_pattern 6
female_pattern 7
female_pattern 8
female_pattern 9
female_pattern 10
Arithemtical average of MOS
18 ETSI TR 102 506 V1.4.1 (2011-08)
5.0
Pattern 1
4.0
3.0
Pattern 3
2.0
1.0
1.0 2.0 3.0 4.0 5.0
Overall quality
'overall'-Qualität (Zielwert)
Figure A.4: Arithmetical average of the MOS assessment of the
individual speech parts to the overall quality assessment
Only in the case of static quality over the complete "call" modelled by patterns 1, 2 and 3 in clause A.2.4 the simple
averaging gives reliable results. For varying quality, the arithmetical average seems to be too optimistic for the
prediction of Speech Quality per Call.
The linear correlation coefficient is about 57 %. This leads to the result, that the arithmetical average should not be used
for describing the Speech Quality per Call.
In the scenarios in which a quality drop within one speech part occurs, the overall quality is below the average. A
possible reason could be that the overall assessment is disproportionately influenced by a strong quality drop in a
longer speech presentation. This degradation is
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...