SIST ETS 300 580-6 E1:2003
(Main)European digital cellular telecommunications system (Phase 2); Voice Activity Detection (VAD) (GSM 06.32)
European digital cellular telecommunications system (Phase 2); Voice Activity Detection (VAD) (GSM 06.32)
.
Evropski digitalni celični telekomunikacijski sistem (faza 2) – Detektor govornih dejavnosti (VAD) (GSM 06.32)
General Information
Standards Content (Sample)
SLOVENSKI STANDARD
SIST ETS 300 580-6 E1:2003
01-december-2003
(YURSVNLGLJLWDOQLFHOLþQLWHOHNRPXQLNDFLMVNLVLVWHPID]D±'HWHNWRUJRYRUQLK
GHMDYQRVWL9$'*60
European digital cellular telecommunications system (Phase 2); Voice Activity Detection
(VAD) (GSM 06.32)
Ta slovenski standard je istoveten z: ETS 300 580-6 Edition 1
ICS:
33.070.50 Globalni sistem za mobilno Global System for Mobile
telekomunikacijo (GSM) Communication (GSM)
SIST ETS 300 580-6 E1:2003 en
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ETS 300 580-6 E1:2003
---------------------- Page: 2 ----------------------
SIST ETS 300 580-6 E1:2003
EUROPEAN ETS 300 580-6
TELECOMMUNICATION September 1994
STANDARD
Source: ETSI TC-SMG Reference: GSM 06.32
ICS: 33.060.30
European digital cellular telecommunications system, Global System for Mobile communications
Key words:
(GSM)
European digital cellular telecommunications system (Phase 2);
Voice Activity Detection (VAD)
(GSM 06.32)
ETSI
European Telecommunications Standards Institute
ETSI Secretariat
F-06921 Sophia Antipolis CEDEX - FRANCE
Postal address:
650 Route des Lucioles - Sophia Antipolis - Valbonne - FRANCE
Office address:
c=fr, a=atlas, p=etsi, s=secretariat - secretariat@etsi.fr
X.400: Internet:
Tel.: +33 92 94 42 00 - Fax: +33 93 65 47 16
Copyright Notification: No part may be reproduced except as authorized by written permission. The copyright and the
foregoing restriction extend to reproduction in all media.
© European Telecommunications Standards Institute 1994. All rights reserved.
New presentation - see History box
---------------------- Page: 3 ----------------------
SIST ETS 300 580-6 E1:2003
Page 2
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
Whilst every care has been taken in the preparation and publication of this document, errors in content,
typographical or otherwise, may occur. If you have comments concerning its accuracy, please write to
"ETSI Editing and Committee Support Dept." at the address shown on the title page.
---------------------- Page: 4 ----------------------
SIST ETS 300 580-6 E1:2003
Page 3
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
Contents
Foreword.5
0.1 Scope .7
0.2 Normative references .7
0.3 Definitions and abbreviations.7
1 General .7
2 Functional description .8
2.1 Overview and principles of operation .8
2.2 Algorithm description.8
2.2.1 Adaptive filtering and energy computation.10
2.2.2 ACF averaging.10
2.2.3 Predictor values computation.11
2.2.4 Spectral comparison.12
2.2.5 Periodicity detection .13
2.2.6 Threshold adaptation.14
2.2.7 VAD decision .16
2.2.8 VAD hangover addition .16
3 Computational details .16
3.1 Adaptive filtering and energy computation .18
3.2 ACF averaging.19
3.3 Predictor values computation .20
3.3.1 Schur recursion to compute reflection coefficients.20
3.3.2 Step-up procedure to obtain the aav1[0.8].21
3.3.3 Computation of the rav1[0.8].22
3.4 Spectral comparison.22
3.5 Periodicity detection .23
3.6 Threshold adaptation. 23
3.7 VAD decision .27
3.8 VAD hangover addition .27
3.9 Periodicity updating.27
4 Digital test sequences.28
4.1 Test configuration.28
4.2 Test sequences .29
Annex 1 (informative): Simplified block filtering operation. 30
Annex 2 (informative): Description of digital test sequences.31
A2.1 Test sequences . 31
A2.2 File format description .33
Annex 3 (informative): VAD performance . 35
History.36
---------------------- Page: 5 ----------------------
SIST ETS 300 580-6 E1:2003
Page 4
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
Blank page
---------------------- Page: 6 ----------------------
SIST ETS 300 580-6 E1:2003
Page 5
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
Foreword
This European Telecommunication Standard (ETS) has been produced by the Special Mobile Group
(SMG) Technical Committee (TC) of the European Telecommunications Standards Institute (ETSI).
This ETS specifies the Voice Activity Detection (VAD) for the European digital cellular telecommunications
system (Phase 2).
This ETS correspond to GSM technical specification, GSM 06.32 version 4.0.5.
The specification from which this ETS has been derived was originally based on CEPT documentation,
hence the presentation of this ETS may not be entirely in accordance with the ETSI/PNE rules.
Reference is made within this ETS to GSM Technical Specifications (GSM-TSs) (NOTE).
NOTE: TC-SMG has produced documents which give the technical specifications for the
implementation of the European digital cellular telecommunications system. Historically,
these documents have been identified as GSM Technical Specifications (GSM-TS).
These TSs may have subsequently become I-ETSs (Phase 1), or ETSs (Phase 2),
whilst others may become ETSI Technical Reports (ETRs). GSM-TSs are, for editorial
reasons, still referred to in GSM ETSs.
---------------------- Page: 7 ----------------------
SIST ETS 300 580-6 E1:2003
Page 6
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
Blank page
---------------------- Page: 8 ----------------------
SIST ETS 300 580-6 E1:2003
Page 7
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
0.1 Scope
This technical specification specifies the voice activity detector (VAD) to be used in the Discontinuous
Transmission (DTX) as described in GSM 06.31. It also specifies the test methods to be used to verify
that a VAD complies with the technical specification.
The requirements are mandatory on any VAD to be used either in the GSM Mobile Stations or Base
Station Systems.
0.2 Normative references
This ETS incorporates by dated and undated reference, provisions from other publications. These
normative references are cited at the appropriate places in the text and the publications are listed
hereafter. For dated references, subsequent amendments to or revisions of any of these publications apply
to this ETS only when incorporated in it by amendment or revision. For undated references, the latest
edition of the publication referred to applies.
[1] GSM 01.04 (ETR 100): "European digital cellular telecommunication system
(Phase 2); Definitions, abbreviations and acronyms".
[2] GSM 06.10 (ETS 300 580-2): "European digital cellular telecommunication
system (Phase 2); Full rate speech transcoding".
[3] GSM 06.12 (ETS 300 580-4): "European digital cellular telecommunication
system (Phase 2); Comfort noise aspect for full rate speech traffic channels".
[4] GSM 06.31 (ETS 300 580-5): "European digital cellular telecommunication
system (Phase 2); Discontinuous Transmission (DTX) for full rate speech traffic
channel".
0.3 Definitions and abbreviations
Definitions and abbreviations used in this specification are listed in GSM 01.04.
1 General
The function of the VAD is to indicate whether each 20ms frame produced by the speech encoder contains
speech or not. The output is a binary flag which is used by the TX DTX handler defined in GSM 06.31.
The technical specification is organised as follows:
Section 2 describes the principles of operation of the VAD.
In section 3, the computational details necessary for the fixed point implementation of the VAD algorithm
are given. This section uses the same notation as used for computational details in GSM 06.10.
The verification of the VAD is based on the use of digital test sequences. Section 4 defines the input and
output signals and the test configuration, whereas the detailed description of the test sequences is
contained in Annex 2.
The performance of the VAD algorithm is characterised by the amount of audible speech clipping it
introduces and the percentage activity it indicates. These characteristics for the VAD defined in this
technical specification have been established by extensive testing under a wide range of operating
conditions. The results are summarised in Annex 3.
---------------------- Page: 9 ----------------------
SIST ETS 300 580-6 E1:2003
Page 8
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2 Functional description
The purpose of this section is to give the reader an understanding of the principles of operation of the
VAD, whereas the detailed description is given in section 3. In case of discrepancy between the two
descriptions, the detailed description of section 3 shall prevail.
In the following subsections of section 2, a Pascal programming type of notation has been used to
describe the algorithm.
2.1 Overview and principles of operation
The function of the VAD is to distinguish between noise with speech present and noise without speech
present. The biggest difficulty for detecting speech in a mobile environment is the very low speech/noise
ratios which are often encountered. The accuracy of the VAD is improved by using filtering to increase the
speech/noise ratio before the decision is made.
For a mobile environment, the worst speech/noise ratios are encountered in moving vehicles. It has been
found that the noise is relatively stationary for quite long periods in a mobile environment. It is therefore
possible to use an adaptive filter with coefficients obtained during noise, to remove much of the vehicle
noise.
The VAD is basically an energy detector. The energy of the filtered signal is compared with a threshold;
speech is indicated whenever the threshold is exceeded.
The noise encountered in mobile environments may be constantly changing in level. The spectrum of the
noise can also change, and varies greatly over different vehicles. Because of these changes the VAD
threshold and adaptive filter coefficients must be constantly adapted. To give reliable detection the
threshold must be sufficiently above the noise level to avoid noise being identified as speech but not so far
above it that low level parts of speech are identified as noise. The threshold and the adaptive filter
coefficients are only updated when speech is not present. It is, of course, potentially dangerous for a VAD
to update these values on the basis of its own decision. This adaptation therefore only occurs when the
signal seems stationary in the frequency domain but does not have the pitch component inherent in voiced
speech and information tones.
A further mechanism is used to ensure that low level noise (which is often not stationary over long periods)
is not detected as speech. Here, an additional fixed threshold is used.
A VAD hangover period is used to eliminate mid-burst clipping of low level speech. Hangover is only added
to speech-bursts which exceed a certain duration to avoid extending noise spikes.
2.2 Algorithm description
The block diagram of the VAD algorithm is shown in figure 2-1. The individual blocks are described in the
following sections. ACF and N are calculated in the speech encoder.
---------------------- Page: 10 ----------------------
SIST ETS 300 580-6 E1:2003
Page 9
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
Adaptive
v
p
VAD
vad
ACF vad
filtering and vad
VAD
hangover
energy
decision
addition
computation
r
vad
th
vad
ptch
N Periodicity Threshold
detection adaption
stat
Predictor
Spectral
values
comparison
computation
r
av1
av1
ACF
averaging
av0
Figure 2-1: Functional block diagram of the VAD
The global variables shown in the block diagram are described as follows:
- ACF are autocorrelation coefficients which are calculated in the speech encoder defined in GSM
06.10 (section 3.1.4, see also Annex 1). The inputs to the speech encoder are 16 bit 2's
complement numbers, as described in GSM 06.10, section 4.2.0.
- av0 and av1 are averaged ACF vectors.
- rav1 are autocorrelated predictor values obtained from av1.
- rvad are the autocorrelated predictor values of the adaptive filter.
- N is the long term predictor lag value which is obtained every subsegment in the speech coder
defined in GSM 06.10.
- ptch indicates whether the signal has a steady periodic component.
- pvad is the energy in the current frame of the input signal after filtering.
- thvad is an adaptive threshold.
- stat indicates spectral stationarity.
- vvad indicates the VAD decision before hangover is added.
- vad is the final VAD decision with hangover included.
---------------------- Page: 11 ----------------------
SIST ETS 300 580-6 E1:2003
Page 10
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2.2.1 Adaptive filtering and energy computation
Pvad is computed as follows:
8
Pvad := rvad[0] ACF[0] + 2SUM rvad[i] ACF[i]
i=1
This corresponds to performing an 8th order block filtering on the input samples to the speech encoder,
after zero offset compensation and pre-emphasis. This is explained in Annex 1.
2.2.2 ACF averaging
Spectral characteristics of the input signal have to be obtained using blocks that are larger than one 20ms
frame. This is done by averaging the autocorrelation values for several consecutive frames. This averaging
is given by the following equations:
frames-1
av0{n}[i] := SUM ACF{n-j}[i] ; i = 0.8
j=0
av1{n}[i] := av0{n-frames}[i] ; i = 0.8
Where n represents the current frame, n-1 represents the previous frame etc. The values of constants are
given in table 2-1.
Table 2-1. Constants and variables for ACF averaging
======================================================
Constant Value Variable Initial value
------------------------------------------------------
frames 4 previous ACF’s
av0 & av1 All set to 0
======================================================
---------------------- Page: 12 ----------------------
SIST ETS 300 580-6 E1:2003
Page 11
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2.2.3 Predictor values computation
The filter predictor values aav1 are obtained from the autocorrelation values av1 according to the equation:
-1
a := R p
where:
- -
R := |av1[0],av1[1],av1[2],av1[3],av1[4],av1[5],av1[6],av1[7]|
|av1[1],av1[0],av1[1],av1[2],av1[3],av1[4],av1[5],av1[6]|
|av1[2],av1[1],av1[0],av1[1],av1[2],av1[3],av1[4],av1[5]|
|av1[3],av1[2],av1[1],av1[0],av1[1],av1[2],av1[3],av1[4]|
|av1[4],av1[3],av1[2],av1[1],av1[0],av1[1],av1[2],av1[3]|
|av1[5],av1[4],av1[3],av1[2],av1[1],av1[0],av1[1],av1[2]|
|av1[6],av1[5],av1[4],av1[3],av1[2],av1[1],av1[0],av1[1]|
|av1[7],av1[6],av1[5],av1[4],av1[3],av1[2],av1[1],av1[0]|
- -
and:
- - - -
p := |av1[1]| a := |aav1[1]|
|av1[2]| |aav1[2]|
|av1[3]| |aav1[3]|
|av1[4]| |aav1[4]|
|av1[5]| |aav1[5]|
|av1[6]| |aav1[6]|
|av1[7]| |aav1[7]|
|av1[8]| |aav1[8]|
- - - -
aav1[0] := -1
av1 is used in preference to av0 as av0 may contain speech.
The autocorrelated predictor values rav1 are then obtained:
8-i
rav1[i] := SUM aav1[k] aav1[k+i] ; i = 0.8
k=0
---------------------- Page: 13 ----------------------
SIST ETS 300 580-6 E1:2003
Page 12
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2.2.4 Spectral comparison
The spectra represented by the autocorrelated predictor values rav1 and the averaged autocorrelation
values av0 are compared using the distortion measure dm defined below. This measure is used to produce
a boolean value stat every 20ms, as given by these equations:
8
dm := ( rav1[0]av0[0] + 2SUM rav1[i]av0[i] ) / av0[0]
i=1
difference := |dm - lastdm|
lastdm := dm
stat := difference < thresh
The values of constants and initial values are given in table 2-2.
Table 2-2. Constants and variables for spectral comparison
=====================================================
Constant: Value: Variable: Initial value:
-----------------------------------------------------
thresh 0.05 lastdm 0
=====================================================
---------------------- Page: 14 ----------------------
SIST ETS 300 580-6 E1:2003
Page 13
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2.2.5 Periodicity detection
The frequency spectrum of mobile noise is relatively stationary over quite long periods. The Inverse Filter
Autocorrelated Predictor coefficients of the adaptive filter rvad are only updated when this stationarity is
detected. Vowel sounds and Information tones however, also have this stationarity, but can be excluded by
detecting the periodicity of these sounds using the long term predictor lag values (Nj) which are obtained
every subsegment from the speech codec defined in GSM 06.10. Consecutive lag values are compared.
Cases in which one lag value is a factor of the other are catered for, however cases in which both lag
values have a common factor, are not. This case is not important for speech input but this method of
periodicity detection may fail for some sine waves. The boolean variable ptch is updated every 20ms and
is true when periodicity is detected. It is calculated according to the following equation:
ptch := oldlagcount + veryoldlagcount >= nthresh
The following operations are done after the VAD decision and when the current LTP lag values (N0 . N3)
are available, this reduces the delay of the VAD decision. (N{-1} = N3 of previous segment.)
lagcount := 0
for j := 0 to 3 do
begin
smallag := maximum(Nj,N{j-1}) mod minimum(Nj,N{j-1})
if minimum(smallag,minimum(Nj,N{j-1})-smallag) < lthresh
then increment(lagcount)
end
veryoldlagcount := oldlagcount
oldlagcount := lagcount
The values of constants and initial values are given in table 2-3.
Table 2-3. Constants and variables for periodicity detection
=======================================================
Constant: Value: Variable: Initial value:
-------------------------------------------------------
lthresh 2 oldlagcount 0
nthresh 4 veryoldlagcount 0
N3 40
=======================================================
---------------------- Page: 15 ----------------------
SIST ETS 300 580-6 E1:2003
Page 14
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2.2.6 Threshold adaptation
A check is made every 20ms to determine whether the VAD decision threshold (thvad) should be changed.
This adaptation is carried out according to the flowchart shown in figure 2-2. The constants used are given
in table 2-4.
Adaptation takes place in two different situations: firstly whenever ACF[0] is very low and secondly
whenever there is a very high probability that speech is not present.
In the first case, the threshold is adapted if the energy of the input signal is less than pth. The threshold is
set to plev without carrying out any further tests because at these very low levels the effect of the signal
quantization makes it impossible to obtain reliable results from these tests.
In the second case, the decision threshold (thvad) and the adaptive filter coefficients (rvad) are only
updated with the rav1 values when the signal is stationary and has no periodic component. In this situation
there is a very high probability that speech is not present. The stationarity is detected in the frequency
domain, by calculating the spectral difference using consecutive averaged ACF values. If this spectral
difference changes very little over a certain number of frames (adp), and the signal does not have a
periodic component inherent in voiced speech and information tones, then adaptation occurs.
The step-size by which the threshold is adapted is not constant but a proportion of the current value
(determined by constants dec and inc). The adaptation begins by experimentally multiplying the threshold
by a factor of (1-1/dec). If the new threshold is now higher than or equal to Pvad times fac then the
threshold needed to be decreased and it is left at this new lower level. If, on the other hand, the new
threshold level is less than Pvad times fac then the threshold either needed to be increased or kept
constant. In this case it is set to Pvad times fac unless this would mean multiplying it by more than a factor
of (1+1/inc) (in which case it is multiplied by a factor of (1+1/inc)). The threshold is never allowed to be
greater than Pvad+margin.
Table 2-4. Constants and variables for threshold adaptation
======================================================
Constant: Value: Variable: Initial value:
------------------------------------------------------
pth 300000 adaptcount 0
plev 800000 thvad 1000000
fac 3.0 rvad[0] 6
adp 8 rvad[1] -4
inc 16 rvad[2] 1
dec 32 rvad[3] to
margin 80000000 rvad[8] All 0
======================================================
---------------------- Page: 16 ----------------------
SIST ETS 300 580-6 E1:2003
Page 15
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
BEGIN
yes
ACF[0] < pth ?
no
th = plev
vad
yes
increment
stat and not ptch ?
adaptcount
no
adaptcount = 0
END
adaptcount > adp ?
no
yes
th = th - th / dec
vad vad vad
yes
= min (th + th / inc , p
th th
< p *fac)
* fac ?
vad vad
vad vad vad
vad
no
yes
th = p
+ margin th > p
+ margin ?
vad
vad
vad vad
no
r
= r
vad av1
adaptcount = adp + 1
END
Fi 2 2 Fl di f h h ld d i
Figure 2-2: Flow diagram for threshold adaptation
---------------------- Page: 17 ----------------------
SIST ETS 300 580-6 E1:2003
Page 16
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
2.2.7 VAD decision
Prior to hangover the VAD decision condition is:
vvad := pvad > thvad
2.2.8 VAD hangover addition
VAD hangover is only added to bursts of speech greater than or equal to burstconst blocks. The boolean
variable vad indicates the decision of the VAD with hangover included. The values of the constants are
given in table 2-5. The hangover algorithm is as follows:
if vvad then increment(burstcount) else burstcount := 0
if burstcount >= burstconst then
begin
hangcount := hangconst;
burstcount := burstconst
end
vad := vvad or (hangcount >= 0)
if hangcount >= 0 then decrement(hangcount)
Table 2-5. Constants and variables for VAD hangover addition
==========================================================
Constant: Value: Variable: Initial value:
----------------------------------------------------------
burstconst 3 burstcount 0
hangconst 5 hangcount -1
==========================================================
3 Computational details
In the next paragraphs, the detailed description of the VAD algorithm follows the preceeding high level
description. This detailed description is divided in nine sections related to the blocks of figure 2-1 (except
the last one) in the high level description of the VAD algorithm.
Those sections are:
1) Adaptive filtering and energy computation;
2) ACF averaging;
3) Predictor values computation;
4) Spectral comparison;
5) Periodicity detection;
6) Threshold adaptation;
7) VAD decision;
8) VAD hangover addition;
9) Periodicity updating.
The VAD algorithm takes as input the following variables of the RPE-LTP encoder (see the detailed
description of the RPE-LTP encoder GSM 06.10):
- L_ACF[0.8], autocorrelation function ( GSM 06.10/4.2.4);
- scalauto, scaling factor to compute the L_ACF[0.8] ( GSM 06.10/4.2.4);
- Nc, LTP lag (one for each sub-segment, GSM 06.10/4.2.11).
So four Nc values are needed for the VAD algorithm.
---------------------- Page: 18 ----------------------
SIST ETS 300 580-6 E1:2003
Page 17
ETS 300 580-6: September 1994 (GSM 06.32 version 4.0.5)
The VAD computation can start as soon as the L_ACF[0.8] and scalauto variables are known. This means
that the VAD computation can take place after part 4.2.4 of GSM 06.10 (Autocorrelation) of the LPC
analysis section of the RPE-LTP encoder. This scheme will reduce the delay to yield the VAD information.
The periodicity updating which is included in section 2.2.5, is done after the processing of the current
speech encoder frame.
All the arithmetic operations and names of the variables follow the RPE-LTP detailed description. To
increase the precision within the fixed point implementation, a pseudo-floating point representation of some
variables is used. This stands for the following variables (and related constants) of the VAD algorithm:
pvad: Energy of filtered signal;
thvad: Threshold of the VAD decision;
acf0: Energy of input signal.
For the representation of these variables, two integers (16 bits) are needed:
- one for the exponent (e_pvad, e_thvad, e_acf0);
- one for the mantissa (m_pvad, m_thvad, m_acf0).
The value e_pvad represents the lowest power of 2 just greater or equal to the actual value of pvad and
the m_pvad value represents a integer which is always greater or equal to 16384 (normalized mantissa). It
means that the pvad value is equal to:
(e_pvad) x (m_pvad/32768).
pvad = 2
This scheme guarantees a large dynamic range for the pvad value and always keeps a precision of 16
bits. All the comparisons are easy to make by comparing the exponents of two variables and the VAD
algorithm needs only one pseudo-floating point addition. All the computations related to the pseudo-floating
point variables require very simple 16 or 32 bits arithmetic operations defined in the detailed description of
the RPE-LTP encoder. This pseudo-floating point arithmetic is only used in section 3.1 and 3.6.
Table 3-1 gives a list of all the variables of the VAD algorithm that must be initialized in the reset procedure
and kept in memory for processing the subsequent frame of the RPE- LTP encoder. The types (16 or 32
bits) and initial values of all these variables are clearly indicated and their related sub-section is also
mentionned. The bit exact implemen
...


Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.