SIST ETS 300 965 E1:2003
(Main)Digital cellular telecommunications system; Voice Activity Detector (VAD) (GSM 06.32 version 5.0.1)
Digital cellular telecommunications system; Voice Activity Detector (VAD) (GSM 06.32 version 5.0.1)
This European Telecommunication Standard (ETS) specifies the Voice Act ivity Detector (VAD) to be used in the Discontinuous Transmission (DTX ) as described in GSM 06.31. It also specifies the test methods to be used to verify that a VAD complies with the technical specification. The requirements are mandatory on any VAD to be used either in the GSM Mobile Stations (MS)s or Base Station Systems (BSS)s.
Digitalni celični telekomunikacijski sistem – Detektor aktivnega govora (VAD) (GSM 06.32, različica 5.0.1)
General Information
Standards Content (Sample)
SLOVENSKI STANDARD
SIST ETS 300 965 E1:2003
01-december-2003
'LJLWDOQLFHOLþQLWHOHNRPXQLNDFLMVNLVLVWHP±'HWHNWRUDNWLYQHJDJRYRUD9$'
*60UD]OLþLFD
Digital cellular telecommunications system; Voice Activity Detector (VAD) (GSM 06.32
version 5.0.1)
Ta slovenski standard je istoveten z: ETS 300 965 Edition 1
ICS:
33.070.50 Globalni sistem za mobilno Global System for Mobile
telekomunikacijo (GSM) Communication (GSM)
SIST ETS 300 965 E1:2003 en
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ETS 300 965 E1:2003
---------------------- Page: 2 ----------------------
SIST ETS 300 965 E1:2003
EUROPEAN ETS 300 965
TELECOMMUNICATION May 1997
STANDARD
Source: ETSI TC-SMG Reference: DE/SMG-110632Q
ICS: 33.020
Key words: Digital cellular telecommunications system, Global System for Mobile communications (GSM)
R
GLOBAL SYSTEM FOR
MOBILE COMMUNICATIONS
Digital cellular telecommunications system;
Voice Activity Detector (VAD)
(GSM 06.32 version 5.0.1)
ETSI
European Telecommunications Standards Institute
ETSI Secretariat
Postal address: F-06921 Sophia Antipolis CEDEX - FRANCE
Office address: 650 Route des Lucioles - Sophia Antipolis - Valbonne - FRANCE
X.400: c=fr, a=atlas, p=etsi, s=secretariat - Internet: secretariat@etsi.fr
Tel.: +33 4 92 94 42 00 - Fax: +33 4 93 65 47 16
Copyright Notification: No part may be reproduced except as authorized by written permission. The copyright and the
foregoing restriction extend to reproduction in all media.
© European Telecommunications Standards Institute 1997. All rights reserved.
---------------------- Page: 3 ----------------------
SIST ETS 300 965 E1:2003
Page 2
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
Whilst every care has been taken in the preparation and publication of this document, errors in content,
typographical or otherwise, may occur. If you have comments concerning its accuracy, please write to
"ETSI Editing and Committee Support Dept." at the address shown on the title page.
---------------------- Page: 4 ----------------------
SIST ETS 300 965 E1:2003
Page 3
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
Contents
Foreword .5
0.1 Scope .7
0.2 Normative references.7
0.3 Abbreviations.7
1 General.7
2 Functional description .8
2.1 Overview and principles of operation.8
2.2 Algorithm description .8
2.2.1 Adaptive filtering and energy computation .10
2.2.2 acf averaging.10
2.2.3 Predictor values computation.10
2.2.4 Spectral comparison.11
2.2.5 Periodicity detection .11
2.2.6 Information tone detection.12
2.2.7 Threshold adaptation.13
2.2.8 VAD decision.16
2.2.9 VAD hangover addition.16
3 Computational details.16
3.1 Adaptive filtering and energy computation.18
3.2 ACF averaging.19
3.3 Predictor values computation .19
3.3.1 Schur recursion to compute reflection coefficients .19
3.3.2 Step-up procedure to obtain the aav1[0.8] .20
3.3.3 Computation of the rav1[0.8].21
3.4 Spectral comparison .21
3.5 Periodicity detection.22
3.6 Threshold adaptation .22
3.7 VAD decision .24
3.8 VAD hangover addition .24
3.9 Periodicity updating.24
3.10 Tone detection .25
3.10.1 Windowing.25
3.10.2 Auto-correlation .25
3.10.3 Computation of the reflection coefficients .25
3.10.4 Filter coefficient calculation .26
3.10.5 Pole Frequency Test .26
3.10.6 Prediction gain test.27
4 Digital test sequences .27
4.1 Test configuration .28
4.2 Test sequences .28
Annex A.1 (informative): Simplified block filtering operation.29
A.1 Simplified block filtering operation.29
A.2 Description of digital test sequences.30
A.2.1 Test sequences .30
A.2.2 File format description .32
---------------------- Page: 5 ----------------------
SIST ETS 300 965 E1:2003
Page 4
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
A.3 VAD performance . 34
A.4 Pole frequency calculation. 35
Annex B (normative): Test sequences diskette. 36
History. 37
---------------------- Page: 6 ----------------------
SIST ETS 300 965 E1:2003
Page 5
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
Foreword
This European Telecommunication Standard (ETS) has been produced by the Special Mobile Group
(SMG) Technical Committee (TC) of the European Telecommunications Standards Institute (ETSI).
This ETS specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission (DTX)
for the digital cellular telecommunications system.
A 3,5 inch diskette (annex B) is attached to the back cover of this ETS, the diskette contain test
sequences, as described in clause A.2.
Diskette 1 ETS 300 965, annex B: Test sequences for the GSM Full Rate speech codec;
Test sequences files *.inp, *.cod, *.vad.
The specification from which this ETS has been derived was originally based on CEPT documentation,
hence the presentation of this ETS may not be entirely in accordance with the ETSI/PNE rules.
Transposition dates
Date of adoption: 18 April 1997
Date of latest announcement of this ETS (doa): 31 August 1997
Date of latest publication of new National Standard
or endorsement of this ETS (dop/e): 28 February 1998
Date of withdrawal of any conflicting National Standard (dow): 28 February 1998
---------------------- Page: 7 ----------------------
SIST ETS 300 965 E1:2003
Page 6
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
Blank page
---------------------- Page: 8 ----------------------
SIST ETS 300 965 E1:2003
Page 7
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
0.1 Scope
This European Telecommunication Standard (ETS) specifies the Voice Activity Detector (VAD) to be used
in the Discontinuous Transmission (DTX) as described in GSM 06.31. It also specifies the test methods to
be used to verify that a VAD complies with the technical specification.
The requirements are mandatory on any VAD to be used either in the GSM Mobile Stations (MS)s or Base
Station Systems (BSS)s.
0.2 Normative references
This ETS incorporates by dated and undated reference, provisions from other publications. These
normative references are cited at the appropriate places in the text and the publications are listed
hereafter. For dated references, subsequent amendments to or revisions of any of these publications
apply to this ETS only when incorporated in it by amendment or revision. For undated references, the
latest edition of the publication referred to applies.
[1] GSM 01.04 (ETR 350): "Digital cellular telecommunications system (Phase 2+);
Abbreviations and acronyms".
[2] GSM 06.10 (ETS 300 961): "Digital cellular telecommunications system; Full
rate speech; Transcoding".
[3] GSM 06.12 (ETS 300 963): "Digital cellular telecommunications system; Full
rate speech; Comfort noise aspect for full rate speech traffic channels".
[4] GSM 06.31 (ETS 300 964): "Digital cellular telecommunications system; Full
rate speech; Discontinuous Transmission (DTX) for full rate speech traffic
channels".
0.3 Abbreviations
Abbreviations used in this ETS are listed in GSM 01.04 [1].
1 General
The function of the VAD is to indicate whether each 20 ms frame produced by the speech encoder
contains speech or not. The output is a binary flag which is used by the TX DTX handler defined in
GSM 06.31 [4].
The ETS is organized as follows:
Clause 2 describes the principles of operation of the VAD.
In clause 3, the computational details necessary for the fixed point implementation of the VAD algorithm
are given. This clause uses the same notation as used for computational details in GSM 06.10.
The verification of the VAD is based on the use of digital test sequences. Clause 4 defines the input and
output signals and the test configuration, whereas the detailed description of the test sequences is
contained in clause A.2.
The performance of the VAD algorithm is characterized by the amount of audible speech clipping it
introduces and the percentage activity it indicates. These characteristics for the VAD defined in this ETS
have been established by extensive testing under a wide range of operating conditions. The results are
summarized in clause A.3.
---------------------- Page: 9 ----------------------
SIST ETS 300 965 E1:2003
Page 8
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
2 Functional description
The purpose of this clause is to give the reader an understanding of the principles of operation of the
VAD, whereas the detailed description is given in clause 3. In case of discrepancy between the two
descriptions, the detailed description of clause 3 shall prevail.
In the following subclauses of clause 2, a Pascal programming type of notation has been used to describe
the algorithm.
2.1 Overview and principles of operation
The function of the VAD is to distinguish between noise with speech present and noise without speech
present. The biggest difficulty for detecting speech in a mobile environment is the very low speech/noise
ratios which are often encountered. The accuracy of the VAD is improved by using filtering to increase the
speech/noise ratio before the decision is made.
For a mobile environment, the worst speech/noise ratios are encountered in moving vehicles. It has been
found that the noise is relatively stationary for quite long periods in a mobile environment. It is therefore
possible to use an adaptive filter with coefficients obtained during noise, to remove much of the vehicle
noise.
The VAD is basically an energy detector. The energy of the filtered signal is compared with a threshold;
speech is indicated whenever the threshold is exceeded.
The noise encountered in mobile environments may be constantly changing in level. The spectrum of the
noise can also change, and varies greatly over different vehicles. Because of these changes the VAD
threshold and adaptive filter coefficients must be constantly adapted. To give reliable detection the
threshold must be sufficiently above the noise level to avoid noise being identified as speech but not so far
above it that low level parts of speech are identified as noise. The threshold and the adaptive filter
coefficients are only updated when speech is not present. It is, of course, potentially dangerous for a VAD
to update these values on the basis of its own decision. This adaptation therefore only occurs when the
signal seems stationary in the frequency domain but does not have the pitch component inherent in voiced
speech. A tone detector is also used to prevent adaptation during information tones.
A further mechanism is used to ensure that low level noise (which is often not stationary over long
periods) is not detected as speech. Here, an additional fixed threshold is used.
A VAD hangover period is used to eliminate mid-burst clipping of low level speech. Hangover is only
added to speech-bursts which exceed a certain duration to avoid extending noise spikes.
2.2 Algorithm description
The block diagram of the VAD algorithm is shown in figure 2.1. The individual blocks are described in the
following subclauses. ACF, N and sof are calculated in the speech encoder.
---------------------- Page: 10 ----------------------
SIST ETS 300 965 E1:2003
Page 9
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
Adaptive
v
p vad
VAD
ACF vad
filtering and vad
VAD
hangover
energy
decision
addition
com putation
r
vad
th
vad
ptch
Periodicity
N Threshold
detection
a dap tatio n
stat
sof Tone
detection
to n e
Predictor
r
av1
Spectral
values
com parison
computation
av1
av0
ACF
averaging
Figure 2.1: Functional block diagram of the VAD
The global variables shown in the block diagram are described as follows:
- ACF are auto-correlation coefficients which are calculated in the speech encoder defined in
GSM 06.10 (subclause 3.1.4, see also clause A.1). The inputs to the speech encoder are 16 bit 2's
complement numbers, as described in GSM 06.10, subclause 4.2.0.
- av0 and av1 are averaged ACF vectors.
- rav1 are autocorrelated predictor values obtained from av1.
- rvad are the autocorrelated predictor values of the adaptive filter.
- N is the long term predictor lag value which is obtained every sub-segment in the speech coder
defined in GSM 06.10.
- ptch indicates whether the signal has a steady periodic component.
- sof is the offset compensated signal frame obtained in the speech coder defined in GSM 06.10.
- pvad is the energy in the current frame of the input signal after filtering.
- thvad is an adaptive threshold.
- stat indicates spectral stationarity.
- vvad indicates the VAD decision before hangover is added.
- vad is the final VAD decision with hangover included.
---------------------- Page: 11 ----------------------
SIST ETS 300 965 E1:2003
Page 10
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
2.2.1 Adaptive filtering and energy computation
Pvad is computed as follows:
8
Pvad=+rvad acf 2 rvad acf
00 i i
∑
i=1
This corresponds to performing an 8th order block filtering on the input samples to the speech encoder,
after zero offset compensation and pre-emphasis. This is explained in clause A.1.
2.2.2 acf averaging
Spectral characteristics of the input signal have to be obtained using blocks that are larger than one 20 ms
frame. This is done by averaging the auto-correlation values for several consecutive frames. This
averaging is given by the following equations:
frames−1
av00{}n=−acf{n j} ;i= .8
ii
∑
j=0
av10{}n=−av {n frames} ;i=0.8
ii
n-1
Where n represents the current frame, represents the previous frame etc. The values of constants
are given in table 2.1.
Table 2.1: Constants and variables for ACF averaging
Constant Value Variable Initial value
frames 4 previous ACF's
av0 & av1 All set to 0
2.2.3 Predictor values computation
The filter predictor values aav1 are obtained from the auto-correlation values av1 according to the
equation:
-1
a = R p
where:
- -
R := | av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6], av1[7] |
| av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6] |
| av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5] |
| av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4] |
| av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3] |
| av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2] |
| av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1] |
| av1[7], av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0] |
- -
and:
---------------------- Page: 12 ----------------------
SIST ETS 300 965 E1:2003
Page 11
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
- - - -
p := |av1[1]| a := |aav1[1]|
|av1[2]| |aav1[2]|
|av1[3]| |aav1[3]|
|av1[4]| |aav1[4]|
|av1[5]| |aav1[5]|
|av1[6]| |aav1[6]|
|av1[7]| |aav1[7]|
|av1[8]| |aav1[8]|
- - - -
aav1[0] := -1
av1 is used in preference to av0 as av0 may contain speech.
The autocorrelated predictor values rav1 are then obtained:
8−i
rav11==aav aav1 ;.i0.8
ik ki+
∑
k=0
2.2.4 Spectral comparison
The spectra represented by the autocorrelated predictor values rav1 and the averaged auto-correlation
values av0 are compared using the distortion measure dm defined below. This measure is used to
produce a Boolean value stat every 20 ms, as given by these equations:
8
rav10av +2 rav1 av0
00∑ ii
i=1
dm=
av0
0
difference := |dm - lastdm|
lastdm := dm
stat := difference < thresh
The values of constants and initial values are given in table 2.2.
Table 2.2: Constants and variables for spectral comparison
Constant Value Variable Initial value
thresh 0.05 lastdm 0
2.2.5 Periodicity detection
The frequency spectrum of mobile noise is relatively stationary over quite long periods. The Inverse Filter
Autocorrelated Predictor coefficients of the adaptive filter rvad are only updated when this stationarity is
detected. Vowel sounds however, also have this stationarity, but can be excluded by detecting the
periodicity of these sounds using the long term predictor lag values (Nj) which are obtained every
sub-segment from the speech codec defined in GSM 06.10. Consecutive lag values are compared. Cases
in which one lag value is a factor of the other are catered for, however cases in which both lag values
have a common factor, are not. This case is not important for speech input but this method of periodicity
detection may fail for some sine waves. The Boolean variable ptch is updated every 20 ms and is true
when periodicity is detected. It is calculated according to the following equation:
ptch := oldlagcount + veryoldlagcount >= nthresh
---------------------- Page: 13 ----------------------
SIST ETS 300 965 E1:2003
Page 12
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
The following operations are done after the VAD decision and when the current LTP lag values (N0 . N3)
are available, this reduces the delay of the VAD decision. (N{-1} = N3 of previous segment.)
lagcount := 0
for j := 0 to 3 do
begin
smallag := maximum(Nj,N{j-1}) mod minimum(Nj,N{j-1})
if minimum(smallag,minimum(Nj,N{j-1})-smallag) < lthresh
then increment(lagcount)
end
veryoldlagcount := oldlagcount
oldlagcount := lagcount
The values of constants and initial values are given in table 2.
Table 2.3: Constants and variables for periodicity detection
Constant Value Variable Initial value
lthresh 2 oldlagcount 0
nthresh 4 veryoldlagcount 0
N3 40
2.2.6 Information tone detection
The tone flag is only evaluated in the downlink VAD. In the uplink VAD, tone detection is not performed
and tone = false.
Computation of the tone flag is complex. It is therefore evaluated after the processing of the current
speech encoder frame. In this way transmission of the speech or SID frame is not delayed.
Information tones and environmental noise can be classified by inspecting the short term prediction gain,
information tones resulting in higher prediction gains than environmental noise. Tones can therefore be
detected by comparing the prediction gain to a fixed threshold. By limiting the prediction gain calculation to
a fourth order analysis, information signals consisting of one or two tones can be detected whilst
minimizing the prediction gain for environmental noise.
The prediction gain decision is implemented by comparing the normalized prediction error with a
threshold. This measure is used to evaluate the Boolean variable tone every 20 ms. The signal is
classified as a tone if the prediction error is smaller than the threshold predth. This is equivalent to a
prediction gain threshold of 13,5 dB.
Mobile noise can contain very strong resonances at low frequencies, resulting in a high prediction gain. A
further test is therefore made to determine the pole frequency of a second order analysis of the signal
frame. The signal is classified as noise if the frequency of the pole is less than 385 Hz. The pole
frequency calculation is described in clause A.4.
The algorithm for detecting information tones is as follows:
tone := false
den := a[1]*a[1]
num := 4*a[2] - a[1]*a[1]
if ( num <= 0 )
return
if (( a[1] < 0 ) AND ( num / den < freqth ))
return
4
prederr := MULT (1 - RC[i]*RC[i])
i=1
if (prederr < predth)
tone := true
return
---------------------- Page: 14 ----------------------
SIST ETS 300 965 E1:2003
Page 13
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
The values of the constants are given in table 2.4. The coefficients a[1.2] are transversal filter coefficients
calculated from rc[1.2]. The calculation of the reflection coefficients rc[1.4] is described below.
The offset compensated signal frame sof[0.159] is multiplied by the Hanning window to give the
windowed frame sofh[0.159]:
sofh==sof hann i 01. 59
ii i
where
i
hann=−05.c1 os 2π i=01. 59
i
159
The auto-correlation acfh[0.4] of the windowed signal frame is then calculated:
159
acfh==sofh sofh ;.k 04.
ki ik−
∑
ik=
rc[1.4] are then calculated from acfh[0.4] using the Schur recursion described in the RPE-LTP codec.
Table 2.4: Constants for information tone detection.
Constant Value
freqth 0,0973
predth 0,0158
NOTE: Reflection coefficients are available in the RPE-LTP codec. However, they are
calculated after pre-emphasis using a rectangular window and do not give good tone
detection results.
2.2.7 Threshold adaptation
A check is made every 20 ms to determine whether the VAD decision threshold (thvad) should be
changed. This adaptation is carried out according to the flowchart shown in figure 2.2. The constants used
are given in table 2.5.
Adaptation takes place in two different situations: firstly whenever ACF[0] is very low and secondly
whenever there is a very high probability that speech and information tones are not present.
In the first case, the threshold is adapted if the energy of the input signal is less than pth. The threshold is
set to plev without carrying out any further tests because at these very low levels the effect of the signal
quantization makes it impossible to obtain reliable results from these tests.
In the second case, the decision threshold (thvad) and the adaptive filter coefficients (rvad) are only
updated with the rav1 values when there is a very high probability that speech and information tones are
not present. Adaptation occurs if the following conditions are met over a number (adp) of signal frames:
- Stationarity is detected in the frequency domain.
- The signal does not contain a periodic component.
- Information tones are not present.
The step-size by which the threshold is adapted is not constant but a proportion of the current value
(determined by constants dec and inc). The adaptation begins by experimentally multiplying the threshold
by a factor of (1-1/dec). If the new threshold is now higher than or equal to Pvad times fac then the
threshold needed to be decreased and it is left at this new lower level. If, on the other hand, the new
threshold level is less than Pvad times fac then the threshold either needed to be increased or kept
constant. In this case it is set to Pvad times fac unless this would mean multiplying it by more than a factor
---------------------- Page: 15 ----------------------
SIST ETS 300 965 E1:2003
Page 14
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
of (1+1/inc) (in which case it is multiplied by a factor of (1+1/inc)). The threshold is never allowed to be
greater than Pvad+margin.
Table 2.5: Constants and variables for threshold adaptation
Constant Value Variable Initial value
pth 300 000 adaptcount 0
plev 800 000 thvad 1 000 000
fac 3.0 rvad[0] 6
adp 8 rvad[1] -4
inc 16 rvad[2] 1
dec 32 rvad[3] to
margin 80 000 000 rvad[8] All 0
---------------------- Page: 16 ----------------------
SIST ETS 300 965 E1:2003
Page 15
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
BEGIN
yes
ACF[0] < pth ?
no
th = plev
vad
yes
stat and not ptch
increm ent
and not tone ?
adaptcount
no
adaptcount = 0
END
adaptcount > adp ?
no
yes
th = th - th / dec
vad vad vad
yes
= m in ( th + th / inc , p
th < p th
*fac)
* fac ?
vad vad
vad vad vad
vad
no
yes
th = p
+ m argin th > p
+ m argin ?
vad vad
vad vad
no
r
= r
vad av1
adaptcount = adp + 1
END
Figure 2.2: Flow diagram for threshold adaptation
---------------------- Page: 17 ----------------------
SIST ETS 300 965 E1:2003
Page 16
ETS 300 965 (GSM 06.32 version 5.0.1): May 1997
2.2.8 VAD decision
Prior to hangover the VAD decision condition is:
vvad := pvad > thvad
2.2.9 VAD hangover addition
VAD hangover is only added to bursts of speech greater than or equal to burstconst blocks. The Boolean
variable vad indicates the decision of the VAD with hangover included. The values of the constants are
given in table 2.6. The hangover algorithm is as follows:
if vvad then increment(burstcount) else burstcount := 0
if burstcount >= burstconst then
begin
hangcount := hangconst;
burstcount := burstconst
end
vad := vvad or (hangcount >= 0)
if hangcount >= 0 then decrement(hangcount)
Table 2.6: Constants and variables for VAD hangover addition
Constant Value Variable Initial value
burstconst 3 burstcount 0
hangconst 5 hangcount -1
3 Computational details
In the next paragraphs, the detailed description of the VAD algorithm follows the preceding high level
description. This detailed description is divided in ten sections related to the blocks of figure 2.1 (except
periodicity updating) in the high level description of the VAD algorithm.
Those sections are:
1) Adaptive filtering and energy computation;
2) ACF averaging;
3) Predictor values computation;
4) Spectral comparis
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.