Digital cellular telecommunications system (Phase 2+) (GSM); Full rate speech; Voice Activity Detector (VAD) for full rate speech traffic channels (GSM 06.32 version 5.0.3)

This European Telecommunication Standard (ETS) specifies the Voice Act ivity Detector (VAD) to be used in the Discontinuous Transmission (DTX ) as described in GSM 06.31. It also specifies the test methods to be used to verify that a VAD complies with the technical specification. The requirements are mandatory on any VAD to be used either in the GSM Mobile Stations (MS)s or Base Station Systems (BSS)s.

Digitalni celični telekomunikacijski sistem (faza 2+) – Govor s polno hitrostjo – Detektor aktivnega govora (VAD) pri kanalih za prenos govora s polno hitrostjo (GSM 06.32, različica 5.0.3)

General Information

Publication Date
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
Due Date
Completion Date

Buy Standard

ETS 300 965 E2:2003
English language
37 pages
sale 10% off
sale 10% off
e-Library read for
1 day

Standards Content (Sample)

2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.Digital cellular telecommunications system (Phase 2+) (GSM); Full rate speech; Voice Activity Detector (VAD) for full rate speech traffic channels (GSM 06.32 version 5.0.3)33.070.50Globalni sistem za mobilno telekomunikacijo (GSM)Global System for Mobile Communication (GSM)ICS:Ta slovenski standard je istoveten z:ETS 300 965 Edition 2SIST ETS 300 965 E2:2003en01-december-2003SIST ETS 300 965 E2:2003SLOVENSKI

SIST ETS 300 965 E2:2003

EUROPEANETS 300 965TECHNICALApril 1998STANDARDSecond EditionSource: SMGReference: RE/SMG-110632QR1ICS:33.020Key words:Digital cellular telecommunications system, Global System for Mobile communications (GSM)GLOBAL SYSTEM
FOR MOBILE COMMUNICATIONSRDigital cellular telecommunications system (Phase 2+);Full rate speech;Voice Activity Detector (VAD) for full rate speech traffic channels(GSM 06.32 version 5.0.3)ETSIEuropean Telecommunications Standards InstituteETSI SecretariatPostal address: F-06921 Sophia Antipolis CEDEX - FRANCEOffice address: 650 Route des Lucioles - Sophia Antipolis - Valbonne - FRANCEInternet: - - http://www.etsi.orgTel.: +33 4 92 94 42 00 - Fax: +33 4 93 65 47 16Copyright Notification: No part may be reproduced except as authorized by written permission. The copyright and theforegoing restriction extend to reproduction in all media.© European Telecommunications Standards Institute 1998. All rights reserved.SIST ETS 300 965 E2:2003

Page 2ETS 300 965 (GSM 06.32 version 5.0.3): April 1998Whilst every care has been taken in the preparation and publication of this document, errors in content,typographical or otherwise, may occur. If you have comments concerning its accuracy, please write to"ETSI Editing and Committee Support Dept." at the address shown on the title page.SIST ETS 300 965 E2:2003

Page 3ETS 300 965 (GSM 06.32 version 5.0.3): April 1998ContentsForeword.50.1 Scope.70.2 Normative references.70.3 Abbreviations.71 General.72 Functional description.82.1 Overview and principles of operation.82.2 Algorithm description.82.2.1 Adaptive filtering and energy computation.102.2.2 ACF averaging.102.2.3 Predictor values computation.102.2.4 Spectral comparison.112.2.5 Periodicity detection.112.2.6 Information tone detection.122.2.7 Threshold adaptation.132.2.8 VAD decision.162.2.9 VAD hangover addition.163 Computational details.163.1 Adaptive filtering and energy computation.183.2 ACF averaging.193.3 Predictor values computation.193.3.1 Schur recursion to compute reflection coefficients.193.3.2 Step-up procedure to obtain the aav1[0.8].203.3.3 Computation of the rav1[0.8].213.4 Spectral comparison.213.5 Periodicity detection.223.6 Threshold adaptation.223.7 VAD decision.243.8 VAD hangover addition.243.9 Periodicity updating.243.10 Tone detection.253.10.1 Windowing.253.10.2 Auto-correlation.253.10.3 Computation of the reflection coefficients.253.10.4 Filter coefficient calculation.263.10.5 Pole Frequency Test.263.10.6 Prediction gain test.274 Digital test sequences.274.1 Test configuration.284.2 Test sequences.28Annex A (informative).29A.1 Simplified block filtering operation.29A.2 Description of digital test sequences.30A.2.1 Test sequences.30A.2.2 File format description.32SIST ETS 300 965 E2:2003

Page 4ETS 300 965 (GSM 06.32 version 5.0.3): April 1998A.3 VAD performance.34A.4 Pole frequency calculation.35Annex B (normative): Test sequences diskette.36History.37SIST ETS 300 965 E2:2003

Page 5ETS 300 965 (GSM 06.32 version 5.0.3): April 1998ForewordThis second edition European Telecommunication Standard (ETS) has been produced by the SpecialMobile Group (SMG) of the European Telecommunications Standards Institute (ETSI).This ETS specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission (DTX)for the digital cellular telecommunications system.A 3,5 inch diskette (annex B) is attached to the back cover of this ETS, the diskette contain testsequences, as described in clause A.2.Diskette 1ETS 300 965, annex B: Test sequences for the GSM Full Rate speech codec;Test sequences files *.inp, *.cod, *.vad.The specification from which this ETS has been derived was originally based on CEPT documentation,hence the presentation of this ETS may not be entirely in accordance with the ETSI/PNE Rules.Transposition datesDate of adoption of this ETS:3 April 1998Date of latest announcement of this ETS (doa):30 June 1998Date of latest publication of new National Standardor endorsement of this ETS (dop/e):31 December 1998Date of withdrawal of any conflicting National Standard (dow):31 December 1998SIST ETS 300 965 E2:2003

Page 6ETS 300 965 (GSM 06.32 version 5.0.3): April 1998Blank pageSIST ETS 300 965 E2:2003

Page 7ETS 300 965 (GSM 06.32 version 5.0.3): April 19980.1ScopeThis European Telecommunication Standard (ETS) specifies the Voice Activity Detector (VAD) to be usedin the Discontinuous Transmission (DTX) as described in GSM 06.31. It also specifies the test methods tobe used to verify that a VAD complies with the technical specification.The requirements are mandatory on any VAD to be used either in the GSM Mobile Stations (MS)s or BaseStation Systems (BSS)s.0.2Normative referencesThis ETS incorporates by dated and undated reference, provisions from other publications. Thesenormative references are cited at the appropriate places in the text and the publications are listedhereafter. For dated references, subsequent amendments to or revisions of any of these publicationsapply to this ETS only when incorporated in it by amendment or revision. For undated references, thelatest edition of the publication referred to applies.[1]GSM 01.04 (ETR 350): "Digital cellular telecommunications system (Phase 2+);Abbreviations and acronyms".[2]GSM 06.10 (ETS 300 961): "Digital cellular telecommunicationssystem(Phase 2+); Full rate speech; Transcoding".[3]GSM 06.12 (ETS 300 963): "Digital cellular telecommunicationssystem(Phase 2+); Full rate speech; Comfort noise aspect for full rate speechtraffic channels".[4]GSM 06.31 (ETS 300 964): "Digital cellular telecommunicationssystem(Phase 2+); Full rate speech; Discontinuous Transmission (DTX) for fullrate speech traffic channels".0.3AbbreviationsAbbreviations used in this ETS are listed in GSM 01.04 [1].1GeneralThe function of the VAD is to indicate whether each 20 ms frame produced by the speech encodercontains speech or not. The output is a binary flag which is used by the TX DTX handler defined inGSM 06.31 [4].The ETS is organized as follows:Clause 2 describes the principles of operation of the VAD.In clause 3, the computational details necessary for the fixed point implementation of the VAD algorithmare given. This clause uses the same notation as used for computational details in GSM 06.10.The verification of the VAD is based on the use of digital test sequences. Clause 4 defines the input andoutput signals and the test configuration, whereas the detailed description of the test sequences iscontained in clause A.2.The performance of the VAD algorithm is characterized by the amount of audible speech clipping itintroduces and the percentage activity it indicates. These characteristics for the VAD defined in this ETShave been established by extensive testing under a wide range of operating conditions. The results aresummarized in clause A.3.SIST ETS 300 965 E2:2003

Page 8ETS 300 965 (GSM 06.32 version 5.0.3): April 19982Functional descriptionThe purpose of this clause is to give the reader an understanding of the principles of operation of theVAD, whereas the detailed description is given in clause 3. In case of discrepancy between the twodescriptions, the detailed description of clause 3 shall prevail.In the following subclauses of clause 2, a Pascal programming type of notation has been used to describethe algorithm.2.1Overview and principles of operationThe function of the VAD is to distinguish between noise with speech present and noise without speechpresent. The biggest difficulty for detecting speech in a mobile environment is the very low speech/noiseratios which are often encountered. The accuracy of the VAD is improved by using filtering to increase thespeech/noise ratio before the decision is made.For a mobile environment, the worst speech/noise ratios are encountered in moving vehicles. It has beenfound that the noise is relatively stationary for quite long periods in a mobile environment. It is thereforepossible to use an adaptive filter with coefficients obtained during noise, to remove much of the vehiclenoise.The VAD is basically an energy detector. The energy of the filtered signal is compared with a threshold;speech is indicated whenever the threshold is exceeded.The noise encountered in mobile environments may be constantly changing in level. The spectrum of thenoise can also change, and varies greatly over different vehicles. Because of these changes the VADthreshold and adaptive filter coefficients must be constantly adapted. To give reliable detection thethreshold must be sufficiently above the noise level to avoid noise being identified as speech but not so farabove it that low level parts of speech are identified as noise. The threshold and the adaptive filtercoefficients are only updated when speech is not present. It is, of course, potentially dangerous for a VADto update these values on the basis of its own decision. This adaptation therefore only occurs when thesignal seems stationary in the frequency domain but does not have the pitch component inherent in voicedspeech. A tone detector is also used to prevent adaptation during information tones.A further mechanism is used to ensure that low level noise (which is often not stationary over longperiods) is not detected as speech. Here, an additional fixed threshold is used.A VAD hangover period is used to eliminate mid-burst clipping of low level speech. Hangover is onlyadded to speech-bursts which exceed a certain duration to avoid extending noise spikes.2.2Algorithm descriptionThe block diagram of the VAD algorithm is shown in figure 2.1. The individual blocks are described in thefollowing subclauses. ACF, N and sof are calculated in the speech encoder.SIST ETS 300 965 E2:2003

Page 9ETS 300 965 (GSM 06.32 version 5.0.3): April 1998PredictorvaluescomputationACFaveragingSpectralcomparisonPeriodicitydetectionvvadthvadstatrvadpvadACFNav1rav1ptchav0vadAdaptive filtering and energy computationsoftoneTonedetectionVADhangoveradditionVADdecisionThresholdadaptationFigure 2.1: Functional block diagram of the VADThe global variables shown in the block diagram are described as follows:-ACF are auto-correlation coefficients which are calculated in the speech encoder defined inGSM 06.10 (subclause 3.1.4, see also clause A.1). The inputs to the speech encoder are 16 bit 2'scomplement numbers, as described in GSM 06.10, subclause 4.2.0.-av0 and av1 are averaged ACF vectors.-rav1 are autocorrelated predictor values obtained from av1.-rvad are the autocorrelated predictor values of the adaptive filter.-N is the long term predictor lag value which is obtained every sub-segment in the speech coderdefined in GSM 06.10.-ptch indicates whether the signal has a steady periodic component.-sof is the offset compensated signal frame obtained in the speech coder defined in GSM 06.10.-pvad is the energy in the current frame of the input signal after filtering.-thvad is an adaptive threshold.-stat indicates spectral stationarity.-vvad indicates the VAD decision before hangover is added.-vad is the final VAD decision with hangover included.SIST ETS 300 965 E2:2003

Page 10ETS 300 965 (GSM 06.32 version 5.0.3): April 19982.2.1Adaptive filtering and energy computationPvad is computed as follows:Pvadrvadacfrvadacfiii=+=å00182This corresponds to performing an 8th order block filtering on the input samples to the speech encoder,after zero offset compensation and pre-emphasis. This is explained in clause A. averagingSpectral characteristics of the input signal have to be obtained using blocks that are larger than one 20 msframe. This is done by averaging the auto-correlation values for several consecutive frames. Thisaveraging is given by the following equations:avnacfnjiiijframes00801{}{};.=-==-åavnavnframesiii1008{}{};.=-=Where n represents the current frame, n-1 represents the previous frame etc. The values of constantsare given in table 2.1.Table 2.1: Constants and variables for ACF averagingConstantValueVariableInitial valueframes4previous ACF'sav0 & av1All set to 02.2.3Predictor values computationThe filter predictor values aav1 are obtained from the auto-correlation values av1 according to theequation:aRp=-1where:
-R =
| av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6], av1[7] |
| av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6] |
| av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5] |
| av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4] |
| av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3] |
| av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2] |
| av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1] |
| av1[7], av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0] |
-and:SIST ETS 300 965 E2:2003

Page 11ETS 300 965 (GSM 06.32 version 5.0.3): April 1998
-p =
a =
-aav1[0] = -1av1 is used in preference to av0 as av0 may contain speech.The autocorrelated predictor values rav1 are then obtained:ravaavaaviikkiki1110808===-+å;.2.2.4Spectral comparisonThe spectra represented by the autocorrelated predictor values rav1 and the averaged auto-correlationvalues av0 are compared using the distortion measure dm defined below. This measure is used toproduce a Boolean value stat every 20 ms, as given by these equations:dmravavravavaviii=+æèççöø÷÷=å10210000180difference = |dm - lastdm|lastdm = dmstat = difference < threshThe values of constants and initial values are given in table 2.2.Table 2.2: Constants and variables for spectral comparisonConstantValueVariableInitial valuethresh0.05lastdm02.2.5Periodicity detectionThe frequency spectrum of mobile noise is relatively stationary over quite long periods. The Inverse FilterAutocorrelated Predictor coefficients of the adaptive filter rvad are only updated when this stationarity isdetected. Vowel sounds however, also have this stationarity, but can be excluded by detecting theperiodicity of these sounds using the long term predictor lag values (Nj) which are obtained everysub-segment from the speech codec defined in GSM 06.10. Consecutive lag values are compared. Casesin which one lag value is a factor of the other are catered for, however cases in which both lag valueshave a common factor, are not. This case is not important for speech input but this method of periodicitydetection may fail for some sine waves. The Boolean variable ptch is updated every 20 ms and is truewhen periodicity is detected. It is calculated according to the following equation:ptch = oldlagcount + veryoldlagcount >= nthreshSIST ETS 300 965 E2:2003

Page 12ETS 300 965 (GSM 06.32 version 5.0.3): April 1998The following operations are done after the VAD decision and when the current LTP lag values (N0 . N3)are available, this reduces the delay of the VAD decision. (N{-1} = N3 of previous segment.)
lagcount = 0
for j = 0 to 3 do
smallag = maximum(Nj,N{j-1}) mod minimum(Nj,N{j-1})
if minimum(smallag,minimum(Nj,N{j-1})-smallag) < lthresh
then increment(lagcount)
veryoldlagcount = oldlagcount
oldlagcount = lagcountThe values of constants and initial values are given in table 2.Table 2.3: Constants and variables for periodicity detectionConstantValueVariableInitial valuelthreshnthresh24oldlagcountveryoldlagcountN300402.2.6Information tone detectionThe tone flag is only evaluated in the downlink VAD. In the uplink VAD, tone detection is not performedand tone = false.Computation of the tone flag is complex. It is therefore evaluated after the processing of the currentspeech encoder frame. In this way transmission of the speech or SID frame is not delayed.Information tones and environmental noise can be classified by inspecting the short term prediction gain,information tones resulting in higher prediction gains than environmental noise. Tones can therefore bedetected by comparing the prediction gain to a fixed threshold. By limiting the prediction gain calculation toa fourth order analysis, information signals consisting of one or two tones can be detected whilstminimizing the prediction gain for environmental noise.The prediction gain decision is implemented by comparing the normalized prediction error with athreshold. This measure is used to evaluate the Boolean variable tone every 20 ms. The signal isclassified as a tone if the prediction error is smaller than the threshold predth. This is equivalent to aprediction gain threshold of 13,5 dB.Mobile noise can contain very strong resonances at low frequencies, resulting in a high prediction gain. Afurther test is therefore made to determine the pole frequency of a second order analysis of the signalframe. The signal is classified as noise if the frequency of the pole is less than 385 Hz. The polefrequency calculation is described in clause A.4.The algorithm for detecting information tones is as follows:tone = falseden = a[1]*a[1]num = 4*a[2] - a[1]*a[1]if ( num <= 0 )
returnif (( a[1] < 0 ) AND ( num / den < freqth ))
4prederr = MULT (1 - RC[i]*RC[i])
i=1if (prederr < predth)
tone = truereturnSIST ETS 300 965 E2:2003

Page 13ETS 300 965 (GSM 06.32 version 5.0.3): April 1998The values of the constants are given in table 2.4. The coefficients a[1.2] are transversal filter coefficientscalculated from rc[1.2]. The calculation of the reflection coefficients rc[1.4] is described below.The offset compensated signal frame sof[0.159] is multiplied by the Hanning window to give thewindowed frame sofh[0.159]:sofhsofhanniiii==0159.wherehanniii=-æèçöø÷æèçöø÷æèçöø÷=05121590159.cos.pThe auto-correlation acfh[0.4] of the windowed signal frame is then calculated:acfhsofhsofhkkiikik===-å15904;.rc[1.4] are then calculated from acfh[0.4] using the Schur recursion described in the RPE-LTP codec.Table 2.4: Constants for information tone detection.ConstantValuefreqthpredth0,09730,0158NOTE:Reflection coefficients are available in the RPE-LTP codec.
However, they arecalculated after pre-emphasis using a rectangular window and do not give good tonedetection results.2.2.7Threshold adaptationA check is made every 20 ms to determine whether the VAD decision threshold (thvad) should bechanged. This adaptation is carried out according to the flowchart shown in figure 2.2. The constants usedare given in table 2.5.Adaptation takes place in two different situations: firstly whenever ACF[0] is very low and secondlywhenever there is a very high probability that speech and information tones are not present.In the first case, the threshold is adapted if the energy of the input signal is less than pth. The threshold isset to plev without carrying out any further tests because at these very low levels the effect of the signalquantization makes it impossible to obtain reliable results from these tests.In the second case, the decision threshold (thvad) and the adaptive filter coefficients (rvad) are onlyupdated with the rav1 values when there is a very high probability that speech and information tones arenot present. Adaptation occurs if the following conditions are met over a number (adp) of signal frames:-Stationarity is detected in the frequency domain.-The signal does not contain a periodic component.-Information tones are not present.The step-size by which the threshold is adapted is not constant but a proportion of the current value(determined by constants dec and inc). The adaptation begins by experimentally multiplying the thresholdby a factor of (1-1/dec). If the new threshold is now higher than or equal to Pvad times fac then thethreshold needed to be decreased and it is left at this new lower level. If, on the other hand, the newthreshold level is less than Pvad times fac then the threshold either needed to be increased or keptconstant. In this case it is set to Pvad times fac unless this would mean multiplying it by more than a factorSIST ETS 300 965 E2:2003

Page 14ETS 300 965 (GSM 06.32 version 5.0.3): April 1998of (1+1/inc) (in which case it is multiplied by a factor of (1+1/inc)). The threshold is never allowed to begreater than Pvad+margin.Table 2.5: Constants and variables for threshold adaptationConstantValueVariableInitial valuepthplevfacadpincdecmargin300 000800 0003.08163280 000 000adaptcountthvadrvad[0]rvad[1]rvad[2]rvad[3] torvad[8]01 000 0006-41All 0SIST ETS 300 965 E2:2003

Page 15ETS 300 965 (GSM 06.32 version 5.0.3): April 1998BEGINnoincrementadaptcountyesadaptcount = 0nothvad= plevENDthvad=thvadthvad-/ decthvadthvadthvad= min (+/inc , pvad*fac)thvad= pvad+ marginrvad= rav1adaptcount = adp + 1yesnoyesnonoENDthvad< pvad* fac ?ACF[0] < pth ?adaptcount > adp ?thvad> pvad+ margin ?yesyesstat and not ptch and not tone ?Figure 2.2: Flow diagram for threshold adaptationSIST ETS 300 965 E2:2003

Page 16ETS 300 965 (GSM 06.32 version 5.0.3): April 19982.2.8VAD decisionPrior to hangover the VAD decision condition is:vvad = pvad > thvad2.2.9VAD hangover additionVAD hangover is only added to bursts of speech greater than or equal to burstconst blocks. The Booleanvariable vad indicates the decision of the VAD with hangover included. The values of the constants aregiven in table 2.6. The hangover algorithm is as follows:
if vvad then increment(burstcount) else burstcount = 0
if burstcount >= burstconst then
hangcount = hangconst;
burstcount = burstconst
vad = vvad or (hangcount >= 0)
if hangcount >= 0 then decrement(hangcount)Table 2.6: Constants and variables for VAD hangover additionConstantValueVariableInitial valueburstconsthangconst35burstcounthangcount0-13Computational detailsIn the next paragraphs, the detailed description of the VAD algorithm follows the preceding high leveldescription. This detailed description is divided in ten clauses related to the blocks of figure 2.1 (exceptperiodicity updating) in the high level description of the VAD algorithm.Those clauses are:1)Adaptive filtering and energy computation;2)ACF averaging;3)Predictor values computation;4)Spectral comparison;5)Periodicity detection;6)Threshold adaptation;7)VAD decision;8)VAD hangover addition;9)Periodicity updating;10)Information tone detection.The VAD algorithm takes as input the following variables of the RPE-LTP encoder (see the detaileddescription of the RPE-LTP encoder GSM 06.10):-L_ACF[0.8], auto-correlation function (GSM 06.10/4.2.4);-scalauto, scaling factor to compute the L_ACF[0.8] (GSM 06.10/4.2.4);-Nc, LTP lag (one for each sub-segment, GSM 06.10/4.2.11).-sof, offset compensated signal frame (GSM 06.10/4.2.2).So four Nc values are needed for the VAD algorithm.The VAD computation can start as soon as the L_ACF[0.8] and scalauto variables are known. Thismeans that the VAD computation can take place after part 4.2.4 of GSM 06.10 (Auto-correlation) of theLPC analysis clause of the RPE-LTP encoder. This scheme will reduce the delay to yield the VADinformation. The periodicity updating (included in subclause 2.2.5) and information tone detection, aredone after the processing of the current speech encoder frame.SIST ETS 300 965 E2:2003

Page 17ETS 300 965 (GSM 06.32 version 5.0.3): April 1998All the arithmetic operations and names of the variables follow the RPE-LTP detailed description. Toincrease the precision within the fixed point implementation, a pseudo-floating point representation ofsome variables is used. This stands for the following variables (and related constants) of theVAD algorithm:pvad:Energy of filtered signal;thvad:Threshold of the VAD decision;acf0:Energy of input signal.For the representation of these variables, two integers (16 bits) are needed:-one for the exponent (e_pvad, e_thvad, e_acf0);-one for the mantissa (m_pvad, m_thvad, m_acf0).The value e_pvad represents the lowest power of 2 just greater or equal to the actual value of pvad andthe m_pvad value represents a integer which is always greater or equal to 16384 (normalized mantissa). Itmeans that the pvad value is equal to:pvad = 2e_pvad*(m_pvad/32768)This scheme guarantees a large dynamic range for the pvad value and always keeps a precision o

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.