Information technology — Multimedia content description interface — Part 8: Extraction and use of MPEG-7 descriptions — Amendment 4: Extraction of audio features from compressed formats

Technologies de l'information — Interface de description du contenu multimédia — Partie 8: Extraction et utilisation des descriptions MPEG-7 — Amendement 4: Extraction de caractéristiques audio à partir de formats compressés

General Information

Status
Published
Publication Date
10-Nov-2009
Current Stage
6060 - International Standard published
Start Date
11-Nov-2009
Due Date
28-Apr-2011
Completion Date
28-Apr-2011
Ref Project

Relations

Technical report
ISO/IEC TR 15938-8:2002/Amd 4:2009 - Extraction of audio features from compressed formats
English language
25 pages
sale 15% off
Preview
sale 15% off
Preview
Technical report
ISO/IEC TR 15938-8:2002/Amd 4:2009 - Extraction of audio features from compressed formats
English language
25 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)


TECHNICAL ISO/IEC
REPORT TR
15938-8
First edition
2002-12-15
AMENDMENT 4
2009-11-15
Information technology — Multimedia
content description interface —
Part 8:
Extraction and use of MPEG-7
descriptions
AMENDMENT 4: Extraction of audio
features from compressed formats
Technologies de l'information — Interface de description du contenu
multimédia —
Partie 8: Extraction et utilisation des descriptions MPEG-7
AMENDEMENT 4: Extraction de caractéristiques audio à partir de
formats compressés
Reference number
ISO/IEC TR 15938-8:2002/Amd.4:2009(E)
©
ISO/IEC 2009
ISO/IEC TR 15938-8:2002/Amd.4:2009(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.

©  ISO/IEC 2009
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2009 – All rights reserved

ISO/IEC TR 15938-8:2002/Amd.4:2009(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
In exceptional circumstances, the joint technical committee may propose the publication of a Technical Report
of one of the following types:
⎯ type 1, when the required support cannot be obtained for the publication of an International Standard,
despite repeated efforts;
⎯ type 2, when the subject is still under technical development or where for any other reason there is the
future but not immediate possibility of an agreement on an International Standard;
⎯ type 3, when the joint technical committee has collected data of a different kind from that which is
normally published as an International Standard (“state of the art”, for example).
Technical Reports of types 1 and 2 are subject to review within three years of publication, to decide whether
they can be transformed into International Standards. Technical Reports of type 3 do not necessarily have to
be reviewed until the data they provide are considered to be no longer valid or useful.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
Amendment 4 to ISO/IEC TR 15938-8:2002 was prepared by Joint Technical Committee ISO/IEC JTC 1,
Information technology, Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia
information.
© ISO/IEC 2009 – All rights reserved iii

ISO/IEC TR 15938-8:2002/Amd.4:2009(E)

Information technology — Multimedia content description
interface —
Part 8:
Extraction and use of MPEG-7 descriptions
AMENDMENT 4: Extraction of audio features from compressed
formats
After 4.8.2.2.6, add Clause 5:
5 Direct audio feature extraction from the compressed domain
5.1 Introduction
Due to efficient MPEG audio compression technologies, such as MPEG 1 – Layer III (MP3), [AMD4-1] or
MPEG-2/-4 AAC, (AAC), [AMD4-2, AMD4-3] the number of personal and institutional music stored in archives
grew significantly during the last years. At the same time, the need for automatic search and retrieval
capabilities for music increased in order to manage these databases. These search and retrieval applications
base on low-level features (e.g. described in the MPEG-7 standard [AMD4-4]) which are extracted from the
digital audio content. In order to efficiently search in large archives, there is need to perform a faster low-level
feature extraction. This technical report describes a method, which allows an extraction of MPEG-7 low-level
features [AMD4-4] directly from the compressed domain, by transforming the frequency representation of
MPEG compressed audio files into the DFT domain for feature extraction.
5.2 Conventional feature extraction
The conventional approach to obtain MPEG-7 features from compressed audio data is to decode it first and
then to generate the MPEG-7 features based on the decoded time signal. But especially when searching large
libraries of compressed audio files this approach can become computationally very expensive. Several works
deal with the conversion between subband domain representations, especially in the field of image and video
coding. In [AMD4-5], [AMD4-6] the conversion between different sizes of DCT transforms is given, having the
drawback that they are restricted to non-lapped transforms. The patent in [AMD4-7] proposes a conversion
method between the MDCT and the DFT domain. It is restricted to MDCT and DFT and therewith not suitable
for our purposes, since we want to include also hybrid filter banks, an integral part of MP3. The architecture
presented in [AMD4-8] is not restricted to the type of filter banks used. Unfortunately, the number of subbands
of the different filterbanks have to be multiples of each other and this is again unsuitable for our needs.
However, this paper serves as the basis for a general conversion method proposed in [AMD4-9], which can be
applied to any maximally-decimated filter bank without condition on their sizes. Here, a conversion matrix is
generated by multiplying the analysis with a synthesis filter bank. Principally, the same is done in this technical
report, though, a universal mathematical description is used, the polyphase description introduced in
[AMD4-10]. Additionally, the described method is extended by applying it to arbitrary resolution translations
between synthesis and analysis filter banks in a practical way. Furthermore, it is adjusted to MP3 and AAC,
and exploits some special properties of the so-called conversion matrix which is explained in the next section.
In [AMD4-11] the problem of generating a complex from a real valued spectral representation is picked up
from the reverse side. Therein it is said that a desired frequency response can be approximated by means of
© ISO/IEC 2009 – All rights reserved 1

ISO/IEC TR 15938-8:2002/Amd.4:2009(E)
a linear combination with constant weighting factors. This approach only allows a coarse approximation,
nonetheless, having a very small computational complexity load. This approach gave the inspiration for the
issue termed as spectral approximation. A completely different approach is worth mentioning here which
works directly on the compressed domain. It uses the MDCT coefficients as the basis for the low level feature
extraction [AMD4-12]. Since there is no conversion into the DFT domain applied, this approach is restricted to
the time/frequency resolution provided by the used codec. It is hence not compatible to existing MPEG-7
feature databases.
5.3 Direct feature extraction
5.3.1 System overview
In order to extract audio features from the compressed domain, we designed a conversion system which
directly converts the given time-frequency representations of MPEG-1 Layer III and MPEG-2/-4 AAC into the
time-frequency representation needed for calculating MPEG-7 compliant features. After applying the
conversion method, the resulting complex-valued spectral coefficients are fed to the feature extraction
algorithm.
Before we elaborate on the direct feature extraction system, it is important to know some details about how
the conventional approach works and how it deals with compressed audio input material. Figure AMD4.1
shows the basic building blocks of the conventional feature extraction process.

Figure AMD4.1 — Basic building blocks of the conventional feature extraction process
First, the compressed input audio material needs to be decoded to PCM audio data. Then, the feature
extraction process, which consists of an analysis and a feature calculation stage, applies a window function to
the PCM input samples followed by an FFT prior to the feature calculation. Our goal is to substitute the bulk of
the computational amount needed for decoding and analyzing by one direct conversion process. In this
context the bulk of the computational amount of the decoding process comprises basically the synthesis filter
bank of the particular decoder. For MP3 additionally reordering and anti-aliasing operations take place.
We now take a look at Figure AMD4.2. The synthesis filter bank of the decoder having a transfer function and
the analysis filter bank of the feature extraction process having another transfer function exhibit different
numbers of subbands, K and L respectively.

Figure AMD4.2 — Synthesis filter bank with K subbands followed by an analysis filter bank with L
subbands. Both filter banks are maximally decimated and linear time-invariant
2 © ISO/IEC 2009 – All rights reserved

ISO/IEC TR 15938-8:2002/Amd.4:2009(E)
Y (m) denotes the subband coefficient of the compressed bitstream of subband K at block m, x(n) is the
k
decoded time audio signal at time n, and y(m) is the subband signal of the desired domain of subband I at
i
block m.
However, a more efficient and useful representation of maximally-decimated filter banks is the so-called
polyphase description introduced by Vaidyanathan [AMD4-1]. The main advantage of the polyphase
description is its mathematical compactness, so that a filter bank can be fully described by a polyphase filter
matrix. The filtering process then reduces to a multiplication of a z-transformed signal vector with a polyphase
filter matrix. Furthermore, a concatenation of different filter banks can be achieved by using only one
polyphase matrix, which can be obtained by multiplying the individual polyphase matrices of these filter banks.
This property enables the construction of a conversion matrix T(z) of size M * M as shown in Figure AMD4.3.
Synthesis Analysis
Polyphase Polyphase
ˆ ˆ
Y(z) X(z) Y(z)
Matrix Matrix
G(z) H(z)
K × K L × L
Direct Conversion System
Polyphase Conversion
ˆ
Y(z) Y(z)
Matrix
ˆ ˆ
TG()zz= ()H()z
M × M
Figure AMD4.3 — Block diagram of the conventional transcoding of the direct conversion method
It is evident, that M multiplications are necessary to calculate the desired spectral values when using an M*M
conversion matrix. That is equivalent to a complexity of O(N ) and, unfortunately, much more complex than
deploying the conventional method, since the latter uses efficient implementations of the MDCT and FFT
featuring an overall complexity of O(N log(N)). We found, that only a fraction of the values inside a conversion
matrix is necessary for the calculation of audio features, which still guarantee a successful identification of the
underlying audio material. This is possible, since the most significant values of a conversion matrix are evenly
spread along the main diagonal, and they decrease quickly the further we move away from it. The most
important characteristic of a conversion matrix T(z) is that it exhibits a strong similarity to diagonal and
therefore sparse matrices. For instance, Figure AMD4.4 shows an example of such a polyphase conversion
matrix, where the white areas corresponds to zeros in the matrix. Observe that three images of matrices can
be used, because each corresponds to the coefficients of a different power of z of the polyphase matrix. The
analysis time window is set to 30 ms because it is suitable for many tasks of music information retrieval. The
sampling frequency is chosen to be 44,1 kHz (generally it is arbitrary), hence the matrix generates
1024 complex Fourier coefficients as output, whereas it takes 576 (the content of one MP3 granule) real
valued input samples.
© ISO/IEC 2009 – All rights reserved 3

ISO/IEC TR 15938-8:2002/Amd.4:2009(E)

Figure AMD4.4 — Exemplary complex polyphase conversion matrix for MP3 converting
one granule of 576 real valued subbands into 1024 DFT coefficients. The figure only
shows absolute values.
It can be seen in Figure AMD4.4 that the most significant values are evenly spread along the main diagonal. If
only the coefficients necessary for the desired accuracy are kept, the sparse matrix shown in Figure AMD4.5
is obtained. For clarification, Figure AMD4.5 shows an exemplary STFT spectrum and its approximation using
sparse matrices for direct conversion. For this example a conversion complexity of about 0,07 % in contrast to
a fully populated matrix was used. This property permits to approximate a desired spectral representation by
only using the strongest diagonals while omitting the less important ones. Exploiting this property
...


TECHNICAL ISO/IEC
REPORT TR
15938-8
First edition
2002-12-15
AMENDMENT 4
2009-11-15
Information technology — Multimedia
content description interface —
Part 8:
Extraction and use of MPEG-7
descriptions
AMENDMENT 4: Extraction of audio
features from compressed formats
Technologies de l'information — Interface de description du contenu
multimédia —
Partie 8: Extraction et utilisation des decriptions MPEG-7
AMENDEMENT 4: Extraction de caractéristiques audio à partir de
formats compressés
Reference number
ISO/IEC TR 15938-8:2002/Amd.4:2009(E)
©
ISO/IEC 2009
ISO/IEC TR 15938-8:2002/Amd.4:2009(E)

PDF disclaimer
PDF files may contain embedded typefaces. In accordance with Adobe's licensing policy, such files may be printed or viewed but shall
not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading a PDF file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create the PDF file(s) constituting this document can be found in the General Info relative to
the file(s); the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the files are suitable for
use by ISO member bodies. In the unlikely event that a problem relating to them is found, please inform the Central Secretariat at the
address given below.
This CD-ROM contains:
1) the publication ISO/IEC TR 15938-8:2002/Amd.4:2009(E) in portable document format (PDF), which
can be viewed using Adobe® Acrobat® Reader;
2) electronic attachments pertinent to extraction of audio features from compressed formats.
Adobe and Acrobat are trademarks of Adobe Systems Incorporated.
©  ISO/IEC 2009
All rights reserved. Unless required for installation or otherwise specified, no part of this CD-ROM may be reproduced, stored in a retrieval
system or transmitted in any form or by any means without prior permission from ISO. Reques
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.