Information technology - Coding of audio-visual objects - Part 3: Audio

ISO/IEC 14496-3:2005 (MPEG-4 Audio) is a new kind of audio International Standard that integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. By standardizing individually sophisticated coding tools as well as a novel, flexible framework for audio synchronization, mixing and downloaded post-production, the developers of the MPEG-4 Audio International Standard have created new technology for a new, interactive world of digital audio. MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target a single application such as real-time telephony or high-quality audio compression. Rather, MPEG-4 Audio is a standard that applies to every application requiring the use of advanced sound compression, synthesis, manipulation or playback. The subparts that follow specify the state-of-the-art coding tools in several domains; however, ISO/IEC 14496-3:2005 is more than just the sum of its parts. As the tools described here are integrated with the rest of the MPEG-4 International Standard, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media are enabled. Since a single set of tools is used to cover the needs of a broad range of applications, interoperability is a natural feature of systems that depend on the MPEG-4 Audio International Standard.

Technologies de l'information — Codage des objets audiovisuels — Partie 3: Codage audio

General Information

Status
Withdrawn
Publication Date
30-Nov-2005
Withdrawal Date
30-Nov-2005
Current Stage
9599 - Withdrawal of International Standard
Start Date
26-Aug-2009
Completion Date
30-Oct-2025
Ref Project

Relations

Standard
ISO/IEC 14496-3:2005 - Information technology -- Coding of audio-visual objects
English language
1163 pages
sale 15% off
Preview
sale 15% off
Preview

Frequently Asked Questions

ISO/IEC 14496-3:2005 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Coding of audio-visual objects - Part 3: Audio". This standard covers: ISO/IEC 14496-3:2005 (MPEG-4 Audio) is a new kind of audio International Standard that integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. By standardizing individually sophisticated coding tools as well as a novel, flexible framework for audio synchronization, mixing and downloaded post-production, the developers of the MPEG-4 Audio International Standard have created new technology for a new, interactive world of digital audio. MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target a single application such as real-time telephony or high-quality audio compression. Rather, MPEG-4 Audio is a standard that applies to every application requiring the use of advanced sound compression, synthesis, manipulation or playback. The subparts that follow specify the state-of-the-art coding tools in several domains; however, ISO/IEC 14496-3:2005 is more than just the sum of its parts. As the tools described here are integrated with the rest of the MPEG-4 International Standard, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media are enabled. Since a single set of tools is used to cover the needs of a broad range of applications, interoperability is a natural feature of systems that depend on the MPEG-4 Audio International Standard.

ISO/IEC 14496-3:2005 (MPEG-4 Audio) is a new kind of audio International Standard that integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. By standardizing individually sophisticated coding tools as well as a novel, flexible framework for audio synchronization, mixing and downloaded post-production, the developers of the MPEG-4 Audio International Standard have created new technology for a new, interactive world of digital audio. MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target a single application such as real-time telephony or high-quality audio compression. Rather, MPEG-4 Audio is a standard that applies to every application requiring the use of advanced sound compression, synthesis, manipulation or playback. The subparts that follow specify the state-of-the-art coding tools in several domains; however, ISO/IEC 14496-3:2005 is more than just the sum of its parts. As the tools described here are integrated with the rest of the MPEG-4 International Standard, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media are enabled. Since a single set of tools is used to cover the needs of a broad range of applications, interoperability is a natural feature of systems that depend on the MPEG-4 Audio International Standard.

ISO/IEC 14496-3:2005 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.040.40 - Coding of audio, video, multimedia and hypermedia information. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC 14496-3:2005 has the following relationships with other standards: It is inter standard links to ISO/IEC 14496-3:2005/Amd 1:2007, ISO/IEC 14496-3:2005/Amd 9:2008, ISO/IEC 14496-3:2005/Amd 8:2008, ISO/IEC 14496-3:2005/Amd 3:2006, ISO/IEC 14496-3:2005/Amd 2:2006, ISO/IEC 14496-3:2005/Amd 5:2007, ISO/IEC 14496-3:2001/Amd 2:2004/Cor 1:2005, ISO/IEC 14496-3:2001/Amd 1:2003/Cor 1:2004, ISO/IEC 14496-3:2009, ISO/IEC 14496-3:2001/Cor 1:2002, ISO/IEC 14496-3:2001/Amd 2:2004, ISO/IEC 14496-3:2001/Amd 6:2005, ISO/IEC 14496-3:2001/Amd 3:2005, ISO/IEC 14496-3:2001/Amd 1:2003, ISO/IEC 14496-3:2001/Cor 2:2004. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

You can purchase ISO/IEC 14496-3:2005 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.

Standards Content (Sample)


INTERNATIONAL ISO/IEC
STANDARD 14496-3
Third edition
2005-12-01
Information technology — Coding of
audio-visual objects —
Part 3:
Audio
Technologies de l'information — Codage des objets audiovisuels —
Partie 3: Codage audio
Reference number
©
ISO/IEC 2005
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.

©  ISO/IEC 2005
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2005 – All rights reserved

Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission)
form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC
participate in the development of International Standards through technical committees established by the
respective organization to deal with particular fields of technical activity. ISO and IEC technical committees
collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in
liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have
established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International Standards
adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International
Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights.
ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 14496-3 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.

This third edition cancels and replaces the second edition (ISO/IEC 14496-3:2001), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-3:2001/Amd.1:2003, ISO/IEC 14496-3:2001/
Amd.2:2004, ISO/IEC 14496-3:2001/Amd.3:2005 and ISO/IEC 14496-3:2001/Amd.6:2005, and the Technical
Corrigenda ISO/IEC 14496-3:2001/Cor.1:2002 and ISO/IEC 14496-3:2001/Cor.2:2004.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of audio-
visual objects:
 Part 1: Systems
 Part 2: Visual
 Part 3: Audio
 Part 4: Conformance testing
 Part 5: Reference software
 Part 6: Delivery Multimedia Integration Framework (DMIF)
 Part 7: Optimized reference software for coding of audio-visual objects [Technical Report]
 Part 8: Carriage of ISO/IEC 14496 contents over IP networks
 Part 9: Reference hardware description [Technical Report]
 Part 10: Advanced Video Coding
© ISO/IEC 2005 — All rights reserved iii

 Part 11: Scene description and application engine
 Part 12: ISO base media file format
 Part 13: Intellectual Property Management and Protection (IPMP) extensions
 Part 14: MP4 file format
 Part 15: Advanced Video Coding (AVC) file format
 Part 16: Animation Framework eXtension (AFX)
 Part 17: Streaming text format
 Part 18: Font compression and streaming
 Part 19: Synthesized texture stream
 Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF)
iv © ISO/IEC 2005 — All rights reserved

0 Introduction
0.1 Overview
ISO/IEC 14496-3 (MPEG-4 Audio) is a new kind of audio standard that integrates many different types of audio
coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music,
complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. By
standardizing individually sophisticated coding tools as well as a novel, flexible framework for audio
synchronization, mixing, and downloaded post-production, the developers of the MPEG-4 Audio standard have
created new technology for a new, interactive world of digital audio.
MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target a single
application such as real-time telephony or high-quality audio compression. Rather, MPEG-4 Audio is a standard
that applies to every application requiring the use of advanced sound compression, synthesis, manipulation, or
playback. The subparts that follow specify the state-of-the-art coding tools in several domains; however, MPEG-4
Audio is more than just the sum of its parts. As the tools described here are integrated with the rest of the MPEG-4
standard, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks,
and other sorts of new media, are enabled.
Since a single set of tools is used to cover the needs of a broad range of applications, interoperability is a natural
feature of systems that depend on the MPEG-4 Audio standard. A system that uses a particular coder — for
example a real-time voice communication system making use of the MPEG-4 speech coding toolset — can easily
share data and development tools with other systems, even in different domains, that use the same tool — for
example a voicemail indexing and retrieval system making use of MPEG-4 speech coding.
The remainder of this Introduction gives a more detailed overview of the capabilities and functioning of MPEG-4
Audio. First a discussion of concepts, that have changed since the MPEG-2 audio standards, is presented. Then
the MPEG-4 Audio toolset is outlined.
0.2 Concepts of MPEG-4 Audio
As with previous MPEG standards, MPEG-4 does not standardize methods for encoding sound. Thus, content
authors are left to their own decisions as to the best method of creating bitstream payloads. At the present time,
methods to automatically convert natural sound into synthetic or multi-object descriptions are not mature; therefore,
most immediate solutions will involve interactively-authoring the content stream in some way. This process is
similar to current schemes for MIDI-based and multi-channel mixdown authoring of soundtracks.
Many concepts in MPEG-4 Audio are different from those in previous MPEG Audio standards. For the benefit of
readers who are familiar with MPEG-1 and MPEG-2, we provide a brief overview here.
0.2.1 Audio storage and transport facilities
In all of the MPEG-4 tools for audio coding, the coding standard ends at the point of constructing access units that
contain the compressed data. The MPEG-4 Systems (ISO/IEC 14496-1) specification describes how to convert
these individually coded access units into elementary streams.
There is no standard transport mechanism of these elementary streams over a channel. This is because the broad
range of applications that can make use of MPEG-4 technology have delivery requirements that are too wide to
easily characterize with a single solution. Rather, what is standardized is an interface (the Delivery Multimedia
Interface Format, or DMIF, specified in ISO/IEC 14496-6) that describes the capabilities of a transport layer and the
communication between transport, multiplex, and demultiplex functions in encoders and decoders. The use of
DMIF and the MPEG-4 Systems specification allows transmission functions that are much more sophisticated than
are possible with previous MPEG standards.
© ISO/IEC 2005 — All rights reserved v

However, LATM and LOAS were defined to provide a low overhead audio multiplex and transport mechanism for
natural audio applications, which do not require sophisticated object-based coding or other functions provided by
MPEG-4 Systems.
The following table gives an overview about the multiplex, storage and transmission formats currently available for
MPEG-4 Audio within the MPEG-4 framework:
Format Functionality defined in Functionality originally Description
MPEG-4: defined in:
M4Mux ISO/IEC 14496-1 - MPEG-4 Multiplex scheme
(normative)
LATM ISO/IEC 14496-3 - Low Overhead Audio Transport
(normative) Multiplex
ADIF ISO/IEC 14496-3 ISO/IEC 13818-7 Audio Data Interchange Format,
(informative) (normative) (AAC only)
MP4FF ISO/IEC 14496-12 - MPEG-4 File Format
(normative)
ADTS ISO/IEC 14496-3 ISO/IEC 13818-7 Audio Data Transport Stream,
(informative) (normative, exemplarily) (AAC only)
LOAS ISO/IEC 14496-3 - Low Overhead Audio Stream, based
(normative, exemplarily) on LATM, three versions are
available:
AudioSyncStream()
EPAudioSyncStream()
AudioPointerStream()
To allow for a user on the remote side of a channel to dynamically control a server streaming MPEG-4 content,
MPEG-4 defines backchannel streams that can carry user interaction information.
0.2.2 MPEG-4 Audio supports low-bitrate coding
Previous MPEG Audio standards have focused primarily on transparent (undetectable) or nearly transparent coding
of high-quality audio at whatever bitrate was required to provide it. MPEG-4 provides new and improved tools for
this purpose, but also standardizes (and has tested) tools that can be used for transmitting audio at the low bitrates
suitable for Internet, digital radio, or other bandwidth-limited delivery. The new tools specified in MPEG-4 are the
state-of-the-art tools that support low-bitrate coding of speech and other audio.
0.2.3 MPEG-4 Audio is an object-based coding standard with multiple tools
Previous MPEG Audio standards provided a single toolset, with different configurations of that toolset specified for
use in various applications. MPEG-4 provides several toolsets that have no particular relationship to each other,
each with a different target function. The profiles of MPEG-4 Audio specify which of these tools are used together
for various applications.
Further, in previous MPEG standards, a single (perhaps multi-channel or multi-language) piece of content was
transmitted. In contrast, MPEG-4 supports a much more flexible concept of a soundtrack. Multiple tools may be
used to transmit several audio objects, and when using multiple tools together an audio composition system is
provided to create a single soundtrack from the several audio substreams. User interaction, terminal capability, and
speaker configuration may be used when determining how to produce a single soundtrack from the component
objects. This capability gives MPEG-4 significant advantages in quality and flexibility when compared to previous
audio standards.
0.2.4 MPEG-4 Audio provides capabilities for synthetic sound
In natural sound coding, an existing sound is compressed by a server, transmitted and decompressed at the
receiver. This type of coding is the subject of many existing standards for sound compression. In contrast, MPEG-4
standardizes a novel paradigm in which synthetic sound descriptions, including synthetic speech and synthetic
vi © ISO/IEC 2005 — All rights reserved

Transmission Storage Multiplex

music, are transmitted and then synthesized into sound at the receiver. Such capabilities open up new areas of
very-low-bitrate but still very-high-quality coding.
0.2.5 MPEG-4 Audio provides capabilities for error robustness
Improved error robustness capabilities for all coding tools are provided through the error-resilient bitstream payload
syntax. This tool supports advanced channel coding techniques, which can be adapted to the special needs of
given coding tools and a given communications channel. This error-resilient bitstream payload syntax is mandatory
for all error resillient object types.
The error protection tool (EP tool) provides unequal error protection (UEP) for MPEG-4 Audio in conjunction with
the error-resilient bitstream payload. UEP is an efficient method to improve the error robustness of source coding
schemes. It is used by various speech and audio coding systems operating over error-prone channels such as
mobile telephone networks or Digital Audio Broadcasting (DAB). The bits of the coded signal representation are
first grouped into different classes according to their error sensitivity. Then error protection is individually applied to
the different classes, giving better protection to more sensitive bits.
Improved error robustness for AAC is provided by a set of error resilience tools. These tools reduce the perceived
degradation of the decoded audio signal that is caused by corrupted bits in the bitstream payload.
0.2.6 MPEG-4 Audio provides capabilities for scalability
Previous MPEG Audio standards provided a single bitrate, single bandwidth toolset, with different configurations of
that toolset specified for use in various applications. MPEG-4 provides several bitrate and bandwidth options within
a single stream, providing a scalability functionality that permits a given stream to scale to the requirement of
different channels and applications or to be responsive to a given channel that has dynamic throughput
characteristics. The tools specified in MPEG-4 are the state-of-the-art tools providing scalable compression of
speech and audio signals.
0.3 The MPEG-4 Audio tool set
0.3.1 Speech coding tools
0.3.1.1 Overview
Speech coding tools are designed for the transmission and decoding of synthetic and natural speech.
Two types of speech coding tools are provided in MPEG-4. The natural speech tools allow the compression,
transmission, and decoding of human speech, for use in telephony, personal communication, and surveillance
applications. The synthetic speech tool provides an interface to text-to-speech synthesis systems; using synthetic
speech provides very-low-bitrate operation and built-in connection with facial animation for use in low-bitrate video
teleconferencing applications.
0.3.1.2 Natural speech coding
The MPEG-4 speech coding toolset covers the compression and decoding of natural speech sound at bitrates
ranging between 2 and 24 kbit/s. When variable bitrate coding is allowed, coding at even less than 2 kbit/s, for
example an average bitrate of 1.2 kbit/s, is also supported. Two basic speech coding techniques are used: One is a
parametric speech coding algorithm, HVXC (Harmonic Vector eXcitation Coding), for very low bit rates; and the
other is a CELP (Code Excited Linear Prediction) coding technique. The MPEG-4 speech coders target
applications range from mobile and satellite communications, to Internet telephony, to packaged media and speech
databases. It meets a wide range of requirements encompassing bitrate, functionality and sound quality.
MPEG-4 HVXC operates at fixed bitrates between 2.0 kbit/s and 4.0 kbit/s using a bitrate scalability technique. It
also operates at lower bitrates, typically 1.2 - 1.7 kbit/s, using a variable bitrate technique. HVXC provides
communications-quality to near-toll-quality speech in the 100 Hz – 3800 Hz band at 8 kHz sampling rate. HVXC
also allows independent change of speed and pitch during decoding, which is a powerful functionality for fast
access to speech databases. HVXC functionalities including 2.0 - 4.0 kbit/s fixed bitrate modes and a 2.0 kbit/s
maximum variable bitrate mode.
© ISO/IEC 2005 — All rights reserved vii

Error Resilient (ER) HVXC extends operation of the variable bitrate mode to 4.0 kbit/s to allow higher quality
variable rate coding. The ER HVXC therefore provides fixed bitrate modes of 2.0 - 4.0 kbit/s and a variable bitrate
of either less than 2.0 kbit/s or less than 4.0 kbit/s, both in scalable and non-scalable modes. In the variable bitrate
modes, non-speech parts are detected in unvoiced signals, and a smaller number of bits are used for these non-
speech parts to reduce the average bitrate. ER HVXC provides communications-quality to near-toll-quality speech
in the 100 Hz - 3800 Hz band at 8 kHz sampling rate. When the variable bitrate mode is allowed, operation at lower
average bitrate is possible. Coded speech using variable bitrate mode at typical bitrates of 1.5 kbit/s average, and
at typical bitrate of 3.0 kbit/s average has essentially the same quality as 2.0 kbit/s fixed rate and 4.0 kbit/s fixed
rate respectively. The functionality of pitch and speed change during decoding is supported for all modes. ER
HVXC has a bitstream payload syntax with the error sensitivity classes to be used with the EP-Tool, and some
error concealment functionality is supported for use in error-prone channels such as mobile communication
channels. The ER HVXC speech coder target applications range from mobile and satellite communications, to
Internet telephony, to packaged media and speech databases.
MPEG-4 CELP is a well-known coding algorithm with new functionality. Conventional CELP coders offer
compression at a single bit rate and are optimized for specific applications. Compression is one of the
functionalities provided by MPEG-4 CELP, but MPEG-4 also enables the use of one basic coder in multiple
applications. It provides scalability in bitrate and bandwidth, as well as the ability to generate bitstream payloads at
arbitrary bitrates. The MPEG-4 CELP coder supports two sampling rates, namely, 8 kHz and 16 kHz. The
associated bandwidths are 100 Hz – 3800 Hz for 8 kHz sampling and 50 Hz – 7000 Hz for 16 kHz sampling. The
silence compression tool comprises a voice activity detector (VAD), a discontinuous transmission (DTX) unit and a
comfort noise generator (CNG) module. The tool encodes/decodes the input signal at a lower bitrate during the
non-active-voice (silent) frames. During the active-voice (speech) frames, MPEG-4 CELP encoding and decoding
are used.
The silence compression tool reduces the average bitrate thanks to compression at a lower-bitrate for silence. In
the encoder, a voice activity detector is used to distinguish between regions with normal speech activity and those
with silence or background noise. During normal speech activity, the CELP coding is used. Otherwise a silence
insertion descriptor (SID) is transmitted at a lower bitrate. This SID enables a comfort noise generator (CNG) in the
decoder. The amplitude and the spectral shape of this comfort noise are specified by energy and LPC parameters
in methods similar to those used in a normal CELP frame. These parameters are optionally re-transmitted in the
SID and thus can be updated as required.
MPEG has conducted extensive verification testing in realistic listening conditions in order to prove the efficacy of
the speech coding toolset.
0.3.1.3 Text-to-speech interface
Text-to-speech (TTS) capability is becoming a rather common media type and plays an important role in various
multi-media application areas. For instance, by using TTS functionality, multimedia content with narration can be
easily created without recording natural speech. Before MPEG-4, however, there was no way for a multimedia
content provider to easily give instructions to an unknown TTS system. With MPEG-4 TTS Interface, a single
common interface for TTS systems is standardized. This interface allows speech information to be transmitted in
the international phonetic alphabet (IPA), or in a textual (written) form of any language.
The MPEG-4 Hybrid/Multi-Level Scalable TTS Interface is a superset of the conventional TTS framework. This
extended TTS Interface can utilize prosodic information taken from natural speech in addition to input text and can
thus generate much higher-quality synthetic speech. The interface and its bitstream payload format is scalable in
terms of this added information; for example, if some parameters of prosodic information are not available, a
decoder can generate the missing parameters by rule. Normative algorithms for speech synthesis and text-to-
phoneme translation are not specified in MPEG-4, but to meet the goal that underlies the MPEG-4 TTS Interface, a
decoder should fully utilize all the provided information according to the user’s requirements level.
As well as an interface to text-to-speech synthesis systems, MPEG-4 specifies a joint coding method for phonemic
information and facial animation (FA) parameters and other animation parameters (AP). Using this technique, a
single bitstream payload may be used to control both the text-to-speech interface and the facial animation visual
object decoder (see ISO/IEC 14496-2, Annex C). The functionality of this extended TTS thus ranges from
conventional TTS to natural speech coding and its application areas, from simple TTS to audio presentation with
TTS and motion picture dubbing with TTS.
viii © ISO/IEC 2005 — All rights reserved

0.3.2 Audio coding tools
0.3.2.1 Overview
Audio coding tools are designed for the transmission and decoding of recorded music and other audio soundtracks.
0.3.2.2 General audio coding tools
MPEG-4 standardizes the coding of natural audio at bitrates ranging from 6 kbit/s up to several hundred kbit/s per
audio channel for mono, two-channel-, and multi-channel-stereo signals. General high-quality compression is
provided by incorporating the MPEG-2 AAC standard (ISO/IEC 13818-7), with certain improvements, as MPEG-4
AAC. At 64 kbit/s/channel and higher ranges, this coder has been found in verification testing under rigorous
conditions to meet the criterion of “indistinguishable quality” as defined by the European Broadcasting Union.
General audio (GA) coding tools comprise the AAC tool set expanded by alternative quantization and coding
schemes (Twin-VQ and BSAC). The general audio coder uses a perceptual filterbank, a sophisticated masking
model, noise-shaping techniques, channel coupling, and noiseless coding and bit-allocation to provide the
maximum compression within the constraints of providing the highest possible quality. Psychoacoustic coding
standards developed by MPEG have represented the state-of-the-art in this technology since MPEG-1 Audio;
MPEG-4 General Audio coding continues this tradition.
For bitrates ranging from 6 kbit/s to 64 kbit/s per channel, the MPEG-4 standard provides extensions to the GA
coding tools, that allow the content author to achieve the highest quality coding at the desired bitrate. Furthermore,
various bit rate scalability options are available within the GA coder. The low-bitrate techniques and scalability
modes provided within this tool set have also been verified in formal tests by MPEG.
The MPEG-4 low delay coding functionality provides the ability to extend the usage of generic low bitrate audio
coding to applications requiring a very low delay in the encoding / decoding chain (e.g. full-duplex real-time
communications). In contrast to traditional low delay coders based on speech coding technology, the concept of
this low delay coder is based on general perceptual audio coding and is thus suitable for a wide range of audio
signals. Specifically, it is derived from the proven architecture of MPEG-2/4 Advanced Audio Coding (AAC) and all
capabilities for coding of 2 (stereo) or more sound channels (multi-channel) are available within the low delay
coder. It operates at up to 48 kHz sampling rate and uses a frame length of 512 or 480 samples, compared to the
1024 or 960 samples used in standard MPEG-2/4 AAC to enable coding of general audio signals with an
algorithmic delay not exceeding 20 ms. Also the size of the window used in the analysis and synthesis filterbank is
reduced by a factor of 2. No block switching is used to avoid the “look-ahead'” delay due to the block switching
decision. To reduce pre-echo artefacts in the case of transient signals, window shape switching is provided instead.
For non-transient portions of the signal a sine window is used, while a so-called low overlap window is used for
transient portions. Use of the bit reservoir is minimized in the encoder in order to reach the desired target delay. As
one extreme case, no bit reservoir is used at all.
The MPEG-4 BSAC is used in combination with the AAC coding tools and replaces the noiseless coding of the
quantized spectral data and the scalefactors. The MPEG-4 BSAC provides fine grain scalability in steps of 1 kbit/s
per audio channel, i.e. 2 kbit/s steps for a stereo signal. One base layer stream and many small enhancement layer
streams are used. To obtain fine step scalability, a bit-slicing scheme is applied to the quantized spectral data. First
the quantized spectral values are grouped into frequency bands. Each of these groups contains the quantized
spectral values in their binary representation. Then the bits of a group are processed in slices according to their
significance. Thus all most significant bits (MSB) of the quantized values in a group are processed first. These bit-
slices are then encoded using an arithmetic coding scheme to obtain entropy coding with minimal redundancy. In
order to implement fine grain scalability efficiently using MPEG-4 Systems tools, the fine grain audio data can be
grouped into large-step layers and these large-step layers can be further grouped by concatenating large-step
layers from several sub-frames. Furthermore, the configuration of the payload transmitted over an Elementary
Stream (ES) can be changed dynamically (by means of the MPEG-4 backchannel capability) depending on the
environment, such as network traffic or user interaction. This means that BSAC can allow for real-time adjustments
to the quality of service. In addition to fine grain scalablity, it can improve the quality of an audio signal that is
decoded from a stream transmitted over an error-prone channel, such as a mobile communication networks or
Digital Audio Broadcasting (DAB) channel.
MPEG-4 SBR (Spectral Band Replication) is a bandwidth extension tool used in combination with the AAC general
audio codec. When integrated into the MPEG AAC codec, a significant improvement of the performance is
available, which can be used to lower the bitrate or improve the audio quality. This is achieved by replicating the
© ISO/IEC 2005 — All rights reserved ix

highband, i.e. the high frequency part of the spectrum. A small amount of data representing a parametric
description of the highband is encoded and used in the decoding process. The data rate is by far below the data
rate required when using conventional AAC coding of the highband.
0.3.2.3 Parametric audio coding tools
The parametric audio coding tool MPEG-4 HILN (Harmonic and Individual Lines plus Noise) codes non-speech
signals like music at bitrates of 4 kbit/s and higher using a parametric representation of the audio signal. The basic
idea of this technique is to decompose the input signal into audio objects which are described by appropriate
source models and represented by model parameters. Object models for sinusoids, harmonic tones, and noise are
utilized in the HILN coder. HILN allows independent change of speed and pitch during decoding.
The Parametric Audio Coding tools combine very low bitrate coding of general audio signals with the possibility of
modifying the playback speed or pitch during decoding without the need for an effects processing unit. In
combination with the speech and audio coding tools in MPEG-4, improved overall coding efficiency is expected for
applications of object based coding allowing selection and/or switching between different coding techniques.
This approach allows to introduce a more advanced source model than just assuming a stationary signal for the
duration of a frame, which motivates the spectral decomposition used in e.g. the MPEG-4 General Audio Coder. As
known from speech coding, where specialized source models based on the speech generation process in the
human vocal tract are applied, advanced source models can be advantageous, especially for very low bitrate
coding schemes.
Due to the very low target bitrates, only the parameters for a small number of objects can be transmitted. Therefore
a perception model is employed to select those objects that are most important for the perceptual quality of the
signal.
In HILN, the frequency and amplitude parameters are quantized according to the “just noticeable differences”
known from psychoacoustics. The spectral envelope of the noise and the harmonic tones are described using LPC
modeling as known from speech coding. Correlation between the parameters of one frame and those of
consecutive frames is exploited by parameter prediction. Finally, the quantized parameters are entropy coded and
multiplexed to form a bitstream payload.
A very interesting property of this parametric coding scheme arises from the fact that the signal is described in
terms of frequency and amplitude parameters. This signal representation permits speed and pitch change
functionality by simple parameter modification in the decoder. The HILN Parametric Audio Coder can be combined
with MPEG-4 Parametric Speech Coder (HVXC) to form an integrated parametric coder covering a wider range of
signals and bitrates. This integrated coder supports speed and pitch change. Using a speech/music classification
tool in the encoder, it is possible to automatically select the HVXC for speech signals and the HILN for music
signals. Such automatic HVXC/HILN switching was successfully demonstrated and the classification tool is
described in the informative Annex of the MPEG-4 standard.
MPEG-4 SSC, (SinuSoidal Coding) is a parametric coding tool that is capable of full bandwidth high quality audio
coding. The coding tool dissects a monaural or stereo audio signal into a number of different objects that each can
be parameterized efficiently and encoded at a low bit-rate. These objects are, transients: representing dynamic
changes in the temporal domain, sinusoids: representing deterministic components, and noise: representing
components that do not have a clear temporal or spectral localisation. The fourth object, that is only relevant for
stereo input signals, captures the stereo image. As the signal is represented in a parametric domain, independent,
high quality pitch and tempo scaling are possible at low computational cost.
0.3.3 Lossless audio coding tools
MPEG-4 DST (Direct Stream Transfer) provides lossless coding of oversampled audio signals.
0.3.4 Synthesis tools
Synthesis tools are designed for very low bitrate description and transmission, and terminal-side synthesis, of
synthetic music and other sounds.
x © ISO/IEC 2005 — All rights reserved

The MPEG-4 toolset providing general audio synthesis capability is called MPEG-4 Structured Audio, and it is
described in subpart 5 of ISO/IEC 14496-3. MPEG-4 Structured Audio (the SA coder) provides very general
capabilities for the description of synthetic sound, and the normative creation of synthetic sound in the decoding
terminal. High-quality stereo sound can be transmitted at bitrates from 0 kbit/s (no continuous cost) to 2-3 kbit/s for
extremely expressive sound using these tools.
Rather than specify a particular method of synthesis, SA specifies a flexible language for describing methods of
synthesis. This technique allows content authors two advantages. First, the set of synthesis techniques available is
not limited to those that were envisioned as useful by the creators of the standard; any current or future method of
synthesis may be used in MPEG-4 Structured Audio. Second, the creation of synthetic sound from structured
descriptions is normative in MPEG-4, so sound created with the SA coder will sound the same on any terminal.
Synthetic audio is transmitted via a set of instrument modules that can create audio signals under the control of a
score. An instrument is a small network of signal-processing primitives that control the parametric generation of
sound according to some algorithm. Several different instruments may be transmitted and used in a single
Structured Audio bitstream payload. A score is a time-sequenced set of commands that invokes various
instruments at specific times to contribute their output to an overall music performance. The format for the
description of instruments is SAOL, the Structured Audio Orchestra Language. The format for the description of
scores is SASL, the Structured Audio Score Language.
Efficient transmission of sound samples, also called wavetables, for use in sampling synthesis is accomplished by
providing interoperability with the MIDI Manufacturers Association Downloaded Sounds Level 2 (DLS-2) standard,
which is normatively referenced by the Structured Audio standard. By using the DLS-2 format, the simple and
popular technique of wavetable synthesis can be used in MPEG-4 Structured Audio soundtracks, either by itself or
in conjunction with other kinds of synthesis using the more general-purpose tools. To further enable interoperability
with existing content and authoring tools, the popular MIDI (Musical Instrument Digital Interface) control format can
be used instead of, or in addition to, scores in SASL for controlling synthesis.
Through the inclusion of compatibility with MIDI standards, MPEG-4 Structured Audio thus represents a unification
of the current technique for synthetic sound description (MIDI-based wavetable synthesis) with that of the future
(general-purpose algorithmic synthesis). The resulting standard solves problems not only in very-low-bitrate coding,
but also in virtual environments, video games, interactive music, karaoke systems, and many other applications.
0.3.5 Composition tools
Composition tools are designed for object-based coding, interactive functionality, and audiovisual synchronization.
The tools for audio composition, like those for visual composition, are specified in the MPEG-4 Systems standard
(ISO/IEC 14496-1). However, since readers interested in audio functionality are likely to look here first, a brief
overview is provided.
Audio composition is the use of multiple individual “audio objects” and mixing techniques to create a single
soundtrack. It is analogous to the process of recording a soundtrack in a multichannel mix, with each musical
instrument, voice actor, and sound effect on its own channel, and then “mixing down” the multiple channels to a
single channel or single stereo pair. In MPEG-4, the multichannel mix itself may be transmitted, with each audio
source using a different coding tool, and a set of instructions for mixdown also transmitted in the bitstream payload.
As the multiple audio objects are received, they are decoded separately, but not played back to the listener; rather,
the instructions for mixdown are used to prepare a single soundtrack from the “raw material” given in the objects.
This final soundtrack is then played for the listener.
An example serves to illustrate the efficacy of this approach. Suppose, for a certain application, we wish to transmit
the sound of a person speaking in a reverberant environment over stereo background music, at very high quality. A
traditional approach to coding would demand the use of a general audio coding at 32 kbit/s/channel or above; the
sound source is too complex to be well-modeled by a simple model-based coder. However, in MPEG-4 we can
represent the soundtrack as the conjunction of several objects: a speaking person passed through a reverberator
added to a synthetic music track. We transmit the speaker’s voice using the CELP tool at 16 kbit/s, the synthetic
music using the SA tool at 2 kbit/s, and allow a small amount of overhead (only a few hundreds of bytes as a fixed
cost) to describe the stereo mixdown and the reverberation. Using MPEG-4 and an object-based approach thus
allows us to describe in less than 20 kbit/s total a stream that might require 64 kbit/s to transmit with traditional
coding, at equivalent quality.
© ISO/IEC 2005 — All rights reserved xi

Additionally, having such structured soundtrack information present in the decoding terminal allows more
sophisticated client-side interaction to be included. For example, the listener can be allowed (if the content author
desires) to request that the background music be muted. This functionality would not be possible if the music and
speech were coded into the same audio track.
With the MPEG-4 Binary Format for Scenes (BIFS), specified in MPEG-4 Systems, a subset tool called
AudioBIFS allows content authors to describe sound scenes using this object-based framework. Multiple sources
may be mixed and combined, and interactive control provided for their combination. Sample-resolution control over
mixing is provided in this method. Dynamic download of custom signal-processing routines allows the content
author to exactly request a particular, normative, digital filter, reverberator, or other effects-processing routine.
Finally, an interface to terminal-dependent methods of 3-D audio spatialisation is provided for the description of
virtual-reality and other 3-D sound material.
As AudioBIFS is part of the general BIFS specification, the same framework is used to synchronize audio and
video, audio and computer graphics, or audio with other material. Please refer to ISO/IEC 14496-1 (MPEG-4
Systems) for more information on AudioBIFS and other topics in audiovisual synchronization.
0.3.6 Scalability tools
Scalability tools are designed for the creation of bitstream payloads that can be transmitted, without recoding, at
several different bitrates.
Many of the stream types in MPEG-4 are scalable in one manner or another. Several types of scalability in the
standard are discussed below.
Bitrate scalability allows a bitstream payload to be parsed into a bitstream payload of lower bitrate such that the
combination can still be decoded into a meaningful signal. The bitstream payload parsing can occur either during
transmission or in the decoder. Scalability is available within each of the natural audio coding schemes, or by a
combination of different natural audio coding schemes.
Bandwidth scalability is a particular case of bitrate scalability, whereby part of a bitstream payload representing a
part of the frequency spectrum can be discarded during transmission or decoding. This is available for the CELP
speech coder, where an extension layer converts the narrow band base layer speech coder into a wide band
speech coder. Also the general audio coding tools which all operate in the frequency domain offer a very flexible
bandwidth control for the different coding layers.
Encoder complexity scalability allows encoders of different complexity to generate valid and meaningful bitstream
payloads. An example for this is the availability of a high quality and a low complexity excitation module for the
wideband CELP coder allowing to choose between significant lower encoder complexity or optimized coding
quality.
Decoder complexity scalability allows a given bitstream payload to be decoded by decoders of different levels of
complexity. A subtype of decoder complexity scalability is graceful degradation, in which a decoder dynamically
monitors the resources available, and scales down the decoding complexity (and thus the audio quality) when
resources are limited. The Structured Audio decoder allows this type of scalability; a content author may provide
(for example) several different algorithms for the synthesis of piano sounds, and the content itself decides,
depending on available resources, which one to use.
0.3.7 Upstream
Upstream tools are designed for the dynamic control the streaming of the server for bitrate control and quality
feedback control.
The MPEG-4 upstream or backchannel allows a user on a remote side to dynamically control the streaming of
MPEG-4 content from a server. Backchannel streams carrying the user interaction information.
xii © ISO/IEC 2005 — All rights reserved

0.3.8 Error robustness facilities
0.3.8.1 Overview
Error robustness facilities include tools for error resilience as well as for error protection.
The error robustness facilities provide improved performance on error-prone transmission channels. They are
comprised of error resilient bitstream payload reordering, a common error protection tool and codec specific error
resilience tools.
0.3.8.2 Error resilient bitstream payload reordering
Error resilient bitstream payload reordering allows the effective use of advanced channel coding techniques like
unequal error protection (UEP), which can be perfectly adapted to the needs of the different coding tools. The basic
idea is to rearrange the audio frame content depending on its error sensitivity in one or more instances belonging to
different error sensitivity categories (ESC). This rearrangement can be either data element-wise or even bit-wise.
An error resilient bitstream payload frame is build by concatenating these instances.
Audio Bitstream Channel
Encoder formatter Coding
Channel
Audio Bitstream Channel
Decoder unformatter Decoding
The basic principle is depicted in the figure above. A bitstream payload is reordered according to the error
sensitivity of single bitstream payload elements or even single bits. This new arranged bitstream payload is channel
coded, transmitted and channel decoded. Prior to audio decoding, the bitstream payload is rearranged to its
original order.
0.3.8.3 Error protection
The EP tool provides unequal error protection. It receives several classes of bits from the audio coding tools, and
then applies forward error correction codes (FEC) and/or cyclic redundancy codes (CRC) for each class, according
to its error sensitivity.
The error protection tool (EP tool) provides the unequal error protection (UEP) capability to the set of ISO/IEC
14496-3 codecs. Main features of this tool are:
• providing a set of error correcting/detecting codes with wide and small-step scalability, both in performance
and in redundancy
• providing a generic and bandwidth-efficient error protection framework, which covers both fixed-length
frame streams and variable-length frame streams
• providing a UEP configuration control with low overhead.
0.3.8.4 Error resilience tools for AAC
Several tools are provided to increase the error resilience for AAC. These tools improve the perceived audio quality
of the decoded audio signal in case of corrupted bitstream payloads, which may occur e. g. in the presence of noisy
transmission channels.
• The Virtual CodeBooks tool (VCB11) extends the sectioning information of an AAC bitstream payload. This
permits the detection of serious errors within the spectral data of an MPEG-4 AAC bitstream payload.
© ISO/IEC 2005 — All ri
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...