Information technology - Coding of audio-visual objects - Part 3: Audio

This document integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. This document standardizes individually sophisticated coding tools to provide a novel, flexible framework for audio synchronization, mixing, and downloaded post-production. This document does not target a single application such as real-time telephony or high-quality audio compression. Rather, it applies to every application requiring the use of advanced sound compression, synthesis, manipulation, or playback. This document specifies the state-of-the-art coding tools in several domains. As the tools it defines are integrated with the rest of the ISO/IEC 14496 series, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media, are enabled.

Technologies de l'information — Codage des objets audiovisuels — Partie 3: Codage audio

General Information

Status
Published
Publication Date
11-Dec-2019
Current Stage
9060 - Close of review
Completion Date
04-Jun-2030

Relations

Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
23-Apr-2020
Effective Date
30-Jun-2018

Overview

ISO/IEC 14496-3:2019 (MPEG-4 Audio) is the 2019 (fifth) edition of the international standard for audio coding within the MPEG‑4 framework. Rather than targeting a single application, it defines a broad, integrated toolset for compressing, synthesizing, manipulating, synchronizing, and presenting audio across use cases - from low‑bitrate speech to high‑quality immersive sound and interactive/virtual‑reality audio. The standard standardizes individual coding tools and a flexible framework for object‑based audio, mixing, synchronization and post‑production, enabling interoperable implementations across devices and services.

Key topics and technical requirements

  • Multi‑tool architecture: Multiple toolsets are defined (speech coders, general audio coders, parametric and synthetic audio) and combined through profiles to suit different applications.
  • Speech coding: Includes HVXC (very low bitrate, scalable 1.2–4.0 kbit/s ranges; 8 kHz sampling, ~100–3800 Hz band) and CELP (supports 8 kHz and 16 kHz sampling; scalable bitrate/bandwidth).
  • General audio codecs: Advanced natural‑audio codecs such as AAC, TwinVQ and BSAC are specified alongside lossless options (ALS, SLS, DST).
  • Structured and parametric audio: Support for synthetic sound descriptions, Text‑To‑Speech Interface (TTSI), Structured Audio (SA), and parametric schemes (HILN, SSC) for very low bitrate/highly flexible rendering.
  • Error resilience and protection: Defines an error‑resilient bitstream payload syntax plus an EP‑tool for unequal error protection (UEP) and improved robustness on error‑prone channels.
  • Scalability: Bitrate and bandwidth scalability within single streams to adapt to networks with varying throughput.
  • Multiplexing & transport interfaces: Integrates with MPEG‑4 Systems (ISO/IEC 14496‑1) and DMIF (ISO/IEC 14496‑6); provides LATM/LOAS and ADIF/ADTS formats for audio transport and low‑overhead multiplexing.
  • Audio synchronization: Subpart for audio synchronization enables precise timing and mixing across multiple objects/streams.

Practical applications and typical users

  • Audio codec developers and implementers building MPEG‑4 Audio encoders/decoders (AAC, CELP, HVXC, ALS, etc.).
  • Streaming platforms, broadcasters and digital radio services needing scalable or low‑bitrate delivery with error robustness.
  • Game engines, VR/AR platforms and interactive media producers using object‑based audio and dynamic soundtracks.
  • Telecommunications and VoIP providers leveraging the speech toolset for low‑bandwidth voice services.
  • Content authors and post‑production engineers who require standardized frameworks for mixing, synchronization and downloaded post‑production.

Related standards

  • ISO/IEC 14496‑1 (MPEG‑4 Systems) - system-level multiplexing/presentation
  • ISO/IEC 14496‑6 (DMIF) - delivery multimedia interface
  • ISO/IEC 14496‑12 (MP4 File Format) - storage
  • LATM / LOAS, ADIF, ADTS - transport/multiplex formats referenced in the standard

ISO/IEC 14496-3:2019 is essential for anyone implementing advanced audio compression, interactive sound systems, or interoperable multimedia services that require state‑of‑the‑art coding, synchronization, and error‑resilient delivery.

Standard

ISO/IEC 14496-3:2019 - Information technology — Coding of audio-visual objects — Part 3: Audio Released:12/12/2019

English language
1443 pages
sale 15% off
Preview
sale 15% off
Preview

Frequently Asked Questions

ISO/IEC 14496-3:2019 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Coding of audio-visual objects - Part 3: Audio". This standard covers: This document integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. This document standardizes individually sophisticated coding tools to provide a novel, flexible framework for audio synchronization, mixing, and downloaded post-production. This document does not target a single application such as real-time telephony or high-quality audio compression. Rather, it applies to every application requiring the use of advanced sound compression, synthesis, manipulation, or playback. This document specifies the state-of-the-art coding tools in several domains. As the tools it defines are integrated with the rest of the ISO/IEC 14496 series, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media, are enabled.

This document integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. This document standardizes individually sophisticated coding tools to provide a novel, flexible framework for audio synchronization, mixing, and downloaded post-production. This document does not target a single application such as real-time telephony or high-quality audio compression. Rather, it applies to every application requiring the use of advanced sound compression, synthesis, manipulation, or playback. This document specifies the state-of-the-art coding tools in several domains. As the tools it defines are integrated with the rest of the ISO/IEC 14496 series, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media, are enabled.

ISO/IEC 14496-3:2019 is classified under the following ICS (International Classification for Standards) categories: 35.040.40 - Coding of audio, video, multimedia and hypermedia information. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC 14496-3:2019 has the following relationships with other standards: It is inter standard links to ISO/IEC 14496-3:2009/Amd 4:2013, ISO/IEC 14496-3:2009/Amd 7:2018, ISO/IEC 14496-3:2009/Amd 3:2012, ISO/IEC 14496-3:2009/Cor 3:2012, ISO/IEC 14496-3:2009/Amd 5:2015, ISO/IEC 14496-3:2009/Amd 1:2009, ISO/IEC 14496-3:2009/Amd 2:2010, ISO/IEC 14496-3:2009/Amd 6:2017, ISO/IEC 14496-3:2009. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

You can purchase ISO/IEC 14496-3:2019 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.

Standards Content (Sample)


INTERNATIONAL ISO/IEC
STANDARD 14496-3
Fifth edition
2019-12
Information technology — Coding of
audio-visual objects —
Part 3:
Audio
Technologies de l'information — Codage des objets audiovisuels —
Partie 3: Codage audio
Reference number
©
ISO/IEC 2019
© ISO/IEC 2019
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2019 – All rights reserved

Contents
Foreword           iv
0 Introduction           v
1 Scope            1
2 Normative references          1
3 Terms and definitions          2
4 Abbreviated terms          21
5 Structure of this document         23
Subpart 1: Main           24
Subpart 2: Speech coding — HVXC        169
Subpart 3: Speech coding — CELP        320
Subpart 4: General Audio coding (GA) — AAC, TwinVQ, BSAC     483
Subpart 5: Structured Audio (SA)        895
Subpart 6: Text To Speech Interface (TTSI)       1043
Subpart 7: Parametric Audio Coding — HILN       1053
Subpart 8: Parametric coding for high quality audio — SSC     1112
Subpart 9: MPEG-1/2 Audio in MPEG-4       1231
Subpart 10: Lossless coding of oversampled audio — DST     1244
Subpart 11: Audio lossless coding — ALS       1281
Subpart 12: Scalable lossless coding— SLS       1355
Subpart 13: Audio Synchronization        1429
© ISO/IEC 2019 – All rights reserved iii

Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights. Details of any
patent rights identified during the development of the document will be in the Introduction and/or on the ISO
list of patent declarations received (see www.iso.org/patents) or the IEC list of patent declarations received
(see http://patents.iec.ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This fifth edition cancels and replaces the fourth edition (ISO/IEC 14496-3:2009), which has been technically
revised. It incorporates the Amendments ISO/IEC 14496-3:2009/Amd.1:2009, ISO/IEC 14496-
3:2009/Amd.2:2010, ISO/IEC 14496-3:2009/Amd.3:2012, ISO/IEC 14496-3:2009/Amd.4:2013,
ISO/IEC 14496-3:2009/Amd.4:2013/Cor.1:2015, ISO/IEC 14496-3:2009/Amd.5:2015, ISO/IEC 14496-
3:2009/Amd.6:2017 and ISO/IEC 14496-3:2009/Amd.7:2018 as well as Technical Corrigenda ISO/IEC 14496-
3:2009/Cor.1:2009, ISO/IEC 14496-3:2009/Cor.2:2011, ISO/IEC 14496-3:2009/Cor.3:2012, ISO/IEC 14496-
3:2009/Cor.4:2012, ISO/IEC 14496-3:2009/Cor.5:2015, ISO/IEC 14496-3:2009/Cor.6:2015, ISO/IEC 14496-
3:2009/Cor.7:2015.
A list of all parts in the ISO/IEC 14496 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv © ISO/IEC 2019 – All rights reserved

0 Introduction
0.1 Overview
ISO/IEC 14496-3 (MPEG-4 Audio) is a new kind of audio standard that integrates many different types of
audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with
music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality
content. By standardizing individually sophisticated coding tools as well as a novel, flexible framework for
audio synchronization, mixing, and downloaded post-production, the developers of the MPEG-4 Audio
standard have created new technology for a new, interactive world of digital audio.
MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target a single
application such as real-time telephony or high-quality audio compression. Rather, MPEG-4 Audio is a
standard that applies to every application requiring the use of advanced sound compression, synthesis,
manipulation, or playback. The subparts that follow specify the state-of-the-art coding tools in several
domains; however, MPEG-4 Audio is more than just the sum of its parts. As the tools described here are
integrated with the rest of the MPEG-4 standard, exciting new possibilities for object-based audio coding,
interactive presentation, dynamic soundtracks, and other sorts of new media, are enabled.
Since a single set of tools is used to cover the needs of a broad range of applications, interoperability is a
natural feature of systems that depend on the MPEG-4 Audio standard. A system that uses a particular coder
— for example a real-time voice communication system making use of the MPEG-4 speech coding toolset —
can easily share data and development tools with other systems, even in different domains, that use the same
tool — for example a voicemail indexing and retrieval system making use of MPEG-4 speech coding.
The remainder of this clause gives a more detailed overview of the capabilities and functioning of MPEG-4
Audio. First a discussion of concepts, that have changed since the MPEG-2 audio standards, is presented.
Then the MPEG-4 Audio toolset is outlined.
0.2 Concepts of MPEG-4 Audio
0.2.1 General
As with previous MPEG standards, MPEG-4 does not standardize methods for encoding sound. Thus, content
authors are left to their own decisions as to the best method of creating bitstream payloads. At the present
time, methods to automatically convert natural sound into synthetic or multi-object descriptions are not mature;
therefore, most immediate solutions will involve interactively-authoring the content stream in some way. This
process is similar to current schemes for MIDI-based and multi-channel mixdown authoring of soundtracks.
Many concepts in MPEG-4 Audio are different than those in previous MPEG Audio standards. For the benefit
of readers who are familiar with MPEG-1 and MPEG-2 we provide a brief overview here.
0.2.2 Audio storage and transport facilities
In all of the MPEG-4 tools for audio coding, the coding standard ends at the point of constructing access units
that contain the compressed data. The MPEG-4 Systems (ISO/IEC 14496-1) specification describes how to
convert these individually coded access units into elementary streams.
There is no standard transport mechanism of these elementary streams over a channel. This is because the
broad range of applications that can make use of MPEG-4 technology have delivery requirements that are too
wide to easily characterize with a single solution. Rather, what is standardized is an interface (the Delivery
Multimedia Interface Format, or DMIF, specified in ISO/IEC 14496-6) that describes the capabilities of a
transport layer and the communication between transport, multiplex, and demultiplex functions in encoders
and decoders. The use of DMIF and the MPEG-4 Systems specification allows transmission functions that are
much more sophisticated than are possible with previous MPEG standards.
However, LATM and LOAS were defined to provide a low overhead audio multiplex and transport mechanism
for natural audio applications, which do not require sophisticated object-based coding or other functions
provided by MPEG-4 Systems.
© ISO/IEC 2019 – All rights reserved v

Table 0.1 gives an overview about the multiplex, storage and transmission formats currently available for
MPEG-4 Audio within the MPEG-4 framework:
Table 0.1 – MPEG-4 Audio multiplex, storage and transmission formats
Format Functionality defined in Functionality originally Description
MPEG-4: defined in:
M4Mux ISO/IEC 14496-1 - MPEG-4 Multiplex scheme
(normative)
LATM ISO/IEC 14496-3 - Low Overhead Audio Transport
(normative) Multiplex
ADIF ISO/IEC 14496-3 ISO/IEC 13818-7 Audio Data Interchange Format,
(informative) (normative) (AAC only)
MP4FF ISO/IEC 14496-12 - MPEG-4 File Format
(normative)
ADTS ISO/IEC 14496-3 ISO/IEC 13818-7 Audio Data Transport Stream,
(informative) (normative, exemplarily) (AAC only)
LOAS ISO/IEC 14496-3 - Low Overhead Audio Stream, based
(normative, exemplarily) on LATM, three versions are
available:
AudioSyncStream()
EPAudioSyncStream()
AudioPointerStream()
To allow for a user on the remote side of a channel to dynamically control a server streaming MPEG-4
content, MPEG-4 defines backchannel streams that can carry user interaction information.
0.2.3 MPEG-4 Audio supports low-bitrate coding
Previous MPEG Audio standards have focused primarily on transparent (undetectable) or nearly transparent
coding of high-quality audio at whatever bitrate was required to provide it. MPEG-4 provides new and
improved tools for this purpose, but also standardizes (and has tested) tools that can be used for transmitting
audio at the low bitrates suitable for Internet, digital radio, or other bandwidth-limited delivery. The new tools
specified in MPEG-4 are the state-of-the-art tools that support low-bitrate coding of speech and other audio.
0.2.4 MPEG-4 Audio is an object-based coding standard with multiple tools
Previous MPEG Audio standards provided a single toolset, with different configurations of that toolset
specified for use in various applications. MPEG-4 provides several toolsets that have no particular relationship
to each other, each with a different target function. The profiles of MPEG-4 Audio specify which of these tools
are used together for various applications.
Further, in previous MPEG standards, a single (perhaps multi-channel or multi-language) piece of content was
transmitted. In contrast, MPEG-4 supports a much more flexible concept of a soundtrack. Multiple tools may
be used to transmit several audio objects, and when using multiple tools together an audio composition
system is provided to create a single soundtrack from the several audio substreams. User interaction, terminal
capability, and speaker configuration may be used when determining how to produce a single soundtrack from
the component objects. This capability gives MPEG-4 significant advantages in quality and flexibility when
compared to previous audio standards.
0.2.5 MPEG-4 Audio provides capabilities for synthetic sound
In natural sound coding, an existing sound is compressed by a server, transmitted and decompressed at the
receiver. This type of coding is the subject of many existing standards for sound compression. In contrast,
MPEG-4 standardizes a novel paradigm in which synthetic sound descriptions, including synthetic speech and
synthetic music, are transmitted and then synthesized into sound at the receiver. Such capabilities open up
new areas of very-low-bitrate but still very-high-quality coding.
vi © ISO/IEC 2019 – All rights reserved

Transmission Storage Multiplex

0.2.6 MPEG-4 Audio provides capabilities for error robustness
Improved error robustness capabilities for all coding tools are provided through the error resilient bitstream
payload syntax. This tool supports advanced channel coding techniques, which can be adapted to the special
needs of given coding tools and a given communications channel. This error resilient bitstream payload syntax
is mandatory for all error resillient object types.
The error protection tool (EP tool) provides unequal error protection (UEP) for MPEG-4 Audio in conjunction
with the error resilient bitstream payload. UEP is an efficient method to improve the error robustness of source
coding schemes. It is used by various speech and audio coding systems operating over error-prone channels
such as mobile telephone networks or Digital Audio Broadcasting (DAB). The bits of the coded signal
representation are first grouped into different classes according to their error sensitivity. Then error protection
is individually applied to the different classes, giving better protection to more sensitive bits.
Improved error robustness for AAC is provided by a set of error resilience tools. These tools reduce the
perceived degradation of the decoded audio signal that is caused by corrupted bits in the bitstream payload.
0.2.7 MPEG-4 Audio provides capabilities for scalability
Previous MPEG Audio standards provided a single bitrate, single bandwidth toolset, with different
configurations of that toolset specified for use in various applications. MPEG-4 provides several bitrate and
bandwidth options within a single stream, providing a scalability functionality that permits a given stream to
scale to the requirement of different channels and applications or to be responsive to a given channel that has
dynamic throughput characteristics. The tools specified in MPEG-4 are the state-of-the-art tools providing
scalable compression of speech and audio signals.
0.3 The MPEG-4 Audio tool set
0.3.1 Speech coding tools
0.3.1.1 Overview
Speech coding tools are designed for the transmission and decoding of synthetic and natural speech.
Two types of speech coding tools are provided in MPEG-4. The natural speech tools allow the compression,
transmission, and decoding of human speech, for use in telephony, personal communication, and surveillance
applications. The synthetic speech tool provides an interface to text-to-speech synthesis systems; using
synthetic speech provides very-low-bitrate operation and built-in connection with facial animation for use in
low-bitrate video teleconferencing applications.
0.3.1.2 Natural speech coding
The MPEG-4 speech coding toolset covers the compression and decoding of natural speech sound at bitrates
ranging between 2 and 24 kbit/s. When variable bitrate coding is allowed, coding at even less than 2 kbit/s, for
example an average bitrate of 1.2 kbit/s, is also supported. Two basic speech coding techniques are used:
One is a parametric speech coding algorithm, HVXC (Harmonic Vector eXcitation Coding), for very low bit
rates; and the other is a CELP (Code Excited Linear Prediction) coding technique. The MPEG-4 speech
coders target applications range from mobile and satellite communications, to Internet telephony, to packaged
media and speech databases. It meets a wide range of requirements encompassing bitrate, functionality and
sound quality.
MPEG-4 HVXC operates at fixed bitrates between 2.0 kbit/s and 4.0 kbit/s using a bitrate scalability technique.
It also operates at lower bitrates, typically 1.2 - 1.7 kbit/s, using a variable bitrate technique. HVXC provides
communications-quality to near-toll-quality speech in the 100 Hz – 3800 Hz band at 8 kHz sampling rate.
HVXC also allows independent change of speed and pitch during decoding, which is a powerful functionality
for fast access to speech databases. HVXC functionalities including 2.0 - 4.0 kbit/s fixed bitrate modes and a
2.0 kbit/s maximum variable bitrate mode.
Error Resilient (ER) HVXC extends operation of the variable bitrate mode to 4.0 kbit/s to allow higher quality
variable rate coding. The ER HVXC therefore provides fixed bitrate modes of 2.0 - 4.0 kbit/s and a variable
bitrate of either less than 2.0 kbit/s or less than 4.0 kbit/s, both in scalable and non-scalable modes. In the
© ISO/IEC 2019 – All rights reserved vii

variable bitrate modes, non-speech parts are detected in unvoiced signals, and a smaller number of bits are
used for these non-speech parts to reduce the average bitrate. ER HVXC provides communications-quality to
near-toll-quality speech in the 100 Hz - 3800 Hz band at 8 kHz sampling rate. When the variable bitrate mode
is allowed, operation at lower average bitrate is possible. Coded speech using variable bitrate mode at typical
bitrates of 1.5 kbit/s average, and at typical bitrate of 3.0 kbit/s average has essentially the same quality as 2.0
kbit/s fixed rate and 4.0 kbit/s fixed rate respectively. The functionality of pitch and speed change during
decoding is supported for all modes. ER HVXC has a bitstream payload syntax with the error sensitivity
classes to be used with the EP-Tool, and some error concealment functionality is supported for use in error-
prone channels such as mobile communication channels. The ER HVXC speech coder target applications
range from mobile and satellite communications, to Internet telephony, to packaged media and speech
databases.
MPEG-4 CELP is a well-known coding algorithm with new functionality. Conventional CELP coders offer
compression at a single bit rate and are optimized for specific applications. Compression is one of the
functionalities provided by MPEG-4 CELP, but MPEG-4 also enables the use of one basic coder in multiple
applications. It provides scalability in bitrate and bandwidth, as well as the ability to generate bitstream
payloads at arbitrary bitrates. The MPEG-4 CELP coder supports two sampling rates, namely, 8 kHz and
16 kHz. The associated bandwidths are 100 Hz – 3800 Hz for 8 kHz sampling and 50 Hz – 7000 Hz for 16
kHz sampling. The silence compression tool comprises a voice activity detector (VAD), a discontinuous
transmission (DTX) unit and a comfort noise generator (CNG) module. The tool encodes/decodes the input
signal at a lower bitrate during the non-active-voice (silent) frames. During the active-voice (speech) frames,
MPEG-4 CELP encoding and decoding are used.
The silence compression tool reduces the average bitrate thanks to compression at a lower-bitrate for silence.
In the encoder, a voice activity detector is used to distinguish between regions with normal speech activity and
those with silence or background noise. During normal speech activity, the CELP coding is used. Otherwise a
silence insertion descriptor (SID) is transmitted at a lower bitrate. This SID enables a comfort noise generator
(CNG) in the decoder. The amplitude and the spectral shape of this comfort noise are specified by energy and
LPC parameters in methods similar to those used in a normal CELP frame. These parameters are optionally
re-transmitted in the SID and thus can be updated as required.
MPEG has conducted extensive verification testing in realistic listening conditions in order to prove the
efficacy of the speech coding toolset.
0.3.1.3 Text-to-speech interface
Text-to-speech (TTS) capability is becoming a rather common media type and plays an important role in
various multi-media application areas. For instance, by using TTS functionality, multimedia content with
narration can be easily created without recording natural speech. Before MPEG-4, however, there was no way
for a multimedia content provider to easily give instructions to an unknown TTS system. With MPEG-4 TTS
Interface, a single common interface for TTS systems is standardized. This interface allows speech
information to be transmitted in the international phonetic alphabet (IPA), or in a textual (written) form of any
language.
The MPEG-4 Hybrid/Multi-Level Scalable TTS Interface is a superset of the conventional TTS framework.
This extended TTS Interface can utilize prosodic information taken from natural speech in addition to input text
and can thus generate much higher-quality synthetic speech. The interface and its bitstream payload format is
scalable in terms of this added information; for example, if some parameters of prosodic information are not
available, a decoder can generate the missing parameters by rule. Normative algorithms for speech synthesis
and text-to-phoneme translation are not specified in MPEG-4, but to meet the goal that underlies the MPEG-4
TTS Interface, a decoder should fully utilize all the provided information according to the user’s requirements
level.
As well as an interface to text-to-speech synthesis systems, MPEG-4 specifies a joint coding method for
phonemic information and facial animation (FA) parameters and other animation parameters (AP). Using this
technique, a single bitstream payload may be used to control both the text-to-speech interface and the facial
animation visual object decoder (see ISO/IEC 14496-2, Annex C). The functionality of this extended TTS thus
ranges from conventional TTS to natural speech coding and its application areas, from simple TTS to audio
presentation with TTS and motion picture dubbing with TTS.
viii © ISO/IEC 2019 – All rights reserved

0.3.2 Audio coding tools
0.3.2.1 Overview
Audio coding tools are designed for the transmission and decoding of recorded music and other audio
soundtracks.
0.3.2.2 General audio coding tools
MPEG-4 standardizes the coding of natural audio at bitrates ranging from 6 kbit/s up to several hundred kbit/s
per audio channel for mono, two-channel-, and multi-channel-stereo signals. General high-quality
compression is provided by incorporating the MPEG-2 AAC standard (ISO/IEC 13818-7), with certain
improvements, as MPEG-4 AAC. At 64 kbit/s/channel and higher ranges, this coder has been found in
verification testing under rigorous conditions to meet the criterion of “indistinguishable quality” as defined by
the European Broadcasting Union.
General audio (GA) coding tools comprise the AAC tool set expanded by alternative quantization and coding
schemes (Twin-VQ and BSAC). The general audio coder uses a perceptual filterbank, a sophisticated
masking model, noise-shaping techniques, channel coupling, and noiseless coding and bit-allocation to
provide the maximum compression within the constraints of providing the highest possible quality.
Psychoacoustic coding standards developed by MPEG have represented the state-of-the-art in this
technology since MPEG-1 Audio; MPEG-4 General Audio coding continues this tradition.
For bitrates ranging from 6 kbit/s to 64 kbit/s per channel, the MPEG-4 standard provides extensions to the
GA coding tools, that allow the content author to achieve the highest quality coding at the desired bitrate.
Furthermore, various bit rate scalability options are available within the GA coder. The low-bitrate techniques
and scalability modes provided within this tool set have also been verified in formal tests by MPEG.
The MPEG-4 low delay coding functionality provides the ability to extend the usage of generic low bitrate
audio coding to applications requiring a very low delay in the encoding / decoding chain (e.g. full-duplex real-
time communications). In contrast to traditional low delay coders based on speech coding technology, the
concept of this low delay coder is based on general perceptual audio coding and is thus suitable for a wide
range of audio signals. Specifically, it is derived from the proven architecture of MPEG-2/4 Advanced Audio
Coding (AAC) and all capabilities for coding of 2 (stereo) or more sound channels (multi-channel) are
available within the low delay coder. To enable coding of general audio signals with an algorithmic delay not
exceeding 20 ms at 48 kHz, it uses a frame length of 512 or 480 samples (compared to the 1024 or 960
samples used in standard MPEG-2/4 AAC). Also the size of the window used in the analysis and synthesis
filterbank is reduced by a factor of 2. No block switching is used to avoid the “look-ahead'” delay due to the
block switching decision. To reduce pre-echo artefacts in the case of transient signals, window shape
switching is provided instead. For non-transient portions of the signal a sine window is used, while a so-called
low overlap window is used for transient portions. Use of the bit reservoir is minimized in the encoder in order
to reach the desired target delay. As one extreme case, no bit reservoir is used at all.
The MPEG-4 BSAC is used in combination with the AAC coding tools and replaces the noiseless coding of
the quantized spectral data and the scalefactors. The MPEG-4 BSAC provides fine grain scalability in steps of
1 kbit/s per audio channel, i.e. 2 kbit/s steps for a stereo signal. One base layer stream and many small
enhancement layer streams are used. To obtain fine step scalability, a bit-slicing scheme is applied to the
quantized spectral data. First the quantized spectral values are grouped into frequency bands. Each of these
groups contains the quantized spectral values in their binary representation. Then the bits of a group are
processed in slices according to their significance. Thus all most significant bits (MSB) of the quantized values
in a group are processed first. These bit-slices are then encoded using an arithmetic coding scheme to obtain
entropy coding with minimal redundancy. In order to implement fine grain scalability efficiently using MPEG-4
Systems tools, the fine grain audio data can be grouped into large-step layers and these large-step layers can
be further grouped by concatenating large-step layers from several sub-frames. Furthermore, the
configuration of the payload transmitted over an Elementary Stream (ES) can be changed dynamically (by
means of the MPEG-4 backchannel capability) depending on the environment, such as network traffic or user
interaction. This means that BSAC can allow for real-time adjustments to the quality of service. In addition to
fine grain scalablity, it can improve the quality of an audio signal that is decoded from a stream transmitted
over an error-prone channel, such as a mobile communication networks or Digital Audio Broadcasting (DAB)
channel.
© ISO/IEC 2019 – All rights reserved ix

MPEG-4 SBR (Spectral Band Replication) is a bandwidth extension tool used in combination with the AAC
general audio codec. When integrated into the MPEG AAC codec, a significant improvement of the
performance is available, which can be used to lower the bitrate or improve the audio quality. This is achieved
by replicating the highband, i.e. the high frequency part of the spectrum. A small amount of data representing
a parametric description of the highband is encoded and used in the decoding process. The data rate is by far
below the data rate required when using conventional AAC coding of the highband.
0.3.2.3 Parametric audio coding tools
The parametric audio coding tool MPEG-4 HILN (Harmonic and Individual Lines plus Noise) codes non-
speech signals like music at bitrates of 4 kbit/s and higher using a parametric representation of the audio
signal. The basic idea of this technique is to decompose the input signal into audio objects which are
described by appropriate source models and represented by model parameters. Object models for sinusoids,
harmonic tones, and noise are utilized in the HILN coder. HILN allows independent change of speed and pitch
during decoding.
The Parametric Audio Coding tools combine very low bitrate coding of general audio signals with the
possibility of modifying the playback speed or pitch during decoding without the need for an effects processing
unit. In combination with the speech and audio coding tools in MPEG-4, improved overall coding efficiency is
expected for applications of object based coding allowing selection and/or switching between different coding
techniques.
This approach allows to introduce a more advanced source model than just assuming a stationary signal for
the duration of a frame, which motivates the spectral decomposition used in e.g. the MPEG-4 General Audio
Coder. As known from speech coding, where specialized source models based on the speech generation
process in the human vocal tract are applied, advanced source models can be advantageous, especially for
very low bitrate coding schemes.
Due to the very low target bitrates, only the parameters for a small number of objects can be transmitted.
Therefore a perception model is employed to select those objects that are most important for the perceptual
quality of the signal.
In HILN, the frequency and amplitude parameters are quantized according to the “just noticeable differences”
known from psychoacoustics. The spectral envelope of the noise and the harmonic tones are described using
LPC modeling as known from speech coding. Correlation between the parameters of one frame and those of
consecutive frames is exploited by parameter prediction. Finally, the quantized parameters are entropy coded
and multiplexed to form a bitstream payload.
A very interesting property of this parametric coding scheme arises from the fact that the signal is described in
terms of frequency and amplitude parameters. This signal representation permits speed and pitch change
functionality by simple parameter modification in the decoder. The HILN Parametric Audio Coder can be
combined with MPEG-4 Parametric Speech Coder (HVXC) to form an integrated parametric coder covering a
wider range of signals and bitrates. This integrated coder supports speed and pitch change. Using a
speech/music classification tool in the encoder, it is possible to automatically select the HVXC for speech
signals and the HILN for music signals. Such automatic HVXC/HILN switching was successfully demonstrated
and the classification tool is described in the informative Annex of the MPEG-4 standard.
MPEG-4 SSC, (SinuSoidal Coding) is a parametric coding tool that is capable of full bandwidth high quality
audio coding. The coding tool dissects a monaural or stereo audio signal into a number of different objects
that each can be parameterized efficiently and encoded at a low bit-rate. These objects are, transients:
representing dynamic changes in the temporal domain, sinusoids: representing deterministic components, and
noise: representing components that do not have a clear temporal or spectral localisation. The fourth object,
that is only relevant for stereo input signals, captures the stereo image. As the signal is represented in a
parametric domain, independent, high quality pitch and tempo scaling are possible at low computational cost.
0.3.3 Lossless audio coding tools
MPEG-4 DST (Direct Stream Transfer) provides lossless coding of oversampled audio signals.
x © ISO/IEC 2019 – All rights reserved

MPEG-4 ALS (Audio Lossless Coding) provides lossless coding of digital audio signals. Input signals can be
integer PCM data with 8 to 32-bit word length or 32-bit IEEE floating-point data. Up to 65536 channels are
supported.
MPEG-4 SLS (Scalable Lossless Coding) is a tool used in combination with optional MPEG-4 General Audio
coding tools to provide fine-grain scalable to numerical lossless coding of digital audio waveform.
0.3.4 Synthesis tools
Synthesis tools are designed for very low bitrate description and transmission, and terminal-side synthesis, of
synthetic music and other sounds.
The MPEG-4 toolset providing general audio synthesis capability is called MPEG-4 Structured Audio, and it
is described in subpart 5 of ISO/IEC 14496-3. MPEG-4 Structured Audio (the SA coder) provides very general
capabilities for the description of synthetic sound, and the normative creation of synthetic sound in the
decoding terminal. High-quality stereo sound can be transmitted at bitrates from 0 kbit/s (no continuous cost)
to 2-3 kbit/s for extremely expressive sound using these tools.
Rather than specify a particular method of synthesis, SA specifies a flexible language for describing methods
of synthesis. This technique allows content authors two advantages. First, the set of synthesis techniques
available is not limited to those that were envisioned as useful by the creators of the standard; any current or
future method of synthesis may be used in MPEG-4 Structured Audio. Second, the creation of synthetic sound
from structured descriptions is normative in MPEG-4, so sound created with the SA coder will sound the same
on any terminal.
Synthetic audio is transmitted via a set of instrument modules that can create audio signals under the control
of a score. An instrument is a small network of signal-processing primitives that control the parametric
generation of sound according to some algorithm. Several different instruments may be transmitted and used
in a single Structured Audio bitstream payload. A score is a time-sequenced set of commands that invokes
various instruments at specific times to contribute their output to an overall music performance. The format for
the description of instruments is SAOL, the Structured Audio Orchestra Language. The format for the
description of scores is SASL, the Structured Audio Score Language.
Efficient transmission of sound samples, also called wavetables, for use in sampling synthesis is
accomplished by providing interoperability with the MIDI Manufacturers Association Downloaded Sounds
Level 2 (DLS-2) standard, which is normatively referenced by the Structured Audio standard. By using the
DLS-2 format, the simple and popular technique of wavetable synthesis can be used in MPEG-4 Structured
Audio soundtracks, either by itself or in conjunction with other kinds of synthesis using the more general-
purpose tools. To further enable interoperability with existing content and authoring tools, the popular MIDI
(Musical Instrument Digital Interface) control format can be used instead of, or in addition to, scores in SASL
for controlling synthesis.
Through the inclusion of compatibility with MIDI standards, MPEG-4 Structured Audio thus represents a
unification of the current technique for synthetic sound description (MIDI-based wavetable synthesis) with that
of the future (general-purpose algorithmic synthesis). The resulting standard solves problems not only in very-
low-bitrate coding, but also in virtual environments, video games, interactive music, karaoke systems, and
many other applications.
0.3.5 Composition tools
Composition tools are designed for object-based coding, interactive functionality, and audiovisual
synchronization.
The tools for audio composition, like those for visual composition, are specified in the MPEG-4 Systems
standard (ISO/IEC 14496-1). However, since readers interested in audio functionality are likely to look here
first, a brief overview is provided.
Audio composition is the use of multiple individual “audio objects” and mixing techniques to create a single
soundtrack. It is analogous to the process of recording a soundtrack in a multichannel mix, with each musical
instrument, voice actor, and sound effect on its own channel, and then “mixing down” the multiple channels to
a single channel or single stereo pair. In MPEG-4, the multichannel mix itself may be transmitted, with each
© ISO/IEC 2019 – All rights reserved xi

audio source using a different coding tool, and a set of instructions for mixdown also transmitted in the
bitstream payload. As the multiple audio objects are received, they are decoded separately, but not played
back to the listener; rather, the instructions for mixdown are used to prepare a single soundtrack from the “raw
material” given in the objects. This final soundtrack is then played for the listener.
An example serves to illustrate the efficacy of this approach. Suppose, for a certain application, we wish to
transmit the sound of a person speaking in a reverberant environment over stereo background music, at very
high quality. A traditional approach to coding would demand the use of a general audio coding at 32
kbit/s/channel or above; the sound source is too complex to be well-modeled by a simple model-based coder.
However, in MPEG-4 we can represent the soundtrack as the conjunction of several objects: a speaking
person passed through a reverberator added to a synthetic music track. We transmit the speaker’s voice using
the CELP tool at 16 kbit/s, the synthetic music using the SA tool at 2 kbit/s, and allow a small amount of
overhead (only a few hundreds of bytes as a fixed cost) to describe the stereo mixdown and the reverberation.
Using MPEG-4 and an object-based approach thus allows us to describe in less than 20 kbit/s total a stream
that might require 64 kbit/s to transmit with traditional coding, at equivalent quality.
Additionally, having such structured soundtrack information present in the decoding terminal allows more
sophisticated client-side interaction to be included. For example, the listener can be allowed (if the content
author desires) to request that the background music be muted. This functionality would not be possible if the
music and speech were coded into the same audio track.
With the MPEG-4 Binary Format for Scenes (BIFS), specified in MPEG-4 Systems, a subset tool called
AudioBIFS allows content authors to describe sound scenes using this object-based framework. Multiple
sources may be mixed and combined, and interactive control provided for their combination. Sample-
resolution control over mixing is provided in this method. Dynamic download of custom signal-processing
routines allows the content author to exactly request a particular, normative, digital filter, reverberator, or other
effects-processing routine. Finally, an interface to terminal-dependent methods of 3-D audio spatialisation is
provided for the description of virtual-reality and other 3-D sound material.
As AudioBIFS is part of the general BIFS specification, the same framework is used to synchronize audio and
video, audio and computer graphics, or audio with other material. Please refer to ISO/IEC 14496-11 (MPEG-4
Scene description and application engine) for more information on AudioBIFS and other topics in audiovisual
synchronization.
0.3.6 Scalability tools
Scalability tools are designed for the creation of bitstream payloads that can be transmitted, without recoding,
at several different bitrates.
Many of the stream types in MPEG-4 are scalable in one manner or another. Several types of scalability in the
standard are discussed below.
Bitrate scalability allows a bitstream payload to be parsed into a bitstream payload of lower bitrate such that
the combination can still be decoded into a meaningful signal. The bitstream payload parsing can occur either
during transmission or in the decoder. Scalability is available within each of the natural audio coding schemes,
or by a combination of different natural audio coding schemes.
Bandwidth scalability is a particular case of bitrate scalability, whereby part of a bitstream payload
representing a part of the frequency spectrum can be discarded during transmission or decoding. This is
available for the CELP speech coder, where an extension layer converts the narrow band base layer speech
coder into a wide band speech coder. Also the general audio coding tools which all operate in the frequency
domain offer a very flexible bandwidth control for the different coding layers.
Encoder complexity scalability allows encoders of different complexity to generate valid and meaningful
bitstream payloads. An example for this is the availability of a high quality and a low complexity excitation
module for the wideband CELP coder allowing to choose between significant lower encoder complexity or
optimized coding quality.
Decoder complexity scalability allows a given bitstream payload to be decoded by decoders of different levels
of complexity. A subtype of decoder complexity scalability is graceful degradation, in which a decoder
dynamically monitors the resources available, and scales down the decoding complexity (and thus the audio
xii © ISO/IEC 2019 – All rights reserved

quality) when resources are limited. The Structured Audio decoder allows this type of scalability; a content
author may provide (for example) several different algorithms for the synthesis of piano sounds, and the
content itself decides, depending on available resources, which one to use.
0.3.7 Upstream
Upstream tools are designed for the dynamic control the streaming of the server for bitrate control and quality
feedback control.
The MPEG-4 upstream or backchannel allows a user on a remote side to dynamically control the streaming
of MPEG-4 content from a server. Backchannel streams carrying the user interaction information.
0.3.8 Error robustness facilities
0.3.8.1 Overview
Error robustness facilities include tools for error resilience as well as for error protection.
The error robustness facilities provide improved performance on error-prone transmission channels. They are
comprised of error resilient bitstream payload reordering, a common error protection tool and codec specific
error resilience tools.
0.3.8.2 Error resilient bitstream payload reordering
Error resilient bitstream payload reordering allows the effective use of advanced channel coding techniques
like unequal error protection (UEP), which can be perfectly adapted to the needs of the different coding tools.
The basic idea is to rearrange the audio frame content depending on its error sensitivity in one or more
instances belonging to different error sensitivity categories (ESC). This rearrangement can be either data
element-wise or even bit-wise. An error resilient bitstream payload frame is build by concatenating these
instances.
Audio Bitstream Channel
Encoder formatter Coding
Channel
Audio Bitstream Channel
Decoder unformatter Decoding
Figure 0.1 – Basic principle of error resilient bitstream handling
The basic principle is depicted in Figure 0.1. A bitstream payload is reordered according to the error sensitivity
of single bitstream payload elements or even single bits. This new arranged bitstream payload is channel
coded, transmitted and channel decoded. Prior to audio decoding, the bitstream payload is rearranged to its
original order.
0.3.8.3 Error protection
The EP tool provides unequal error protection. It receives several classes of bits from the audio coding tools,
and then applies forward error correction codes (FEC) and/or cyclic redundancy codes (CRC) for each class,
according to its error sensitivity.
The error protection tool (EP tool) provides the unequal error protection (UEP) capability to the set of ISO/IEC
14496-3 codecs. Ma
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...

ISO/IEC 14496-3:2019 문서는 오디오-비주얼 객체의 코딩을 다루는 표준으로, 다양한 오디오 코딩 방식의 통합을 통해 최신 기술을 제공합니다. 이 문서는 자연음과 합성음, 저비율 전송과 고품질 전송, 말소리와 음악, 복잡한 사운드트랙과 단순한 것, 전통적인 콘텐츠와 인터랙티브 및 가상 현실 콘텐츠까지 다양한 오디오 코딩 방식을 아우르는 점에서 그 범위가 매우 넓습니다. 이 표준의 강점은 복잡하고 세밀한 코딩 도구들을 개별적으로 표준화하여, 오디오 동기화, 믹싱 및 다운로드 후 제작을 위한 혁신적이고 유연한 프레임워크를 제공한다는 것입니다. 특히, 실시간 전화 통화나 고품질 오디오 압축과 같은 단일 애플리케이션을 목표로 하지 않고, 고급 사운드 압축, 합성, 조작 또는 재생을 요구하는 모든 애플리케이션에 적용 가능하다는 점에서 그 중요성이 높습니다. 또한, ISO/IEC 14496 시리즈의 나머지와 통합된 코딩 도구들은 객체 기반 오디오 코딩, 인터랙티브 프레젠테이션, 동적 사운드트랙 및 새로운 미디어 형식의 가능성을 열어줍니다. 이는 특히 최신 디지털 미디어 환경에서 사용자들에게 더욱 풍부하고 몰입감 있는 경험을 제공할 수 있는 기반을 마련해 줍니다. 결론적으로, ISO/IEC 14496-3:2019는 오디오 코딩의 모든 측면을 포괄하며, 다양한 기술적 요구를 충족시키는 중요한 표준으로 자리매김하고 있습니다.

La norme ISO/IEC 14496-3:2019 se distingue par son approche intégratrice du codage audio, englobant une vaste gamme de types de son, allant du son naturel au son synthétique, ainsi que des livraisons à faible débit aux livraisons de haute qualité. Sa portée est particulièrement intéressante car elle ne se limite pas à une seule application, comme la téléphonie en temps réel ou la compression audio de haute qualité, mais s'applique à une multitude d'applications nécessitant des techniques avancées de compression sonore, de synthèse, de manipulation ou de lecture. Cette norme met en avant des outils de codage innovants et sophistiqués, qui sont standardisés pour offrir un cadre flexible et nouveau pour la synchronisation audio, le mélange et la post-production téléchargée. Elle couvre également différents contextes sonores, y compris les discours, la musique, ainsi que des bandes son complexes et simples, tout en prenant en compte des contenus traditionnels et des contenus interactifs et en réalité virtuelle. L'un des atouts majeurs de la norme réside dans sa capacité à s'intégrer avec le reste de la série ISO/IEC 14496, ce qui ouvre des possibilités passionnantes pour le codage audio basé sur des objets, la présentation interactive, les bandes son dynamiques et d'autres formes de nouveaux médias. Cette interconnexion permet d’explorer de nouvelles avenues créatives et techniques, affirmant ainsi la pertinence de la norme dans un paysage technologique en constante évolution. En résumé, la norme ISO/IEC 14496-3:2019, en tant que référence dans le domaine de la technologie de l'information et du codage d'objets audiovisuels, représente un pilier fondamental pour le développement futur de solutions audio complexes et variées, consolidant son rôle essentiel et sa pertinence dans l'industrie.

The ISO/IEC 14496-3:2019 standard offers a comprehensive framework for coding audio-visual objects, specifically focusing on the audio component. The scope of this standard is particularly noteworthy as it addresses a wide variety of audio types, including natural sounds, synthetic sounds, speech, music, and the intricacies of soundtracks that can range from complex to simple. This versatility illustrates the standard’s strength in catering to diverse applications, from lower bitrate delivery to high-quality audio compression. A significant advantage of ISO/IEC 14496-3:2019 is its holistic approach, as it does not limit itself to specific applications like real-time telephony, making it highly relevant for any application needing advanced sound compression, synthesis, manipulation, or playback. The standard facilitates a flexible framework for audio synchronization and mixing, promoting innovation in audio engineering and production. The integration of sophisticated coding tools within this standard exemplifies its forward-thinking design, allowing it to adapt seamlessly within the broader ISO/IEC 14496 series. This interconnectedness opens new avenues for object-based audio coding, enabling dynamic soundtracks and enhancing interactive presentations. Such capabilities are increasingly essential in contemporary media environments, where immersive experiences are paramount. Overall, the ISO/IEC 14496-3:2019 standard remarkably aligns with the demands of modern audio production and consumption, making it a pivotal reference for professionals in the field. Its comprehensive coverage and innovative coding tools uniquely position it within the evolving landscape of audio-visual technologies.

Der Standard ISO/IEC 14496-3:2019 bietet eine umfassende und integrative Lösung für die Kodierung von audiovisuellen Objekten und deckt eine Vielzahl von Anwendungen und Technologien ab. Dieser Standard ist besonders relevant, da er viele verschiedene Formen der Audiokodierung integriert, einschließlich der Kombination von natürlichen und synthetischen Klängen, der Übertragung von niedrigen Bitraten sowie qualitativ hochwertigen Audioausgaben. Er berücksichtigt sowohl Sprache als auch Musik, komplexe Klanglandschaften und einfache Audiomaterialien, was seine Vielseitigkeit unterstreicht. Die Stärken des Standards liegen in seiner Fähigkeit, fortschrittliche Klangkompression zu standardisieren und gleichzeitig flexible Rahmenbedingungen für die Audiokollaboration, -mischung und nachträgliche Produktion zu bieten. Ein zentrales Merkmal ist die Flexibilität, die er für die Integration in Interaktivität und virtuelle Realität ermöglicht. Durch die Standardisierung verschiedener anspruchsvoller Kodierungswerkzeuge ermöglicht dieser Standard eine innovative Herangehensweise an die Objektaudio-Kodierung und die Gestaltung dynamischer Soundtracks. Darüber hinaus wird durch die Integration der definierten Werkzeuge mit der gesamten ISO/IEC 14496-Serie eine Vielzahl neuer Möglichkeiten eröffnet, die für die Entwicklung interaktiver Präsentationen und neuer Medienformate von Bedeutung sind. Dies trägt dazu bei, dass der Standard nicht nur für eine spezifische Anwendung, wie z.B. Echtzeittelefonie oder hochqualitative Audiokompression, sondern für alle Anwendungen, die eine fortschrittliche Audioverarbeitung erfordern, von großer Relevanz ist. Insgesamt betrachtet ist ISO/IEC 14496-3:2019 ein wegweisendes Dokument, das die Grundlage für die zukünftige Entwicklung und Anwendung von audiovisuellen Technologien legt, und kann als unverzichtbares Werkzeug für Fachleute in der Branche angesehen werden.

ISO/IEC 14496-3:2019は、情報技術における音声表現の標準化を目的とした文書であり、近年の音声コーディングの発展に寄与しています。この文書は、多様な音声コーディング手法を統合しており、自然音と合成音、低ビットレート配信と高品質配信、スピーチと音楽、複雑なサウンドトラックと単純な音声、伝統的コンテンツとインタラクティブ・バーチャルリアリティコンテンツのすべてに対応しています。 この標準の強みは、専門的なコーディングツールを個別に標準化することで、音声の同期、ミキシング、ダウンロード後のポストプロダクションにおける新しいフレームワークを提供する点にあります。さまざまなアプリケーションで使用される高度な音声圧縮、合成、操作、再生に関する要求に応じており、特定の分野(例:リアルタイム電話や高品質音声圧縮)に限定されるものではありません。 ISO/IEC 14496-3:2019は、音声が求められる各種アプリケーションに広く適用可能であり、その先進的なコーディングツールは、同系列の他の標準との統合を通じて、オブジェクトベースの音声コーディングやインタラクティブなプレゼンテーション、ダイナミックなサウンドトラックなどの新メディアの可能性を実現します。このように、音声技術の革新において、非常に重要な役割を果たす標準であることが明確です。