ISO/IEC 14496-1:2010
(Main)Information technology — Coding of audio-visual objects — Part 1: Systems
Information technology — Coding of audio-visual objects — Part 1: Systems
ISO/IEC 14496-1:2010 specifies system level functionalities for the communication of interactive audio-visual scenes, i.e. the coded representation of information related to the management of data streams (synchronization, identification, description and association of stream content).
Technologies de l'information — Codage des objets audiovisuels — Partie 1: Systèmes
General Information
Relations
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 14496-1
Fourth edition
2010-06-01
Information technology — Coding of
audio-visual objects —
Part 1:
Systems
Technologies de l'information — Codage des objets audiovisuels —
Partie 1: Systèmes
Reference number
ISO/IEC 14496-1:2010(E)
©
ISO/IEC 2010
---------------------- Page: 1 ----------------------
ISO/IEC 14496-1:2010(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2010 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC 14496-1:2010(E)
Contents Page
Foreword .iv
0 Introduction.vi
1 Scope.1
2 Normative references.1
3 Additional references.2
4 Terms and definitions .2
5 Abbreviated terms .10
6 Conventions.11
7 Streaming Framework.11
8 Syntactic Description Language.99
9 Profiles.110
Annex A (informative) Time Base Reconstruction .112
Annex B (informative) Registration procedure .115
Annex C (informative) The QoS Management Model for ISO/IEC 14496 Content.119
Annex D (informative) Conversion Between Time and Date Conventions .120
Annex E (informative) Graphical Representation of Object Descriptor and Sync Layer Syntax.122
Annex F (informative) Elementary Stream Interface.130
Annex G (informative) Upstream Walkthrough.132
Annex H (informative) Scene and Object Description Carrousel.137
Annex I (normative) Usage of ITU-T Recommendation H.264 | ISO/IEC 14496-10 AVC .138
Annex J (informative) Patent statements .141
Bibliography.144
© ISO/IEC 2010 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC 14496-1:2010(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
ISO/IEC 14496-1 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This fourth edition cancels and replaces the third edition (ISO/IEC 14496-1:2004), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-1:2004/Amd.1:2005,
ISO/IEC 14496-1:2004/Amd.2:2007, ISO/IEC 14496-1:2004/Amd.3:2007 and Technical Corrigenda
ISO/IEC 14496-1:2004/Cor.1:2006 and ISO/IEC 14496-1:2004/Cor.2:2007.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of
audio-visual objects:
⎯ Part 1: Systems
⎯ Part 2: Visual
⎯ Part 3: Audio
⎯ Part 4: Conformance testing
⎯ Part 5: Reference software
⎯ Part 6: Delivery Multimedia Integration Framework (DMIF)
⎯ Part 7: Optimized reference software for coding of audio-visual objects
⎯ Part 8: Carriage of ISO/IEC 14496 contents over IP networks
⎯ Part 9: Reference hardware description
⎯ Part 10: Advanced Video Coding
⎯ Part 11: Scene description and application engine
⎯ Part 12: ISO base media file format
⎯ Part 13: Intellectual Property Management and Protection (IPMP) extensions
iv © ISO/IEC 2010 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC 14496-1:2010(E)
⎯ Part 14: MP4 file format
⎯ Part 15: Advanced Video Coding (AVC) file format
⎯ Part 16: Animation Framework eXtension (AFX)
⎯ Part 17: Streaming text format
⎯ Part 18: Font compression and streaming
⎯ Part 19: Synthesized texture stream
⎯ Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF)
⎯ Part 21: MPEG-J Graphics Framework eXtensions (GFX)
⎯ Part 22: Open Font Format
⎯ Part 23: Symbolic Music Representation
⎯ Part 24: Audio and systems interaction
⎯ Part 25: 3D Graphics Compression Model
⎯ Part 26: Audio conformance
⎯ Part 27: 3D Graphics conformance
© ISO/IEC 2010 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO/IEC 14496-1:2010(E)
0 Introduction
0.1 Overview
ISO/IEC 14496 specifies a system for the communication of interactive audio-visual scenes. This specification
includes the following elements.
a) The coded representation of natural or synthetic, two-dimensional (2D) or three-dimensional (3D) objects
that can be manifested audibly and/or visually (audio-visual objects) (specified in Parts 2, 3, 10, 11, 16,
19, 20, 23 and 25 of ISO/IEC 14496).
b) The coded representation of the spatio-temporal positioning of audio-visual objects as well as their
behavior in response to interaction (scene description, specified in Parts 11 and 20 of ISO/IEC 14496).
c) The coded representation of information related to the management of data streams (synchronization,
identification, description and association of stream content, specified in this Part and in Part 24 of
ISO/IEC 14496).
d) A generic interface to the data stream delivery layer functionality (specified in Part 6 of ISO/IEC 14496).
e) An application engine for programmatic control of the player: format, delivery of downloadable Java byte
code as well as its execution lifecycle and behavior through APIs (specified in Parts 11 and 21 of
ISO/IEC 14496).
f) A file format to contain the media information of an ISO/IEC 14496 presentation in a flexible, extensible
format to facilitate interchange, management, editing, and presentation of the media specified in Part 12
(ISO File Format), Part 14 (MP4 File Format) and Part 15 (AVC File Format) of ISO/IEC 14496.
g) The coded representation of font data and of information related to the management of text streams and
font data streams (specified in Parts 17, 18 and 22 of ISO/IEC 14496).
The overall operation of a system communicating audio-visual scenes can be paraphrased as follows:
At the sending terminal, the audio-visual scene information is compressed, supplemented with
synchronization information and passed to a delivery layer that multiplexes it into one or more coded binary
streams that are transmitted or stored. At the receiving terminal, these streams are demultiplexed and
decompressed. The audio-visual objects are composed according to the scene description and
synchronization information and presented to the end user. The end user may have the option to interact with
this presentation. Interaction information can be processed locally or transmitted back to the sending terminal.
ISO/IEC 14496 defines the syntax and semantics of the bitstreams that convey such scene information, as
well as the details of their decoding processes.
This part of ISO/IEC 14496 specifies the following tools.
⎯ A terminal model for time and buffer management.
⎯ A coded representation of metadata for the identification, description and logical dependencies of the
elementary streams (object descriptors and other descriptors).
⎯ A coded representation of descriptive audio-visual content information [object content information (OCI)].
⎯ An interface to intellectual property management and protection (IPMP) systems.
⎯ A coded representation of synchronization information (sync layer – SL).
⎯ A multiplexed representation of individual elementary streams in a single stream (M4Mux).
vi © ISO/IEC 2010 – All rights reserved
---------------------- Page: 6 ----------------------
ISO/IEC 14496-1:2010(E)
These various elements are described functionally in this clause and specified in the normative clauses that
follow.
0.2 Architecture
The information representation specified in ISO/IEC 14496 describes the means to create an interactive
audio-visual scene in terms of coded audio-visual information and associated scene description information.
The entity that composes and sends, or receives and presents such a coded representation of an interactive
audio-visual scene is generically referred to as an “audio-visual terminal” or just “terminal”. This terminal may
correspond to a stand-alone application or be part of an application system.
Display and
User
Interaction
Interactive Audiovisual
Scene
Composition and Rendering
Upstream Compression
...
Information
Layer
Scene
Object
AV Object
Description
Descriptor
data
Information
Elementary Streams
Elementary Stream Interface
SL SL SL SL SL SL
...
Sync
Layer
SL
SL-Packetized Streams
DMIF Application Interface
M4Mux M4Mux M4Mux
Delivery
Layer
(PES) (RTP)
AAL2 H223 DAB
MPEG-2 UDP
...
ATM PSTN Mux
TS IP
Multiplexed Streams
Transmission/Storage Medium
Figure 1 — The ISO/IEC 14496 Terminal Architecture
© ISO/IEC 2010 – All rights reserved vii
---------------------- Page: 7 ----------------------
ISO/IEC 14496-1:2010(E)
The basic operations performed by such a receiver terminal are as follows. Information that allows access to
content complying with ISO/IEC 14496 is provided as initial session set up information to the terminal. Part 6
of ISO/IEC 14496 defines the procedures for establishing such session contexts as well as the interface to the
delivery layer that generically abstracts the storage or transport medium. The initial set up information allows,
in a recursive manner, to locate one or more elementary streams that are part of the coded content
representation. Some of these elementary streams may be grouped together using the multiplexing tool
described in ISO/IEC 14496-1.
Elementary streams contain the coded representation of either audio or visual data or scene description
information or user interaction data or text or font data. Elementary streams may as well themselves convey
information to identify streams, to describe logical dependencies between streams, or to describe information
related to the content of the streams. Each elementary stream contains only one type of data.
Elementary streams are decoded using their respective stream-specific decoders. The audio-visual objects
are composed according to the scene description information and presented by the terminal's presentation
device(s). All these processes are synchronized according to the systems decoder model (SDM) using the
synchronization information provided at the synchronization layer.
These basic operations are depicted in Figure 1, and are described in more detail below.
0.3 Terminal Model: Systems Decoder Model
The systems decoder model provides an abstract view of the behavior of a terminal complying with
ISO/IEC 14496-1. Its purpose is to enable a sending terminal to predict how the receiving terminal will behave
in terms of buffer management and synchronization when reconstructing the audio-visual information that
comprises the presentation. The systems decoder model includes a systems timing model and a systems
buffer model which are described briefly in the following Subclauses.
0.3.1 Timing Model
The timing model defines the mechanisms through which a receiving terminal establishes a notion of time that
enables it to process time-dependent events. This model also allows the receiving terminal to establish
mechanisms to maintain synchronization both across and within particular audio-visual objects as well as with
user interaction events. In order to facilitate these functions at the receiving terminal, the timing model
requires that the transmitted data streams contain implicit or explicit timing information. Two sets of timing
information are defined in ISO/IEC 14496-1: clock references and time stamps. The former convey the
sending terminal's time base to the receiving terminal, while the latter convey a notion of relative time for
specific events such as the desired decoding or composition time for portions of the encoded audio-visual
information.
0.3.2 Buffer Model
The buffer model enables the sending terminal to monitor and control the buffer resources that are needed to
decode each elementary stream in a presentation. The required buffer resources are conveyed to the
receiving terminal by means of descriptors at the beginning of the presentation. The terminal can then decide
whether or not it is capable of handling this particular presentation. The buffer model allows the sending
terminal to specify when information may be removed from these buffers and enables it to schedule data
transmission so that the appropriate buffers at the receiving terminal do not overflow or underflow.
0.4 Multiplexing of Streams: The Delivery Layer
The term delivery layer is used as a generic abstraction of any existing transport protocol stack that may be
used to transmit and/or store content complying with ISO/IEC 14496. The functionality of this layer is not
within the scope of ISO/IEC 14496-1, and only the interface to this layer is considered. This interface is the
DMIF Application Interface (DAI) specified in ISO/IEC 14496-6. The DAI defines not only an interface for the
delivery of streaming data, but also for signaling information required for session and channel set up as well
as tear down. A wide variety of delivery mechanisms exist below this interface, with some of them indicated in
Figure 1. These mechanisms serve for transmission as well as storage of streaming data, i.e., a file is
viii © ISO/IEC 2010 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/IEC 14496-1:2010(E)
considered to be a particular instance of a delivery layer. For applications where the desired transport facility
does not fully address the needs of a service according to the specifications in ISO/IEC 14496, a simple
multiplexing tool (M4Mux) with low delay and low overhead is defined in ISO/IEC 14496-1.
0.5 Synchronization of Streams: The Sync Layer
Elementary streams are the basic abstraction for any streaming data source. Elementary streams are
conveyed as sync layer-packetized (SL-packetized) streams at the DMIF Application Interface. This
packetized representation additionally provides timing and synchronization information, as well as
fragmentation and random access information. The sync layer (SL) extracts this timing information to enable
synchronized decoding and, subsequently, composition of the elementary stream data.
0.6 The Compression Layer
The compression layer receives data in its encoded format and performs the necessary operations to decode
this data. The decoded information is then used by the terminal's composition, rendering and presentation
subsystems.
0.6.1 Object Description Framework
The purpose of the object description framework is to identify and describe elementary streams and to
associate them appropriately to an audio-visual scene description. Object descriptors serve to gain access to
ISO/IEC 14496 content. Object content information and the interface to intellectual property management and
protection systems are also part of this framework.
An object descriptor is a collection of one or more elementary stream descriptors that provide the
configuration and other information for the streams that relate to either an audio-visual object, or text or font
data, or a scene description. Object descriptors are themselves conveyed in elementary streams. Each object
descriptor is assigned an identifier (object descriptor ID), which is unique within a defined name scope. This
identifier is used to associate audio-visual objects in the scene description with a particular object descriptor,
and thus the elementary streams related to that particular object.
Elementary stream descriptors include information about the source of the stream data, in form of a unique
numeric identifier (the elementary stream ID) or a URL pointing to a remote source for the stream. Elementary
stream descriptors also include information about the encoding format, configuration information for the
decoding process and the sync layer packetization, as well as quality of service requirements for the
transmission of the stream and intellectual property identification. Dependencies between streams can also be
signaled within the elementary stream descriptors. This functionality may be used, for example, in scalable
audio or visual object representations to indicate the logical dependency of a stream containing enhancement
information, to a stream containing the base information. It can also be used to describe alternative
representations for the same content (e.g. the same speech content in various languages).
0.6.1.1 Intellectual Property Management and Protection
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists of
a normative interface that permits an ISO/IEC 14496 terminal to host one or more IPMP Systems in the form
of monolithic IPMP Systems or modular IPMP Tools. The IPMP interface consists of IPMP elementary
streams and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP
elementary streams carry time variant IPMP information that can be associated to multiple object descriptors.
The IPMP System, or IPMP Tools themselves are non-normative components that provides intellectual
property management and protection functions for the terminal. The IPMP Systems or Tools uses the
information carried by the IPMP elementary streams and descriptors to make protected ISO/IEC 14496
content available to the terminal.
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists of
a set of tools that permits an ISO/IEC 14496 terminal to support IPMP functionality. This functionality is
© ISO/IEC 2010 – All rights reserved ix
---------------------- Page: 9 ----------------------
ISO/IEC 14496-1:2010(E)
provided by the following two different complementary technologies, supporting different levels of
interoperability.
a) The IPMP framework as defined in 7.2.3, consists of a normative interface that permits an ISO/IEC 14496
terminal to host one or more IPMP Systems. The IPMP interface consists of IPMP elementary streams
and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP
elementary streams carry time variant IPMP information that can be associated to multiple object
descriptors. The IPMP System itself is a non-normative component that provides intellectual property
management and protection functions for the terminal. The IPMP System uses the information carried by
the IPMP elementary streams and descriptors to make protected ISO/IEC 14496 content available to the
terminal.
b) The IPMP framework extension, as specified in ISO/IEC 14496-13 allows, in addition to the functionality
specified in ISO/IEC 14496-1, a finer granularity of governance. ISO/IEC 14496-13 provides normative
support for individual IPMP components, referred to as IPMP Tools, to be normatively placed at identified
points of control within the terminal systems model. Additionally ISO/IEC 14496-13 provides normative
support for secure communications to be performed between IPMP Tools. ISO/IEC 14496-1 also
specifies specific normative extensions at the Systems level to support the IPMP functionality described
in ISO/IEC 14496-13.
An application may choose not to use an IPMP System, thereby offering no management and protection
features.
0.6.1.2 Object Content Information
Object content information (OCI) descriptors convey descriptive information about audio-visual objects. The
main content descriptors are: content classification descriptors, keyword descriptors, rating descriptors,
language descriptors, textual descriptors, and descriptors about the creation of the content. OCI descriptors
can be included directly in the related object descriptor or elementary stream descriptor or, if it is time variant,
it may be carried in an elementary stream by itself. An OCI stream is organized in a sequence of small,
synchronized entities called events that contain a set of OCI descriptors. OCI streams can be associated to
multiple object descriptors.
0.6.2 Scene Description Streams
Scene description addresses the organization of audio-visual objects in a scene, in terms of both spatial and
temporal attributes. This information allows the composition and rendering of individual audio-visual objects
after the respective decoders have reconstructed the streaming data for them. For visual data,
ISO/IEC 14496-11 does not mandate particular composition algorithms. Hence, visual composition is
implementation dependent. For audio data, the composition process is defined in a normative manner in
ISO/IEC 14496-11 and ISO/IEC 14496-3.
The scene description is represented using a parametric approach (BIFS - Binary Format for Scenes). The
description consists of an encoded hierarchy (tree) of nodes with attributes and other information (including
event sources and targets). Leaf nodes in this tree correspond to elementary audio-visual data, whereas
intermediate nodes group this material to form audio-visual objects, and perform grouping, transformation, and
other such operations on audio-visual objects (scene description nodes). The scene description can evolve
over time by using scene description updates.
In order to facilitate active user involvement with the presented audio-visual information, ISO/IEC 14496-11
provides support for user and object interactions. Interactivity mechanisms are integrated with the scene
description information, in the form of linked event sources and targets (routes) as well as sensors (special
nodes that can trigger events based on specific conditions). These event sources and targets are part of
scene description nodes, and thus allow close coupling of dynamic and interactive behavior with the specific
scene at hand. ISO/IEC 14496-11, however, does not specify a particular user interface or a mechanism that
maps user actions (e.g., keyboard key presses or mouse movements) to such events.
x © ISO/IEC 2010 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/IEC 14496-1:2010(E)
Such an interactive environment may not need an upstream channel, but ISO/IEC 14496 also provides means
for client-server interactive sessions with the ability to set up upstream elementary streams and associate
them to specific downstream elementary streams.
0.6.3 Audio-visual Streams
The coded representation of audio and visual information are described in ISO/IEC 14496-3 (Audio) and
ISO/IEC 14496-2 (Visual) and ISO/IEC 14496-10 (Advanced Video Coding) respectively. The reconstructed
audio-visual data are made available to the composition process for potential use during the scene rendering.
0.6.4 Upchannel Streams
Downchannel elementary streams may require upchannel information to be transmitted from the receiving
terminal to the sending terminal (e.g., to allow for client-server interactivity). Figure 1 indicates the flowpath for
an elementary stream from the receiving terminal to the sending terminal. The content of upchannel streams
is specified in the same part of the specification that defines the content of the downstream data. For example,
upchannel control streams for video downchannel elementary streams are defined in ISO/IEC 14496-2.
0.6.5 Interaction Streams
The coded representation of user interaction information is not in the scope of ISO/IEC 14496. But this
information shall be translated into scene modification and the modifications made available to the
composition process for potential use during the scene rendering.
0.6.6 Text and Font data Streams
Scene description often contains information presented in textual format. The audio-visual data encoded in the
scene may also be accompanied by supplemental text information such as subtitles. In order to enable time-
based updates of text data and to insure the text appearance and layout, both elementary streams carrying
timed text information and font data are used. The coded representation of the timed text stream is described
in ISO/IEC 14496-17. The font data format and encoded representation of font data stream are described in
ISO/IEC 14496-18 (font data stream) and ISO/IEC 14496-22 (font data format).
0.7 Application Engine
The MPEG-J is a programmatic system (as opposed to a conventional parametric system) which specifies
API(s) for interoperation of MPEG-4 media players with Java code. By combining MPEG-4 media and safe
executable code, content creators may embed complex control and data processing mechanisms with their
media data to intelligently manage the operation of the audio-visual session. The parametric MPEG-4 System
forms the Presentation Engine while the MPEG-J subsystem controlling the Presentation Engine forms the
Application Engine.
The Java application is delivered as a separate elementary stream to the MPEG-4 terminal. There it will be
directed to the MPEG-J run time environment, from where the MPEG-J program will have access to the
various components and required data of the MPEG-4 player to control it.
In addition to the basic packages of the language (java.lang, java.io, java.util) a few categories of APIs have
been defined for different scopes. For the Scene graph API the objective is to provide access to the scene
graph specified in ISO/IEC 14496-11: to inspect the graph, to alter nodes and their fields, and to add and
remove nodes within the graph. The Resource API is used for regulation of performance: it provides a
centralized facility for managing resources. This is used when the program execution is contingent upon the
terminal configuration and
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.