ISO/IEC 14496-1:2010
(Main)Information technology — Coding of audio-visual objects — Part 1: Systems
Information technology — Coding of audio-visual objects — Part 1: Systems
ISO/IEC 14496-1:2010 specifies system level functionalities for the communication of interactive audio-visual scenes, i.e. the coded representation of information related to the management of data streams (synchronization, identification, description and association of stream content).
Technologies de l'information — Codage des objets audiovisuels — Partie 1: Systèmes
General Information
Relations
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 14496-1
Fourth edition
2010-06-01
Information technology — Coding of
audio-visual objects —
Part 1:
Systems
Technologies de l'information — Codage des objets audiovisuels —
Partie 1: Systèmes
Reference number
©
ISO/IEC 2010
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO/IEC 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2010 – All rights reserved
Contents Page
Foreword .iv
0 Introduction.vi
1 Scope.1
2 Normative references.1
3 Additional references.2
4 Terms and definitions .2
5 Abbreviated terms .10
6 Conventions.11
7 Streaming Framework.11
8 Syntactic Description Language.99
9 Profiles.110
Annex A (informative) Time Base Reconstruction .112
Annex B (informative) Registration procedure .115
Annex C (informative) The QoS Management Model for ISO/IEC 14496 Content.119
Annex D (informative) Conversion Between Time and Date Conventions .120
Annex E (informative) Graphical Representation of Object Descriptor and Sync Layer Syntax.122
Annex F (informative) Elementary Stream Interface.130
Annex G (informative) Upstream Walkthrough.132
Annex H (informative) Scene and Object Description Carrousel.137
Annex I (normative) Usage of ITU-T Recommendation H.264 | ISO/IEC 14496-10 AVC .138
Annex J (informative) Patent statements .141
Bibliography.144
© ISO/IEC 2010 – All rights reserved iii
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
ISO/IEC 14496-1 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This fourth edition cancels and replaces the third edition (ISO/IEC 14496-1:2004), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-1:2004/Amd.1:2005,
ISO/IEC 14496-1:2004/Amd.2:2007, ISO/IEC 14496-1:2004/Amd.3:2007 and Technical Corrigenda
ISO/IEC 14496-1:2004/Cor.1:2006 and ISO/IEC 14496-1:2004/Cor.2:2007.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of
audio-visual objects:
⎯ Part 1: Systems
⎯ Part 2: Visual
⎯ Part 3: Audio
⎯ Part 4: Conformance testing
⎯ Part 5: Reference software
⎯ Part 6: Delivery Multimedia Integration Framework (DMIF)
⎯ Part 7: Optimized reference software for coding of audio-visual objects
⎯ Part 8: Carriage of ISO/IEC 14496 contents over IP networks
⎯ Part 9: Reference hardware description
⎯ Part 10: Advanced Video Coding
⎯ Part 11: Scene description and application engine
⎯ Part 12: ISO base media file format
⎯ Part 13: Intellectual Property Management and Protection (IPMP) extensions
iv © ISO/IEC 2010 – All rights reserved
⎯ Part 14: MP4 file format
⎯ Part 15: Advanced Video Coding (AVC) file format
⎯ Part 16: Animation Framework eXtension (AFX)
⎯ Part 17: Streaming text format
⎯ Part 18: Font compression and streaming
⎯ Part 19: Synthesized texture stream
⎯ Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF)
⎯ Part 21: MPEG-J Graphics Framework eXtensions (GFX)
⎯ Part 22: Open Font Format
⎯ Part 23: Symbolic Music Representation
⎯ Part 24: Audio and systems interaction
⎯ Part 25: 3D Graphics Compression Model
⎯ Part 26: Audio conformance
⎯ Part 27: 3D Graphics conformance
© ISO/IEC 2010 – All rights reserved v
0 Introduction
0.1 Overview
ISO/IEC 14496 specifies a system for the communication of interactive audio-visual scenes. This specification
includes the following elements.
a) The coded representation of natural or synthetic, two-dimensional (2D) or three-dimensional (3D) objects
that can be manifested audibly and/or visually (audio-visual objects) (specified in Parts 2, 3, 10, 11, 16,
19, 20, 23 and 25 of ISO/IEC 14496).
b) The coded representation of the spatio-temporal positioning of audio-visual objects as well as their
behavior in response to interaction (scene description, specified in Parts 11 and 20 of ISO/IEC 14496).
c) The coded representation of information related to the management of data streams (synchronization,
identification, description and association of stream content, specified in this Part and in Part 24 of
ISO/IEC 14496).
d) A generic interface to the data stream delivery layer functionality (specified in Part 6 of ISO/IEC 14496).
e) An application engine for programmatic control of the player: format, delivery of downloadable Java byte
code as well as its execution lifecycle and behavior through APIs (specified in Parts 11 and 21 of
ISO/IEC 14496).
f) A file format to contain the media information of an ISO/IEC 14496 presentation in a flexible, extensible
format to facilitate interchange, management, editing, and presentation of the media specified in Part 12
(ISO File Format), Part 14 (MP4 File Format) and Part 15 (AVC File Format) of ISO/IEC 14496.
g) The coded representation of font data and of information related to the management of text streams and
font data streams (specified in Parts 17, 18 and 22 of ISO/IEC 14496).
The overall operation of a system communicating audio-visual scenes can be paraphrased as follows:
At the sending terminal, the audio-visual scene information is compressed, supplemented with
synchronization information and passed to a delivery layer that multiplexes it into one or more coded binary
streams that are transmitted or stored. At the receiving terminal, these streams are demultiplexed and
decompressed. The audio-visual objects are composed according to the scene description and
synchronization information and presented to the end user. The end user may have the option to interact with
this presentation. Interaction information can be processed locally or transmitted back to the sending terminal.
ISO/IEC 14496 defines the syntax and semantics of the bitstreams that convey such scene information, as
well as the details of their decoding processes.
This part of ISO/IEC 14496 specifies the following tools.
⎯ A terminal model for time and buffer management.
⎯ A coded representation of metadata for the identification, description and logical dependencies of the
elementary streams (object descriptors and other descriptors).
⎯ A coded representation of descriptive audio-visual content information [object content information (OCI)].
⎯ An interface to intellectual property management and protection (IPMP) systems.
⎯ A coded representation of synchronization information (sync layer – SL).
⎯ A multiplexed representation of individual elementary streams in a single stream (M4Mux).
vi © ISO/IEC 2010 – All rights reserved
These various elements are described functionally in this clause and specified in the normative clauses that
follow.
0.2 Architecture
The information representation specified in ISO/IEC 14496 describes the means to create an interactive
audio-visual scene in terms of coded audio-visual information and associated scene description information.
The entity that composes and sends, or receives and presents such a coded representation of an interactive
audio-visual scene is generically referred to as an “audio-visual terminal” or just “terminal”. This terminal may
correspond to a stand-alone application or be part of an application system.
Display and
User
Interaction
Interactive Audiovisual
Scene
Composition and Rendering
Upstream Compression
...
Information
Layer
Scene
Object
AV Object
Description
Descriptor
data
Information
Elementary Streams
Elementary Stream Interface
SL SL SL SL SL SL
...
Sync
Layer
SL
SL-Packetized Streams
DMIF Application Interface
M4Mux M4Mux M4Mux
Delivery
Layer
(PES) (RTP)
AAL2 H223 DAB
MPEG-2 UDP
...
ATM PSTN Mux
TS IP
Multiplexed Streams
Transmission/Storage Medium
Figure 1 — The ISO/IEC 14496 Terminal Architecture
© ISO/IEC 2010 – All rights reserved vii
The basic operations performed by such a receiver terminal are as follows. Information that allows access to
content complying with ISO/IEC 14496 is provided as initial session set up information to the terminal. Part 6
of ISO/IEC 14496 defines the procedures for establishing such session contexts as well as the interface to the
delivery layer that generically abstracts the storage or transport medium. The initial set up information allows,
in a recursive manner, to locate one or more elementary streams that are part of the coded content
representation. Some of these elementary streams may be grouped together using the multiplexing tool
described in ISO/IEC 14496-1.
Elementary streams contain the coded representation of either audio or visual data or scene description
information or user interaction data or text or font data. Elementary streams may as well themselves convey
information to identify streams, to describe logical dependencies between streams, or to describe information
related to the content of the streams. Each elementary stream contains only one type of data.
Elementary streams are decoded using their respective stream-specific decoders. The audio-visual objects
are composed according to the scene description information and presented by the terminal's presentation
device(s). All these processes are synchronized according to the systems decoder model (SDM) using the
synchronization information provided at the synchronization layer.
These basic operations are depicted in Figure 1, and are described in more detail below.
0.3 Terminal Model: Systems Decoder Model
The systems decoder model provides an abstract view of the behavior of a terminal complying with
ISO/IEC 14496-1. Its purpose is to enable a sending terminal to predict how the receiving terminal will behave
in terms of buffer management and synchronization when reconstructing the audio-visual information that
comprises the presentation. The systems decoder model includes a systems timing model and a systems
buffer model which are described briefly in the following Subclauses.
0.3.1 Timing Model
The timing model defines the mechanisms through which a receiving terminal establishes a notion of time that
enables it to process time-dependent events. This model also allows the receiving terminal to establish
mechanisms to maintain synchronization both across and within particular audio-visual objects as well as with
user interaction events. In order to facilitate these functions at the receiving terminal, the timing model
requires that the transmitted data streams contain implicit or explicit timing information. Two sets of timing
information are defined in ISO/IEC 14496-1: clock references and time stamps. The former convey the
sending terminal's time base to the receiving terminal, while the latter convey a notion of relative time for
specific events such as the desired decoding or composition time for portions of the encoded audio-visual
information.
0.3.2 Buffer Model
The buffer model enables the sending terminal to monitor and control the buffer resources that are needed to
decode each elementary stream in a presentation. The required buffer resources are conveyed to the
receiving terminal by means of descriptors at the beginning of the presentation. The terminal can then decide
whether or not it is capable of handling this particular presentation. The buffer model allows the sending
terminal to specify when information may be removed from these buffers and enables it to schedule data
transmission so that the appropriate buffers at the receiving terminal do not overflow or underflow.
0.4 Multiplexing of Streams: The Delivery Layer
The term delivery layer is used as a generic abstraction of any existing transport protocol stack that may be
used to transmit and/or store content complying with ISO/IEC 14496. The functionality of this layer is not
within the scope of ISO/IEC 14496-1, and only the interface to this layer is considered. This interface is the
DMIF Application Interface (DAI) specified in ISO/IEC 14496-6. The DAI defines not only an interface for the
delivery of streaming data, but also for signaling information required for session and channel set up as well
as tear down. A wide variety of delivery mechanisms exist below this interface, with some of them indicated in
Figure 1. These mechanisms serve for transmission as well as storage of streaming data, i.e., a file is
viii © ISO/IEC 2010 – All rights reserved
considered to be a particular instance of a delivery layer. For applications where the desired transport facility
does not fully address the needs of a service according to the specifications in ISO/IEC 14496, a simple
multiplexing tool (M4Mux) with low delay and low overhead is defined in ISO/IEC 14496-1.
0.5 Synchronization of Streams: The Sync Layer
Elementary streams are the basic abstraction for any streaming data source. Elementary streams are
conveyed as sync layer-packetized (SL-packetized) streams at the DMIF Application Interface. This
packetized representation additionally provides timing and synchronization information, as well as
fragmentation and random access information. The sync layer (SL) extracts this timing information to enable
synchronized decoding and, subsequently, composition of the elementary stream data.
0.6 The Compression Layer
The compression layer receives data in its encoded format and performs the necessary operations to decode
this data. The decoded information is then used by the terminal's composition, rendering and presentation
subsystems.
0.6.1 Object Description Framework
The purpose of the object description framework is to identify and describe elementary streams and to
associate them appropriately to an audio-visual scene description. Object descriptors serve to gain access to
ISO/IEC 14496 content. Object content information and the interface to intellectual property management and
protection systems are also part of this framework.
An object descriptor is a collection of one or more elementary stream descriptors that provide the
configuration and other information for the streams that relate to either an audio-visual object, or text or font
data, or a scene description. Object descriptors are themselves conveyed in elementary streams. Each object
descriptor is assigned an identifier (object descriptor ID), which is unique within a defined name scope. This
identifier is used to associate audio-visual objects in the scene description with a particular object descriptor,
and thus the elementary streams related to that particular object.
Elementary stream descriptors include information about the source of the stream data, in form of a unique
numeric identifier (the elementary stream ID) or a URL pointing to a remote source for the stream. Elementary
stream descriptors also include information about the encoding format, configuration information for the
decoding process and the sync layer packetization, as well as quality of service requirements for the
transmission of the stream and intellectual property identification. Dependencies between streams can also be
signaled within the elementary stream descriptors. This functionality may be used, for example, in scalable
audio or visual object representations to indicate the logical dependency of a stream containing enhancement
information, to a stream containing the base information. It can also be used to describe alternative
representations for the same content (e.g. the same speech content in various languages).
0.6.1.1 Intellectual Property Management and Protection
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists of
a normative interface that permits an ISO/IEC 14496 terminal to host one or more IPMP Systems in the form
of monolithic IPMP Systems or modular IPMP Tools. The IPMP interface consists of IPMP elementary
streams and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP
elementary streams carry time variant IPMP information that can be associated to multiple object descriptors.
The IPMP System, or IPMP Tools themselves are non-normative components that provides intellectual
property management and protection functions for the terminal. The IPMP Systems or Tools uses the
information carried by the IPMP elementary streams and descriptors to make protected ISO/IEC 14496
content available to the terminal.
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists of
a set of tools that permits an ISO/IEC 14496 terminal to support IPMP functionality. This functionality is
© ISO/IEC 2010 – All rights reserved ix
provided by the following two different complementary technologies, supporting different levels of
interoperability.
a) The IPMP framework as defined in 7.2.3, consists of a normative interface that permits an ISO/IEC 14496
terminal to host one or more IPMP Systems. The IPMP interface consists of IPMP elementary streams
and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP
elementary streams carry time variant IPMP information that can be associated to multiple object
descriptors. The IPMP System itself is a non-normative component that provides intellectual property
management and protection functions for the terminal. The IPMP System uses the information carried by
the IPMP elementary streams and descriptors to make protected ISO/IEC 14496 content available to the
terminal.
b) The IPMP framework extension, as specified in ISO/IEC 14496-13 allows, in addition to the functionality
specified in ISO/IEC 14496-1, a finer granularity of governance. ISO/IEC 14496-13 provides normative
support for individual IPMP components, referred to as IPMP Tools, to be normatively placed at identified
points of control within the terminal systems model. Additionally ISO/IEC 14496-13 provides normative
support for secure communications to be performed between IPMP Tools. ISO/IEC 14496-1 also
specifies specific normative extensions at the Systems level to support the IPMP functionality described
in ISO/IEC 14496-13.
An application may choose not to use an IPMP System, thereby offering no management and protection
features.
0.6.1.2 Object Content Information
Object content information (OCI) descriptors convey descriptive information about audio-visual objects. The
main content descriptors are: content classification descriptors, keyword descriptors, rating descriptors,
language descriptors, textual descriptors, and descriptors about the creation of the content. OCI descriptors
can be included directly in the related object descriptor or elementary stream descriptor or, if it is time variant,
it may be carried in an elementary stream by itself. An OCI stream is organized in a sequence of small,
synchronized entities called events that contain a set of OCI descriptors. OCI streams can be associated to
multiple object descriptors.
0.6.2 Scene Description Streams
Scene description addresses the organization of audio-visual objects in a scene, in terms of both spatial and
temporal attributes. This information allows the composition and rendering of individual audio-visual objects
after the respective decoders have reconstructed the streaming data for them. For visual data,
ISO/IEC 14496-11 does not mandate particular composition algorithms. Hence, visual composition is
implementation dependent. For audio data, the composition process is defined in a normative manner in
ISO/IEC 14496-11 and ISO/IEC 14496-3.
The scene description is represented using a parametric approach (BIFS - Binary Format for Scenes). The
description consists of an encoded hierarchy (tree) of nodes with attributes and other information (including
event sources and targets). Leaf nodes in this tree correspond to elementary audio-visual data, whereas
intermediate nodes group this material to form audio-visual objects, and perform grouping, transformation, and
other such operations on audio-visual objects (scene description nodes). The scene description can evolve
over time by using scene description updates.
In order to facilitate active user involvement with the presented audio-visual information, ISO/IEC 14496-11
provides support for user and object interactions. Interactivity mechanisms are integrated with the scene
description information, in the form of linked event sources and targets (routes) as well as sensors (special
nodes that can trigger events based on specific conditions). These event sources and targets are part of
scene description nodes, and thus allow close coupling of dynamic and interactive behavior with the specific
scene at hand. ISO/IEC 14496-11, however, does not specify a particular user interface or a mechanism that
maps user actions (e.g., keyboard key presses or mouse movements) to such events.
x © ISO/IEC 2010 – All rights reserved
Such an interactive environment may not need an upstream channel, but ISO/IEC 14496 also provides means
for client-server interactive sessions with the ability to set up upstream elementary streams and associate
them to specific downstream elementary streams.
0.6.3 Audio-visual Streams
The coded representation of audio and visual information are described in ISO/IEC 14496-3 (Audio) and
ISO/IEC 14496-2 (Visual) and ISO/IEC 14496-10 (Advanced Video Coding) respectively. The reconstructed
audio-visual data are made available to the composition process for potential use during the scene rendering.
0.6.4 Upchannel Streams
Downchannel elementary streams may require upchannel information to be transmitted from the receiving
terminal to the sending terminal (e.g., to allow for client-server interactivity). Figure 1 indicates the flowpath for
an elementary stream from the receiving terminal to the sending terminal. The content of upchannel streams
is specified in the same part of the specification that defines the content of the downstream data. For example,
upchannel control streams for video downchannel elementary streams are defined in ISO/IEC 14496-2.
0.6.5 Interaction Streams
The coded representation of user interaction information is not in the scope of ISO/IEC 14496. But this
information shall be translated into scene modification and the modifications made available to the
composition process for potential use during the scene rendering.
0.6.6 Text and Font data Streams
Scene description often contains information presented in textual format. The audio-visual data encoded in the
scene may also be accompanied by supplemental text information such as subtitles. In order to enable time-
based updates of text data and to insure the text appearance and layout, both elementary streams carrying
timed text information and font data are used. The coded representation of the timed text stream is described
in ISO/IEC 14496-17. The font data format and encoded representation of font data stream are described in
ISO/IEC 14496-18 (font data stream) and ISO/IEC 14496-22 (font data format).
0.7 Application Engine
The MPEG-J is a programmatic system (as opposed to a conventional parametric system) which specifies
API(s) for interoperation of MPEG-4 media players with Java code. By combining MPEG-4 media and safe
executable code, content creators may embed complex control and data processing mechanisms with their
media data to intelligently manage the operation of the audio-visual session. The parametric MPEG-4 System
forms the Presentation Engine while the MPEG-J subsystem controlling the Presentation Engine forms the
Application Engine.
The Java application is delivered as a separate elementary stream to the MPEG-4 terminal. There it will be
directed to the MPEG-J run time environment, from where the MPEG-J program will have access to the
various components and required data of the MPEG-4 player to control it.
In addition to the basic packages of the language (java.lang, java.io, java.util) a few categories of APIs have
been defined for different scopes. For the Scene graph API the objective is to provide access to the scene
graph specified in ISO/IEC 14496-11: to inspect the graph, to alter nodes and their fields, and to add and
remove nodes within the graph. The Resource API is used for regulation of performance: it provides a
centralized facility for managing resources. This is used when the program execution is contingent upon the
terminal configuration and its capabilities, both static (that do not change during execution) and dynamic.
Decoder API allows the control of the decoders that are present in the terminal. The Net API provides a way to
interact with the network, being compliant to the MPEG-4 DMIF Application Interface. Complex applications
and enhanced interactivity are possible with these basic packages. The architecture of MPEG-J is presented
in more detail in ISO/IEC 14496-11.
© ISO/IEC 2010 – All rights reserved xi
0.8 Extensible MPEG-4 Textual Format (XMT)
The Extensible MPEG-4 Textual (XMT) format is a textual representation of the multimedia content described
in ISO/IEC 14496 using the Extensible Markup Language (XML). XMT is designed to facilitate the creation
and maintenance of MPEG-4 multimedia content, whether by human authors or by automated machine
programs. XMT is specified in ISO/IEC 14496-11.
The textual representation of MPEG-4 content has high-level abstractions, XMT-O, that allow authors to
exchange their content easily with other authors or authoring tools, while at the same time preserving
semantic intent. XMT also has low-level textual representations, XMT-A, covering the full scope and function
of MPEG-4. The high-level XMT-O is designed to facilitate interoperability with the Synchronized Multimedia
Integration Language (SMIL) 2.0, a recommendation from the W3C consortium, and also with Extensible 3D
specification, X3D, developed by the Web3D consortium as the next generation of Virtual Reality Modeling
Language (VRML).
The XMT language has grammars that are specified using the W3C XML Schema language. The grammars
contain rules for element placement and attribute values, etc. These rules for XMT, defined using the Schema
language, follow the binary coding rules defined in ISO/IEC 14496-11 and help ensure that the textual
representation can be coded into correct binary according to ISO/IEC 14496-11 coding rules.
All constructs in the ISO/IEC 14496 specification have their parallel in the XMT textual format. For the Visual
and Audio parts, XMT provides a means to reference external media streams of either pre-encoded or raw
audiovisual binary content. While XMT does not contain a textual format for audiovisual media, it does contain
hints in a textual format that allow an XMT tool to encode and embed the audiovisual media into a complete
MPEG-4 presentation.
0.9 Patent Rights
The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC)
draw attention to the fact that it is claimed that compliance with this document may involve the use of a patent.
The ISO and IEC take no position concerning the evidence, validity and scope of this patent right.
The holder of this patent right has assured the ISO and IEC that he is willing to negotiate licences under
reasonable and non-discriminatory terms and conditions with applicants throughout the world. In this respect,
the statement of the holder of this patent right is registered with the ISO and IEC. Information may be obtained
from the companies listed in Annex J.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights other than those identified in Annex J. ISO and IEC shall not be held responsible for identifying any or
all such patent rights.
xii © ISO/IEC 2010 – All rights reserved
INTERNATIONAL STANDARD ISO/IEC 14496-1:2010(E)
Information technology — Coding of audio-visual objects —
Part 1:
Systems
1 Scope
This part of ISO/IEC 14496 specifies system level functionalities for the communication of interactive audio-
visual scenes, i.e. the coded representation of information related to the management of data streams
(synchronization, identification, description and association of stream content).
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 639-2:1998, Codes for the representation of names of languages — Part 2: Alpha-3 code
ISO/IEC 10646-1:2000, Information technology — Universal Multiple-Octet Coded Character Set (UCS) —
Part 1: Architecture and Basic Multilingual Plane
ISO/IEC 11172-2:1993, Information technology — Coding of moving pictures and associated audio for digital
storage media at up to about 1,5 Mbit/s — Part 2: Video
ISO/IEC 11172-3:1993, Information technology — Coding of moving pictures and associated audio for digital
storage media at up to about 1,5 Mbit/s — Part 3: Audio
ISO/IEC 13818-3:1998, Information technology — Generic coding of moving pictures and associated audio
information — Part 3: Audio
ISO/IEC 13818-7:2006, Information technology — Generic coding of moving pictures and associated audio
information — Part 7: Advanced Audio Coding (AAC)
ISO/IEC 14496-2:2004, Information technology — Coding of audio-visual objects — Part 2: Visual
ISO/IEC 14496-10:2009, Information technology — Coding of audio-visual objects — Part 10: Advanced
Video Coding
ISO/IEC 14496-15:2004, Information technology — Coding of audio-visual objects — Part 15: Advanced
Video Coding (AVC) file format
ISO/IEC 14496-16:2006, Information technology — Coding of audio-visual objects — Part 16: Animation
Framework eXtension (AFX)
ISO/IEC 14496-18:2004, Information technology — Coding of audio-visual objects — Part 18: Font
compression and streaming
© ISO/IEC 2010 – All rights reserved 1
ISO/IEC 13818-2:2000, Information technology — Generic coding of moving pictures and associated audio
information — Part 2: Video
ISO/IEC 10918-1:1994, Information technology — Digital compression and coding of continuous-tone still
images — Part 1: Requirements and guidelines
ANSI/SMPTE 291M:1996, Television — Ancillary Data Packet and Space Formatting
SMPTE 315M:1999, Television — Camera Positioning Information Conveyed by Ancillary Data Packets
W3C Recommendation: 28 October 2004 — XML Schema, http://www.w3.org/TR/xmlschema-0/
3 Additional references
For additional references see the Bibliography.
4 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
4.1
access unit
AU
smallest individually accessible portion of data within an elementary stream to which unique timing
information can be attributed
4.2
alpha map
representation of the transparency parameters associated with a texture map
4.3
audio-visual object
representation of a natural or synthetic object that has an audio and/or visual manifestation
NOTE The representation corresponds to a node or a group of nodes in the BIFS scene description. Each audio-
visual object is associated with zero or more elementary streams using one or more object descriptors.
4.4
audio-visual scene
AV scene
set of audio-visual objects together with scene description information that defines their spatial and temporal
attributes including behaviors resulting from object and user interactions
4.5
AVC parameter set
sequence parameter set or a picture parameter set
4.6
AVC access unit
access unit made up of NAL Units as defined in ISO/IEC 14496-10 with the structure defined in
ISO/IEC 14496-15:2004, 5.2.3
4.7
AVC parameter set access unit
access unit made up only of sequence parameter set NAL units or picture parameter set NAL units having
same timestamps to be applied
2 © ISO/IEC 2010 – All rights reserved
4.8
AVC parameter set elementary stream
elementary stream containing made up only of AVC parameter set access units
4.9
AVC video elementary stream
elementary stream containing access units made up of NAL units for coded picture data
4.10
binary format for scene
BIFS
coded representation of a parametric scene description format as specified in ISO/IEC 14496-11
4.11
buffer model
model that defines how a terminal complying with ISO/IEC 14496 manages the buffer resources that are
needed to decode a presentation
4.12
byte aligned
position in a coded bit stream with a distance of a multiple of 8-bits from the first bit in the stream
4.13
clock reference
special time stamp that conveys a reading of a time base
4.14
composition
process of applying scene description information in order to identify the spatio-temporal attributes and
hierarchies of audio-visual objects
4.15
composition memory
CM
random access memory that contains composition units
4.16
composition time stamp
CTS
indication of the nominal composition time of a composition unit
4.17
composition unit
CU
individually accessible portion of the output that a decoder produces from access units
4.18
compression layer
layer of a system according to the specifications in ISO/IEC 14496 that translates between the coded
representation of an elementary stream and its decoded representation. It incorporates the decoders
4.19
control point
point on a given elementary stream in a terminal where IPMP Processing on stream data is carried out
4.20
decoder
entity that translates between the coded representation of an elementary stream and its decoded
representation
© ISO/IEC 2010 – All rights reserved 3
4.21
decoding buffer
DB
buffer at the input of a decoder that contains access units
4.22
decoder configuration
configuration of a decoder for processing its elementary stream data by using information contained in its
elementary stream descriptor
4.23
decoding time stamp
DTS
indication of the nominal decoding time of an access unit
4.24
delivery layer
generic abstraction for delivery mechanisms (computer networks, etc.) able to store or transmit a number of
multiplexed elementary streams or M4Mux streams
4.25
descriptor
data structure that is used to describe particular aspects of an elementary stream or a coded audio-visual
object
4.26
DMIF application interface
DAI
interface specified in ISO/IEC 14496-6 used to model the exchange of SL-packetized stream data and
associated control information between the sync layer and the delivery layer
4.27
elementary stream
ES
consecutive flow of mono-media data from a single source entity to a single destination entity on the
compression layer
4.28
elementary stream descriptor
structure contained in object descriptors that describes the encoding format, initialization information, sync
layer configuration, and other descriptive information about the content carried in an elementary stream
4.29
elementary stream interface
ESI
conceptual interface modeling the exchange of elementary stream data and associated control information
between the compression layer and the sync layer
4.30
M4Mux channel
FMC
label to differentiate between data belonging to different constituent streams within one M4Mux stream
NOTE A sequence of data in one M4Mux channel within a M4Mux stream corresponds to one single SL-packetized
stream.
4.31
M4Mux packet
smallest data entity managed by the M4Mux tool consisting of a header and a payload
4 © ISO/IEC 2010 – All rights reserved
4.32
M4Mux stream
sequence of M4Mux Packets with data from one or more SL-packetized streams that are each identified by
their own M4Mux channel
4.33
M4Mux tool
tool that allows the interleaving of data from multiple data streams
4.34
graphics profile
profile that specifies the permissible set of graphical elements of the BIFS tool that may be used in a scene
description stream
NOTE BIFS comprises both graphical and scene description elements.
4.35
inter
mode for coding parameters that uses previously coded parameters to construct a prediction
4.36
interaction stream
elementary stream that conveys user interaction information
4.37
intra
mode for coding parameters that does not make reference to previously coded parameters to perform the
encoding
4.38
initial object descriptor
special object descriptor that allows the receiving terminal to gain initial access to portions of content encoded
according to ISO/IEC 14496 and that conveys profile and level information to describe the complexity of the
content
4.39
intellectual property identification
IPI
unique identification of one or more elementary streams corresponding to parts of one or more audio-visual
objects
4.40
intellectual property management and protection system
IPMP system
generic term for mechanisms and tools to manage and protect intellectual property
NOTE This part of ISO/IEC 14496 defines the interface to such systems as well as the following.
⎯ The provision for the identification of IPMP tools either through the use of a registration authority or through the use
of a functional description of the IPMP tools' capabilities in a parametric fashion.
⎯ Controlling the time of instantiation of IPMP tools either by the inclusion of references to the required IPMP tools or at
the request of already instantiated IPMP tools.
⎯ Providing secure messaging between IPMP tools and the terminal and between IPMP tools and the user.
⎯ Notification of the instantiation of IPMP tools to IPMP tools requesting such notification.
⎯ Interaction between IPMP tools, and/or the terminal and the user.
⎯ The carriage of IPMP tools within the bitstream.
© ISO/IEC 2010 – All rights reserved 5
4.41
IPMP information
Information directed to a given IPMP Tool to enable, assist or facilitate its operation
4.42
IPMP system
monolithic IPMP protection scheme which requires implementation dependant access to protected streams at
required Control Points and must provide any intra-communication within an IPMP System on an
implementation basis
NOTE In this standard the use of the term “IPMP System” is used in some cases to indicate either an actual IPMP
System or a combination of IPMP Tools whose combination provides the functionality of an IPMP System. In cases where
the distinction is important the proper respective terms are used.
4.43
IPMP tool
module that performs (one or more) IPMP functions such as authentication, decryption, watermarking
NOTE Conceptually the use of one or more IPMP tools is combined to perform the functionality of an IPMP system.
IPMP tools, as opposed to IPMP systems, are normatively identified as to which control points they function at as well as
are provided normative methods for secure communicati
...









Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...