ISO/IEC 14496-11:2015
(Main)Information technology — Coding of audio-visual objects — Part 11: Scene description and application engine
Information technology — Coding of audio-visual objects — Part 11: Scene description and application engine
ISO/IEC 14496-11:2015 specifies: 1. the coded representation of the spatio-temporal positioning of audio-visual objects as well as their behavior in response to interaction (scene description); 2. the Extensible MPEG-4 Textual (XMT) format, a textual representation of the multimedia content described in ISO/IEC 14496 using the Extensible Markup Language (XML); and 3. a system level description of an application engine (format, delivery, lifecycle, and behavior of downloadable Java byte code applications).
Technologies de l'information — Codage des objets audiovisuels — Partie 11: Description de scène et moteur d'application
General Information
Relations
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 14496-11
Second Edition
2015-11-01
Information technology — Coding of
audio-visual objects —
Part 11:
Scene description and application engine
Technologies de l'information — Codage des objets audiovisuels —
Partie 11: Description de scène et moteur d'application
Reference number
ISO/IEC 14496-11:2015(E)
©
ISO/IEC 2015
---------------------- Page: 1 ----------------------
ISO/IEC 14496-11:2015(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO/IEC 2015
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2015 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC 14496-11:2015(E)
Contents Page
Foreword . v
0 Introduction . vii
0.1 Scene Description . vii
0.2 Extensible MPEG-4 Textual Format . ix
0.3 MPEG-J . ix
1 Scope . 1
2 Normative references . 1
3 Additional reference . 2
4 Terms and definitions . 2
5 Abbreviations and Symbols . 7
6 Conventions . 7
7 MPEG-4 Systems Node Semantics . 8
7.1 Scene Description . 8
7.2 Node Semantics . 24
7.3 Informative: Differences Between MPEG-4 Scripts and ECMA Scripts . 181
7.4 Informative: FlexTime behavior . 182
7.5 Informative: Implementation of MaterialKey node . 183
7.6 Informative: Example implementation of spatial audio processing (perceptual approach) . 184
7.7 Informative: MPEG-4 Audio TTS application with Facial Animation . 188
7.8 Informative: 3D Mesh Coding in BIFS scenes . 188
7.9 Profiles. 189
7.10 Metric information for resident fonts . 220
7.11 Font metrics for SANS SERIF font (Albany) . 221
7.12 Font metrics for SERIF font (Thorndale) . 227
7.13 Font metrics for TYPEWRITER font (Cumberland) . 234
8 BIFS . 242
8.1 Introduction . 242
8.2 Decoding tables, data structures and associated functions . 242
8.3 Quantization . 247
8.4 Compensation process . 257
8.5 BIFS Configuration . 258
8.6 BIFS Command Syntax . 262
8.7 BIFS Scene . 274
8.8 BIFS-Anim . 305
8.9 Interpolator compression . 310
8.10 Definition of bodySceneGraph nodes . 349
8.11 Adaptive Arithmetic Decoder for BIFS-Anim . 357
8.12 Informative : Adaptive Arithmetic Encoder for BIFS-Anim . 359
8.13 View Dependent Object Scalability . 360
9 The Extensible MPEG-4 Textual Format . 381
9.1 Introduction . 381
9.2 XMT-A Format . 381
9.3 XMT-Ω Format . 433
9.4 XMT-C Modules . 478
9.5 XMT Schemas . 486
9.6 Informative: XMT/X3D Compatibility . 486
9.7 Informative: The usage of XMT-A BitWrapper element in authoring side . 487
© ISO/IEC 2015 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC 14496-11:2015(E)
10 MPEG-J . 500
10.1 Architecture . 500
10.2 MPEG-J Session . 502
10.3 Delivery of MPEG-J Data . 503
10.4 MPEG-J API List . 506
10.5 Informative: Starting the Java Virtual Machine . 512
10.6 Informative: Examples of MPEG-J API usage . 513
Annex A (normative) Curve-based animators . 522
Annex B (normative) Procedural textures algorithms . 525
Annex C (informative) Text Processing in BIFS . 530
Annex D (informative) Patent statements . 532
Bibliography . 533
iv © ISO/IEC 2015 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC 14496-11:2015(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 14496-11 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information Technology,
Subcommittee SC 29, Coding of Audio, Picture, Multimedia and Hypermedia Information.
This second edition cancels and replaces the first edition, which has been technically revised.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of
audio-visual objects:
— Part 1: Systems
— Part 2: Visual
— Part 3: Audio
— Part 4: Conformance testing
— Part 5: Reference software
— Part 6: Delivery Multimedia Integration Framework (DMIF)
— Part 7: Optimized reference software for coding of audio-visual objects [Technical Report]
— Part 8: Carriage of ISO/IEC 14496 contents over IP networks
— Part 9: Reference hardware description [Technical Report]
— Part 10: Advanced Video Coding
— Part 11: Scene description and application engine
— Part 12: ISO base media file format
— Part 13: Intellectual Property Management and Protection (IPMP) extensions
— Part 14: MP4 file format
© ISO/IEC 2015 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO/IEC 14496-11:2015(E)
— Part 15: Advanced Video Coding (AVC) file format
— Part 16: Animation Framework eXtension (AFX)
— Part 17: Streaming text format
— Part 18: Font compression and streaming
— Part 19: Synthesized texture stream
— Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF)
— Part 21: MPEG-J GFX
vi © ISO/IEC 2015 – All rights reserved
---------------------- Page: 6 ----------------------
ISO/IEC 14496-11:2015(E)
Introduction
1.1 Scene Description
1.1.1 Overview
ISO/IEC 14496 addresses the coding of audio-visual objects of various types: natural video and audio objects as well as
textures, text, 2- and 3-dimensional graphics, and also synthetic music and sound effects. To reconstruct a multimedia
scene at the terminal, it is hence not sufficient to transmit the raw audio-visual data to a receiving terminal. Additional
information is needed in order to combine this audio-visual data at the terminal and construct and present to the end-user
a meaningful multimedia scene. This information, called scene description, determines the placement of audio-visual
objects in space and time and is transmitted together with the coded objects as illustrated in Figure 1. Note that the scene
description only describes the structure of the scene. The action of assembling these objects in the same representation
space is called composition. The action of transforming these audio-visual objects from a common representation space to
a specific presentation device (i.e. speakers and a viewing window) is called rendering.
audiovisual
voice
audiovisual
presentation
sprite
multiplexed
downstream control / data
2D background
multiplexed
upstream control / data
y
3D objects
scene
coordinate
system
x
z
user events
video audio
compositor
compositor
projection
plane
hypothetical viewer
display
speaker
user input
Figure 1 — An example of an object-based multimedia scene
Independent coding of different objects may achieve higher compression, and also brings the ability to manipulate content
at the terminal. The behaviors of objects and their response to user inputs can thus also be represented in the scene
description.
The scene description framework used in this part of ISO/IEC 14496 is based largely on ISO/IEC 14772-1:1998 (Virtual
Reality Modeling Language – VRML).
1.1.2 Composition and Rendering
ISO/IEC 14496-11 defines the syntax and semantics of bitstreams that describe the spatio-temporal relationships of audio-
visual objects. For visual data, particular composition algorithms are not mandated since they are implementation-
dependent; for audio data, subclause 7.1.1.2.13 and the semantics of the AudioBIFS nodes normatively define the
composition process. The manner in which the composed scene is presented to the user is not specified for audio or
visual data. The scene description representation is termed “BInary Format for Scenes” (BIFS).
© ISO/IEC 2015 – All rights reserved vii
---------------------- Page: 7 ----------------------
ISO/IEC 14496-11:2015(E)
1.1.3 Scene Description
In order to facilitate the development of authoring, editing and interaction tools, scene descriptions are coded
independently from the audio-visual media that form part of the scene. This permits modification of the scene without
having to decode or process in any way the audio-visual media. The following clauses detail the scene description
capabilities that are provided by ISO/IEC 14496-11.
1.1.3.1 Grouping of audio-visual objects
A scene description follows a hierarchical structure that can be represented as a graph. Nodes of the graph form audio-
visual objects, as illustrated in Figure 2. The structure is not necessarily static; nodes may be added, deleted or be
modified.
scene
person 2D background furniture audiovisual
presentation
voice sprit globe desk
Figure 2 — Logical structure of example scene
1.1.3.2 Spatio-Temporal positioning of objects
Audio-visual objects have both a spatial and a temporal extent. Complex audio-visual objects are constructed by
combining appropriate scene description nodes to build up the scene graph. Audio-visual objects may be located in 2D or
3D space. Each audio-visual object has a local co-ordinate system. A local co-ordinate system is one in which the audio-
visual object has a pre-defined (but possibly varying) spatio-temporal location and scale (size and orientation). Audio-
visual objects are positioned in a scene by specifying a co-ordinate transformation from the object’s local co-ordinate
system into another co-ordinate system defined by a parent node in the scene graph.
1.1.3.3 Attributes of audio-visual objects
Scene description nodes expose a set of parameters through which aspects of their appearance and behavior can be
controlled.
EXAMPLE the volume of a sound; the color of a synthetic visual object; the source of a streaming video.
1.1.3.4 Behavior of audio-visual objects
ISO/IEC 14496-11 provides tools for enabling dynamic scene behavior and user interaction with the presented content.
User interaction can be separated into two major categories: client-side and server-side. Client-side interaction is an
integral part of the scene description described herein. Server-side interaction is not dealt with.
Client-side interaction involves content manipulation that is handled locally at the end-user’s terminal. It consists of the
modification of attributes of scene objects according to specified user actions.
EXAMPLE A user can click on a scene to start an animation or video sequence. The facilities for describing such
interactive behavior are part of the scene description, thus ensuring the same behavior in all terminals conforming to
ISO/IEC 14496-11.
viii © ISO/IEC 2015 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/IEC 14496-11:2015(E)
1.2 Extensible MPEG-4 Textual Format
1.2.1 Overview
The Extensible MPEG-4 Textual format (XMT) is a framework (illustrated in Figure 3) for representing MPEG-4 scene
description using a textual syntax. The XMT allows the content authors to exchange their content with other authors, tools
or service providers, and facilitates interoperability with both the Extensible 3D (X3D) being developed by the Web3D and
the Synchronized Multimedia Integration Language (SMIL) from the W3C.
SM IL P la y e r
Pa rse
SM IL
C o m p ile
SV G
VR M L
Bro w ser
XM T
MPEG -7
MPEG -4
R e pre se nta tion
MPEG -4
( e .g . m p 4 )
Pla y er
X3D
Figure 3 — Overview of the XMT Framework
1.2.2 Interoperability of XMT
The XMT format can be interchangeable between SMIL players, VRML players, and MPEG-4 players. The format can be
parsed and played directly by a W3C SMIL player, preprocessed to Web3D X3D and played back by a VRML player, or
compiled to an MPEG-4 representation such as MP4, which can then be played by an MPEG-4 player. See below for a
graphical description of interoperability of the XMT.
1.2.3 Two-tier Architecture: XMT-A and XMT-Ω Formats
The XMT framework consists of two levels of textual syntax and semantics: the XMT-A format and the XMT-Ω format,
which we will abbreviate by A and Ω, respectively, and use them interchangeably where there is no confusion.
The XMT-A is an XML-based version of MPEG-4 content, which contains a subset of the X3D. Also contained in XMT-A
is an MPEG-4 extension to the X3D to represent MPEG-4 specific features. The XMT-A provides a straightforward, one-
to-one mapping between the textual and binary formats.
The XMT-Ω is a high-level abstraction of MPEG-4 features designed based on the W3C SMIL. The XMT provides a
default mapping from Ω to A, for there is no deterministic mapping between the two, and it also provides content authors
with an escape mechanism from Ω to A.
In addition an XMT-C (Common) section contains the definition of elements and attributes that may be used within either
XMT-A or XMT-Ω.
1.3 MPEG-J
1.3.1 Overview
MPEG-J is a flexible programmatic control system that represents an audio-visual session in a manner that allows the
session to adapt to the operating characteristics when presented at the terminal. Two important characteristics are
supported: first, the capability to allow graceful degradation under limited or time varying resources, and second, the ability
to respond to user interaction and provide enhanced multimedia functionality.
More specifically, 9.7 normatively defines:
© ISO/IEC 2015 – All rights reserved ix
---------------------- Page: 9 ----------------------
ISO/IEC 14496-11:2015(E)
The format and delivery of Java byte code by specifying the MPEG-J stream format and the delivery mechanism of
such a stream (Java byte code and associated data);
The MPEG-J Session and the MPEG-J application lifecycle; and
The interactions and behavior of byte code through the specification of Java APIs.
1.3.2 Organization MPEG-J specification
10.1 gives an overall architecture of the MPEG-J system. MPEG-J Session start-up is walked through in 10.2. The
Delivery of MPEG-J data to the terminal is specified in 10.3. 10.4 specifies the different categories of APIs that a program
in the form of Java bytecode would use. 10.5 is an informative annex on starting the Java Virtual Machine. The electronic
annex attached to this document lists the normative MPEG-J APIs in the HTML format. 10.6 illustrates the usage of
MPEG-J APIs through a few examples.
----------------------
The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) draw
attention to the fact that it is claimed that compliance with this document may involve the use of patents.
The ISO and IEC take no position concerning the evidence, validity and scope of these patent rights.
The holder of these patent rights have assured the ISO and IEC that they are willing to negotiate licences under
reasonable and non-discriminatory terms and conditions with applicants throughout the world. In this respect, the
statement of the holder of this patent right is registered with the ISO and IEC. Information may be obtained from the
companies listed in Annex D.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights other
than those identified in Annex D. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
x © ISO/IEC 2015 – All rights reserved
---------------------- Page: 10 ----------------------
INTERNATIONAL STANDARD ISO/IEC 14496-11:2015(E)
Information technology — Coding of audio-visual objects —
Part 11:
Scene description and application engine
1. Scope
This part of ISO/IEC 14496 specifies:
1. the coded representation of the spatio-temporal positioning of audio-visual objects as well as their behavior in
response to interaction (scene description);
2. the Extensible MPEG-4 Textual (XMT) format, a textual representation of the multimedia content described in
ISO/IEC 14496 using the Extensible Markup Language (XML); and
3. a system level description of an application engine (format, delivery, lifecycle, and behavior of downloadable Java byte
code applications).
2. Normative references
The following referenced documents are indispensable for the application of this document. For dated references, only the
edition cited applies. For undated references, the latest edition of the referenced document (including any amendments)
applies.
ISO 639-2:1998, Codes for the representation of names of languages — Part 2: Alpha-3 code
ISO 3166-1:1997, Codes for the representation of names of countries and their subdivisions — Part 1: Country codes
ISO 9613-1:1993, Acoustics — Attenuation of sound during propagation outdoors — Part 1: Calculation of the absorption
of sound by the atmosphere
ISO/IEC 11172-2:1993, Information technology — Coding of moving pictures and associated audio for digital storage
media at up to about 1,5 Mbit/s — Part 2: Video
ISO/IEC 11172-3:1993, Information technology — Coding of moving pictures and associated audio for digital storage
media at up to about 1,5 Mbit/s — Part 3: Audio
ISO/IEC 13818-3:1998, Information technology — Generic coding of moving pictures and associated audio information —
Part 3: Audio
ISO/IEC 13818-7: 2004, Information technology — Generic coding of moving pictures and associated audio information —
Part 7: Advanced Audio Coding (AAC)
ISO/IEC 14496-2:2004, Information technology — Coding of audio-visual objects — Part 2: Visual
ISO/IEC 14772-1:1997, Information technology — Computer graphics and image processing — The Virtual Reality
Modeling Language — Part 1: Functional specification and UTF-8 encoding
ISO/IEC 14772-1:1997/Amd.1:2003, Information technology — Computer graphics and image processing — The Virtual
Reality Modeling Language — Part 1: Functional specification and UTF-8 encoding — Amendment 1: Enhanced
interoperability
ISO/IEC 16262:2002, Information technology — ECMAScript language specification
ISO/IEC 13818-2:2000, Information technology — Generic coding of moving pictures and associated audio information —
Part 2: Video
ISO/IEC 10918-1:1994, Information technology — Digital compression and coding of continuous-tone still images:
Requirements and guidelines
IEEE Std 754-1985, Standard for Binary Floating-Point Arithmetic
Addison-Wesley:September 1996, The Java Language Specification, by James Gosling, Bill Joy and Guy Steele, ISBN 0-
201-63451-1
Addison-Wesley:September 1996, The Java Virtual Machine Specification, by T. Lindholm and F. Yellin, ISBN 0-201-
63452-X
Addison-Wesley:July 1998, Java Class Libraries Vol. 1 The Java Class Libraries, Second Edition Volume 1, by Patrick
Chan, Rosanna Lee and Douglas Kramer, ISBN 0-201-31002-3
© ISO/IEC 2015 – All rights reserved 1
---------------------- Page: 11 ----------------------
ISO/IEC 14496-11:2015(E)
Addison-Wesley:July 1998, Java Class Libraries Vol. 2 The Java Class Libraries, Second Edition Volume 2, by Patrick
Chan and Rosanna Lee, ISBN 0-201-31003-1
Addison-Wesley, May 1996, Java API, The Java Application Programming Interface, Volume 1: Core Packages, by
J. Gosling, F. Yellin and the Java Team, ISBN 0-201-63453-8
DAVIC 1.4.1 specification Part 9: Information Representation
ANSI/SMPTE 291M-1996, Television — Ancillary Data Packet and Space Formatting
SMPTE 315M -1999, Television — Camera Positioning Information Conveyed by Ancillary Data Packets
3. Additional reference
ISO/IEC 13522-6:1998, Information technology — Coding of multimedia and hypermedia information — Part 6: Support for
enhanced interactive applications. This reference contains the full normative references to Java APIs and the Java Virtual
Machine as described in the normative references above.
4. Terms and definitions
For the purposes of this document, the following terms and definitions apply.
4.1
Access Unit (AU)
individually accessible portion of data within an elementary stream
NOTE An access unit is the smallest data entity to which timing information can be attributed.
4.2
Alpha map
representation of the transparency parameters associated with a texture map
4.3
audio-visual object
representation of a natural or synthetic object that has an audio and/or visual manifestation
NOTE The representation corresponds to a node or a group of nodes in the BIFS scene description. Each audio-
visual object is associated with zero or more elementary streams using one or more object descriptors.
4.4
audio-visual scene (AV scene)
set of audio-visual objects together with scene description information that defines their spatial and temporal attributes
including behaviors resulting from object and user interactions
4.5
Binary Format for Scene (BIFS)
coded representation of a parametric scene description format
4.6
buffer model
model that defines how a terminal complying with ISO/IEC 14496 manages the buffer resources that are needed to
decode a presentation
4.7
byte aligned
position in a coded bit stream with a distance of a multiple of 8-bits from the first bit in the stream
4.8
clock reference
special time stamp that conveys a reading of a time base
4.9
composition
process of applying scene description information in order to identify the spatio-temporal attributes and hierarchies of
audio-visual objects
4.10
Composition Memory (CM)
random access memory that contains composition units
2 © ISO/IEC 2015
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.