Information technology — Coding of audio-visual objects — Part 2: Visual

ISO/IEC 14496-2:2004 provides the following elements related to the encoded representation of visual information: - Specification of video coding tools, object types and profiles, including capability to encode rectangular-based and arbitrary-shaped video objects, capability to define scalable bitstreams and error-resilient encoding tools; - Specification of coding tools, object types and profiles for mapping of still textures into visual scenes; - Specification of coding tools, object types and profiles for human face and body animation based on face/body models and additional semantic parameters; and - Specification of coding tools, object types and profiles for animation of 2D warping grids with uniform and irregular topology. The Visual specification contains definitions of the bitstream syntax, bitstream semantics and the related decoding process. It does not specify the encoders, which can be optimized in different implementations.

Technologies de l'information — Codage des objets audiovisuels — Partie 2: Codage visuel

General Information

Status
Published
Publication Date
23-May-2004
Current Stage
9093 - International Standard confirmed
Completion Date
23-Jun-2021
Ref Project

Relations

Buy Standard

Standard
ISO/IEC 14496-2:2004 - Information technology -- Coding of audio-visual objects
English language
706 pages
sale 15% off
Preview
sale 15% off
Preview
Standard
ISO/IEC 14496-2:2004 - Information technology -- Coding of audio-visual objects
English language
706 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)

INTERNATIONAL ISO/IEC
STANDARD 14496-2
Third edition
2004-06-01


Information technology — Coding of
audio-visual objects — Part 2: Visual
Technologies de l'information — Codage des objets audiovisuels —
Partie 2: Codage visuel




Reference number
ISO/IEC 14496-2:2004(E)
©
ISO/IEC 2004

---------------------- Page: 1 ----------------------
ISO/IEC 14496-2:2004(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.


©  ISO/IEC 2004
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland

ii © ISO/IEC 2004 – All rights reserved

---------------------- Page: 2 ----------------------
ISO/IEC 14496-2:2004(E)
Contents
1 Scope. 1
2 Normative references. 1
3 Terms and definitions. 1
4 Abbreviations and symbols . 13
4.1 Arithmetic operators. 13
4.2 Logical operators . 14
4.3 Relational operators. 14
4.4 Bitwise operators . 15
4.5 Conditional operators. 15
4.6 Assignment. 15
4.7 Mnemonics. 15
4.8 Constants. 15
5 Conventions. 16
5.1 Method of describing bitstream syntax . 16
5.2 Definition of functions . 17
5.3 Reserved, forbidden and marker_bit. 18
5.4 Arithmetic precision. 19
6 Visual bitstream syntax and semantics. 19
6.1 Structure of coded visual data. 19
6.2 Visual bitstream syntax . 38
6.3 Visual bitstream semantics. 135
7 The visual decoding process. 236
7.1 Video decoding process. 237
7.2 Higher syntactic structures. 238
7.3 VOP reconstruction. 238
7.4 Texture decoding . 239
7.5 Shape decoding. 250
7.6 Motion compensation decoding . 274
7.7 Interlaced video decoding. 297
7.8 Sprite decoding . 306
7.9 Generalized scalable decoding. 313
7.10 Still texture object decoding . 323
7.11 Mesh object decoding. 347
7.12 FBA object decoding . 352
7.13 3D Mesh Object Decoding. 358
7.14 NEWPRED mode decoding . 384
7.15 Output of the decoding process. 385
7.16 Video object decoding for the studio profile. 385
7.17 The FGS decoding process. 427
8 Visual-Systems Composition Issues . 429
8.1 Temporal Scalability Composition . 429
8.2 Sprite Composition . 430
8.3 Mesh Object Composition. 431
8.4 Spatial Scalability composition . 432
9 Profiles and Levels. 432
9.1 Visual Object Types . 432
9.2 Visual Profiles. 436
9.3 Visual Profiles@Levels. 437
© ISO/IEC 2004 – All rights reserved iii

---------------------- Page: 3 ----------------------
ISO/IEC 14496-2:2004(E)
Annex A (normative) Coding transforms . 441
Annex B (normative) Variable length codes and arithmetic decoding . 451
Annex C (normative) Face and body object decoding tables and definitions . 547
Annex D (normative) Video buffering verifier . 580
Annex E (informative) Features supported by the algorithm . 589
Annex F (informative) Preprocessing and postprocessing. 599
Annex G (normative) Profile and level indication and restrictions . 625


iv © ISO/IEC 2004 – All rights reserved

---------------------- Page: 4 ----------------------
ISO/IEC 14496-2:2004(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission)
form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC
participate in the development of International Standards through technical committees established by the
respective organization to deal with particular fields of technical activity. ISO and IEC technical committees
collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in
liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have
established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International Standards
adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International
Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights.
ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 14496-2 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This third edition cancels and replaces the second edition (ISO/IEC 14496-2:2001), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-2:2001/Amd. 1:2002,
ISO/IEC 14496-2:2001/Amd. 2:2002 and ISO/IEC 14496-2:2001/Amd. 3:2003.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of audio-
visual objects:
— Part 1: Systems
— Part 2: Visual
— Part 3: Audio
— Part 4: Conformance testing
— Part 5: Reference software
— Part 6: Delivery Multimedia Integration Framework (DMIF)
— Part 7: Optimized reference software for coding of audio-visual objects
— Part 8: Carriage of ISO/IEC 14496 content over IP networks
— Part 9: Reference hardware description
— Part 10: Advanced video coding
— Part 11: Scene description and application engine
— Part 12: ISO base media file format
— Part 13: Intellectual Property Management and Protection (IPMP) extentions
© ISO/IEC 2004 – All rights reserved v

---------------------- Page: 5 ----------------------
ISO/IEC 14496-2:2004(E)
— Part 14: MP4 file format
— Part 15: Advanced Video Coding (AVC) file format
— Part 16: Animation framework extension (AFX)
— Part 17: Streaming text format
— Part 18: Font compression and streaming
— Part 19: Synthesized texture stream

vi © ISO/IEC 2004 – All rights reserved

---------------------- Page: 6 ----------------------
ISO/IEC 14496-2:2004(E)
Introduction
Purpose
This part of ISO/IEC 14496 was developed in response to the growing need for a coding method that can facilitate
access to visual objects in natural and synthetic moving pictures and associated natural or synthetic sound for
various applications such as digital storage media, internet, various forms of wired or wireless communication etc.
The use of ISO/IEC 14496 means that motion video can be manipulated as a form of computer data and can be
stored on various storage media, transmitted and received over existing and future networks and distributed on
existing and future broadcast channels.
Application
The applications of ISO/IEC 14496 cover, but are not limited to, such areas as listed below:
IMM Internet Multimedia
IVG Interactive Video Games
IPC Interpersonal Communications (videoconferencing, videophone, etc.)
ISM Interactive Storage Media (optical disks, etc.)
MMM Multimedia Mailing
NDB Networked Database Services (via ATM, etc.)
RES Remote Emergency Systems
RVS Remote Video Surveillance
WMM Wireless Multimedia
Multimedia
Profiles and levels
ISO/IEC 14496 is intended to be generic in the sense that it serves a wide range of applications, bitrates,
resolutions, qualities and services. Furthermore, it allows a number of modes of coding of both natural and
synthetic video in a manner facilitating access to individual objects in images or video, referred to as content based
access. Applications should cover, among other things, digital storage media, content based image and video
databases, internet video, interpersonal video communications, wireless video etc. In the course of creating
ISO/IEC 14496, various requirements from typical applications have been considered, necessary algorithmic
elements have been developed, and they have been integrated into a single syntax. Hence ISO/IEC 14496 will
facilitate the bitstream interchange among different applications.
This part of ISO/IEC 14496 includes one or more complete decoding algorithms as well as a set of decoding tools.
Moreover, the various tools of this part of ISO/IEC 14496 as well as that derived from ISO/IEC 13818-2:2000 can
be combined to form other decoding algorithms. Considering the practicality of implementing the full syntax of this
part of ISO/IEC 14496, however, a limited number of subsets of the syntax are also stipulated by means of “profile”
and “level”.
A “profile” is a defined subset of the entire bitstream syntax that is defined by this part of ISO/IEC 14496. Within the
bounds imposed by the syntax of a given profile it is still possible to require a very large variation in the
performance of encoders and decoders depending upon the values taken by parameters in the bitstream.
In order to deal with this problem “levels” are defined within each profile. A level is a defined set of constraints
imposed on parameters in the bitstream. These constraints may be simple limits on numbers. Alternatively they
may take the form of constraints on arithmetic combinations of the parameters.
© ISO/IEC 2004 – All rights reserved vii

---------------------- Page: 7 ----------------------
ISO/IEC 14496-2:2004(E)
Object based coding syntax
Video object
A video object in a scene is an entity that a user is allowed to access (seek, browse) and manipulate (cut and
paste). The instances of video objects at a given time are called video object planes (VOPs). The encoding process
generates a coded representation of a VOP as well as composition information necessary for display. Further, at
the decoder, a user may interact with and modify the composition process as needed.
The full syntax allows coding of rectangular as well as arbitrarily shaped video objects in a scene. Furthermore, the
syntax supports both nonscalable coding and scalable coding. Thus it becomes possible to handle normal
scalabilities as well as object based scalabilities. The scalability syntax enables the reconstruction of useful video
from pieces of a total bitstream. This is achieved by structuring the total bitstream in two or more layers, starting
from a standalone base layer and adding a number of enhancement layers. The base layer can be coded using a
non-scalable syntax, or in the case of picture based coding, even using a syntax of a different video coding
standard.
To ensure the ability to access individual objects, it is necessary to achieve a coded representation of its shape. A
natural video object consists of a sequence of 2D representations (at different points in time) referred to here as
VOPs. For efficient coding of VOPs, both temporal redundancies as well as spatial redundancies are exploited.
Thus a coded representation of a VOP includes representation of its shape, its motion and its texture.
FBA object
The FBA object is a collection of nodes in a scene graph which are animated by the FBA (Face and Body
Animation) object bitstream. The FBA object is controlled by two separate bitstreams. The first bitstream, called
BIFS, contains instances of Body Definition Parameters (BDPs) in addition to Facial Definition Parameters (FDPs),
and the second bitstream, FBA bitstream, contains Body Animation Parameters (BAPs) together with Facial
Animation Parameters (FAPs).
A 3D (or 2D) face object is a representation of the human face that is structured for portraying the visual
manifestations of speech and facial expressions adequate to achieve visual speech intelligibility and the recognition
of the mood of the speaker. A face object is animated by a stream of face animation parameters (FAP) encoded for
low-bandwidth transmission in broadcast (one-to-many) or dedicated interactive (point-to-point) communications.
The FAPs manipulate key feature control points in a mesh model of the face to produce animated visemes for the
mouth (lips, tongue, teeth), as well as animation of the head and facial features like the eyes. FAPs are quantised
with careful consideration for the limited movements of facial features, and then prediction errors are calculated and
coded arithmetically. The remote manipulation of a face model in a terminal with FAPs can accomplish lifelike
visual scenes of the speaker in real-time without sending pictorial or video details of face imagery every frame.
A simple streaming connection can be made to a decoding terminal that animates a default face model. A more
complex session can initialize a custom face in a more capable terminal by downloading face definition parameters
(FDP) from the encoder. Thus specific background images, facial textures, and head geometry can be portrayed.
The composition of specific backgrounds, face 2D/3D meshes, texture attribution of the mesh, etc. is described in
ISO/IEC 14496-1:2001. The FAP stream for a given user can be generated at the user’s terminal from video/audio,
or from text-to-speech. FAPs can be encoded at bitrates up to 2-3kbit/s at necessary speech rates. Optional
temporal DCT coding provides further compression efficiency in exchange for delay. Using the facilities of
ISO/IEC 14496-1:2001, a composition of the animated face model and synchronized, coded speech audio (low-
bitrate speech coder or text-to-speech) can provide an integrated low-bandwidth audio/visual speaker for broadcast
applications or interactive conversation.
Limited scalability is supported. Face animation achieves its efficiency by employing very concise motion animation
controls in the channel, while relying on a suitably equipped terminal for rendering of moving 2D/3D faces with non-
normative models held in local memory. Models stored and updated for rendering in the terminal can be simple or
complex. To support speech intelligibility, the normative specification of FAPs intends for their selective or complete
use as signaled by the encoder. A masking scheme provides for selective transmission of FAPs according to what
parts of the face are naturally active from moment to moment. A further control in the FAP stream allows face
animation to be suspended while leaving face features in the terminal in a defined quiescent state for higher overall
efficiency during multi-point connections.
viii © ISO/IEC 2004 – All rights reserved

---------------------- Page: 8 ----------------------
ISO/IEC 14496-2:2004(E)
A body model is a representation of a virtual human or human-like character that allows portraying body
movements adequate to achieve nonverbal communication and general actions. A body model is animated by a
stream of body animation parameters (BAP) encoded for low-bitrate transmission in broadcast and dedicated
interactive communications. The BAPs manipulate independent degrees of freedom in the skeleton model of the
body to produce animation of the body parts. The BAPs are quantised considering the joint limitations, and
prediction errors are calculated and coded arithmetically. Similar to the face, the remote manipulation of a body
model in a terminal with BAPs can accomplish lifelike visual scenes of the body in real-time without sending
pictorial and video details of the body every frame.
The BAPs, if correctly interpreted, will produce reasonably similar high level results in terms of body posture and
animation on different body models, also without the need to initialize or calibrate the model.The BDP set defines
the set of parameters to transform the default body to a customized body optionally with its body surface, body
dimensions, and texture.
The body definition parameters (BDP) allow the encoder to replace the local model of a more capable terminal.
BDP parameters include body geometry, calibration of body parts, degrees of freedom, and optionally deformation
information.
The FBA Animation specification is defined in ISO/IEC 14496-1:2001 and this part of ISO/IEC 14496. This clause is
intended to facilitate finding various parts of specification. As a rule of thumb, FAP and BAP specification is found in
the part 2, and FDP and BDP specification in the part 1. However, this is not a strict rule. For an overview of
FAPs/BAPs and their interpretation, read subclauses “6.1.5.2 Facial animation parameter set”, “6.1.5.3 Facial
animation parameter units”, “6.1.5.4 Description of a neutral face” as well as the Table C.1. The viseme parameter
is documented in subclause “7.12.3 Decoding of the viseme parameter fap 1” and the Table C.5 in Annex C. The
expression parameter is documented in subclause “7.12.4 Decoding of the expression parameter fap 2” and the
Table C.3. FBA bitstream syntax is found in subclauses “6.2.10 FBA Object”, semantics in “6.3.10 FBA Object”,
and subclause “7.12 FBA object decoding” explains in more detail the FAP/BAP decoding process. FAP/BAP
masking and interpolation is explained in subclauses “6.3.11.1 FBA Object Plane”, “7.12.1.1 Decoding of FBA”,
“7.12.5 FBA masking” . The FIT interpolation scheme is documented in subclause “7.2.5.3.2.4 FIT” of
ISO/IEC 14496-1:2001. The FDPs and BDPs and their interpretation are documented in subclause “7.2.5.3.2.6
FDP” of ISO/IEC 14496-1:2001. In particular, the FDP feature points are documented in Figure C-1. Details on
body models are documented in Annex C.
Mesh object
A 2D mesh object is a representation of a 2D deformable geometric shape, with which synthetic video objects may
be created during a composition process at the decoder, by spatially piece-wise warping of existing video object
planes or still texture objects. The instances of mesh objects at a given time are called mesh object planes (mops).
The geometry of mesh object planes is coded losslessly. Temporally and spatially predictive techniques and
variable length coding are used to compress 2D mesh geometry. The coded representation of a 2D mesh object
includes representation of its geometry and motion.
3D Mesh Object
The 3D Mesh Object is a 3D polygonal model that can be represented as an IndexedFaceSet or Hierarchical 3D
Mesh node in BIFS. It is defined by the position of its vertices (geometry), by the association between each face
and its sustaining vertices (connectivity), and optionally by colours, normals, and texture coordinates (properties).
Properties do not affect the 3D geometry, but influence the way the model is shaded. 3D mesh coding (3DMC)
addresses the efficient coding of 3D mesh object. It comprises a basic method and several options. The basic
3DMC method operates on manifold model and features incremental representation of single resolution 3D model.
The model may be triangular or polygonal – the latter are triangulated for coding purposes and are fully recovered
in the decoder. Options include: (a) support for computational graceful degradation control; (b) support for non-
manifold model; (c) support for error resilience; and (d) quality scalability via hierarchical transmission of levels of
detail with implicit support for smooth transition between consecutive levels. The compression of application-
specific geometry streams (Face Animation Parameters) and generalized animation parameters (BIFS Anim) are
currently addressed elsewhere in this part of ISO/IEC 14496.
In 3DMC, the compression of the connectivity of the 3D mesh (e.g. how edges, faces, and vertices relate) is
lossless, whereas the compression of the other attributes (such as vertex coordinates, normals, colours, and
texture coordinates) may be lossy.
© ISO/IEC 2004 – All rights reserved ix

---------------------- Page: 9 ----------------------
ISO/IEC 14496-2:2004(E)
Single Resolution Mode
The incremental representation of a single resolution 3D model is based on the Topological Surgery scheme. For
manifold triangular 3D meshes, the Topological Surgery representation decomposes the connectivity of each
connected component into a simple polygon and a vertex graph. All the triangular faces of the 3D mesh are
connected in the simple polygon forming a triangle tree, which is a spanning tree in the dual graph of the 3D mesh.
Figure 0-1 shows an example of a triangular 3D mesh, its dual graph, and a triangle tree. The vertex graph
identifies which pairs of boundary edges of the simple polygon are associated with each other to reconstruct the
connectivity of the 3D mesh. The triangle tree does not fully describe the triangulation of the simple polygon. The
missing information is recorded as a marching edge.

A B C
Figure 0-1 — A triangular 3D mesh (A), its dual graph (B), and a triangle tree (C)
For manifold 3D meshes, the connectivity is represented in a similar fashion. The polygonal faces of the 3D mesh
are connected in a simple polygon forming a face tree. The faces are triangulated, and which edges of the resulting
triangular 3D mesh are edges of the original 3D mesh is recorded as a sequence of polygon_edge bits. The face
tree is also a spanning tree in the dual graph of the 3D mesh, and the vertex graph is always composed of edges of
the original 3D mesh.
The vertex coordinates and optional properties of the 3D mesh (normals, colours, and texture coordinates) are
quantised, predicted as a function of decoded ancestors with respect to the order of traversal, and the errors are
entropy encoded.
Incremental Representation
When a 3D mesh is downloaded over networks with limited bandwidth (e.g. PSTN), it may be desired to begin
decoding and rendering the 3D mesh before it has all been received. Moreover, content providers may wish to
control such incremental representation to present the most important data first. The basic 3DMC method supports
this by interleaving the data such that each triangle may be reconstructed as it is received. Incremental
representation is also facilitated by the options of hierarchical transmission for quality scalability and partitioning for
error resilience.
Hierarchical Mode
An example of a 3D mesh represented in hierarchical mode is illustrated in Figure 0-2. The hierarchical mode
allows the decoder to show progressively better approximations of the model as data are received. The hierarchical
3D mesh decomposition can also be organized in the decoder as layered detail, and view-dependent expansion of
this detail can be subsequently accomplished during a viewer's interactio
...

INTERNATIONAL ISO/IEC
STANDARD 14496-2
Third edition
2004-06-01


Information technology — Coding of
audio-visual objects — Part 2: Visual
Technologies de l'information — Codage des objets audiovisuels —
Partie 2: Codage visuel



Reference number
ISO/IEC 14496-2:2004(E)
©
ISO/IEC 2004

---------------------- Page: 1 ----------------------
ISO/IEC 14496-2:2004(E)

PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.

This CD-ROM contains the publication ISO/IEC 14496-2:2004 in portable document format (PDF), which can
be viewed using Adobe® Acrobat® Reader.

Adobe and Acrobat are trademarks of Adobe Systems Incorporated.
This third edition cancels and replaces the second edition (ISO/IEC 14496-2:2001), which has been
technically revised. It also incorporates the Amendments ISO/IEC 14496-2:2001/Amd. 1:2002,
ISO/IEC
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.