ISO/IEC 14496-2:2004
(Main)Information technology — Coding of audio-visual objects — Part 2: Visual
Information technology — Coding of audio-visual objects — Part 2: Visual
ISO/IEC 14496-2:2004 provides the following elements related to the encoded representation of visual information: - Specification of video coding tools, object types and profiles, including capability to encode rectangular-based and arbitrary-shaped video objects, capability to define scalable bitstreams and error-resilient encoding tools; - Specification of coding tools, object types and profiles for mapping of still textures into visual scenes; - Specification of coding tools, object types and profiles for human face and body animation based on face/body models and additional semantic parameters; and - Specification of coding tools, object types and profiles for animation of 2D warping grids with uniform and irregular topology. The Visual specification contains definitions of the bitstream syntax, bitstream semantics and the related decoding process. It does not specify the encoders, which can be optimized in different implementations.
Technologies de l'information — Codage des objets audiovisuels — Partie 2: Codage visuel
General Information
Relations
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 14496-2
Third edition
2004-06-01
Information technology — Coding of
audio-visual objects — Part 2: Visual
Technologies de l'information — Codage des objets audiovisuels —
Partie 2: Codage visuel
Reference number
©
ISO/IEC 2004
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
This CD-ROM contains the publication ISO/IEC 14496-2:2004 in portable document format (PDF), which can
be viewed using Adobe® Acrobat® Reader.
Adobe and Acrobat are trademarks of Adobe Systems Incorporated.
This third edition cancels and replaces the second edition (ISO/IEC 14496-2:2001), which has been
technically revised. It also incorporates the Amendments ISO/IEC 14496-2:2001/Amd. 1:2002,
ISO/IEC
...
INTERNATIONAL ISO/IEC
STANDARD 14496-2
Third edition
2004-06-01
Information technology — Coding of
audio-visual objects — Part 2: Visual
Technologies de l'information — Codage des objets audiovisuels —
Partie 2: Codage visuel
Reference number
©
ISO/IEC 2004
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO/IEC 2004
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2004 – All rights reserved
Contents
1 Scope. 1
2 Normative references. 1
3 Terms and definitions. 1
4 Abbreviations and symbols . 13
4.1 Arithmetic operators. 13
4.2 Logical operators . 14
4.3 Relational operators. 14
4.4 Bitwise operators . 15
4.5 Conditional operators. 15
4.6 Assignment. 15
4.7 Mnemonics. 15
4.8 Constants. 15
5 Conventions. 16
5.1 Method of describing bitstream syntax . 16
5.2 Definition of functions . 17
5.3 Reserved, forbidden and marker_bit. 18
5.4 Arithmetic precision. 19
6 Visual bitstream syntax and semantics. 19
6.1 Structure of coded visual data. 19
6.2 Visual bitstream syntax . 38
6.3 Visual bitstream semantics. 135
7 The visual decoding process. 236
7.1 Video decoding process. 237
7.2 Higher syntactic structures. 238
7.3 VOP reconstruction. 238
7.4 Texture decoding . 239
7.5 Shape decoding. 250
7.6 Motion compensation decoding . 274
7.7 Interlaced video decoding. 297
7.8 Sprite decoding . 306
7.9 Generalized scalable decoding. 313
7.10 Still texture object decoding . 323
7.11 Mesh object decoding. 347
7.12 FBA object decoding . 352
7.13 3D Mesh Object Decoding. 358
7.14 NEWPRED mode decoding . 384
7.15 Output of the decoding process. 385
7.16 Video object decoding for the studio profile. 385
7.17 The FGS decoding process. 427
8 Visual-Systems Composition Issues . 429
8.1 Temporal Scalability Composition . 429
8.2 Sprite Composition . 430
8.3 Mesh Object Composition. 431
8.4 Spatial Scalability composition . 432
9 Profiles and Levels. 432
9.1 Visual Object Types . 432
9.2 Visual Profiles. 436
9.3 Visual Profiles@Levels. 437
© ISO/IEC 2004 – All rights reserved iii
Annex A (normative) Coding transforms . 441
Annex B (normative) Variable length codes and arithmetic decoding . 451
Annex C (normative) Face and body object decoding tables and definitions . 547
Annex D (normative) Video buffering verifier . 580
Annex E (informative) Features supported by the algorithm . 589
Annex F (informative) Preprocessing and postprocessing. 599
Annex G (normative) Profile and level indication and restrictions . 625
iv © ISO/IEC 2004 – All rights reserved
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission)
form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC
participate in the development of International Standards through technical committees established by the
respective organization to deal with particular fields of technical activity. ISO and IEC technical committees
collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in
liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have
established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International Standards
adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International
Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights.
ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 14496-2 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This third edition cancels and replaces the second edition (ISO/IEC 14496-2:2001), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-2:2001/Amd. 1:2002,
ISO/IEC 14496-2:2001/Amd. 2:2002 and ISO/IEC 14496-2:2001/Amd. 3:2003.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of audio-
visual objects:
— Part 1: Systems
— Part 2: Visual
— Part 3: Audio
— Part 4: Conformance testing
— Part 5: Reference software
— Part 6: Delivery Multimedia Integration Framework (DMIF)
— Part 7: Optimized reference software for coding of audio-visual objects
— Part 8: Carriage of ISO/IEC 14496 content over IP networks
— Part 9: Reference hardware description
— Part 10: Advanced video coding
— Part 11: Scene description and application engine
— Part 12: ISO base media file format
— Part 13: Intellectual Property Management and Protection (IPMP) extentions
© ISO/IEC 2004 – All rights reserved v
— Part 14: MP4 file format
— Part 15: Advanced Video Coding (AVC) file format
— Part 16: Animation framework extension (AFX)
— Part 17: Streaming text format
— Part 18: Font compression and streaming
— Part 19: Synthesized texture stream
vi © ISO/IEC 2004 – All rights reserved
Introduction
Purpose
This part of ISO/IEC 14496 was developed in response to the growing need for a coding method that can facilitate
access to visual objects in natural and synthetic moving pictures and associated natural or synthetic sound for
various applications such as digital storage media, internet, various forms of wired or wireless communication etc.
The use of ISO/IEC 14496 means that motion video can be manipulated as a form of computer data and can be
stored on various storage media, transmitted and received over existing and future networks and distributed on
existing and future broadcast channels.
Application
The applications of ISO/IEC 14496 cover, but are not limited to, such areas as listed below:
IMM Internet Multimedia
IVG Interactive Video Games
IPC Interpersonal Communications (videoconferencing, videophone, etc.)
ISM Interactive Storage Media (optical disks, etc.)
MMM Multimedia Mailing
NDB Networked Database Services (via ATM, etc.)
RES Remote Emergency Systems
RVS Remote Video Surveillance
WMM Wireless Multimedia
Multimedia
Profiles and levels
ISO/IEC 14496 is intended to be generic in the sense that it serves a wide range of applications, bitrates,
resolutions, qualities and services. Furthermore, it allows a number of modes of coding of both natural and
synthetic video in a manner facilitating access to individual objects in images or video, referred to as content based
access. Applications should cover, among other things, digital storage media, content based image and video
databases, internet video, interpersonal video communications, wireless video etc. In the course of creating
ISO/IEC 14496, various requirements from typical applications have been considered, necessary algorithmic
elements have been developed, and they have been integrated into a single syntax. Hence ISO/IEC 14496 will
facilitate the bitstream interchange among different applications.
This part of ISO/IEC 14496 includes one or more complete decoding algorithms as well as a set of decoding tools.
Moreover, the various tools of this part of ISO/IEC 14496 as well as that derived from ISO/IEC 13818-2:2000 can
be combined to form other decoding algorithms. Considering the practicality of implementing the full syntax of this
part of ISO/IEC 14496, however, a limited number of subsets of the syntax are also stipulated by means of “profile”
and “level”.
A “profile” is a defined subset of the entire bitstream syntax that is defined by this part of ISO/IEC 14496. Within the
bounds imposed by the syntax of a given profile it is still possible to require a very large variation in the
performance of encoders and decoders depending upon the values taken by parameters in the bitstream.
In order to deal with this problem “levels” are defined within each profile. A level is a defined set of constraints
imposed on parameters in the bitstream. These constraints may be simple limits on numbers. Alternatively they
may take the form of constraints on arithmetic combinations of the parameters.
© ISO/IEC 2004 – All rights reserved vii
Object based coding syntax
Video object
A video object in a scene is an entity that a user is allowed to access (seek, browse) and manipulate (cut and
paste). The instances of video objects at a given time are called video object planes (VOPs). The encoding process
generates a coded representation of a VOP as well as composition information necessary for display. Further, at
the decoder, a user may interact with and modify the composition process as needed.
The full syntax allows coding of rectangular as well as arbitrarily shaped video objects in a scene. Furthermore, the
syntax supports both nonscalable coding and scalable coding. Thus it becomes possible to handle normal
scalabilities as well as object based scalabilities. The scalability syntax enables the reconstruction of useful video
from pieces of a total bitstream. This is achieved by structuring the total bitstream in two or more layers, starting
from a standalone base layer and adding a number of enhancement layers. The base layer can be coded using a
non-scalable syntax, or in the case of picture based coding, even using a syntax of a different video coding
standard.
To ensure the ability to access individual objects, it is necessary to achieve a coded representation of its shape. A
natural video object consists of a sequence of 2D representations (at different points in time) referred to here as
VOPs. For efficient coding of VOPs, both temporal redundancies as well as spatial redundancies are exploited.
Thus a coded representation of a VOP includes representation of its shape, its motion and its texture.
FBA object
The FBA object is a collection of nodes in a scene graph which are animated by the FBA (Face and Body
Animation) object bitstream. The FBA object is controlled by two separate bitstreams. The first bitstream, called
BIFS, contains instances of Body Definition Parameters (BDPs) in addition to Facial Definition Parameters (FDPs),
and the second bitstream, FBA bitstream, contains Body Animation Parameters (BAPs) together with Facial
Animation Parameters (FAPs).
A 3D (or 2D) face object is a representation of the human face that is structured for portraying the visual
manifestations of speech and facial expressions adequate to achieve visual speech intelligibility and the recognition
of the mood of the speaker. A face object is animated by a stream of face animation parameters (FAP) encoded for
low-bandwidth transmission in broadcast (one-to-many) or dedicated interactive (point-to-point) communications.
The FAPs manipulate key feature control points in a mesh model of the face to produce animated visemes for the
mouth (lips, tongue, teeth), as well as animation of the head and facial features like the eyes. FAPs are quantised
with careful consideration for the limited movements of facial features, and then prediction errors are calculated and
coded arithmetically. The remote manipulation of a face model in a terminal with FAPs can accomplish lifelike
visual scenes of the speaker in real-time without sending pictorial or video details of face imagery every frame.
A simple streaming connection can be made to a decoding terminal that animates a default face model. A more
complex session can initialize a custom face in a more capable terminal by downloading face definition parameters
(FDP) from the encoder. Thus specific background images, facial textures, and head geometry can be portrayed.
The composition of specific backgrounds, face 2D/3D meshes, texture attribution of the mesh, etc. is described in
ISO/IEC 14496-1:2001. The FAP stream for a given user can be generated at the user’s terminal from video/audio,
or from text-to-speech. FAPs can be encoded at bitrates up to 2-3kbit/s at necessary speech rates. Optional
temporal DCT coding provides further compression efficiency in exchange for delay. Using the facilities of
ISO/IEC 14496-1:2001, a composition of the animated face model and synchronized, coded speech audio (low-
bitrate speech coder or text-to-speech) can provide an integrated low-bandwidth audio/visual speaker for broadcast
applications or interactive conversation.
Limited scalability is supported. Face animation achieves its efficiency by employing very concise motion animation
controls in the channel, while relying on a suitably equipped terminal for rendering of moving 2D/3D faces with non-
normative models held in local memory. Models stored and updated for rendering in the terminal can be simple or
complex. To support speech intelligibility, the normative specification of FAPs intends for their selective or complete
use as signaled by the encoder. A masking scheme provides for selective transmission of FAPs according to what
parts of the face are naturally active from moment to moment. A further control in the FAP stream allows face
animation to be suspended while leaving face features in the terminal in a defined quiescent state for higher overall
efficiency during multi-point connections.
viii © ISO/IEC 2004 – All rights reserved
A body model is a representation of a virtual human or human-like character that allows portraying body
movements adequate to achieve nonverbal communication and general actions. A body model is animated by a
stream of body animation parameters (BAP) encoded for low-bitrate transmission in broadcast and dedicated
interactive communications. The BAPs manipulate independent degrees of freedom in the skeleton model of the
body to produce animation of the body parts. The BAPs are quantised considering the joint limitations, and
prediction errors are calculated and coded arithmetically. Similar to the face, the remote manipulation of a body
model in a terminal with BAPs can accomplish lifelike visual scenes of the body in real-time without sending
pictorial and video details of the body every frame.
The BAPs, if correctly interpreted, will produce reasonably similar high level results in terms of body posture and
animation on different body models, also without the need to initialize or calibrate the model.The BDP set defines
the set of parameters to transform the default body to a customized body optionally with its body surface, body
dimensions, and texture.
The body definition parameters (BDP) allow the encoder to replace the local model of a more capable terminal.
BDP parameters include body geometry, calibration of body parts, degrees of freedom, and optionally deformation
information.
The FBA Animation specification is defined in ISO/IEC 14496-1:2001 and this part of ISO/IEC 14496. This clause is
intended to facilitate finding various parts of specification. As a rule of thumb, FAP and BAP specification is found in
the part 2, and FDP and BDP specification in the part 1. However, this is not a strict rule. For an overview of
FAPs/BAPs and their interpretation, read subclauses “6.1.5.2 Facial animation parameter set”, “6.1.5.3 Facial
animation parameter units”, “6.1.5.4 Description of a neutral face” as well as the Table C.1. The viseme parameter
is documented in subclause “7.12.3 Decoding of the viseme parameter fap 1” and the Table C.5 in Annex C. The
expression parameter is documented in subclause “7.12.4 Decoding of the expression parameter fap 2” and the
Table C.3. FBA bitstream syntax is found in subclauses “6.2.10 FBA Object”, semantics in “6.3.10 FBA Object”,
and subclause “7.12 FBA object decoding” explains in more detail the FAP/BAP decoding process. FAP/BAP
masking and interpolation is explained in subclauses “6.3.11.1 FBA Object Plane”, “7.12.1.1 Decoding of FBA”,
“7.12.5 FBA masking” . The FIT interpolation scheme is documented in subclause “7.2.5.3.2.4 FIT” of
ISO/IEC 14496-1:2001. The FDPs and BDPs and their interpretation are documented in subclause “7.2.5.3.2.6
FDP” of ISO/IEC 14496-1:2001. In particular, the FDP feature points are documented in Figure C-1. Details on
body models are documented in Annex C.
Mesh object
A 2D mesh object is a representation of a 2D deformable geometric shape, with which synthetic video objects may
be created during a composition process at the decoder, by spatially piece-wise warping of existing video object
planes or still texture objects. The instances of mesh objects at a given time are called mesh object planes (mops).
The geometry of mesh object planes is coded losslessly. Temporally and spatially predictive techniques and
variable length coding are used to compress 2D mesh geometry. The coded representation of a 2D mesh object
includes representation of its geometry and motion.
3D Mesh Object
The 3D Mesh Object is a 3D polygonal model that can be represented as an IndexedFaceSet or Hierarchical 3D
Mesh node in BIFS. It is defined by the position of its vertices (geometry), by the association between each face
and its sustaining vertices (connectivity), and optionally by colours, normals, and texture coordinates (properties).
Properties do not affect the 3D geometry, but influence the way the model is shaded. 3D mesh coding (3DMC)
addresses the efficient coding of 3D mesh object. It comprises a basic method and several options. The basic
3DMC method operates on manifold model and features incremental representation of single resolution 3D model.
The model may be triangular or polygonal – the latter are triangulated for coding purposes and are fully recovered
in the decoder. Options include: (a) support for computational graceful degradation control; (b) support for non-
manifold model; (c) support for error resilience; and (d) quality scalability via hierarchical transmission of levels of
detail with implicit support for smooth transition between consecutive levels. The compression of application-
specific geometry streams (Face Animation Parameters) and generalized animation parameters (BIFS Anim) are
currently addressed elsewhere in this part of ISO/IEC 14496.
In 3DMC, the compression of the connectivity of the 3D mesh (e.g. how edges, faces, and vertices relate) is
lossless, whereas the compression of the other attributes (such as vertex coordinates, normals, colours, and
texture coordinates) may be lossy.
© ISO/IEC 2004 – All rights reserved ix
Single Resolution Mode
The incremental representation of a single resolution 3D model is based on the Topological Surgery scheme. For
manifold triangular 3D meshes, the Topological Surgery representation decomposes the connectivity of each
connected component into a simple polygon and a vertex graph. All the triangular faces of the 3D mesh are
connected in the simple polygon forming a triangle tree, which is a spanning tree in the dual graph of the 3D mesh.
Figure 0-1 shows an example of a triangular 3D mesh, its dual graph, and a triangle tree. The vertex graph
identifies which pairs of boundary edges of the simple polygon are associated with each other to reconstruct the
connectivity of the 3D mesh. The triangle tree does not fully describe the triangulation of the simple polygon. The
missing information is recorded as a marching edge.
A B C
Figure 0-1 — A triangular 3D mesh (A), its dual graph (B), and a triangle tree (C)
For manifold 3D meshes, the connectivity is represented in a similar fashion. The polygonal faces of the 3D mesh
are connected in a simple polygon forming a face tree. The faces are triangulated, and which edges of the resulting
triangular 3D mesh are edges of the original 3D mesh is recorded as a sequence of polygon_edge bits. The face
tree is also a spanning tree in the dual graph of the 3D mesh, and the vertex graph is always composed of edges of
the original 3D mesh.
The vertex coordinates and optional properties of the 3D mesh (normals, colours, and texture coordinates) are
quantised, predicted as a function of decoded ancestors with respect to the order of traversal, and the errors are
entropy encoded.
Incremental Representation
When a 3D mesh is downloaded over networks with limited bandwidth (e.g. PSTN), it may be desired to begin
decoding and rendering the 3D mesh before it has all been received. Moreover, content providers may wish to
control such incremental representation to present the most important data first. The basic 3DMC method supports
this by interleaving the data such that each triangle may be reconstructed as it is received. Incremental
representation is also facilitated by the options of hierarchical transmission for quality scalability and partitioning for
error resilience.
Hierarchical Mode
An example of a 3D mesh represented in hierarchical mode is illustrated in Figure 0-2. The hierarchical mode
allows the decoder to show progressively better approximations of the model as data are received. The hierarchical
3D mesh decomposition can also be organized in the decoder as layered detail, and view-dependent expansion of
this detail can be subsequently accomplished during a viewer's interaction with the 3D model. Downloadable
scalability of 3D model allows decoders with widely varied rendering performance to use the content, without
necessarily making repeated requests to the encoder for specific versions of the content and without the latencies
of updating view-dependent model from the encoder. This quality scalability of 3D meshes is complementary to
scaleable still texture and supports bitstream scalability and downloading of quality hierarchies of hybrid imagery to
the decoder.
x © ISO/IEC 2004 – All rights reserved
Figure 0-2 — A hierarchy of levels of detail
The hierarchical representation of a 3D model is based on the Progressive Forest Split scheme. In the Progressive
Forest Split representation, a manifold 3D mesh is represented as a base 3D mesh followed by a sequence of
forest split operations. The base 3D mesh is encoded as a single resolution 3D mesh using the Topological
Surgery scheme. Each forest split operation is composed of a forest in the graph of the current 3D mesh, and a
sequence of simple polygons. Figure 0-3 illustrates the method. To prevent visual artifacts which may occur while
switching from a level of detail to the next one, the data structure supports smooth transition between consecutive
levels of detail in the form of linear interpolation of vertex positions.
A B C
Figure 0-3 — The Forest Split operation. A 3D mesh with a forest of edges marked (A). The tree loops
resulting from cutting the 3D mesh through the forest edges (B). Each tree loop is filled by a simple
polygon composed of polygonal faces (C).
Error Resilience for 3D mesh object
If the 3D mesh is partitioned into independent parts, it may be possible to perform more efficient data transmission
in an error-prone environment, e.g., an IP network or datacasting service in a broadcast TV network. It must be
possible to resynchronize after a channel error, and continue data transmission and rendering from that point
instead of starting over from scratch. Even with the presence of channel errors, the decoder can start decoding and
rendering from the next partition that is received intact from the channel.
Flexible partitioning methods can be used to organize the data, such that it fits the underlying network packet
structure more closely, and overhead is reduced to the minimum. To allow flexible partitioning, several connected
components may be merged into one partition, where as a large connected component may be divided into several
independent partitions. Merging and dividing of connected components using different partition types can be done
at any point in the 3D mesh object.
© ISO/IEC 2004 – All rights reserved xi
Stitching for Non-Manifold and Non-Orientable Meshes
The connectivity of a non-manifold and non-orientable 3D mesh is represented as a manifold 3D mesh and a
sequence of stitches. Each stitch describes how to identify one or more pairs of vertices along two linear paths of
edges, and each one of these paths is contained in one of the vertex graphs that span the connected components
of the manifold 3D mesh.
Encoder and Decoder Block Diagrams
High level block diagrams of a general 3D polygonal model encoder and decoder are shown in Figure 0-4. They
consist of a 3D mesh connectivity (de)coder, geometry (de)coder, property (de)coder, and entropy (de)coding
blocks. Connectivity, vertex position, and property information are extracted from 3D mesh model described in
VRML or MPEG-4 BIFS format. The connectivity (de)coder is used for an efficient representation of the association
between each face and its sustaining vertices. The geometry (de)coder is used for a lossy or lossless compression
of vertex coordinates. The property (de)coder is used for a lossy or lossless compression of colour, normal, and
texture coordinate data.
3D mesh model
Geometry Geometry coder
Compressed
3D mesh
MPEG
Bitstream
model
Entropy
System
Connectivity
coding
Connectivity
MUX
coder
Property
Property
coder
(a) 3D mesh encoder
3D mesh model
Geometry
Geometry
decoder
Composition
Bitstream MPEG Entropy
&
Connectivity
System decoding Connectivity
Rendering
decoder
DEMUX
Property
Property
decoder
(b) 3D mesh decoder
Figure 0-4 — General block diagram of the 3D mesh compres
Overview of the object based nonscalable syntax
The coded representation defined in the non-scalable syntax achieves a high compression ratio while preserving
good image quality. Further, when access to individual objects is desired, the shape of objects also needs to be
coded, and depending on the bandwidth available, the shape information can be coded in a lossy or lossless
fashion.
The compression algorithm employed for texture data is not lossless as the exact sample values are not preserved
during coding. Obtaining good image quality at the bitrates of interest demands very high compression, which is not
xii © ISO/IEC 2004 – All rights reserved
achievable with intra coding alone. The need for random access, however, is best satisfied with pure intra coding.
The choice of the techniques is based on the need to balance a high image quality and compression ratio with the
requirement to make random access to the coded bitstream.
A number of techniques are used to achieve high compression. The algorithm first uses block-based motion
compensation to reduce the temporal redundancy. Motion compensation is used both for causal prediction of the
current VOP from a previous VOP, and for non-causal, interpolative prediction from past and future VOPs. Motion
vectors are defined for each 16-sample by 16-line region of a VOP or 8-sample by 8-line region of a VOP as
required. The prediction error, is further compressed using the discrete cosine transform (DCT) to remove spatial
correlation before it is quantised in an irreversible process that discards the less important information. Finally, the
shape information, motion vectors and the quantised DCT information, are encoded using variable length codes.
In order to preserve the lossless quality, or to restrict the maximum bit count of block data, the block based DPCM
coding can be used for Studio Profiles.
Temporal processing
Because of the conflicting requirements of random access to and highly efficient compression, three main VOP
types are defined. Intra coded VOPs (I-VOPs) are coded without reference to other pictures. They provide access
points to the coded sequence where decoding can begin, but are coded with only moderate compression.
Predictive coded VOPs (P-VOPs) are coded more efficiently using motion compensated prediction from a past intra
or predictive coded VOPs and are generally used as a reference for further prediction. Bidirectionally-predictive
coded VOPs (B-VOPs) provide the highest degree of compression but require both past and future reference VOPs
for motion compensation. Bidirectionally-predictive coded VOPs are never used as references for prediction
(except in the case that the resulting VOP is used as a reference for scalable enhancement layer). The
organisation of the three VOP types in a sequence is very flexible. The choice is left to the encoder and will depend
on the requirements of the application.
Coding of Shapes
In natural video scenes, VOPs are generated by segmentation of the scene according to some semantic meaning.
For such scenes, the shape information is thus binary (binary shape). Shape information is also referred to as
alpha plane. The binary alpha plane is coded on a macroblock basis by a coder which uses the context information,
motion compensation and arithmetic coding. For Studio Profiles particularly, HHC binary alpha block coding is
used.
For coding of shape of a VOP, a bounding rectangle is first created and is extended to multiples of 16×16 blocks
with extended alpha samples set to zero. Shape coding is then initiated on a 16×16 block basis; these blocks are
also referred to as binary alpha blocks.
Coding interlaced video
Each frame of interlaced video consists of two fields which are separated by one field-period. Studio Profiles allow
either the frame to be encoded as a VOP or the two fields to be encoded as two VOPs. Frame encoding or field
encoding can be adaptively selected on a frame-by-frame basis. Frame encoding is typically preferred when the
video scene contains significant detail with limited motion. Field encoding, in which the second field can be
predicted from the first, works better when there is fast movement.
Motion representation - macroblocks
The choice of 16×16 blocks (referred to as macroblocks) for the motion-compensation unit is a result of the trade-
off between the coding gain provided by using motion information and the overhead needed to represent it. Each
macroblock can further be subdivided to 8×8 blocks for motion estimation and compensation depending on the
overhead that can be afforded. In order to encode the highly active scene with higher VOP rate, a Reduced
Resolution VOP tool is provided. When this tool is used , the size of the macroblock used for motion compensation
decoding is 32x32 pixels and the size of block is 16x16 pixels.
For Studio Profiles particularly, in frame encoding, the prediction from the previous reference frame can itself be
either frame-based or field-based.
© ISO/IEC 2004 – All rights reserved xiii
Depending on the type of the macroblock, motion vector information and other side information is encoded with the
compressed prediction error in each macroblock. The motion vectors are differenced with respect to a prediction
value and coded using variable length codes. The maximum length of the motion vectors allowed is decided at the
encoder. It is the responsibility of the encoder to calculate appropriate motion vectors. The specification does not
specify how this should be done.
Spatial redundancy reduction
Both source VOPs and prediction errors VOPs have significant spatial redundancy. This part of ISO/IEC 14496
uses a block-based DCT method with optional visually weighted quantisation, and run-length coding. After motion
compensated prediction or interpolation, the resulting prediction error is split into 8×8 blocks. These are
transformed into the DCT domain where they can be weighted before being quantised. After quantisation many of
the DCT coefficients are zero in value and so two-dimensional run-length and variable length coding is used to
encode the remaining DCT coefficients efficiently.
Chrominance formats
This part of ISO/IEC 14496 currently supports the 4:2:0 chrominance format. Studio Profiles support the 4:2:2 and
4:4:4 chorominance formats in addition.
RGB color components
Studio Profiles support coding of RGB color components. The resolution of each component shall be identical when
input data is treated as RGB color components.
Pixel depth
This part of ISO/IEC 14496 supports pixel depths between 4 and 12 bits in luminance and chrominance planes.
IStudio Profiles support 8, 10 and 12 bits in luminance and chrominance or RGB planes.
Generalized scalability
The scalability tools in this part of ISO/IEC 14496 are designed to support applications beyond that supported by
single layer video. The major applications of scalability include internet video, wireless video, multi-quality video
services, video database browsing etc. In some of these applications, either normal scalabilities on picture basis
such as those in ISO/IEC 13818-2:2000 may be employed or object based scalabilities may be necessary; both
categories of scalability are enabled by this part of ISO/IEC 14496.
Although a simple solution to scalable video is the simulcast technique that is based on transmission/storage of
multiple independently coded reproductions of video, a more efficient alternative is scalable video coding, in which
the bandwidth allocated to a given reproduction of video can be partially re-utilised in coding of the next
reproduction of video. In scalable video coding, it is assumed that given a coded bitstream, decoders of various
complexities can decode and display appropriate reproductions of coded video. A scalable video encoder is likely
to have increased complexity when compared to a single layer encoder. However, this part of ISO/IEC 14496
provides several different forms of scalabilities that address non-overlapping applications with corresponding
complexities.
The basic scalability tools offered are temporal scalability and spatial scalability. Moreover, combinations of these
basic scalability tools are also supported and are referred to as hybrid scalability. In the case of basic scalability,
two layers of video referred to as the lower layer and the enhancement layer are allowed, whereas in hybrid
scalability up to four layers are supported.
Object based Temporal scalability
Temporal scalability is a tool intended for use in a range of diverse video applications from video databases,
internet video, wireless video and multiview/stereoscopic coding of video. Furthermore, it may also provide a
migration path from current lower temporal resolution video systems to higher temporal resolution systems of the
future.
Temporal scalability involves partitioning of VOPs into layers, where the lower layer is coded by itself to provide the
basic temporal rate and the enhancement layer is coded with temporal prediction with respect to the lower layer.
xiv © ISO/IEC 2004 – All rights reserved
These layers when decoded and temporally multiplexed yield full temporal resolution. The lower temporal
resolution systems may only decode the lower layer to provide basic temporal resolution whereas enhanced
systems of the future may support both layers. Furthermore, temporal scalability has use in bandwidth constrained
networked applications where adaptation to frequent changes in allowed throughput are necessary. An additional
advantage of temporal scalability is its ability to provide resilience to transmission errors as the more important
data of the lower layer can be sent over a channel with better error performance, whereas the less critical
enhancement layer can be sent over a channel with poor error performance. Object based temporal scalability can
also be employed to allow graceful control of picture quality by controlling the temporal rate of each video object
under the constraint of a given bit-budget.
Object Spatial scalability
Spatial scalability is a tool intended for use in video applications involving multi quality video services, video
da
...











Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...