ISO/IEC 15938-13:2015
(Main)Information technology — Multimedia content description interface — Part 13: Compact descriptors for visual search
Information technology — Multimedia content description interface — Part 13: Compact descriptors for visual search
1.1 Organization of the document ISO/IEC 15938-13:2015 is as follows. Clauses 2-3 specify the terms, abbreviations, symbols and conventions used in the document. Clause 4 specifies the binary representation syntax and descriptor component semantics for a CDVS image descriptor. Clause 5 specifies the extraction and encoding process for a CDVS image descriptor. Annexes A-J specify information relevant to the encoding process of Clause 5. Annex K contains an informative description of the decoding process of a CDVS image descriptor. 1.2 Overview of compact descriptors for visual search This part of the MPEG-7 standard specifies an image description tool designed to enable efficient and interoperable visual search applications, allowing visual content matching in images. Visual content matching includes matching of views of objects, landmarks, and printed documents, while being robust to partial occlusions as well as changes in viewpoint, camera parameters, and lighting conditions.
Technologies de l'information — Interface de description du contenu multimédia — Partie 13: Descripteurs compacts pour recherche visuelle
General Information
- Status
- Published
- Publication Date
- 24-Aug-2015
- Current Stage
- 9020 - International Standard under periodical review
- Start Date
- 15-Apr-2026
- Completion Date
- 15-Apr-2026
Overview
ISO/IEC 15938-13:2015 - "Information technology - Multimedia content description interface - Part 13: Compact descriptors for visual search" (CDVS) is an international standard within the MPEG‑7 family that specifies a compact, interoperable image description tool for visual search. The standard defines the binary syntax, semantics, extraction and encoding processes for compact descriptors for visual search (CDVS) to enable robust visual content matching across images (objects, landmarks, printed documents) under changes in viewpoint, scale, occlusion and lighting.
Key topics and technical requirements
The standard is organized into clauses and normative/ informative annexes and specifies the following technical building blocks:
- Binary representation syntax and descriptor semantics (Clause 4) - the format and meaning of CDVS image descriptor components.
- Extraction and encoding process (Clause 5) - end‑to‑end procedure for producing an image descriptor from an input image, including required steps and options.
- Interest point detection - scale‑space construction, detection of scale‑space extrema, coordinate refinement to subpixel precision, duplicate elimination and orientation assignment.
- Local feature selection and description - selection probabilities, local region descriptors computed from cell histograms and local feature descriptor construction.
- Descriptor aggregation and compression - methods to aggregate local descriptors into a compact global descriptor, PCA projection, GMM parameters, quantization and compression strategies for both descriptors and location information.
- Encoding order and descriptor length targets - six average image descriptor lengths are specified: 512, 1024, 2048, 4096, 8192 and 16384 bytes.
- Normative annexes (A–J) - implementation parameters such as PCA matrices, GMM parameters, quantization thresholds and coding model probabilities.
- Informative decoding description (Annex K) - guidance on decoding CDVS image descriptors.
Practical applications
CDVS is designed for applications that require compact, standardised image representations for matching and retrieval, including:
- Mobile visual search and image-based product lookup
- Landmark recognition and location‑based services
- Large‑scale image retrieval and deduplication in cloud/search engines
- Augmented reality (AR) and visual navigation systems
- Document image matching and OCR preprocessing
- Video keyframe indexing and surveillance analytics
Who should use this standard
- Software and hardware implementers building interoperable visual search engines and libraries
- Developers of mobile apps, cloud image search, AR toolkits, and multimedia platforms
- Researchers and system integrators needing a standardized compact image descriptor format
- Standards bodies and product teams requiring conformance to MPEG‑7/CDVS conventions
Related standards
ISO/IEC 15938 (MPEG‑7) Parts 1–12 define broader MPEG‑7 description tools (Systems, Visual, Audio, Profiles, Query Format, etc.). ISO/IEC 15938‑13:2015 complements these by providing the compact visual descriptor specification for interoperable visual search.
Keywords: ISO/IEC 15938-13:2015, CDVS, compact descriptors for visual search, MPEG‑7, image descriptor, visual search, interest point detection, descriptor compression.
Buy Documents
ISO/IEC 15938-13:2015 - Information technology -- Multimedia content description interface
ISO/IEC 15938-13:2015 - Information technology — Multimedia content description interface — Part 13: Compact descriptors for visual search/25/2015
Get Certified
Connect with accredited certification bodies for this standard

BSI Group
BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

NYCE
Mexican standards and certification body.
Sponsored listings
Frequently Asked Questions
ISO/IEC 15938-13:2015 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology — Multimedia content description interface — Part 13: Compact descriptors for visual search". This standard covers: 1.1 Organization of the document ISO/IEC 15938-13:2015 is as follows. Clauses 2-3 specify the terms, abbreviations, symbols and conventions used in the document. Clause 4 specifies the binary representation syntax and descriptor component semantics for a CDVS image descriptor. Clause 5 specifies the extraction and encoding process for a CDVS image descriptor. Annexes A-J specify information relevant to the encoding process of Clause 5. Annex K contains an informative description of the decoding process of a CDVS image descriptor. 1.2 Overview of compact descriptors for visual search This part of the MPEG-7 standard specifies an image description tool designed to enable efficient and interoperable visual search applications, allowing visual content matching in images. Visual content matching includes matching of views of objects, landmarks, and printed documents, while being robust to partial occlusions as well as changes in viewpoint, camera parameters, and lighting conditions.
1.1 Organization of the document ISO/IEC 15938-13:2015 is as follows. Clauses 2-3 specify the terms, abbreviations, symbols and conventions used in the document. Clause 4 specifies the binary representation syntax and descriptor component semantics for a CDVS image descriptor. Clause 5 specifies the extraction and encoding process for a CDVS image descriptor. Annexes A-J specify information relevant to the encoding process of Clause 5. Annex K contains an informative description of the decoding process of a CDVS image descriptor. 1.2 Overview of compact descriptors for visual search This part of the MPEG-7 standard specifies an image description tool designed to enable efficient and interoperable visual search applications, allowing visual content matching in images. Visual content matching includes matching of views of objects, landmarks, and printed documents, while being robust to partial occlusions as well as changes in viewpoint, camera parameters, and lighting conditions.
ISO/IEC 15938-13:2015 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.040.40 - Coding of audio, video, multimedia and hypermedia information. The ICS classification helps identify the subject area and facilitates finding related standards.
ISO/IEC 15938-13:2015 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 15938-13
First edition
2015-09-01
Information technology — Multimedia
content description interface —
Part 13:
Compact descriptors for visual search
Technologies de l’information — Interface de description du
contenu multimédia —
Partie 13: Descripteurs compacts pour recherche visuelle
Reference number
©
ISO/IEC 2015
© ISO/IEC 2015, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO/IEC 2015 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Terms and definitions . 1
3 Symbols and abbreviated terms . 2
3.1 General . 2
3.2 Abbreviations . 2
3.3 Arithmetic operations . 3
3.4 Logical operators . 3
3.5 Relational operators . 3
3.6 Bitwise operators. 4
3.7 Assignment . 4
3.8 Mnemonics . 4
3.9 Constants . 4
3.10 Functions . 4
4 CDVS syntax . 5
4.1 Binary representation syntax . 5
4.2 Descriptor component semantics . 6
5 CDVS encoding . 9
5.1 General . 9
5.2 Original image preprocessing . 9
5.3 Interest point detection . 9
5.3.1 Introduction . 9
5.3.2 Scale space construction . 9
5.3.3 Detection of scale-space extrema .10
5.3.4 Coordinate refinement to subpixel precision. .14
5.3.5 Transformation of coordinates and scale to the converted image resolution .17
5.3.6 Elimination of duplicates .17
5.3.7 Orientation Assignment .17
5.3.8 Interest point characteristics .19
5.4 Local feature selection .19
5.4.1 Operation .19
5.4.2 Descriptor components .20
5.5 Local feature description .21
5.6 Local feature descriptor aggregation .23
5.6.1 Operation .23
5.6.2 Descriptor components .26
5.7 Local feature descriptor compression .27
5.7.1 Operation .27
5.7.2 Descriptor components .30
5.8 Local feature location compression .31
5.8.1 Operation .31
5.8.2 Descriptor components .36
5.9 Encoding order of compressed local feature descriptors and relevance bits .37
5.10 Computation of the number of compressed local feature descriptors at different
image descriptor lengths .37
Annex A (informative) CDVS encoder organization .38
Annex B (normative) Coefficients for coordinate refinement .39
Annex C (normative) Probability values for the feature selection .41
Annex D (normative) PCA projection matrix for local feature descriptor aggregation .44
© ISO/IEC 2015 – All rights reserved iii
Annex E (normative) GMM parameters for local feature descriptor aggregation .55
Annex F (normative) Gaussian function selection parameters for local feature
descriptor aggregation .135
Annex G (normative) Bit selection masks for local feature descriptor aggregation .136
Annex H (normative) Scalar quantization thresholds for local feature descriptor compression .138
Annex I (normative) Histogram count arithmetic coding model probabilities .142
Annex J (normative) Histogram map arithmetic coding model probabilities .144
Annex K (informative) CDVS decoding .145
iv © ISO/IEC 2015 – All rights reserved
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical
activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the
work. In the field of information technology, ISO and IEC have established a joint technical committee,
ISO/IEC JTC 1.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity
assessment, as well as information about ISO’s adherence to the WTO principles in the Technical
Barriers to Trade (TBT) see the following URL: Foreword - Supplementary information
The committee responsible for this document is ISO/IEC JTC 1, Information technology, SC 29, Coding of
audio, picture, multimedia and hypermedia information.
ISO/IEC 15938 consists of the following parts, under the general title Information technology —
Multimedia content description interface:
— Part 1: Systems
— Part 2: Description definition language
— Part 3: Visual
— Part 4: Audio
— Part 5: Multimedia description schemes
— Part 6: Reference software
— Part 7: Conformance testing
— Part 8: Extraction and use of MPEG-7 descriptions
— Part 9: Profiles and levels
— Part 10: Schema definition
— Part 11: MPEG-7 profile schemas
— Part 12: Query format
— Part 13: Compact descriptors for visual search
© ISO/IEC 2015 – All rights reserved v
Introduction
This International Standard, also known as “Multimedia Content Description Interface,” provides a
standardized set of technologies for describing multimedia content. It addresses a broad spectrum of
multimedia applications and requirements by providing a metadata system for describing the features
of multimedia content.
The following are specified in this International Standard:
— Description schemes (DS) describe entities or relationships pertaining to multimedia content.
Description schemes specify the structure and semantics of their components, which may be
Description Schemes, descriptors, or datatypes.
— Descriptors (D) describe features, attributes, or groups of attributes of multimedia content.
— Datatypes are the basic reusable datatypes employed by description schemes and descriptors.
— Systems tools support delivery of descriptions, multiplexing of descriptions with multimedia
content, synchronization, file format, and so forth.
This International Standard is subdivided into 13 parts:
— Part 1 — Systems: specifies the tools for preparing descriptions for efficient transport and storage,
compressing descriptions, and allowing synchronization between content and descriptions.
— Part 2 — Description definition language: specifies the language for defining the International
Standard set of description tools (DSs, Ds, and datatypes) and for defining new description tools.
— Part 3 — Visual: specifies the description tools pertaining to visual content.
— Part 4 — Audio: specifies the description tools pertaining to audio content.
— Part 5 — Multimedia description schemes: specifies the generic description tools pertaining to
multimedia including audio and visual content.
— Part 6 — Reference software: provides a software implementation of the International Standard.
— Part 7 — Conformance testing: specifies the guidelines and procedures for testing conformance
of implementations of the International Standard.
— Part 8 — Extraction and use of MPEG-7 descriptions: provides guidelines and examples of the
extraction and use of descriptions.
— Part 9 — Profiles and levels: provides guidelines and standard profiles.
— Part 10 — Schema definition: specifies the schema using description definition language.
— Part 11 — Profile Schemas: listing of profile schemas using description definition language.
— Part 12 — Query format: contains the tools of the MPEG Query Format (MPQF).
— Part 13 — Compact descriptors for visual search: specifies an image description tool for visual
search applications.
vi © ISO/IEC 2015 – All rights reserved
INTERNATIONAL STANDARD ISO/IEC 15938-13:2015(E)
Information technology — Multimedia content
description interface —
Part 13:
Compact descriptors for visual search
1 Scope
The structure of this part of ISO/IEC 15938 is as follows. Clauses 2 and 3 specify the terms,
abbreviations, symbols, and conventions used in the International Standard. Clause 4 specifies the
binary representation syntax and descriptor component semantics for a CDVS image descriptor.
Clause 5 specifies the extraction and encoding process for a CDVS image descriptor. Annexes A-J specify
information relevant to the encoding process of Clause 5. Annex K contains an informative description
of the decoding process of a CDVS image descriptor.
This part of the MPEG-7 standard specifies an image description tool designed to enable efficient and
interoperable visual search applications, allowing visual content matching in images. Visual content
matching includes matching of views of objects, landmarks, and printed documents, while being robust
to partial occlusions as well as changes in viewpoint, camera parameters, and lighting conditions.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
image descriptor
descriptor extracted from one image
2.2
image descriptor length
size of an image descriptor in bytes
Note 1 to entry: This International Standard specifies six average (i.e. over a large number of images) image
descriptor lengths, i.e. 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes, 8192 bytes, and 16384 bytes, and the
encoding process for each image descriptor length.
2.3
original image
input image to the image descriptor encoder
2.4
converted image
image which is a spatially resampled version of the original image and from which the image
descriptor is extracted
2.5
pixel
indexable element of the original image or the converted image, comprising spatial coordinates and a
luminance value
2.6
interest point
point in an image showing detection stability under local and global perturbations in the image domain,
including perspective transformations, changes in image scale, and illumination variations
© ISO/IEC 2015 – All rights reserved 1
2.7
local region
area in an image in the neighbourhood of an interest point, used to generate local feature descriptors
2.8
cell
each of the 4x4 subdivisions of a local region
2.9
cell histogram
histogram of gradients computed from the cell
2.10
local feature descriptor
descriptor of a local region, computed from the cell histograms
2.11
global descriptor
aggregation of local feature descriptors into a compact representation of the image
2.12
compressed local feature descriptor
compressed representation of a local feature descriptor
2.13
interest point coordinate
horizontal and vertical pixel coordinates indicating the position of an interest point in the converted
image resolution, rounded to the nearest integer
2.14
location quantization factor
size of the blocks of the spatial grid superimposed on top of the converted image in order to obtain
quantized interest point coordinates’ values
2.15
histogram map
binary representation of the converted image scaled down by the location quantization factor, indicating
whether each bin generated through the superimposition of the spatial grid on top of the converted
image is populated with at least one interest point
2.16
histogram count
vector indicating the number of interest points that populate each non-empty bin generated through
the superimposition of a spatial grid on top of the converted image
3 Symbols and abbreviated terms
3.1 General
NOTE The mathematical operators used in this part of ISO/IEC 15938 are similar to those used in the C
programming language. Unless otherwise indicated, all the arithmetic operations are performed with real
values. Numbering and counting conventions generally begin from 0.
3.2 Abbreviations
CDVS Compact Descriptors for Visual Search
LoG Laplacian-of-Gaussian
2 © ISO/IEC 2015 – All rights reserved
MPEG Moving Picture Experts Group
MPEG-7 ISO/IEC 15938
3.3 Arithmetic operations
+ Addition
- Subtraction (as a binary operator) or negation (as a unary operator)
++ Increment by 1, i.e. x++ is equivalent to x=x+1
-- Decrement by 1, i.e. x-- is equivalent to x=x-1
+= Increment by value, i.e. x+=y is equivalent to x=x+y
-= Decrement by value, i.e. x-=y is equivalent to x=x-y
* Multiplication (in binary representation syntax and pseudo-code) or convolution
(elsewhere)
× Multiplication
· Multiplication
/ Division
÷ Division
% Modulo operator
3.4 Logical operators
|| Logical OR
˅ Logical OR
&& Logical AND
˄ Logical AND
! Logical NOT
3.5 Relational operators
> Greater than
>= Greater than or equal to
≥ Greater than or equal to
< Less than
<= Less than or equal to
≤ Less than or equal to
== Equal to
!= Not equal to
© ISO/IEC 2015 – All rights reserved 3
3.6 Bitwise operators
| OR
& AND
3.7 Assignment
= Assignment operator
← Assignment operator
3.8 Mnemonics
The following mnemonics are defined to describe the different data types used in the coded bitstream.
bslbf Bit string, left bit first, where “left” is the order in which bits are written in the bit-
stream.
uimsbf Unsigned integer, most significant bit first.
vlclbf Variable length code, left bit first, where “left” refers to the order in which the VLC
codes are written in the bitstream and where the byte order of multibyte words is
most significant byte first.
3.9 Constants
π 3.141 592 653 58…
e 2.718 281 828 45…
3.10 Functions
log ( ) Base-n logarithm
n
max( ) Maximum value in argument list
min( ) Minimum value in argument list
sgn( ) Sign function, i.e. sgn(x) = -1, 0 or +1 when x < 0, x == 0 or x > 0, respectively
Absolute value of scalar or a vector norm
Floor function which returns the maximum integer number less than or equal to the
given real number
Ceiling function which returns the minimum integer number greater than or equal to
the given real number
Downsamples an image by keeping only the even rows and even columns of the
↓
22x
image, without anti-alias filtering
4 © ISO/IEC 2015 – All rights reserved
4 CDVS syntax
4.1 Binary representation syntax
CDVSDescriptor { Number of Mnemonics
bits
VersionID 3 bslbf
ModeID 8 uimsbf
GlobalHasBitSelection 1 bslbf
GlobalHasVariance 1 bslbf
RelevanceBitsPresent 1 bslbf
ReservedBits 2 bslbf
OriginalImageXResolution 16 uimsbf
OriginalImageYResolution 16 uimsbf
NumberOfLocalDescriptors 16 uimsbf
if(NumberOfLocalDescriptors>0) {
for(k=0; k
GlobalFunctionPresent[k] 1 bslbf
}
if(GlobalHasBitSelection) {
for(k=0; k
if(GlobalFunctionPresent[k]) {
GlobalFunctionMeanVector[k] 24 bslbf
}
}
}
else {
for(k=0; k
if(GlobalFunctionPresent[k]) {
GlobalFunctionMeanVector[k] 32 bslbf
}
}
}
if(GlobalHasVariance) {
for(k=0; k
if(GlobalFunctionPresent[k]) {
GlobalFunctionVarianceVector[k] 32 bslbf
}
}
}
HistogramCountSize 16 uimsbf
HistogramMapSizeX 16 uimsbf
HistogramMapSizeY 16 uimsbf
HistogramCount (arithmetically coded block; see 5.8) >=0 vlclbf
© ISO/IEC 2015 – All rights reserved 5
CDVSDescriptor { Number of Mnemonics
bits
HistogramMap (arithmetically coded block; see 5.8) >=0 vlclbf
NumberOfElementGroups 6 uimsbf
for(k=0; k
for(n=0; n<(4*NumberOfElementGroups); n++) {
LocalDescriptorElements[k][n] 1-2 vlclbf
}
}
if(RelevanceBitsPresent) {
for(k=0; k
RelevanceBits[k] 1 bslbf
}
}
BitStuffing 0-7 vlclbf
}
}
VersionID = 1
NumberOfGlobalFunctions = 512
4.2 Descriptor component semantics
VersionID
This descriptor component specifies the CDVSDescriptor version. In this International Standard
ModeID
This descriptor component specifies the image descriptor length. There are six image descriptor
lengths, and their corresponding ModeID values are shown in Table 1 below.
Table 1 — ModeID values for the six image descriptor lengths
Image descriptor length ModeID
512 bytes 1
1024 bytes 2
2048 bytes 3
4096 bytes 4
8192 bytes 5
16384 bytes 6
GlobalHasBitSelection
This descriptor component specifies whether bit selection is applied or not to the
GlobalFunctionMeanVector of each of the Gaussian functions which are present in the global
descriptor of an image descriptor. If GlobalHasBitSelection == 1 then bit selection is applied, and if
GlobalHasBitSelection == 0 then bit selection is not applied. More details are provided in 5.6.
6 © ISO/IEC 2015 – All rights reserved
GlobalHasVariance
This descriptor component specifies whether the GlobalFunctionVarianceVector of each of the Gaussian
functions which are present in the global descriptor of an image descriptor appears in the bitstream
or not. If GlobalHasVariance == 1 then GlobalFunctionVarianceVector appears in the bitstream, and if
GlobalHasVariance == 0 then GlobalFunctionVarianceVector does not appear in the bitstream. More
details are provided in 5.6.
RelevanceBitsPresent
This descriptor component specifies if a relevance bit for each compressed local feature descriptor
is present in the bitstream. If RelevanceBitsPresent == 1 then the relevance bits are present in the
bitstream, and if RelevanceBitsPresent == 0 then the relevance bits are not present in the bitstream.
More details are provided in 5.4.
ReservedBits
This descriptor component comprises two bits which are reserved for future use and they shall
both be set to 0.
OriginalImageXResolution
This descriptor component specifies the width (in pixels) of the original image.
OriginalImageYResolution
This descriptor component specifies the height (in pixels) of the original image.
NumberOfLocalDescriptors
This descriptor component specifies the number of compressed local feature descriptors which are
present in the bitstream. More details are provided in 5.10. NumberOfLocalDescriptors == 0 indicates
that no local features were identified in the image.
NumberOfGlobalFunctions
This descriptor component specifies the maximum number of Gaussian functions used in the global
descriptor and has a value NumberOfGlobalFunctions = 512. More details are provided in 5.6.
GlobalFunctionPresent
This descriptor component specifies a 1-D array of size NumberOfGlobalFunctions indicating which
Gaussian functions are present in the global descriptor of a particular image descriptor. If a Gaussian
function is present in the global descriptor the corresponding value in the array is 1, otherwise it is 0.
More details are provided in 5.6.
GlobalFunctionMeanVector
This descriptor component specifies a 1-D array of size equal to the number of Gaussian functions
which are present in the global descriptor, i.e. those Gaussian functions with a corresponding value of
1 in GlobalFunctionPresent. Each entry in the array is the binarized mean vector of the corresponding
global descriptor Gaussian function, and the length of each vector is 24 bits if GlobalHasBitSelection
== 1 and 32 bits if GlobalHasBitSelection == 0. More details are provided in 5.6.
GlobalFunctionVarianceVector
This descriptor component specifies a 1-D array of size equal to the number of Gaussian functions
which are present in the global descriptor, i.e. those Gaussian functions with a corresponding value of 1
in GlobalFunctionPresent. Each entry in the array is the binarized variance vector of the corresponding
global descriptor Gaussian function. More details are provided in 5.6.
© ISO/IEC 2015 – All rights reserved 7
HistogramCountSize
This descriptor component specifies the histogram count vector length for location coding. More details
are provided in 5.8.
HistogramMapSizeX
This descriptor component specifies the horizontal x resolution of the histogram map for location
coding. More details are provided in 5.8.
HistogramMapSizeY
This descriptor component specifies the vertical y resolution of the histogram map for location coding.
More details are provided in 5.8.
HistogramCount
This descriptor component specifies a vector for location coding, containing the number of non-zero
elements for each non-null block of the histogram map. More details are provided in 5.8.
HistogramMap
This descriptor component specifies a 2D-array for location coding, containing a block representation
of the converted image. Each block can assume a binary value, indicating the occurrence or not of
interest points within that block. The array is scanned according a procedure described in 5.8. The
scanning terminates when all the non-null elements of the Histogram Map are encoded. More details
are provided in 5.8.
NumberOfElementGroups
This descriptor component specifies the number of element groups in each compressed local feature
descriptor. Each element group contains four elements and the number of elements in each compressed
local feature descriptor is given by 4×NumberOfElementGroups. More details are provided in 5.7.
LocalDescriptorElements
This descriptor component specifies a 2-D array of compressed local feature descriptor elements.
The size of the first dimension is NumberOfLocalDescriptors and the size of the second dimension is
th th
4×NumberOfElementGroups. LocalDescriptorElements[k][n] is the n element of the k compressed
local feature descriptor. For each compressed local feature descriptor, its elements are ordered as
described in 5.7.
The compressed local feature descriptors themselves are ordered as described in 5.9.
RelevanceBits
This descriptor component specifies a 1-D array of size NumberOfLocalDescriptors indicating which
compressed local feature descriptors correspond to the top 300 local features as determined in 5.4. If
th
the k local feature is one of the top 300 local features, then RelevaceBits[k] is set to 1, otherwise it is
set to 0. If NumberOfLocalDescriptor<300, then all the values in RelevanceBits are set to 1. More details
are provided in 5.4.
The relevance bits are ordered in the same order as the descriptors in LocalDescriptorElement, as
described in 5.9.
BitStuffing
This descriptor component specifies stuffing bits (a sequence of ‘1’s) to align the descriptor to a
byte boundary.
8 © ISO/IEC 2015 – All rights reserved
5 CDVS encoding
5.1 General
This clause specifies the encoder operations for computing an image descriptor. A simplified diagram of
a complete CDVS encoder implementing these encoding operations is presented in informative Annex A.
5.2 Original image preprocessing
The original image is a luminance raster image containing values in the interval [0, 255] where
increasing values correspond to increasing luminance. The exact mapping of luminance values within
this interval is beyond the scope of the standard. If at least one of the dimensions of the original image
is greater than 640 pixels then the original image shall be spatially resampled, maintaining the aspect
ratio, so that the largest of the vertical and horizontal image dimensions is equal to 640 pixels, to obtain
a converted image J(x, y), in which xX∈−{,01…,} and yY∈−{,01…,} are the horizontal and vertical
pixel coordinates respectively, X and Y the pixel horizontal and vertical image dimensions respectively,
and with coordinates (0,0) located at the top left corner of the image. For this resampling operation, a
Lanczos filter with a = 3 should be used. If both the dimensions of the original image are no greater
than 640 pixels, then no spatial resampling is performed and the content of the converted image shall
be the same as the content of the original image.
5.3 Interest point detection
5.3.1 Introduction
This operation is performed using the ALP (A Low-degree Polynomial) detector. In order to find interest
points, ALP approximates the result of the LoG filtering by means of polynomials, used to find extrema
in the scale space and to refine the spatial position of the detected points.
5.3.2 Scale space construction
Let g denote the Gaussian kernel in two dimensions with positive scale parameter σ
xy+
−
1 2
2σ
gx(, ye,)σ = (1)
2πσ
The filtering operations shall be done at 4 scales with values for the σ parameter in an exponentially
increasing sequence
k
σσ=⋅20,,k = …,3 (2)
k 0
as provided in Table 2 below.
Table 2 — Values of the scale parameter
k σ
k
0 1,600000
1 2,262742
2 3,200000
3 4,525483
© ISO/IEC 2015 – All rights reserved 9
Interest points shall be identified by means of the scale-normalized Laplacian-of-Gaussian (LoG) kernel,
which is realized as the convolution
01 0
hg(,⋅⋅,)σσ=⋅ 14− 1 ∗⋅(,⋅,)σ (3)
01 0
where g in this case is a truncated and spatially discrete Gaussian function, with width equal to
24⋅ σ +1 where denotes the ceiling function.
For the converted image J(x, y) in which xX∈−{,01…,} and yY∈−{,01…,}are the horizontal and
vertical pixel coordinates respectively, X and Y the pixel horizontal and vertical image dimensions
respectively, and with coordinates (0,0) located at the top left corner of the image J(x, y), scale space
shall be constructed as follows.
The image shall be processed in a scale space representation obtained by Gaussian blur with different
scale factors σ. The scale space shall be structured in a number Q of octaves,
QX=−max{log(max{ ,}Y ),31 } (4)
2
with denoting the floor function.
For each octave in scale space, 4 images shall be produced by filtering of a first image I with a Gaussian
kernel. In any octave, these images shall be obtained by the following filtering operations
II=
II=∗g()δ
10 1
(5)
II=∗g()δ
20 2
II=∗g()δ
30 3
2 2
with the parameter δσ=−σ for n = 1,…,3. The first image in the first octave shall be obtained as
nn 0
IJ=∗g()σ (6)
and in all other octaves the first image shall be obtained by downsampling
prev
II=↓ () (7)
22x
prev
where I denotes image I in the previous octave. Anti-alias filtering shall not be applied since the
downsampling is applied to images which are already low-pass filtered.
Additionally, in any octave 4 images shall be produced by scale-normalized Laplacian filtering of the
Gaussian-filtered images
2 2
LI=⋅σσ∗=fL, ⋅∗If
00 01 1 1
(8)
2 2
LI=⋅σσ∗=fL, ⋅∗If
22 23 3 3
01 0
where f =−14 1 is the discrete Laplacian operator.
01 0
5.3.3 Detection of scale-space extrema
For each octave, two intervals in scale are defined. One is the outer interval Ω and it shall contain a
smaller one called the inner interval Ω.
10 © ISO/IEC 2015 – All rights reserved
The outer interval has the lowest and highest scales σ and σ as boundaries Ω = 16.,4.525483 and
0 3 []
the inner interval has the boundaries Ω = 17.,40. .
[]
For each pixel (x, y) in the image, a polynomial approximation to the scale-space function
3 2
px(, yx,)σα=+(, yx)(σα ,)yxσα++(, yx)(σα ,)y (9)
3 2 10
shall be searched for a local extremum over the outer interval Ω . The coefficients shall be obtained by
computing weighted sums of the images L ,…,L
0 3
K−1
α xy,,=⋅aL xy
() ()
3 ∑ kk
k=0
K−1
α xy,,=⋅bL xy
() ()
2 ∑ kk
k=0
(10)
K−1
α xy, = ccL⋅ xy,
() ()
1 ∑ kk
k=0
K−1
α xy,,=⋅dL xy
() ()
0 ∑ kk
=
k 0
where the coefficients a , b , c , d , corresponding to the 4 predefined scales σ , k =0,…,3, are listed in
k k k k k
Table 3.
Table 3 — Coefficients for the equations for polynomial approximation
k a b c d
k k k k
0 -0,2464 2,5021 -8,2007 8,6432
1 0,4934 -4,5636 12,9824 -10,8424
2 -0,2717 2,0108 -4,0449 2,1204
3 0,0140 0,1549 -1,0565 1,3886
In this manner, the polynomial approximation is obtained by filtering the original image with a
weighted sum of Laplacian-of-Gaussian filters
23 2
σσ⋅∗fa()++bcσσ +⋅dg()σ (11)
kk kk k
∑
k=0
where each of the 4 weights is a polynomial in σ , as illustrated in Figure 1.
© ISO/IEC 2015 – All rights reserved 11
Figure 1 — Polynomial weights for approximating the scale-space function
The coefficients are computed by minimizing the approximation error
fg∗−()σσ()ab++σσcd+⋅ fg∗ ()σ (12)
∑ kk kk k
k=0
over a set of scales contained within the outer interval. Figure 2 depicts a Laplacian-of-Gaussian filter
with σ = 2.5 and its approximation.
Figure 2 — Exact and approximated Laplacian-of-Gaussian filters at scale 2.5
*
A tentative scale σ (,xy) shall be associated to each pixel location x,y as the most extreme over the
outer interval Ω ,
*
σσ(,xy)a= rgmax(px,,y ) (13)
σ∈Ω
or
*
σσ(,xy)a= rgmin(px,,y ) (14)
σ∈Ω
whichever of the two alternatives has the greatest absolute value. Therefore, for all pixels x,y such that
αα(,xy)(−>30xy,)α (,xy) (15)
2 13
12 © ISO/IEC 2015 – All rights reserved
only the ones that
−+αα(,xy)(xy,)−3αα(,xy)(xy,)
22 13
*
σ (,xy)= ∈Ω (16)
3α (,xy)
or
−−αα(,xy)(xy,)−3αα(,xy)(xy,)
22 13
*
σ (,xy)= ∈Ω (17)
3α (,xy)
*
shall be considered. Among the considered pixels, those with positive solutions σ larger than the
polynomial at any boundary of the outer interval
*
pxσσ(, yp)m> ax{(xy,, ),px(, y,)σ } (18)
()
*
as well as all pixels with negative solutions σ smaller than the polynomial at any boundary of the
outer interval
*
pxσσ(, yp)m< in{(xy,, ),px(, y,)σ } (19)
()
*
shall be accepted as candidates, forming triples {,xy,(σ xy,)}. The other candidates are eliminated
from further processing in the present octave. This mechanism eliminates also solutions that are not
local extrema, i. e. with second derivative equal to zero.
* *
Those candidates for which the solutions σ are within the inner interval σ (,xy)∈Ω shall be
subjected to further processing, the other candidates being eliminated from further processing in the
present octave.
*
Thereafter, any remaining candidate {,xy,(σ xy,)} is eliminated from further processing in the
present octave if the absolute value of the polynomial is below a first threshold equal to 0.4, i.e.
*
px(, yx,(σθ,)y ).<= with θ 04 (20)
*
Thereafter, any remaining candidate {,xy,(σ xy,)} is eliminated from further processing in the
present octave if the second derivative of the scale space function with regard to σ is below a second
threshold set to 0.4, i.e.
∂
* 2
px(, yx,(σα,)yx)(=+62,)yxσα (, y).<=θθ with 04 (21)
3 22 2
∂σ
© ISO/IEC 2015 – All rights reserved 13
*
Thereafter, any remaining candidate {,xy,(σ xy,)} is eliminated from further processing in the
*
present octave if the polynomial value px(, yx,(σ ,)y ) is surpassed by the polynomial value of any
remaining candidates among its 8-neighbours. Specifically, for any m∈−{,10,}1 and any n∈−{,10,}1
excluding the combination mn,,= 00 , if
() ()
**
px(,yx,(σσ,)yp)(≤+xm,,yn+∈);σσΩ when px(,yx,( ,)y ) is a maaximum (22)
or
**
px(,yx,(σσ,)yp)(≥+xm,,yn+∈);σσΩ when px(,yx,( ,)y ) is a miinimum (23)
then the candidate is eliminated.
*
The remaining candidates {,xy,(σ xy,)} are input to the next processing step.
5.3.4 Coordinate refinement to subpixel precision.
For the position refinement 9 pre-defined positions shall be used: all 9 combinations of u∈−{,10,}1 and
v∈−{,10,}1 , corresponding to shifts of the LoG kernels in the xy plane.
*
Firstly, candidates at local edges in the polynomial px(, yx,(σ ,)y ) shall be eliminated by the
following test:
*
The 3 × 3 pixels around any candidate are computed at the scale σ (,xy) of the candidate
** *
px(,−−11yx,(σσ,)yp)(xy,,−+11(,xy)) px(,y−1,,(σ xy,))
* ** *
Px(,yx,(σ ,)y )= px(,−+11yx,(σσ,)yp)(xy,, (,xy)) px(,yx,(σ ,)y ) (24)
** *
p(xxy−+11,,σσ(,xy)) px(,yx++11,( ,)yp)(xy,,+1 σ (,xy))
For these pixels, three quantities shall be computed
pP=−2PP+ ,
xx 21 22 23
pP=−2PP+ , (25)
yy 12 22 32
PP+−PP−
11 33 31 13
p =
xy
*
where P is shorthand for Px(, yx,(σ ,)y ) and P denotes the element in row i and column j of P. The
ij
candidate is eliminated if the following quantity ρ exceeds a threshold equal to 12, i.e.
()pp−
xx yy
ρθ= >= with θ 12 (26)
pp⋅−p
xx yy xy
This number is the ratio of the squared trace of the Hessian (at the scale and location of the interest
point) and the determinant of the same Hessian. It is related to the ratio of principal curvatures r as
ρ = (r + 1) /r.
14 © ISO/IEC 2015 – All rights reserved
*
For each remaining candidate (,xy,)σ , a polynomial approximation to the scale-space function in the
displacement parameters u,v
** 2 **2
qu(,vx;, yxσβ)(=+,,yuσβ)(xy,,σβ)(vx++,,yuσ ) v
5 4 3
(27)
** *
β (,xy,,σβ)(ux++,,yvσβ)(xy,,σ )
2 10
*
shall be searched for a local extremum. The coefficients are derived from the matrix Px(, yx,(σ ,)y )
*
associated to the candidate (,xy,)σ , as in the previous equations. Any coefficient is a weighted sum,
in which the correspondence between term number k and the row number i and column number j is
given by Table 4, found below.
Table 4 — Mapping from term number k to row number i(k) and column number j(k)
k i(k) j(k)
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
*
The coefficients for the candidate (,xy,)σ are thus given by weighted sums with K=9
K
*
βσ(,xy,)=⋅aP
5 ∑ ki()kj,(k)
k=1
K
*
βσ(,xy,)=⋅bP
4 ∑ ki()kj,(k)
k=1
K
*
ββσ(,xy,)=⋅cP
3 ∑ ki()kj,(k)
k=1
(28)
K
*
βσ(,xy,)=⋅dP
2 ∑∑ ki()kj,(k)
k=1
K
*
βσ(,xy,)=⋅eP
1 ki()kj,(k)
∑
k=1
KK
*
βσ(,xy,)=⋅fP
0 ∑ ki()kj,(k)
k=1
*
where P is shorthand for element i(k),j(k) of the matrix Px(, yx,(σ ,)y ).
i(k),j(k)
The coefficients a , b , c , d , e , f are dependent on scale; there are 4 sets and the one corresponding
k k k k k k
to the nearest neighbor to the scale σ among σ ,…,σ shall be used. Normative Annex B provides the
0 3
coefficient sets.
NOTE The coefficients are distinct from those contained in Table 3.
© ISO/IEC 2015 – All rights reserved 15
*
The polynomial q may be written (in shorthand, omitting the variables xy,,σ ) as
K
qu(,va)(=+ub vc++uv du++ev fP)⋅ (29)
∑ kk kk kk ik(),(jk)
k=1
The coefficients form a polynomial that provide an interpolation between shifted kernels. Indeed, the
coefficients are computed to minimize the approximation error in
K
()au ++bv cuvd++ue vf+⋅
k
...
INTERNATIONAL ISO/IEC
STANDARD 15938-13
First edition
2015-09-01
Information technology — Multimedia
content description interface —
Part 13:
Compact descriptors for visual search
Technologies de l’information — Interface de description du
contenu multimédia —
Partie 13: Descripteurs compacts pour recherche visuelle
Reference number
©
ISO/IEC 2015
© ISO/IEC 2015, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO/IEC 2015 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Terms and definitions . 1
3 Symbols and abbreviated terms . 2
3.1 General . 2
3.2 Abbreviations . 2
3.3 Arithmetic operations . 3
3.4 Logical operators . 3
3.5 Relational operators . 3
3.6 Bitwise operators. 4
3.7 Assignment . 4
3.8 Mnemonics . 4
3.9 Constants . 4
3.10 Functions . 4
4 CDVS syntax . 5
4.1 Binary representation syntax . 5
4.2 Descriptor component semantics . 6
5 CDVS encoding . 9
5.1 General . 9
5.2 Original image preprocessing . 9
5.3 Interest point detection . 9
5.3.1 Introduction . 9
5.3.2 Scale space construction . 9
5.3.3 Detection of scale-space extrema .10
5.3.4 Coordinate refinement to subpixel precision. .14
5.3.5 Transformation of coordinates and scale to the converted image resolution .17
5.3.6 Elimination of duplicates .17
5.3.7 Orientation Assignment .17
5.3.8 Interest point characteristics .19
5.4 Local feature selection .19
5.4.1 Operation .19
5.4.2 Descriptor components .20
5.5 Local feature description .21
5.6 Local feature descriptor aggregation .23
5.6.1 Operation .23
5.6.2 Descriptor components .26
5.7 Local feature descriptor compression .27
5.7.1 Operation .27
5.7.2 Descriptor components .30
5.8 Local feature location compression .31
5.8.1 Operation .31
5.8.2 Descriptor components .36
5.9 Encoding order of compressed local feature descriptors and relevance bits .37
5.10 Computation of the number of compressed local feature descriptors at different
image descriptor lengths .37
Annex A (informative) CDVS encoder organization .38
Annex B (normative) Coefficients for coordinate refinement .39
Annex C (normative) Probability values for the feature selection .41
Annex D (normative) PCA projection matrix for local feature descriptor aggregation .44
© ISO/IEC 2015 – All rights reserved iii
Annex E (normative) GMM parameters for local feature descriptor aggregation .55
Annex F (normative) Gaussian function selection parameters for local feature
descriptor aggregation .135
Annex G (normative) Bit selection masks for local feature descriptor aggregation .136
Annex H (normative) Scalar quantization thresholds for local feature descriptor compression .138
Annex I (normative) Histogram count arithmetic coding model probabilities .142
Annex J (normative) Histogram map arithmetic coding model probabilities .144
Annex K (informative) CDVS decoding .145
iv © ISO/IEC 2015 – All rights reserved
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical
activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the
work. In the field of information technology, ISO and IEC have established a joint technical committee,
ISO/IEC JTC 1.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity
assessment, as well as information about ISO’s adherence to the WTO principles in the Technical
Barriers to Trade (TBT) see the following URL: Foreword - Supplementary information
The committee responsible for this document is ISO/IEC JTC 1, Information technology, SC 29, Coding of
audio, picture, multimedia and hypermedia information.
ISO/IEC 15938 consists of the following parts, under the general title Information technology —
Multimedia content description interface:
— Part 1: Systems
— Part 2: Description definition language
— Part 3: Visual
— Part 4: Audio
— Part 5: Multimedia description schemes
— Part 6: Reference software
— Part 7: Conformance testing
— Part 8: Extraction and use of MPEG-7 descriptions
— Part 9: Profiles and levels
— Part 10: Schema definition
— Part 11: MPEG-7 profile schemas
— Part 12: Query format
— Part 13: Compact descriptors for visual search
© ISO/IEC 2015 – All rights reserved v
Introduction
This International Standard, also known as “Multimedia Content Description Interface,” provides a
standardized set of technologies for describing multimedia content. It addresses a broad spectrum of
multimedia applications and requirements by providing a metadata system for describing the features
of multimedia content.
The following are specified in this International Standard:
— Description schemes (DS) describe entities or relationships pertaining to multimedia content.
Description schemes specify the structure and semantics of their components, which may be
Description Schemes, descriptors, or datatypes.
— Descriptors (D) describe features, attributes, or groups of attributes of multimedia content.
— Datatypes are the basic reusable datatypes employed by description schemes and descriptors.
— Systems tools support delivery of descriptions, multiplexing of descriptions with multimedia
content, synchronization, file format, and so forth.
This International Standard is subdivided into 13 parts:
— Part 1 — Systems: specifies the tools for preparing descriptions for efficient transport and storage,
compressing descriptions, and allowing synchronization between content and descriptions.
— Part 2 — Description definition language: specifies the language for defining the International
Standard set of description tools (DSs, Ds, and datatypes) and for defining new description tools.
— Part 3 — Visual: specifies the description tools pertaining to visual content.
— Part 4 — Audio: specifies the description tools pertaining to audio content.
— Part 5 — Multimedia description schemes: specifies the generic description tools pertaining to
multimedia including audio and visual content.
— Part 6 — Reference software: provides a software implementation of the International Standard.
— Part 7 — Conformance testing: specifies the guidelines and procedures for testing conformance
of implementations of the International Standard.
— Part 8 — Extraction and use of MPEG-7 descriptions: provides guidelines and examples of the
extraction and use of descriptions.
— Part 9 — Profiles and levels: provides guidelines and standard profiles.
— Part 10 — Schema definition: specifies the schema using description definition language.
— Part 11 — Profile Schemas: listing of profile schemas using description definition language.
— Part 12 — Query format: contains the tools of the MPEG Query Format (MPQF).
— Part 13 — Compact descriptors for visual search: specifies an image description tool for visual
search applications.
vi © ISO/IEC 2015 – All rights reserved
INTERNATIONAL STANDARD ISO/IEC 15938-13:2015(E)
Information technology — Multimedia content
description interface —
Part 13:
Compact descriptors for visual search
1 Scope
The structure of this part of ISO/IEC 15938 is as follows. Clauses 2 and 3 specify the terms,
abbreviations, symbols, and conventions used in the International Standard. Clause 4 specifies the
binary representation syntax and descriptor component semantics for a CDVS image descriptor.
Clause 5 specifies the extraction and encoding process for a CDVS image descriptor. Annexes A-J specify
information relevant to the encoding process of Clause 5. Annex K contains an informative description
of the decoding process of a CDVS image descriptor.
This part of the MPEG-7 standard specifies an image description tool designed to enable efficient and
interoperable visual search applications, allowing visual content matching in images. Visual content
matching includes matching of views of objects, landmarks, and printed documents, while being robust
to partial occlusions as well as changes in viewpoint, camera parameters, and lighting conditions.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
image descriptor
descriptor extracted from one image
2.2
image descriptor length
size of an image descriptor in bytes
Note 1 to entry: This International Standard specifies six average (i.e. over a large number of images) image
descriptor lengths, i.e. 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes, 8192 bytes, and 16384 bytes, and the
encoding process for each image descriptor length.
2.3
original image
input image to the image descriptor encoder
2.4
converted image
image which is a spatially resampled version of the original image and from which the image
descriptor is extracted
2.5
pixel
indexable element of the original image or the converted image, comprising spatial coordinates and a
luminance value
2.6
interest point
point in an image showing detection stability under local and global perturbations in the image domain,
including perspective transformations, changes in image scale, and illumination variations
© ISO/IEC 2015 – All rights reserved 1
2.7
local region
area in an image in the neighbourhood of an interest point, used to generate local feature descriptors
2.8
cell
each of the 4x4 subdivisions of a local region
2.9
cell histogram
histogram of gradients computed from the cell
2.10
local feature descriptor
descriptor of a local region, computed from the cell histograms
2.11
global descriptor
aggregation of local feature descriptors into a compact representation of the image
2.12
compressed local feature descriptor
compressed representation of a local feature descriptor
2.13
interest point coordinate
horizontal and vertical pixel coordinates indicating the position of an interest point in the converted
image resolution, rounded to the nearest integer
2.14
location quantization factor
size of the blocks of the spatial grid superimposed on top of the converted image in order to obtain
quantized interest point coordinates’ values
2.15
histogram map
binary representation of the converted image scaled down by the location quantization factor, indicating
whether each bin generated through the superimposition of the spatial grid on top of the converted
image is populated with at least one interest point
2.16
histogram count
vector indicating the number of interest points that populate each non-empty bin generated through
the superimposition of a spatial grid on top of the converted image
3 Symbols and abbreviated terms
3.1 General
NOTE The mathematical operators used in this part of ISO/IEC 15938 are similar to those used in the C
programming language. Unless otherwise indicated, all the arithmetic operations are performed with real
values. Numbering and counting conventions generally begin from 0.
3.2 Abbreviations
CDVS Compact Descriptors for Visual Search
LoG Laplacian-of-Gaussian
2 © ISO/IEC 2015 – All rights reserved
MPEG Moving Picture Experts Group
MPEG-7 ISO/IEC 15938
3.3 Arithmetic operations
+ Addition
- Subtraction (as a binary operator) or negation (as a unary operator)
++ Increment by 1, i.e. x++ is equivalent to x=x+1
-- Decrement by 1, i.e. x-- is equivalent to x=x-1
+= Increment by value, i.e. x+=y is equivalent to x=x+y
-= Decrement by value, i.e. x-=y is equivalent to x=x-y
* Multiplication (in binary representation syntax and pseudo-code) or convolution
(elsewhere)
× Multiplication
· Multiplication
/ Division
÷ Division
% Modulo operator
3.4 Logical operators
|| Logical OR
˅ Logical OR
&& Logical AND
˄ Logical AND
! Logical NOT
3.5 Relational operators
> Greater than
>= Greater than or equal to
≥ Greater than or equal to
< Less than
<= Less than or equal to
≤ Less than or equal to
== Equal to
!= Not equal to
© ISO/IEC 2015 – All rights reserved 3
3.6 Bitwise operators
| OR
& AND
3.7 Assignment
= Assignment operator
← Assignment operator
3.8 Mnemonics
The following mnemonics are defined to describe the different data types used in the coded bitstream.
bslbf Bit string, left bit first, where “left” is the order in which bits are written in the bit-
stream.
uimsbf Unsigned integer, most significant bit first.
vlclbf Variable length code, left bit first, where “left” refers to the order in which the VLC
codes are written in the bitstream and where the byte order of multibyte words is
most significant byte first.
3.9 Constants
π 3.141 592 653 58…
e 2.718 281 828 45…
3.10 Functions
log ( ) Base-n logarithm
n
max( ) Maximum value in argument list
min( ) Minimum value in argument list
sgn( ) Sign function, i.e. sgn(x) = -1, 0 or +1 when x < 0, x == 0 or x > 0, respectively
Absolute value of scalar or a vector norm
Floor function which returns the maximum integer number less than or equal to the
given real number
Ceiling function which returns the minimum integer number greater than or equal to
the given real number
Downsamples an image by keeping only the even rows and even columns of the
↓
22x
image, without anti-alias filtering
4 © ISO/IEC 2015 – All rights reserved
4 CDVS syntax
4.1 Binary representation syntax
CDVSDescriptor { Number of Mnemonics
bits
VersionID 3 bslbf
ModeID 8 uimsbf
GlobalHasBitSelection 1 bslbf
GlobalHasVariance 1 bslbf
RelevanceBitsPresent 1 bslbf
ReservedBits 2 bslbf
OriginalImageXResolution 16 uimsbf
OriginalImageYResolution 16 uimsbf
NumberOfLocalDescriptors 16 uimsbf
if(NumberOfLocalDescriptors>0) {
for(k=0; k
GlobalFunctionPresent[k] 1 bslbf
}
if(GlobalHasBitSelection) {
for(k=0; k
if(GlobalFunctionPresent[k]) {
GlobalFunctionMeanVector[k] 24 bslbf
}
}
}
else {
for(k=0; k
if(GlobalFunctionPresent[k]) {
GlobalFunctionMeanVector[k] 32 bslbf
}
}
}
if(GlobalHasVariance) {
for(k=0; k
if(GlobalFunctionPresent[k]) {
GlobalFunctionVarianceVector[k] 32 bslbf
}
}
}
HistogramCountSize 16 uimsbf
HistogramMapSizeX 16 uimsbf
HistogramMapSizeY 16 uimsbf
HistogramCount (arithmetically coded block; see 5.8) >=0 vlclbf
© ISO/IEC 2015 – All rights reserved 5
CDVSDescriptor { Number of Mnemonics
bits
HistogramMap (arithmetically coded block; see 5.8) >=0 vlclbf
NumberOfElementGroups 6 uimsbf
for(k=0; k
for(n=0; n<(4*NumberOfElementGroups); n++) {
LocalDescriptorElements[k][n] 1-2 vlclbf
}
}
if(RelevanceBitsPresent) {
for(k=0; k
RelevanceBits[k] 1 bslbf
}
}
BitStuffing 0-7 vlclbf
}
}
VersionID = 1
NumberOfGlobalFunctions = 512
4.2 Descriptor component semantics
VersionID
This descriptor component specifies the CDVSDescriptor version. In this International Standard
ModeID
This descriptor component specifies the image descriptor length. There are six image descriptor
lengths, and their corresponding ModeID values are shown in Table 1 below.
Table 1 — ModeID values for the six image descriptor lengths
Image descriptor length ModeID
512 bytes 1
1024 bytes 2
2048 bytes 3
4096 bytes 4
8192 bytes 5
16384 bytes 6
GlobalHasBitSelection
This descriptor component specifies whether bit selection is applied or not to the
GlobalFunctionMeanVector of each of the Gaussian functions which are present in the global
descriptor of an image descriptor. If GlobalHasBitSelection == 1 then bit selection is applied, and if
GlobalHasBitSelection == 0 then bit selection is not applied. More details are provided in 5.6.
6 © ISO/IEC 2015 – All rights reserved
GlobalHasVariance
This descriptor component specifies whether the GlobalFunctionVarianceVector of each of the Gaussian
functions which are present in the global descriptor of an image descriptor appears in the bitstream
or not. If GlobalHasVariance == 1 then GlobalFunctionVarianceVector appears in the bitstream, and if
GlobalHasVariance == 0 then GlobalFunctionVarianceVector does not appear in the bitstream. More
details are provided in 5.6.
RelevanceBitsPresent
This descriptor component specifies if a relevance bit for each compressed local feature descriptor
is present in the bitstream. If RelevanceBitsPresent == 1 then the relevance bits are present in the
bitstream, and if RelevanceBitsPresent == 0 then the relevance bits are not present in the bitstream.
More details are provided in 5.4.
ReservedBits
This descriptor component comprises two bits which are reserved for future use and they shall
both be set to 0.
OriginalImageXResolution
This descriptor component specifies the width (in pixels) of the original image.
OriginalImageYResolution
This descriptor component specifies the height (in pixels) of the original image.
NumberOfLocalDescriptors
This descriptor component specifies the number of compressed local feature descriptors which are
present in the bitstream. More details are provided in 5.10. NumberOfLocalDescriptors == 0 indicates
that no local features were identified in the image.
NumberOfGlobalFunctions
This descriptor component specifies the maximum number of Gaussian functions used in the global
descriptor and has a value NumberOfGlobalFunctions = 512. More details are provided in 5.6.
GlobalFunctionPresent
This descriptor component specifies a 1-D array of size NumberOfGlobalFunctions indicating which
Gaussian functions are present in the global descriptor of a particular image descriptor. If a Gaussian
function is present in the global descriptor the corresponding value in the array is 1, otherwise it is 0.
More details are provided in 5.6.
GlobalFunctionMeanVector
This descriptor component specifies a 1-D array of size equal to the number of Gaussian functions
which are present in the global descriptor, i.e. those Gaussian functions with a corresponding value of
1 in GlobalFunctionPresent. Each entry in the array is the binarized mean vector of the corresponding
global descriptor Gaussian function, and the length of each vector is 24 bits if GlobalHasBitSelection
== 1 and 32 bits if GlobalHasBitSelection == 0. More details are provided in 5.6.
GlobalFunctionVarianceVector
This descriptor component specifies a 1-D array of size equal to the number of Gaussian functions
which are present in the global descriptor, i.e. those Gaussian functions with a corresponding value of 1
in GlobalFunctionPresent. Each entry in the array is the binarized variance vector of the corresponding
global descriptor Gaussian function. More details are provided in 5.6.
© ISO/IEC 2015 – All rights reserved 7
HistogramCountSize
This descriptor component specifies the histogram count vector length for location coding. More details
are provided in 5.8.
HistogramMapSizeX
This descriptor component specifies the horizontal x resolution of the histogram map for location
coding. More details are provided in 5.8.
HistogramMapSizeY
This descriptor component specifies the vertical y resolution of the histogram map for location coding.
More details are provided in 5.8.
HistogramCount
This descriptor component specifies a vector for location coding, containing the number of non-zero
elements for each non-null block of the histogram map. More details are provided in 5.8.
HistogramMap
This descriptor component specifies a 2D-array for location coding, containing a block representation
of the converted image. Each block can assume a binary value, indicating the occurrence or not of
interest points within that block. The array is scanned according a procedure described in 5.8. The
scanning terminates when all the non-null elements of the Histogram Map are encoded. More details
are provided in 5.8.
NumberOfElementGroups
This descriptor component specifies the number of element groups in each compressed local feature
descriptor. Each element group contains four elements and the number of elements in each compressed
local feature descriptor is given by 4×NumberOfElementGroups. More details are provided in 5.7.
LocalDescriptorElements
This descriptor component specifies a 2-D array of compressed local feature descriptor elements.
The size of the first dimension is NumberOfLocalDescriptors and the size of the second dimension is
th th
4×NumberOfElementGroups. LocalDescriptorElements[k][n] is the n element of the k compressed
local feature descriptor. For each compressed local feature descriptor, its elements are ordered as
described in 5.7.
The compressed local feature descriptors themselves are ordered as described in 5.9.
RelevanceBits
This descriptor component specifies a 1-D array of size NumberOfLocalDescriptors indicating which
compressed local feature descriptors correspond to the top 300 local features as determined in 5.4. If
th
the k local feature is one of the top 300 local features, then RelevaceBits[k] is set to 1, otherwise it is
set to 0. If NumberOfLocalDescriptor<300, then all the values in RelevanceBits are set to 1. More details
are provided in 5.4.
The relevance bits are ordered in the same order as the descriptors in LocalDescriptorElement, as
described in 5.9.
BitStuffing
This descriptor component specifies stuffing bits (a sequence of ‘1’s) to align the descriptor to a
byte boundary.
8 © ISO/IEC 2015 – All rights reserved
5 CDVS encoding
5.1 General
This clause specifies the encoder operations for computing an image descriptor. A simplified diagram of
a complete CDVS encoder implementing these encoding operations is presented in informative Annex A.
5.2 Original image preprocessing
The original image is a luminance raster image containing values in the interval [0, 255] where
increasing values correspond to increasing luminance. The exact mapping of luminance values within
this interval is beyond the scope of the standard. If at least one of the dimensions of the original image
is greater than 640 pixels then the original image shall be spatially resampled, maintaining the aspect
ratio, so that the largest of the vertical and horizontal image dimensions is equal to 640 pixels, to obtain
a converted image J(x, y), in which xX∈−{,01…,} and yY∈−{,01…,} are the horizontal and vertical
pixel coordinates respectively, X and Y the pixel horizontal and vertical image dimensions respectively,
and with coordinates (0,0) located at the top left corner of the image. For this resampling operation, a
Lanczos filter with a = 3 should be used. If both the dimensions of the original image are no greater
than 640 pixels, then no spatial resampling is performed and the content of the converted image shall
be the same as the content of the original image.
5.3 Interest point detection
5.3.1 Introduction
This operation is performed using the ALP (A Low-degree Polynomial) detector. In order to find interest
points, ALP approximates the result of the LoG filtering by means of polynomials, used to find extrema
in the scale space and to refine the spatial position of the detected points.
5.3.2 Scale space construction
Let g denote the Gaussian kernel in two dimensions with positive scale parameter σ
xy+
−
1 2
2σ
gx(, ye,)σ = (1)
2πσ
The filtering operations shall be done at 4 scales with values for the σ parameter in an exponentially
increasing sequence
k
σσ=⋅20,,k = …,3 (2)
k 0
as provided in Table 2 below.
Table 2 — Values of the scale parameter
k σ
k
0 1,600000
1 2,262742
2 3,200000
3 4,525483
© ISO/IEC 2015 – All rights reserved 9
Interest points shall be identified by means of the scale-normalized Laplacian-of-Gaussian (LoG) kernel,
which is realized as the convolution
01 0
hg(,⋅⋅,)σσ=⋅ 14− 1 ∗⋅(,⋅,)σ (3)
01 0
where g in this case is a truncated and spatially discrete Gaussian function, with width equal to
24⋅ σ +1 where denotes the ceiling function.
For the converted image J(x, y) in which xX∈−{,01…,} and yY∈−{,01…,}are the horizontal and
vertical pixel coordinates respectively, X and Y the pixel horizontal and vertical image dimensions
respectively, and with coordinates (0,0) located at the top left corner of the image J(x, y), scale space
shall be constructed as follows.
The image shall be processed in a scale space representation obtained by Gaussian blur with different
scale factors σ. The scale space shall be structured in a number Q of octaves,
QX=−max{log(max{ ,}Y ),31 } (4)
2
with denoting the floor function.
For each octave in scale space, 4 images shall be produced by filtering of a first image I with a Gaussian
kernel. In any octave, these images shall be obtained by the following filtering operations
II=
II=∗g()δ
10 1
(5)
II=∗g()δ
20 2
II=∗g()δ
30 3
2 2
with the parameter δσ=−σ for n = 1,…,3. The first image in the first octave shall be obtained as
nn 0
IJ=∗g()σ (6)
and in all other octaves the first image shall be obtained by downsampling
prev
II=↓ () (7)
22x
prev
where I denotes image I in the previous octave. Anti-alias filtering shall not be applied since the
downsampling is applied to images which are already low-pass filtered.
Additionally, in any octave 4 images shall be produced by scale-normalized Laplacian filtering of the
Gaussian-filtered images
2 2
LI=⋅σσ∗=fL, ⋅∗If
00 01 1 1
(8)
2 2
LI=⋅σσ∗=fL, ⋅∗If
22 23 3 3
01 0
where f =−14 1 is the discrete Laplacian operator.
01 0
5.3.3 Detection of scale-space extrema
For each octave, two intervals in scale are defined. One is the outer interval Ω and it shall contain a
smaller one called the inner interval Ω.
10 © ISO/IEC 2015 – All rights reserved
The outer interval has the lowest and highest scales σ and σ as boundaries Ω = 16.,4.525483 and
0 3 []
the inner interval has the boundaries Ω = 17.,40. .
[]
For each pixel (x, y) in the image, a polynomial approximation to the scale-space function
3 2
px(, yx,)σα=+(, yx)(σα ,)yxσα++(, yx)(σα ,)y (9)
3 2 10
shall be searched for a local extremum over the outer interval Ω . The coefficients shall be obtained by
computing weighted sums of the images L ,…,L
0 3
K−1
α xy,,=⋅aL xy
() ()
3 ∑ kk
k=0
K−1
α xy,,=⋅bL xy
() ()
2 ∑ kk
k=0
(10)
K−1
α xy, = ccL⋅ xy,
() ()
1 ∑ kk
k=0
K−1
α xy,,=⋅dL xy
() ()
0 ∑ kk
=
k 0
where the coefficients a , b , c , d , corresponding to the 4 predefined scales σ , k =0,…,3, are listed in
k k k k k
Table 3.
Table 3 — Coefficients for the equations for polynomial approximation
k a b c d
k k k k
0 -0,2464 2,5021 -8,2007 8,6432
1 0,4934 -4,5636 12,9824 -10,8424
2 -0,2717 2,0108 -4,0449 2,1204
3 0,0140 0,1549 -1,0565 1,3886
In this manner, the polynomial approximation is obtained by filtering the original image with a
weighted sum of Laplacian-of-Gaussian filters
23 2
σσ⋅∗fa()++bcσσ +⋅dg()σ (11)
kk kk k
∑
k=0
where each of the 4 weights is a polynomial in σ , as illustrated in Figure 1.
© ISO/IEC 2015 – All rights reserved 11
Figure 1 — Polynomial weights for approximating the scale-space function
The coefficients are computed by minimizing the approximation error
fg∗−()σσ()ab++σσcd+⋅ fg∗ ()σ (12)
∑ kk kk k
k=0
over a set of scales contained within the outer interval. Figure 2 depicts a Laplacian-of-Gaussian filter
with σ = 2.5 and its approximation.
Figure 2 — Exact and approximated Laplacian-of-Gaussian filters at scale 2.5
*
A tentative scale σ (,xy) shall be associated to each pixel location x,y as the most extreme over the
outer interval Ω ,
*
σσ(,xy)a= rgmax(px,,y ) (13)
σ∈Ω
or
*
σσ(,xy)a= rgmin(px,,y ) (14)
σ∈Ω
whichever of the two alternatives has the greatest absolute value. Therefore, for all pixels x,y such that
αα(,xy)(−>30xy,)α (,xy) (15)
2 13
12 © ISO/IEC 2015 – All rights reserved
only the ones that
−+αα(,xy)(xy,)−3αα(,xy)(xy,)
22 13
*
σ (,xy)= ∈Ω (16)
3α (,xy)
or
−−αα(,xy)(xy,)−3αα(,xy)(xy,)
22 13
*
σ (,xy)= ∈Ω (17)
3α (,xy)
*
shall be considered. Among the considered pixels, those with positive solutions σ larger than the
polynomial at any boundary of the outer interval
*
pxσσ(, yp)m> ax{(xy,, ),px(, y,)σ } (18)
()
*
as well as all pixels with negative solutions σ smaller than the polynomial at any boundary of the
outer interval
*
pxσσ(, yp)m< in{(xy,, ),px(, y,)σ } (19)
()
*
shall be accepted as candidates, forming triples {,xy,(σ xy,)}. The other candidates are eliminated
from further processing in the present octave. This mechanism eliminates also solutions that are not
local extrema, i. e. with second derivative equal to zero.
* *
Those candidates for which the solutions σ are within the inner interval σ (,xy)∈Ω shall be
subjected to further processing, the other candidates being eliminated from further processing in the
present octave.
*
Thereafter, any remaining candidate {,xy,(σ xy,)} is eliminated from further processing in the
present octave if the absolute value of the polynomial is below a first threshold equal to 0.4, i.e.
*
px(, yx,(σθ,)y ).<= with θ 04 (20)
*
Thereafter, any remaining candidate {,xy,(σ xy,)} is eliminated from further processing in the
present octave if the second derivative of the scale space function with regard to σ is below a second
threshold set to 0.4, i.e.
∂
* 2
px(, yx,(σα,)yx)(=+62,)yxσα (, y).<=θθ with 04 (21)
3 22 2
∂σ
© ISO/IEC 2015 – All rights reserved 13
*
Thereafter, any remaining candidate {,xy,(σ xy,)} is eliminated from further processing in the
*
present octave if the polynomial value px(, yx,(σ ,)y ) is surpassed by the polynomial value of any
remaining candidates among its 8-neighbours. Specifically, for any m∈−{,10,}1 and any n∈−{,10,}1
excluding the combination mn,,= 00 , if
() ()
**
px(,yx,(σσ,)yp)(≤+xm,,yn+∈);σσΩ when px(,yx,( ,)y ) is a maaximum (22)
or
**
px(,yx,(σσ,)yp)(≥+xm,,yn+∈);σσΩ when px(,yx,( ,)y ) is a miinimum (23)
then the candidate is eliminated.
*
The remaining candidates {,xy,(σ xy,)} are input to the next processing step.
5.3.4 Coordinate refinement to subpixel precision.
For the position refinement 9 pre-defined positions shall be used: all 9 combinations of u∈−{,10,}1 and
v∈−{,10,}1 , corresponding to shifts of the LoG kernels in the xy plane.
*
Firstly, candidates at local edges in the polynomial px(, yx,(σ ,)y ) shall be eliminated by the
following test:
*
The 3 × 3 pixels around any candidate are computed at the scale σ (,xy) of the candidate
** *
px(,−−11yx,(σσ,)yp)(xy,,−+11(,xy)) px(,y−1,,(σ xy,))
* ** *
Px(,yx,(σ ,)y )= px(,−+11yx,(σσ,)yp)(xy,, (,xy)) px(,yx,(σ ,)y ) (24)
** *
p(xxy−+11,,σσ(,xy)) px(,yx++11,( ,)yp)(xy,,+1 σ (,xy))
For these pixels, three quantities shall be computed
pP=−2PP+ ,
xx 21 22 23
pP=−2PP+ , (25)
yy 12 22 32
PP+−PP−
11 33 31 13
p =
xy
*
where P is shorthand for Px(, yx,(σ ,)y ) and P denotes the element in row i and column j of P. The
ij
candidate is eliminated if the following quantity ρ exceeds a threshold equal to 12, i.e.
()pp−
xx yy
ρθ= >= with θ 12 (26)
pp⋅−p
xx yy xy
This number is the ratio of the squared trace of the Hessian (at the scale and location of the interest
point) and the determinant of the same Hessian. It is related to the ratio of principal curvatures r as
ρ = (r + 1) /r.
14 © ISO/IEC 2015 – All rights reserved
*
For each remaining candidate (,xy,)σ , a polynomial approximation to the scale-space function in the
displacement parameters u,v
** 2 **2
qu(,vx;, yxσβ)(=+,,yuσβ)(xy,,σβ)(vx++,,yuσ ) v
5 4 3
(27)
** *
β (,xy,,σβ)(ux++,,yvσβ)(xy,,σ )
2 10
*
shall be searched for a local extremum. The coefficients are derived from the matrix Px(, yx,(σ ,)y )
*
associated to the candidate (,xy,)σ , as in the previous equations. Any coefficient is a weighted sum,
in which the correspondence between term number k and the row number i and column number j is
given by Table 4, found below.
Table 4 — Mapping from term number k to row number i(k) and column number j(k)
k i(k) j(k)
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
*
The coefficients for the candidate (,xy,)σ are thus given by weighted sums with K=9
K
*
βσ(,xy,)=⋅aP
5 ∑ ki()kj,(k)
k=1
K
*
βσ(,xy,)=⋅bP
4 ∑ ki()kj,(k)
k=1
K
*
ββσ(,xy,)=⋅cP
3 ∑ ki()kj,(k)
k=1
(28)
K
*
βσ(,xy,)=⋅dP
2 ∑∑ ki()kj,(k)
k=1
K
*
βσ(,xy,)=⋅eP
1 ki()kj,(k)
∑
k=1
KK
*
βσ(,xy,)=⋅fP
0 ∑ ki()kj,(k)
k=1
*
where P is shorthand for element i(k),j(k) of the matrix Px(, yx,(σ ,)y ).
i(k),j(k)
The coefficients a , b , c , d , e , f are dependent on scale; there are 4 sets and the one corresponding
k k k k k k
to the nearest neighbor to the scale σ among σ ,…,σ shall be used. Normative Annex B provides the
0 3
coefficient sets.
NOTE The coefficients are distinct from those contained in Table 3.
© ISO/IEC 2015 – All rights reserved 15
*
The polynomial q may be written (in shorthand, omitting the variables xy,,σ ) as
K
qu(,va)(=+ub vc++uv du++ev fP)⋅ (29)
∑ kk kk kk ik(),(jk)
k=1
The coefficients form a polynomial that provide an interpolation between shifted kernels. Indeed, the
coefficients are computed to minimize the approximation error in
K
()au ++bv cuvd++ue vf+⋅
k
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...