ISO/IEC TR 15938-8:2002/Amd 3:2007
(Amendment)Information technology — Multimedia content description interface — Part 8: Extraction and use of MPEG-7 descriptions — Amendment 3: Technologies for digital photo management using MPEG-7 visual tools
Information technology — Multimedia content description interface — Part 8: Extraction and use of MPEG-7 descriptions — Amendment 3: Technologies for digital photo management using MPEG-7 visual tools
Technologies de l'information — Interface de description du contenu multimédia — Partie 8: Extraction et utilisation des descriptions MPEG-7 — Amendement 3: Technologies pour la gestion des photos numériques à l'aide des outils visuels MPEG-7
General Information
Relations
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD TR
15938-8
First edition
2002-12-15
AMENDMENT 3
2007-12-15
Information technology — Multimedia
content description interface —
Part 8:
Extraction and use of MPEG-7
descriptions
AMENDMENT 3: Technologies for digital
photo management using MPEG-7 visual
tools
Technologies de l'information — Interface de description du contenu
multimédia —
Partie 8: Extraction et utilisation des descriptions MPEG-7
AMENDEMENT 3: Technologies pour la gestion des photos
numériques à l'aide des outils visuels MPEG-7
Reference number
©
ISO/IEC 2007
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO/IEC 2007
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
In exceptional circumstances, the joint technical committee may propose the publication of a Technical Report
of one of the following types:
— type 1, when the required support cannot be obtained for the publication of an International Standard,
despite repeated efforts;
— type 2, when the subject is still under technical development or where for any other reason there is the
future but not immediate possibility of an agreement on an International Standard;
— type 3, when the joint technical committee has collected data of a different kind from that which is
normally published as an International Standard (“state of the art”, for example).
Technical Reports of types 1 and 2 are subject to review within three years of publication, to decide whether
they can be transformed into International Standards. Technical Reports of type 3 do not necessarily have to
be reviewed until the data they provide are considered to be no longer valid or useful.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
Amendment 3 to ISO/IEC TR 15938-8:2002 was prepared by Joint Technical Committee ISO/IEC JTC 1,
Information technology, Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia
Information.
© ISO/IEC 2007 – All rights reserved iii
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Information technology — Multimedia content description
interface —
Part 8:
Extraction and use of MPEG-7 descriptions
AMENDMENT 3: Technologies for digital photo management using
MPEG-7 visual tools
Add after subclause 4.2.3.3:
4.2.3.4 Dominant Color Temperature
4.2.3.4.1 General
This subclause provides an advanced use scenario of the Dominant Color descriptor. The Dominant Color
Temperature is a variation of Dominant Color, but suitable to implement perceptual similarity based retrieval.
Images usually have one of a few dominant color temperatures perceived by users when they look at them.
Dominant Color Temperatures enable users to search for images in scenarios such as query by example or
query by value, and for image browsing regarding their color temperature. It can be useful for users who want
to find images which look similar according to color temperature rather than to find images which have similar
color regions.
4.2.3.4.2 Use scenario
Dominant Color Temperatures can be used in query by example and query by value search scenarios.
Examples of such queries are depicted in Figure AMD3.1. In a query by example a user inputs an example
image or draws a colored sketch (query by sketch) and the search application returns the most similar images
regarding their color temperature. In a query by value a user chooses a temperature value, and the system
retrieves images in which the appearance of color temperature is closest to the user choice.
© ISO/IEC 2007 – All rights reserved 1
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
a) b)
Figure AMD3.1 — Examples of image retrieval using Dominant Color Temperatures: a) query by
example; b) query by color temperature value given in kelvins
4.2.3.4.3 Feature extraction
The Dominant Color Temperature, which consists of a maximum of eight pairs of color temperature and
percentage, is obtained by the following steps.
1. Get RGB color values and percentages of dominant colors from a Dominant Color descriptor instance.
2. Convert each dominant color value from RGB to color temperature using the relevant method
specified in the feature extraction method of Color Temperature descriptor [subclause 6.9.1.1]. The
number of obtained color temperatures cannot, therefore, exceed the number of dominant colors in
the Dominant Color descriptor instance. The colors that do not have significant color temperature
(colors having luminance values below the luminance threshold specified in the extraction method of
the Color Temperature descriptor) should be omitted.
3. Use the obtained color temperatures and their percentages given by the Dominant Color descriptor
instance in queries: query by example, query by color temperature value, ranking search results, and
others.
4.2.3.4.4 Similarity matching
The similarity is based on a distance function which is defined as an integral of absolute difference between
two percentage distributions of dominant color temperature. The percentage distributions of dominant color
temperature should be obtained first in the following steps:
1. Convert color temperature values T of Dominant Color Temperature description to Reciprocal
i
-1
Megakelvin scale RT [MK ] = 1000000/T [K].
i i
2. Sort, in ascending order, the dominant color temperatures expressed in reciprocal scale.
3. Create the percentage distribution of dominant color temperature D (RT) using the following
i i
equations:
D(RT) = 0 for RT < RT ;
D(RT) = p + p + . + p for RT ≤ RT < RT , 1 ≤ i ≤ n-1 ;
0 1 i-1 i-1 i
2 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
D(RT) = p + p + . + p for RT ≥ RT ;
0 1 n-1 n-1
Where:
n – number of dominant color temperatures;
RT , RT , . , RT – sorted dominant color temperatures;
0 1 n-1
p , p , . , p – percentages.
0 1 n-1
Figure AMD3.2 shows an example of a dominant color temperature distribution.
ppp[%[%[%]]]
25%25%25%
606060 40%40%40%
15%15%15%
20%20%20%
RTRTRT RTRTRT
RTRTRT RTRTRT RTRTRT RTRTRT
minminmin maxmaxmax
000 111 222 333
-1-1-1
RTRTRT [MK[MK[MK ]]]
Figure AMD3.2 — Example of cumulative dominant color temperature distribution
The proposed distance function is given by the following equation, which is an integral of difference between
two color temperature distributions.
RT
max
dist= D()RT −D()RT dRT
1 2
∫
RT
min
This expression is equivalent to the geometrical area bounded by the two distributions. An example of
distance calculation is depicted in Figure AMD3.3, where the distribution distances are shown graphically on
distribution diagrams.
Figure AMD3.3 — Example of distance calculation
© ISO/IEC 2007 – All rights reserved 3
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
The distance function presented in the above equation can be efficiently implemented using the following
steps:
1. Input: two percentage distributions of dominant color temperature:
RT1, D1 – tables of temperature and percentage distribution for image 1,
RT2, D2 – tables of temperature and percentage distribution for image 2;
2. Initialize: dist=0, x = RT ;
1 min
3. Take the next minimum temperature value t from tables RT1, RT2, and let x = t ;
curr 2 curr
4. Find in D1, D2 the lower bound y and the upper bound y of the rectangle corresponding to the current x ,
1 2 1
x coordinates;
5. dist = dist + (x – x )(y – y );
2 1 2 1
6. x = x ;
1 2
7. If all values from tables D1, D2 have been taken then return dist else go to step 3.
The tables used as an input to the algorithm above are obtained from the percentage distributions of dominant
color temperature in the following way: RTX[i] = RT, DX[i]=D(RT), for 0 ≤ i ≤ n, where X stands for image 1 or
i i
2.
In the case of query by color temperature value, the same distance function can be used, by assuming that
the query value given by the user is a single dominant color temperature with a percentage of 100%. Although
in this case, the distance function can be simplified to the following:
n−1
∆RT= RT−RT p
∑ i REF i
i=0
where RT is the value of the query color temperature, RT are dominant color temperatures, p are
REF i i
percentages, and n is the number of dominant color temperatures in image.
4.2.3.4.5 Condition of usage
The same restrictions are applied as for the Dominant Color descriptor. Additionally, Dominant Color
Temperatures cannot be used for very dark images of which all dominant colors have luminance values below
the luminance threshold specified in the extraction method of Color Temperature descriptor.
Add after subclause 4.7:
4.8 High-level use scenarios
4.8.1 Content based Image retrieval
4.8.1.1 General
Content-based image retrieval gives an efficient and easy way of managing and retrieving digital images from
enormous digital contents. In content-based image retrieval, there are two representative methods. One is a
query by example, where a user selects a similar image to those expected for a query. The other is query by
4 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
sketch, in which a user must draw a sketch and use it as a query. Since a seed picture is needed, some
mechanism to assist users finding query image itself is required in the former scenario. One possible solution
is to combine text-based image retrieval or query by sketch as a pre-processing step of query by example.
4.8.1.2 Query within region of interest (ROI)
4.8.1.2.1 General
This section provides a usage scenario to enable users to dynamically retrieve photographs with similar
Region of Interest (such as background) in image space. Region-based image retrieval can be implemented
by portioning an image into several small regions and assigning a StillRegionFeatureDS for each of them.
However, in practice, such an approach is difficult as it requires prior segmentations that are often subjective
and may depend on a particular query. The ROI-based photo retrieval gives users the benefit of defining ROI
when making a query. Although query by example is very useful for image retrieval, one may want to retrieve
photos with similar backgrounds. In other words, if the scenery is well known or quite beautiful, people tend to
take pictures with the same background but different persons. For those photos, it will be more efficient to
retrieve the photos by matching the background regions only. In this scenario, the user can select the region
that he wants to retrieve in particular and send it to the system as a query.
4.8.1.2.2 Use Scenario
Figure AMD3.4 shows the flow of the proposed query method. The user first selects a query image. In the
query image, the user selects a ROI by selecting local regions (shown in blue). The ROI is used as a query
image for retrieval. Figure AMD3.5 shows the example of image retrieval within a ROI.
ROI Query DB
Query image ROI selected in blue region
Figure AMD3.4 — Flow of query by ROI
© ISO/IEC 2007 – All rights reserved 5
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Figure AMD3.5 — Retrieval of image within ROI
4.8.1.2.3 Tools to be used
StillRegionFeatureDS or VideoSegmentFeatureDS is used for this scenario. Among the several elements
included in these DSs, the Edge Histogram descriptor and the Color Layout descriptor should be instantiated
to implement the functionality of ROI-based retrieval. For video retrieval, shots are extracted from the video
sequence and for each shot, localized features from the specific region are extracted. Then, the ROI is used
as a query for video retrieval.
4.8.1.2.4 Feature Extraction
ROI-based retrieval can be implemented by extracting localized features from the specified region. The
extraction process of the localized feature from the instances of two mandatory description tools, Color Layout
and Edge Histogram, is described in this subclause. Figure AMD3.6 illustrates this process. From the Edge
Histogram Descriptor one can obtain a localized edge distribution in each 4 x 4 local rectangular region. From
the Color Layout descriptor, one can obtain an 8 x 8 region-based DCT: by performing inverse quantization
and taking the 8 x 8 inverse DCT (as described in subclause 4.2.5.2.3), we can obtain average color values
for 8 x 8 local rectangular regions. Feature extraction of each descriptor is defined in ISO/IEC 15938-3,
MPEG-7 Visual. As in Figure AMD3.6, a combination of the Edge Histogram Descriptor and Color Layout
Descriptors can be used for the rectangular region-based query-by-ROI.
6 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
44 x4 x4 E EHHDD
CoCommbbineinedd
FeaturFeaturee
For EachFor Each
8x8x8 ID8 IDCCTT
4x4x4 Bl4 Blockock
8x8x8 Spati8 Spatiaall D Doommaaiinn
AAvveeraragege co cololor fr foorr
Each blEach blockock
CCLLDD on on 8x8x8 D8 DCCTT Pl Plaannee
Figure AMD3.6 — 4x4 block-based “Query-by-ROI” with Edge Histogram Descriptor and Color Layout
Descriptor
4.8.1.2.5 Similarity Matching
For the Color Layout descriptor, we can take an 8 x 8 inverse DCT for the quantized DCT coefficients of Y, Cr,
and Cb. Then, we have representative color values for 8 x 8 blocks of the image. These block-wise color
values are combined with the edge histogram bins for each 4 x 4 image region (see Figure AMD3.7). Thus,
each rectangular image region of the (4 x 4) Edge Histogram descriptor blocks includes 4 (2 x 2) color blocks
obtained by the inverse 8 x 8 DCT of the Color Layout descriptor. Now, a combination of the color and edge
information in each of the (4 x 4) rectangular image regions will form a feature vector for the rectangular
region-based similarity matching.
© ISO/IEC 2007 – All rights reserved 7
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Figure AMD3.7 — Parameter value example of EHD
Figure AMD3.7 shows an example of parameter values when using the EHD for matching blocks. When the
total number of images is N, the jth(j=0,1,2 . 15) block of the ith(i=0,1,2,.N) image has five types(0°, 45°,
90°, 135°, non-directional) of edge value. If we represent these edge value as k (k=0,1,2,3,4), the parameter
value H [k] is the kth edge value of the jth block of the ith image. For a query image Q, the edge value of the
ij
Q EHD
selected sub-image is H [k].The local distance of Edge Histogram LD is as follows.
ij
EHD Q
LD ij= |H [k]−H [k]| (AMD1)
∑ ij
k=0
8 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Figure AMD3.8 — Parameter of Inverse DCT Color Layout descriptor
Figure AMD3.8 shows the parameters of the inverse DCT Color Layout descriptor. The total number of
images is N. We group 8 X 8 image blocks into 4 X 4 blocks. Newly grouped blocks can be labeled as β where
each block consists of 4 blocks(β=0,1,2,3). Each Y, Cb, Cr is labeled as α(α=0,1,2) (α=0 for Y, α=1 for Cb,
α=2 for Cr). Parameter value is C [α] for jth block(sub image) of ith image. C [α] represents color value α of β
ijβ ijβ
Q
sub-block of jth block of ith image. C [α] are parameter values of the query image Q. The local distance of
β
Color Layout descriptor of jth sub image of ith image can be obtained as follows.
Q
2 3 C [α]−C [α]
β ij
β
CLD
LD = (AMD2)
ij
∑∑
αβ==0 0
© ISO/IEC 2007 – All rights reserved 9
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
EHD CLD
The distance values are then scaled, so that the maximum values of LD ij (AMD1) and LD ij (AMD2)
for all images in the database, are normalised to 1. The combined distance CD can be obtained from these
ij
EHD CLD
normalized distances, ND and ND , as follows:
ij ij
EHD CLD
ND ij+ ND ij
CD = (AMD3)
ij
4.8.1.2.6 Condition of Use
In order to use ROI there shall be rectangular regions and there is a restriction on the setting of rectangular
regions as 4 X 4.
4.8.1.2.7 DDL instance examples
xmlns="urn:mpeg:mpeg7:schema:2004">
xsi:type="StillRegionFeatureType">
40
34
30
16 12 15 12 17
12 17
12 14
2 6 4 4 2 1 7 5 3 2 1 6 4 2 2 2 5 4
5 3 1 5 5 6 5 2 6 5 4 4 1 6 4 4 4 0 6 3 5
2 1 5 5 6 6 4 2 3 6 7 3 2 5 5 7 3 2 4 4 7
1 5 6 4 6 1 5 7 4 5 1 6 4 6 5 1 3 4 7 6
4.8.2 Grouping Technologies
4.8.2.1 Situation-based clustering
4.8.2.1.1 General
A simple but very effective structure is to group images by the occasion on which they were taken. This is
natural for the user since they will often remember the context of the situation much better than a date, time or
explicit label attached to the picture. It is possible to automatically cluster images into such “situations” by
using MPEG-7 visual description, together with the time stamp of the image. Based on the assumption that
each situation is contiguous in time, the organisational structure can be represented by the time-sequence of
images, with a flag or marker to indicate the boundaries between situations (cf. Figure AMD3.9). This provides
the user with a simple, intuitive and effective means to browse through their collection, without placing any
10 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
additional burden on them to spend time organising it. Two methods are presented for situation based
clustering.
Boundaries (vertical bars) are inserted between adjacent images in the sequence to denote the grouping.
Figure AMD3.9 — One representation of the grouped sequence of images
4.8.2.1.2 Use scenario
This kind of clustering can easily be implemented in traditional photo-browsing software applications. For the
user, it is very simple to use – the extraction and matching of MPEG-7 descriptors and detection of the
boundaries is fully automatic, so the tool is essentially “one-click”. Of course, some users may choose to
adjust and refine the automatic output to match their individual preferences. This process would still be far
easier than organising all the photos manually.
The clustering information can be used to access and manipulate the image content in a variety of ways:
• Browsing:
o Display a cluster of images per page, or
o Display a single thumbnail / icon for each cluster
• Annotation
o User can easily assign a single label to all the images in a cluster
• Sharing:
o User can select images by cluster and…
o Print
o Copy
o Upload to website
© ISO/IEC 2007 – All rights reserved 11
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
4.8.2.1.3 Method1: Simple Linear Clustering
4.8.2.1.3.1 General
This method achieves good clustering performance with minimal complexity. The additional computation (after
extracting and matching MPEG-7 visual descriptors) consists of a simple weighted linear summation. It is
therefore well-suited to applications where MPEG-7 descriptors have been extracted from images but
resources are not available for higher-level processing (for example, in low-complexity devices). The input
parameters to the algorithm are also simple and therefore easy to adapt - for example, to different applications
or user preferences.
4.8.2.1.3.2 Tools to be used
Six tools defined in ISO/IEC 15938-3 shall be instantiated in StillRegionFeatureDS:
z Dominant Color (DC)
z Scalable Color (SC)
z Color Layout (CL)
z Color Structure (CS)
z Homogeneous Texture (HT)
z Edge Histogram (EH)
Also, capturing date/time information should be included. If an image is encoded in Exif file format (JEITA CP-
3451), this information can be obtained from the Exif header.
z EXIF DateTime tag (ID36867)
Alternatively, the same information can be captured using
z CreationInformation/CreationCoordinates/Date (mpeg7:TimeType)
4.8.2.1.3.3 Clustering Algorithm
The images are ordered by their time stamps and each potential boundary in the sequence is evaluated in
turn. To determine the presence or absence of a boundary, a number of pair-wise comparisons are made
amongst images lying in a window either side of the transition. This neighbourhood and the comparisons used
are illustrated in Figure AMD3.10.
12 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
j-2 j-1 j j+1 j+2 j+3
?
Figure AMD3.10 — Neighbourhood comparisons evaluated to determine if a boundary is present
Comparison of images consists of computing the descriptor distances (by the respective methods suggested
in ISO/IEC 15938-8 TR) and calculating the time difference. The latter is measured on a logarithmic scale, to
compress the range of this feature and allow meaningful comparisons. Time distance is therefore defined as:
−5
D (i,i+1)= ln(10 +(T −T))
T i+1 i
The unit of time used for T is days. The natural logarithm is applied to normalise the range of time distances
i
– potential time differences will vary over several orders of magnitude. After this transformation, the variation
−5
of the time distance is comparable to the remaining features. The constant, 10 , meanwhile, chooses the
minimum scale of the distance – just under one second, in this case. It also ensures that ln(0) does not
occur.
The input to the algorithm includes the first-, second- and third-order distances in a short time interval around
the boundary to be tested. Here “first-order” refers to the difference, for any given feature, between two
images that are adjacent in the sequence – i.e. D (i,i+1). A second-order distance is the difference
F
between two images that are separated by one other image – i.e. D (i,i+ 2). Similarly, a third-order
F
distance is the difference between two images that are separated by two other images – i.e. D (i,i+ 3) . The
F
total measurement of difference between images j and j+1 is now:
2 1 0
⎧ ⎫
D(j, j+1)= α D (j+i, j+i+1)+ β D (j+i, j+i+ 2)+ γ D (j+i, j+i+ 3)
⎨ ⎬
∑∑ Fi F ∑ Fi F ∑ Fi F
Fi⎩=−2 i=−2 i=−2 ⎭
This is a summation over a set of 12 distance measurements for each of 6 visual features (the outer
summation being over the set of features, F). For time difference, only the first order distances are used,
adding 5 more distance measurements, to give a total set of 77 numbers. These are weighted by 77 weights
α,β,γ , the recommended values of which are given in Table AMD3.1.
© ISO/IEC 2007 – All rights reserved 13
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Table AMD3.1 —Weights for distance calculation. (Feature distances are first normalized for zero
mean and unit variance)
Dominant Scalable Color Color Homogeneous Edge
Time
Color Color Layout Structure Texture Histogram
α 0,0583 0,2598 0,4546 0,2661 -0,0718 0,4890 2,8952
α 0,1976 -0,0077 -0,0986 -0,3279 -0,1370 -0,3108 -0,2911
α -0,0425 0,1117 0,0543 0,0594 -0,0089 -0,1642 -0,0035
α 0,2718 -0,1835 -0,0640 -0,1153 0,0102 -0,3534 -0,3748
−1
α 0,0085 -0,0259 -0,0539 -0,1419 -0,0725 0,0951 -0,0786
−2
β -0,0249 -0,0107 0,4662 0,3828 0,0567 -0,2351
−1
β 0,1718 -0,0788 -0,0086 0,2190 0,2653 0,2186
β 0,0958 -0,2618 -0,0520 -0,0652 -0,0496 0,1157
−2
β 0,2785 0,0072 -0,3648 -0,1872 -0,0611 0,1439
γ -0,1955 0,1203 -0,0767 -0,0567 0,0148 0,1178
−1
γ 0,0324 0,1808 -0,2327 0,2665 0,0167 0,2029
−2
γ
-0,1199 0,0196 0,0477 0,1841 0,0288 0,1436
The output, D, is the real-valued indicator of boundary confidence – i.e., higher values of D indicate a stronger
belief that there is a boundary at the candidate position. A binary boundary indicator is obtained by comparing
this number to a threshold. The threshold may be adjusted for sensitivity depending on the image collection,
the particular application, or the preferences of the user. A value of around 3.35 could be recommended to
produce a good balance between false positives (mistakenly detected boundaries) and false negatives
(missed boundaries).
4.8.2.1.4 Method2: Clustering based on Visual Semantic Hints
4.8.2.1.4.1 General
The proposed method achieves good clustering performance on similar situations based upon the visual
semantic hints. If the visual semantic hints are used for adaptive feature selection, they eventually help to
reduce computational complexity while achieving reasonable clustering performance. For example, a low
performance device like a mobile phone, which can only extract a limited number of MPEG-7 descriptors, can
apply this method while maintaining a reasonable clustering performance.
4.8.2.1.4.2 Tools to be used
Seven tools defined in ISO/IEC 15938-3 shall be instantiated in StillRegionFeatureDS:
z Dominant Color (DC)
z Scalable Color (SC)
z Color Layout (CL)
z Color Structure (CS)
z Homogeneous Texture (HT)
z Texture Browsing (TB)
z Edge Histogram (EH)
14 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Also, capturing date/time information should be included. If an image is encoded in Exif file format
(JEITA CP-3451), this information can be obtained from the Exif header.
z EXIF DateTime tag (ID36867)
Alternatively, the same information can be captured using
z CreationInformation/CreationCoordinates/Date (mpeg7:TimeType)
4.8.2.1.4.3 Semantic Hint Extraction
We determine the weight of each visual feature using ‘visual semantic hints’, which are automatically
extracted from a series of photos, in order to improve the Situation/View-based Photo clustering performance.
The visual semantic hints of image represent the visual characteristics that are perceived by human visual
system. The visual semantic hints used are as follows:
1) Colorfulness (CoF) hint: it represents degree of a visual sensation according to the purity of colors on
photo. Figure AMD3.11 shows some exemplary photos with high degree of colorfulness.
Figure AMD3.11 — Exemplary photos with Colorfulness semantics.
To extract the CoF semantics, we utilize Scalable Color Descriptor.
F ={f , f , f ,", f ,", f }, where N ∈{16,32,64,128,256}
SCD 1 2 3 j N SCD
SCD
where f represents the number of colors belonging to each HSV bin, i.e. it could be obtained by inverting
j
Harr transform coefficients.
Whether a photo has CoF semantics is determined by using magnitude of the HSV bins. The colorfulness
of a color is represented by averaging the magnitude of each HSV bin. Thus power of the magnitude is
represent the CoF semantics. The CoF semantics of a photo is also set to high or low CoF for simple
way. It is defined as,
N
SCD
⎧ 1
high, f >th
⎪
∑ j CoF
CoF=
⎨ N
j=1
SCD
⎪
low, otherwise
⎩
where th is a threshold to detect whether the power of the magnitude is sufficient to represent the CoF
CoF
semantics. The threshold, th , was heuristically set to an average of ‘colorfulness’ values in given
CoF
database. The th was set to 13.56.
CoF
2) Color Coherence (CoC) hint: it represents degree of a visual sensation according to spatial coherency of
colors on photo. Figure AMD3.12 shows some exemplary photos with high degree of color coherency.
Figure AMD3.12 — Exemplary photos with Color Coherence semantics
© ISO/IEC 2007 – All rights reserved 15
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
To extract the CoC semantics, we utilize Dominant Color Descriptor.
F ={(c ,p ,u),s}, where j= 1,2,3,",N
DCD j j j DCD
where N is the number of dominant colors. Each dominant color values c is a vector of corresponding
DCD j
color space component values. The percentage p (normalized to a value between 0.0 and 1.0) is the
j
fraction of pixels in the image corresponding c. The optional color variance u describes the variation of
j j
the color values of the pixels in a cluster around the corresponding representative colors. The spatial
coherency s is a single number that represents the overall spatial homogeneity of the dominant colors in
the images.
Whether a photo has CoC semantics is determined by using the percentage of each dominant color and
the spatial coherency. To capture the human visual perception about the CoC semantics of photo, the
CoC semantics is captured by the number of dominant colors over large portion of image and high spatial
coherency. The CoC semantics of a photo is set to high or low CoC for simple way. It is defined as,
N
DCD
⎧
⎧ ⎫
1 2 3
⎪high, B()p >th >th and s>th
⎨∑ ⎬
j CoC CoC CoC
CoC=
⎨
j=1
⎩ ⎭
⎪
low, otherwise
⎩
1 1
where B( p >th ) is 1 if ( p >th ) is true and it is 0 otherwise. th is a threshold to detect whether
CoC
j CoC
j CoC
each dominant color take large portion of image region. th is a threshold to detect whether there are
CoC
sufficient number of dominant colors that take large portion of image region. The th is a threshold to
CoC
detect whether the dominant colors has spatially high coherence over the image region. The threshold,
1 2 3
th , th and th , were heuristically set to 0.5, 7.0, and 0.31, respectively.
CoC CoC CoC
3) Level of Detail (LoD) hint: it represents degree of a visual sensation for objects on photo appearing more
or less detailed. It is also one of the importance semantics since, for example, the photos of the
mountains generally have higher level of detail than the close-up photos of human face. Figure AMD3.13
shows some exemplary photos with the LoD semantics. High LoD photos contain more detail such as
edge rather than low LoD photos.
(a) (b)
Figure AMD3.13 — Exemplary photos with Level of Detail semantics, (a) photos with high Level of
Detail semantics and (b) photos with low Level of Detail semantics
The basic idea to measure the LoD is that the photos of high LoD semantics have much detail description
in their contents since the neighbor pixel values of them are abruptly and frequently changed. This means
the photo has much high frequency components. On the other hands, the photos of low LoD semantics
have relatively small detail description in their contents.
Nowadays, JPEG compression is very popular in digital photographs since it does not much degraded
image quality, and also drastically reduces the file size of the photo. In order to measure LoD semantics
of a photo, we define ‘a relative compression ratio per pixel’. In general, the photos of high LoD semantics
may have lower compression ration than the photos of low LoD semantics since they have relatively
smaller spatial redundancy among inter-pixels.
In JPEG compression, a loss is caused by quantization. It is common that each photo may have
compressed with different quantization table. Thus before extracting LoD semantics from photos, all
photos to be clustered should be decompressed and then compressed with the same quantization table.
16 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
The file size of the JPEG compressed photo is obtained from its file header. And the file size is
normalized by the total number of pixels. The LoD semantics is defined as,
⎧
f
fs
⎪high, >th
LoD
LoD=
f × f × f
⎨
iw ih cd
Q
⎪
low, otherwise
⎩
where f is the file size of photo, f is image width of photo, f is the image height of photo, and f is the
fs iw ih cd
color depth of photo. Q is the given quantization table. Baseline JPEG quantization table, as seen in
Figure AMD3.14, is recommended. The threshold, th , was heuristically set to an average of ‘level of
LoD
detail’ values in given database. th was heuristically set to 0.499.
LoD
1717 1818 2424 4747 9999 9999 9999 9999
1616 1111 1010 1616 2424 4040 5151 61 61
1818 2121 2626 6666 9999 9999 9999 9999
1212 1212 1414 1919 2626 5858 6060 55 55
1414 1313 1616 2424 4040 5757 6969 56 56 2424 2626 5656 9999 9999 9999 9999 9999
4747 6666 9999 9999 9999 9999 9999 9999
1414 1717 2222 2929 5151 8787 8080 62 62
9999 9999 9999 9999 9999 9999 9999 9999
1818 2222 3737 5656 6868 109109 103103 77 77
9999 9999 9999 9999 9999 9999 9999 9999
2424 3535 5555 6464 8181 104104 113113 92 92
9999 9999 9999 9999 9999 9999 9999 9999
4949 6464 7878 8787 103103 121121 120120 101 101
7272 9292 9595 9898 112112 100100 103103 99 99 9999 9999 9999 9999 9999 9999 9999 9999
(a) Luminance quantization table (b) Chrominance quantization table
Figure AMD3.14 — Baseline JPEG quantization table for LoD semantic hint
4) Homogeneous Texture (HoT) hint: it represents degree of a visual sensation according to homogeneous
texture on photo. HoT semantic hint can express how regular the textures of objects on a photo
repeatedly form clumped region. Figure AMD3.15 shows some exemplary photos with high regular texture
patterns.
Figure AMD3.15 — Example photos with Homogeneous Texture semantic hint
To extract the HoT semantic hint, we utilize Texture Browsing Descriptor.
F ={f , f , f , f , f }
TBD r d1 d 2 s1 s 2
where f is regularity or structure of the texture. f and f are two dominant directions of the texture. f
r d1 d2 s1
and f are two dominant scales to capture the coarseness of the texture.
s2
The homogeneous texture of a photo can be represented by averaging the magnitude of each edge bin.
Thus power of the magnitude represents the HoT semantic hint. The HoT semantic hint of a photo is also
set to high or low for simple way. It is defined as,
⎧high, f >th
r HoT
HoT=
⎨
low, otherwise
⎩
where th is a threshold to detect sufficient regularity (or homogeneity) of the texture on image region.
HoT
The threshold, th , was heuristically set to an average of ‘homogeneous texture’ values in given
HoT
database. The th was set to 2.
HoT
© ISO/IEC 2007 – All rights reserved 17
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
5) Heterogeneous Texture (HeT) hint: it represents degree of a visual sensation how continuous or strong
the boundaries on photo. Figure AMD3.16 shows some exemplary photos with strong boundaries.
Figure AMD3.16 — Example photos with Heterogeneous Texture semantic hint
To extract the HeT semantic hint, we utilize Edge Histogram Descriptor.
F ={f , f , f ,", f ,", f }
EHD 1 2 3 j N =80
EHD
where f represents magnitude of edge bin.
j
The heterogeneous texture of a photo can be represented by averaging the magnitude of each edge bin.
Thus power of the magnitude represents the HeT semantic hint. The HeT semantic hint of a photo is also
set to high or low for simple way. It is defined as,
N
EHD
⎧ 1
⎪high, f >th
∑
j HeT
HeT=
⎨ N
j=1
EHD
⎪
low, otherwise
⎩
where th is a threshold to detect whether the power of the magnitude is sufficient to represent the HeT
HeF
semantic hint. The threshold, th , was heuristically set to an average of ‘heterogeneous texture’ values
HeT
in given database. The th was set to 227.98.
HeT
4.8.2.1.4.4 Hierarchical Thresholding to Detect Situation Change
In a sequential series of photos, two adjacent photos have a variety of time and visual differences. The
situation change is detected with hierarchical thresholding. Situation changes on a hierarchy are more
frequent on the higher level of hierarchies, i.e., a situation group can be divided into more than two groups in
the higher hierarchy. That is, the finest situation changes are made in the highest hierarchy. Figure AMD3.17
shows an illustration of the hierarchical situation detection
st
1 hierarchy
nd
2 hierarchy
th
r hierarchy
Figure AMD3.17 — Hierarchical situation change detection
18 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
Given multiple features, similarity distances between photos are measured. The time difference between the
th th
(i) and the (j) photos is measured as,
{ () ( ) }
log F i− F j+C
time time time
D()i, j =
time
D
time _ max
where log{F (i) – F (j) + C } is a time scale function, C is a constant to avoid zero for input of the log
time time time time
scale function. D is maximum time difference in a series of input photos to be clustered. Time difference
time_max
is non-linear and the value of time similarity distance increases as the time difference increases. The value of
the time similarity distance is scaled, so that it is less sensitive to the large time difference and is consistent in
the same situation.
th th
Content-based similarity between the (i) and the (j) photos is also defined as,
D ()i, j={}D(i, j)| f∈F ={Θ{F'(i)− F'(j)}| f∈F}
content f f f
where Θ is a similarity measurement function, such as L1 or L2 norm distance measure, for a given low-level
feature f.
Given the inter-photo similarity distances, similarity distance between the photos that belong to two adjacent
th
situations, is measured. Figure AMD3.18 shows an example of determining whether the situation of the (i)
th
photo in the S has been changed or not. The S is the situation change results of the (r-1) hierarchy.
(r-1) (r-1)
Figure AMD3.18 — Inter-situation similarity measurement
th th
As shown in Figure AMD3.18, assume that a situation group is made from the (i-n) photo to (i+m+1) photo
th
in the (r-1) hierarchy. Then, inter-situation similarity is measured in a comparison bound, defined as,
B (i)=[b ,b ]
r min max
where b is the photo on minimum bound and b is the photo on maximum bound. Initially, the comparison
min max
th th
bound is the situation change boundary in the (r-1) hierarchy, e.g., in the Figure AMD3.18, b is (i-n) photo
min
th
and b is (i+m+1) photo.
max
The comparison bound B (i) is updated as finding the two most similar photos within the bound. This is to
r
th th
avoid unnecessary comparisons of photos that are not similar to the (i) photo. Provided that the (i) photo,
the minimum bound b’ is updated to the most similar one among the photos which were taken in prior to the
min
© ISO/IEC 2007 – All rights reserved 19
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
th
(i) photo. Similarly, the maximum bound b’ is updated to the most similar one among photos which were
max
th
taken in posterior to the (i) photo. The updated bound is measured as,
⎡ ⎤
B' (i)=[]b' ,b' = arg min{}D(i, j)b ≤ j
r min max min max
⎢ ⎥
j j
⎣ ⎦
Given the bound B’ (i), the inter-situation similarity is measured. Similarity between two photos in the bound is
r
measured as,
v (i)×D(j , j)
f f 1 2
D()j , j =
f 1 2
i
v (i)
∑ f
f∈F
where v (i) is importance value of the feature f. The importance value for each feature was heuristically set as
f
shown in the following Table AMD3.2. The default importance value for each feature is 1.0.
Table AMD3.2 — An importance value table for each feature and semantic hints
v = 1.5 v = 2.7 v = 3.2 v = 2.0 v = 1.5 v = 2.5
HTD EHD CLD CSD SCD DCD
HoT High - - - -
HeT - High - - High -
LoD - - Low High - Low
CoF - - - High High -
CoC - - High - - High
The inter-situation similarity Z (i) is finally measured by three terms of inter-photo similarity and it is as follows,
r
b'
i−1
max
⎡ ⎤
⎧ ⎫
D()j,k
⎢ ⎥
⎨ ⎬
∑∑∑ f
j==b' kfi∈F
⎢ ⎥
min ⎩ ⎭
⎣ ⎦
Ζ()i =α⋅D(i,b' )−β⋅D(i,b' )+γ⋅
r f min f max
M
where α, β, and γ are importance values of each term. Without any prior knowledge, α, β, and γ can be set to
th
1.0. The first term is similarity distance between the (i) photo and the minimum bound b’ . The second term
min
th
is similarity distance between the (i) photo and the maximum bound b’ . The third term is the sum of
max
th
similarity distances among a group of photos which were taken in prior to the (i) photo and a group of photos
th
which were taken in posterior to the (i) photo.
th
If there is a situation change in the (i) photo, the first term would be relatively lower than that of no situation
change case. The second term and the third term would be relatively higher than the first term.
th
In the (r) hierarchy, the inter-situation similarity is determined by,
()
⎧true, Ζ i >th
r r
S (i)=
r ⎨
false, otherwise
⎩
where th is threshold to determine situation change on the (i) photo. If the inter-situation similarity of the (i)th
r th
th
photo Z (i) is bigger than the threshold, the (i) photo is regarded as a situation change.
r
The threshold value decreases as the hierarchy increases. The lower threshold makes more situation
boundaries. The threshold is defined as,
th =th −∆th
r init r
20 © ISO/IEC 2007 – All rights reserved
ISO/IEC TR 15938-8:2002/Amd.3:2007(E)
th
where th is initial threshold at the first hierarchy, and ∆th is variation of the threshold in the (r) hierarchy.
init r
th and ∆th were set to 0.7 and 0.02, respectively
init r
Situation change detection is finished when the threshold meet the following condition.
th
r stop
where the th is minimum criteria to stop the situation change detec
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...