Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

This document provides information about optimizations for encoders and receiving systems for conducting machine analysis tasks on coded video content. It provides a concept-level overview of recent practices and provides comments on technical aspects and cautions to be taken when interpreting the results. This document describes technologies that have recently been studied and have demonstrated benefits to coding efficiency for some machine analysis tasks.

Technologies de l'information — Intelligence artificielle pour le multimédia — Partie 3: Optimisation des codeurs et des systèmes de réception pour l'analyse automatique de contenus vidéo codés

General Information

Status: Published
Publication Date: 14-Jun-2026

ICS: 35.040.40 - Coding of audio, video, multimedia and hypermedia information
: 35.240.01 - Application of information technology in general

Technical Committee: ISO/IEC JTC 1/SC 29 - Coding of audio, picture, multimedia and hypermedia information
Drafting Committee: ISO/IEC JTC 1/SC 29 - Coding of audio, picture, multimedia and hypermedia information

Current Stage: 6060 - International Standard published
Start Date: 15-Jun-2026
Completion Date: 15-Jun-2026

Buy Documents

ISO/IEC TR 23888-3:2026 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content - Page 1 preview

Technical report

ISO/IEC TR 23888-3:2026 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

Release Date:15-Jun-2026

English language (21 pages)

sale 15% off

Preview

sale 15% off

Preview

Overview

ISO/IEC TR 23888-3:2026 is an international technical report developed by ISO and IEC that addresses the optimization of video encoders and receiving systems for machine analysis of coded video content. As machine learning and artificial intelligence (AI) methods become critical in multimedia applications, it is essential to optimize video pipelines not just for human viewing, but also for automated machine analysis. This document delivers a concept-level overview of recent practices and highlights important technical aspects and cautions to consider when evaluating encoder optimizations for video content consumed by AI systems. Technologies described in this standard have demonstrated benefits to coding efficiency for select machine analysis tasks.

Key Topics

Video Pipeline Optimization: The report discusses end-to-end systems for processing coded video content, encompassing pre-processing, encoding, and post-processing optimizations targeted at machine analysis tasks.
Evaluation Methodologies: It foregrounds recommended metrics like bit rate, peak signal-to-noise ratio (PSNR), mean average precision (mAP), multiple object tracking accuracy (MOTA), and Bjøntegaard delta rate (BD-rate) to benchmark optimizations for machine consumption.
Pre-processing Techniques: Methods such as region of interest (RoI) detection, foreground/background differentiation, temporal and spatial subsampling, and noise filtering are covered for improving encoder input for AI analysis.
Encoding Enhancements: Adaptive encoding strategies, including RoI-based quantization parameter adaptation and adjustments for temporal layers, are emphasized to balance efficiency with analysis accuracy.
Post-processing and Metadata: Efficient post-processing methods and the use of metadata to support downstream machine analysis are detailed, including the use of supplemental enhancement information (SEI) messages relevant to machine vision applications.

Applications

The optimizations detailed in ISO/IEC TR 23888-3:2026 are relevant across several high-impact sectors:

Surveillance Systems: Large-scale sensor networks benefit from encoding optimizations that reduce bandwidth while enabling reliable real-time object detection and tracking by AI algorithms.
Intelligent Transportation: Connected vehicles and smart infrastructure rely on efficient coded video content for machine-based interpretation, facilitating interoperability and reduced network load.
Industrial Automation: Automated inspection and visual analysis in manufacturing environments are streamlined using video coding tailored for machine consumption, enhancing throughput and reliability.
General AI Multimedia Analysis: Machine learning pipelines that process video data for object detection, segmentation, or tracking can leverage these optimizations to accelerate workflows and reduce resource consumption.

Related Standards

ISO/IEC TR 23888-3:2026 is part of a broader standards ecosystem. Key related documents include:

ISO/IEC 23090-3 (VVC) / ITU-T H.266: The Versatile Video Coding standard, relevant for the baseline decoding of video bitstreams.
ISO/IEC 23008-2 (HEVC) / ITU-T H.265: High Efficiency Video Coding, widely adopted in modern streaming and broadcasting.
ISO/IEC 14496-10 (AVC) / ITU-T H.264: Advanced Video Coding, foundational in digital video.
ISO/IEC 23002-7 / ITU-T H.274: Specifies supplemental enhancement information (SEI) messages critical for improved machine analysis of coded bitstreams.
ISO/IEC TR 23888-1: Offers more extensive guidance on use cases and machine-centered video analysis within this standards series.

Practical Value

Implementing the guidelines and technologies described in ISO/IEC TR 23888-3:2026 helps organizations:

Maximize video coding efficiency for machine-oriented workflows;
Enable reliable and accurate AI analysis of video streams in diverse deployment scenarios;
Comply with evolving international standards in AI for multimedia content;
Stay current with best practices, boosting system performance while managing computational and network costs.

By following this report, stakeholders in AI video analysis, intelligent transportation, surveillance, and industrial automation can optimize their systems for future-ready, machine-driven operations.

Buy Documents

Technical report

ISO/IEC TR 23888-3:2026 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

Release Date:15-Jun-2026

English language (21 pages)

sale 15% off

Preview

sale 15% off

Preview

Get Certified

Connect with accredited certification bodies for this standard

BSI Group

BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

UKAS United Kingdom Verified

Visit Website

NYCE

Mexican standards and certification body.

EMA Mexico Verified

Visit Website

Frequently Asked Questions

What is ISO/IEC TR 23888-3:2026?

ISO/IEC TR 23888-3:2026 is a technical report published by the International Organization for Standardization (ISO). Its full title is "Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content". This standard covers: This document provides information about optimizations for encoders and receiving systems for conducting machine analysis tasks on coded video content. It provides a concept-level overview of recent practices and provides comments on technical aspects and cautions to be taken when interpreting the results. This document describes technologies that have recently been studied and have demonstrated benefits to coding efficiency for some machine analysis tasks.

What is the scope of ISO/IEC TR 23888-3:2026?

What ICS categories does ISO/IEC TR 23888-3:2026 belong to?

ISO/IEC TR 23888-3:2026 is classified under the following ICS (International Classification for Standards) categories: 35.040.40 - Coding of audio, video, multimedia and hypermedia information; 35.240.01 - Application of information technology in general. The ICS classification helps identify the subject area and facilitates finding related standards.

How can I access ISO/IEC TR 23888-3:2026?

ISO/IEC TR 23888-3:2026 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.

Standards Content (Sample)

ISO/IEC TR 23888-3:2026 - Info...

Technical
Report
ISO/IEC TR 23888-3
First edition
Information technology — Artificial
2026-06
intelligence for multimedia —
Part 3:
Optimization of encoders and
receiving systems for machine
analysis of coded video content
Technologies de l'information — Intelligence artificielle pour le
multimédia —
Partie 3: Optimisation des codeurs et des systèmes de réception
pour l'analyse automatique de contenus vidéo codés
Reference number
© ISO/IEC 2026
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
© ISO/IEC 2026 – All rights reserved
ii
Contents Page
Foreword .iv
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Overview . 2
5.1 General overview .2
5.2 Use cases and applications .3
6 Evaluation methodology . 3
6.1 General .3
6.2 Bit rate .4
6.3 PSNR .4
6.4 mAP .4
6.5 MOTA .5
6.6 BD-rate .5
7 Pre-processing technologies . 6
7.1 Region of interest-based methods.6
7.2 Foreground and background processing .7
7.3 Temporal subsampling .7
7.4 Spatial subsampling .8
7.5 Noise filtering .8
8 Encoding technologies . 8
8.1 RoI-based quantization parameter adaption .8
8.2 Quantization step adjustment for temporal layers .9
8.3 Chroma QP offset setting.10
9 Post-processing technologies .10
9.1 Temporal resampling .10
9.2 Spatial resampling .10
9.3 Enhancement post-filtering .10
10 Metadata .11
10.1 General .11
10.2 Neural-network post-filter SEI message .11
10.3 Annotated regions SEI message .11
10.4 Object mask information SEI message .11
10.5 Encoder optimization information SEI message . 12
10.6 Packed regions information SEI message . 12
Annex A (informative) Software implementation examples .13
Annex B (informative) Combined software implementation examples .20
Bibliography .21

© ISO/IEC 2026 – All rights reserved
iii
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T as ITU-T H.Sup.MACVC).
A list of all parts in the ISO/IEC 23888 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.

© ISO/IEC 2026 – All rights reserved
iv
Technical Report ISO/IEC TR 23888-3:2026(en)
Information technology — Artificial intelligence for
multimedia —
Part 3:
Optimization of encoders and receiving systems for machine
analysis of coded video content
1 Scope
This document provides information about optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices
and provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and have demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC 23002-7
and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
machine consumption
operation of a machine analysis task such as object detection, segmentation or object tracking

© ISO/IEC 2026 – All rights reserved
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
IoU intersection over union
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PRI packed regions information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams (Rec.
ITU-T H.274 | ISO/IEC 23002-7)
VTM Reference software for versatile video coding (Rec. ITU-T H.266.2 | ISO/IEC 23090-16:2025)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as YUV
B R
YUV colour space representation commonly used for video/image distribution, also written as Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1. This document
describes technologies for optimization of encoders and receiving systems, such as pre-processing, encoding
and post-processing for machine consumption. The decoding process, on the other hand, is fully specified
in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T H.265 |
ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10 Advanced

© ISO/IEC 2026 – All rights reserved
Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded video are
fully specified by the given input bitstream.
Figure 1 — General video coding and processing pipeline
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in Clause 6. Descriptions of pre-processing technologies can be found in Clause 7.
Encoder optimization technologies are described in Clause 8 and post-processing technologies are described
in Clause 9. Metadata that is useful for machine consumption is described in Clause 10.
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested technologies and their combinations are listed in Annex A and Annex B, respectively.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on
the server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles
are expected to play a significant role in future transport systems and the tremendous number of
vehicles emphasizes the need of reducing the amount of data being transmitted between them to avoid
overloading the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[1]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2. Here the
input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video is
used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1.

© ISO/IEC 2026 – All rights reserved
Figure 2 — Evaluation framework and points of measurement
6.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following formula is applied to calculate the bit rate:
8**fileSizeInBytesfps
bitRate=
numFrames*1000
where fileSizeInBytes is the size of a file measured in number of bytes, fps is the number of frames per second
and numFrames is the total number of frames in the file.
6.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic-range video, the distortion metric primarily used in the
video coding standardization community has been peak signal to noise ratio (PSNR). The following two
formulae are used to calculate PSNR:
n1m1
MSE xi,,jy ij

mn*
i0 j0

bitDepth8

255*2

PSNR 10*log
MSE

where x(i,j) is the decoded sample value of a certain colour component, y(i,j) is the corresponding original
sample value, and bitDepth is the bit depth of the input video. It is a common practice to calculate PSNR
values for each of the colour components Y, U and V. Information on how to interpret or combine PSNR values
[2]
of colour components can be found in ISO/IEC TR 23002-8 .
6.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision(mAP)
described as follows.
For a given category of object, true positive TP T , false positive FP T , true negative TN T and

IoU IoU IoU
false negative FN T are defined with an IoU threshold T for that category, where true positive is a

IoU IoU
© ISO/IEC 2026 – All rights reserved
case that an object is detected by the model and it is also a part of the ground truth, false positive is a case
that an object is detected by the model but it is not a part of the ground truth; true negative is a case that the
an object is not detected by the model and is it also not a part of the ground truth; false negative is a case
that an object is not detected by the model but it is a part of the ground truth.
Then, recall of the given IoU threshold is defined as the proportion of all true positive cases in all true
positive and false negative cases corresponding to that IoU threshold:
TP T

IoU
recall T

IoU
TP TF NT

IoUIoU
The precision of the given IoU threshold is the proportion of all true positive cases in all positive cases:
TP T

IoU
precisionT

IoU
TP TF PT

IoUIoU
It is possible for a neural network of detection or segmentation to obtain several pairs of recall and precision
values with different confidence levels. For each recall value r in the pairs, let pr take the maximum

precision value in all precision values for which the corresponding recall values are above the given recall
value r :
pr max precisionr


rr: r
Average precision (AP) of a given category of object is defined as the average value of pr for all recall

values provided by the neural network, which characterizes the area of the entire precision-recall curve.
The mAP is defined as the average of AP scores of all categories within a range of IoU thresholds.
Some commonly used variants of this metric are:
— mAP@0 . 5: An object is counted as correctly identified if the IoU between the detected bounding box and
the ground truth bounding box is at least 0.5. Sometimes this variant of the mAP metric is also referred
to as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to
the upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated
to produce the final mAP.
6.5 MOTA
Object tracking performance is measured by multiple object tracking accuracy (MOTA). This metric accounts
for all object configuration errors made by the tracker, false positives, misses (true negative), mismatches,
overall frames. The calculation of MOTA is as follows:
FN FP mme

tt t

t
MOTA1
�g
t
t
where FN , FP , mme and g are the number of false negatives, the number of false positives, the number
t t t t
of mismatch error (ID Switching between 2 successive frames), and the number of objects in the ground
truth respectively at time t .
6.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[2]
(BD-rate) metric is used. Instead of using PSNR as the distortion metric as is typical for human vision

© ISO/IEC 2026 – All rights reserved
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2
fx bx**bx bx* b

0 1 23
For a given polynomial function in the above formula, b , b , b , and b are coefficients of the function, x is
0 1 2 3
the input (bit rate) and fx is the output (quality). The following two constraints are invoked to ensure its

monotonicity and convexity:
— the first order derivative of the polynomial shown below is positive in the given x range

fx 32**bx *bx* b

0 12
— the second order derivative of the polynomial shown below is negative in the given x range

fx 62**bx *b

Parameters bb,, bb, in the polynomial function are solved by sequential least squares programming

01 23
(SLSQP) and applied to curve fitting.
NOTE It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum
quality value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is
analysed in some way and then the encoder can optimize the encoding towards machine consumption based
on the analysis results. The analysis can be done using various methods, e.g., neural networks. An example
of a pipeline that can be used for RoI-based approaches is shown in Figure 3.
Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each
object includes the index of the picture in which the object can be found and the position of the object in
the picture. Some networks can provide more information than this and the encoder can choose to select a
subset of all objects by filtering based on, for example, the class of an object or the estimated likelihood of an
object of the described class being at the described position. In a similar approach, a segmentation network

© ISO/IEC 2026 – All rights reserved
can be used where the object is not described by a bounding box but by a segmentation mask indicating
exactly which samples the segmentation network estimates belonging to the object. The list produced
during the analysis can then be used by the encoder, for example, to separate foreground and background
with the purpose of encoding the foreground at a better quality and the background at a lower quality. One
such encoding method is described in 8.1. In this example, the analysis does not change the input video, but
directly forwards it to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as
subsampling the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data.
The network produces a list of objects segmented with the object shapes in the current picture. The object
shapes and positions could be represented, for example, by segmentation masks. More information such as
the object class or the estimated likelihood of the object segment could also be provided by the network to
identify the objects. Based on the object information, it is possible to derive spatial complexity and temporal
complexity for the different segments, and then RoI-based pre-processing of the input video can be adapted
based on the spatial and temporal complexity. The spatial complexity here indicates the averaged object
size which can be calculated by dividing the percentage of the area covered by the objects by the total
number of the objects. Temporal complexity indicates the content changes between two pictures which can
be calculated by various methods, for example, by taking the mean absolute difference of the collocated
samples in two pictures.
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the
input video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words,
compared with binary classification of foreground and background, these extracted features can provide
importance information at a finer granularity. Therefore, such extracted features can be used to determine
how to process foreground and background differently. In one implementation example, a feature map is
extracted by a feature extraction network, and based on the feature map, the parameters of a Gaussian
smoothing filter are adapted and then the adaptive filtering is applied to the picture. As the background area
and foreground area have different features and even within the background or foreground area, different
regions can have different features, the Gaussian smoothing filter can be controlled at a finer granularity,
which finally results in a more efficient pre-processing.
An implementation example with more detailed description can be found in A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the
video at a lower frame rate. One example is to remove every other frame from the input video and encode
the video at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion
between two or more frames and if there is only little motion, a frame can be removed. In some cases, if
the receiving system requires a specific frame rate, a corresponding post-processing technology that up-
samples the video to the full frame rate can be applied.
An implementation example with more detailed description can be found in A.4.

© ISO/IEC 2026 – All rights reserved
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose
to encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1. In this case, the unmodified input video is forwarded to the encoder with a scale
...

Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

Technologies de l'information — Intelligence artificielle pour le multimédia — Partie 3: Optimisation des codeurs et des systèmes de réception pour l'analyse automatique de contenus vidéo codés

General Information

Buy Documents

ISO/IEC TR 23888-3:2026 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

Overview

Key Topics

Applications

Related Standards

Practical Value

Buy Documents

ISO/IEC TR 23888-3:2026 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

Get Certified

BSI Group

NYCE

Frequently Asked Questions

Standards Content (Sample)

This May Also Interest You