ISO/IEC DTR 42106
(Main)Information technology — Artificial intelligence (AI) — Overview of differentiated benchmarking of AI system quality characteristics
Information technology — Artificial intelligence (AI) — Overview of differentiated benchmarking of AI system quality characteristics
Titre manque
General Information
Standards Content (Sample)
FINAL DRAFT
Technical
Report
ISO/IEC JTC 1/SC 42
Information technology — Artificial
Secretariat: ANSI
intelligence (AI) — Overview of
Voting begins on:
differentiated benchmarking of AI
2025-12-22
system quality characteristics
Voting terminates on:
2026-02-16
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
Reference number
FINAL DRAFT
Technical
Report
ISO/IEC JTC 1/SC 42
Information technology — Artificial
Secretariat: ANSI
intelligence (AI) — Overview of
Voting begins on:
differentiated benchmarking of AI
system quality characteristics
Voting terminates on:
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
© ISO/IEC 2025
IN ADDITION TO THEIR EVALUATION AS
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
or ISO’s member body in the country of the requester.
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland Reference number
© ISO/IEC 2025 – All rights reserved
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Overview of relevant benchmarking methods . 2
4.1 Review of benchmarking definitions .2
4.2 Types of benchmarking . .3
4.3 Metrics, measures and criteria .3
5 Benchmarking AI systems . 4
5.1 Benchmarking AI system quality . .4
5.2 Context of use .5
5.3 Complex adaptive systems .5
5.4 Limitations in benchmarking AI systems .6
6 Frameworks for differentiated benchmarking . 7
6.1 AI management frameworks . . .7
6.2 Classification-based frameworks .7
6.3 Levels of specification .9
7 Feasibility analysis .11
7.1 Case Example 1: On-job training recommendation system .11
7.2 Case Example 2: User intent recognition . 12
7.3 Case Example 3: Generation of clinical pathways . 13
Annex A (informative) Sample levels of specification .15
Annex B (informative) Descriptions of measures .16
Bibliography .18
© ISO/IEC 2025 – All rights reserved
iii
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 42, Artificial intelligence.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.
© ISO/IEC 2025 – All rights reserved
iv
Introduction
Artificial intelligence (AI) systems are diverse in nature and heterogeneous in terms of their potential
impact on consumers and third-parties. For example, a company’s use of disposable income estimates using
a linear regression model to serve ads targeted at different socio-economic profiles reflects a system with
low complexity and potential impact. Conversely, a bank’s use of disposable income estimates using a large
neural network model to make housing loan decisions reflects a system with both high complexity and high
impact.
Benchmarking is often used to compare quality characteristics of software systems against reference or
target values. Given the diverse nature and heterogeneous impact of AI systems, such reference or target
values can differ widely across systems and across deployments of similar systems across contexts of use.
To accommodate this diversity and heterogeneity, differentiated benchmarking can target different quality
characteristics at different reference values. This document reviews AI management frameworks for their
ability to offer guidance for differentiated benchmarking of AI system quality characteristics.
By evaluating frameworks for specifying differing levels of benchmarks and benchmarking, commensurate
with expected social impact of AI systems, this document identifies gaps in current frameworks that, when
filled, can yield guidance for differentiated benchmarking of AI systems, which can help rationalize the
standardization implementation effort of AI providers, while maintaining system trustworthiness for AI
customers and AI partners.
© ISO/IEC 2025 – All rights reserved
v
FINAL DRAFT Technical Report ISO/IEC DTR 42106:2025(en)
Information technology — Artificial intelligence (AI) —
Overview of differentiated benchmarking of AI system quality
characteristics
1 Scope
This document provides an overview of conceptual frameworks for graded benchmarking of artificial
intelligence (AI) system quality characteristics. The aim is to examine the feasibility of using differentiated
benchmarking of quality characteristics based on the complexity and context of use of an AI system.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 22989, Information technology — Artificial intelligence — Artificial intelligence concepts and
terminology
ISO/IEC 23053, Framework for Artificial Intelligence (AI) Systems Using Machine Learning (ML)
ISO/IEC 25059, Software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE)
— Quality model for AI systems
3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO/IEC 22989, ISO/IEC 23053,
ISO/IEC 25059 and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— IEC Electropedia: available at https:// www .electropedia .org/
— ISO Online browsing platform: available at https:// www .iso .org/ obp
3.1
benchmark
reference point against which comparisons can be made
[1]
Note 1 to entry: For AI system benchmarking, an AI system quality characteristic (ISO/IEC 25059 ) is the object of
comparison.
[2]
[SOURCE: ISO/IEC 29155-1:2017 , 3.2, modified —Note 1 to entry has been replaced.]
3.2
benchmarking
activity of comparing objects of interest to each other or against a benchmark to evaluate characteristic(s)
(3.1)
Note 1 to entry: For AI system benchmarking, the object of interest is an AI system quality characteristic
[1]
(ISO/IEC 25059 ).
[2]
[SOURCE: ISO/IEC 29155-1:2017 , 3.3, modified — Note 1 to entry has been replaced.]
© ISO/IEC 2025 – All rights reserved
4 Overview of relevant benchmarking methods
4.1 Review of benchmarking definitions
1) 2) 3)
When searching for "benchmarking" in the ISO , IEC and ITU terminology databases, there were 76
results from the ISO OBP and none from the IEC and ITU-T databases. By deleting not relevant terms and
definitions and merging the same definitions together, 14 definitions were collected.
These 14 definitions include several instances that define "benchmark" and "benchmarking" as a pair,
with the definition of benchmarking relying on the paired definition of benchmark (e.g. the pairs {D3, D5},
{D13, D14}, {D4, D2}). A clustered view of the objects of interest for each of these definitions of benchmark/
benchmarking is given in Table 1.
Table 1 — Clustered objects and characteristics relevant for benchmark and benchmarking
Cluster Object Description of characteristics Sources
[3]
1 reference point/tool/method comparisons can be made, pro- ISO 41011:2024 , 3.8.5;
[4]
cess , performance or quality can
(metric against; ISO 14031:2021 , 3.4.8;
be measured; other characteristics
[2]
any standard or reference; ISO/IEC 29155-1:2017 , 2.1 and 3.2;
can be measured
[5]
point of fixed location; ISO 17258:2015 , 3.1;
[6]
permanent mark) ISO 14050:2020 , 3.2.15;
[7]
(benchmark) ISO 21678:2020 , 3.2;
[8]
ISO 21931-1:2022 , 3.2.16;
[9]
ISO/IEC/IEEE 24765:2017 , 3.362;
[10]
ISO/TS 18667:2018 , 3.1.1;
[11]
ISO 20468-1:2018 , 3.1.2;
[12]
ISO 13053-2:2011 , 2.1;
[13]
ISO/IEC 25040
[2]
2 activity of comparing, evaluat- objects of interest to each other or ISO/IEC 29155-1:2017 , 3.2 and 3.3;
[14]
ing and analysis against a benchmark,
ISO/IEC 18520:2019 , 3.1.1;
[15]
(activity of comparing or characteristic; similar operational
ISO/TR 24514:2018 , 3.1;
practices; an organization can use
[16]
evaluating ; comparative eval-
ISO 14644-16:2019 , 3.3.1;
to search for and compare practic-
uation or analysis; activity of
[17]
ISO 10010:2022 , 3.4;
es inside and outside the organiza-
measurement and analysis)
[18]
ISO 10014:2021 , 3.8;
tion, with the aim of improving its
[19]
ISO 30400:2022 , 3.1.18 and 3.17;
performance; similar operational
[20]
practices
ISO 32210:2022 , 3.34;
[9]
ISO/IEC/IEEE 24765:2017 , 3.362
and 3.363;
[21]
ISO/IEC TS 25058 .
[3]
3 process of comparing process- the same nature, under the same ISO 41011:2024 , 3.8.5.1
es, performances or quali- circumstances and with similar
ty against practices measures
[5]
4 single value (benchmark) used for orientation ISO 17258:2015 , 3.2;
[22]
ISO 24523:2017 , 3.2;
[23]
ISO 24513:2019 , 3.7.1.1.2;
[15]
ISO/TR 24514:2018 , 3.1
Core concepts about "benchmarking" are reflected in the repetition of words and phrases across these 14
definitions. Among these, "comparisons" (10 times), “performances” (7 times), "can be measured" (6 times),
and "practices" (5 times) are the most frequently used core concepts.
1) ISO Online browsing platform (OBP): available at https:// www .iso .org/ obp
2) IEC Electropedia (IEV): available at https:// www .electropedia .org/
3) ITU-T Terms and Definitions available at: https:// www .itu .int/ br _tsb _terms/ #/
© ISO/IEC 2025 – All rights reserved
The terms “process” (4 times), “organization” (4 times), "a reference point" (3 times), “metric against"
(3 times), “evaluate (3 times)”, and “standard” (3 times) also contribute to an understanding of the basic
concept of "benchmarking”.
4.2 Types of benchmarking
From the review of uses of benchmarking in standardization literature, it is evident that there are primarily
two types of benchmarking in existing definitions of ISO deliverables: benchmarking can be intended in
[24] [3]
terms of an activity and components (ISO/IEC 29155-1 ) or for processes (ISO 41011:2024 ) as objects
of benchmarking.
With regards to benchmarking activity and components, the focus lies on comparing objects of interest
[24]
to each other or against benchmark to evaluate characteristics (ISO/IEC 29155-1 ). Such activities
have characteristics of similar operation practice, similar attributes, processes or performance that are
comparable. The benchmark refers to a reference point against which comparisons can be made. For the
relevant stakeholder, a reference point can be any of the following:
— a tool for performance improvement through systematic search and adaptation of leading practice;
— a standard against which results can be measured or evaluated;
— a method for comparing the performance of organizations in a market segment ;
— a test procedure that can be used to compare systems or components to each other or to a standard.
[24] [25]
ISO/IEC 29155-1 relies for product quality evaluation on ISO/IEC 25000 and in this perspective
[13]
ISO/IEC 25040 defines the context for using benchmarks as quality criteria.
With regards to benchmarking processes, the focus lies on comparing any combination of processes,
performances and quality against practices of the same nature, under the same circumstances and with
similar measures. Its special considerations are the systematic process for the identification of, becoming
acquainted with and for adoption of successful practices of benchmarking partners. This concept is use in
[3]
the domain of facility management (ISO 41011:2024 ).
Within this document, the concept of benchmarking is used to focus on the activity of comparing objects
of interest against a benchmark to evaluate characteristics, and the concept of benchmark to focus on a
reference point to which comparisons can be made. Such concepts are used widely in domain of information
technology project performance benchmarking framework of systems and software engineering, as defined
[13] [21]
in ISO/IEC 25040 and ISO/IEC TS 25058 .
Therefore, the definitions of benchmark and benchmarking given in this document are adapted from
[2] [2]
ISO/IEC 29155-1:2017 , 3.2 and ISO/IEC 29155-1:2017 , 3.3 respectively, for reflecting the emphasis on
product benchmarking most clearly.
4.3 Metrics, measures and criteria
AI system functional correctness is measured using a vast array of quantitative metrics. Additionally, a
number of measures, such as loss functions, are relevant for measuring functional reliability during model
training, but not during actual system usage. In addition, some criteria are used for model size determination,
model selection, model training time and so forth, but are not directly reported as functional correctness
indices.
This document describes metrics, measures and criteria used for common tasks performed by AI systems.
The given list is not comprehensive, but is intended to provide a useful overview of tasks and their
corresponding measures.
© ISO/IEC 2025 – All rights reserved
Table 2 — Metrics and measures for common AI system tasks
Task Measures
[26]
Classification ISO/IEC TS 4213 Accuracy
Cross validation
Precision,
Recall,
Confusion matrix
ROC (receiver operating characteristic) curve
[27]
Regression Root mean squared error (RMSE)
[27]
Prediction ranking Spearman’s rank correlation
[28]
Localisation (bounding box around an object) Intersection over Union (IoU)
[29][28]
Object detection Mean average precision (mAP)
[29]
Image - semantic segmentation Mean intersection over union (mIoU)
[30]
Time-series forecasting MSE, MAE, MAPE (percentage error), Mean absolute scaled error
(MASE)
[31]
POS (part-of-speech) tagging Accuracy
[31]
Named entity recognition Precision, Recall, F1 Score
[31]
Dependency parsing Labelled attachment score (LAS), unlabelled attachment score
(UAS), label accuracy score (LS)
[31]
Information retrieval PR curve, interpolated precision,
Mean average precision
[32]
Summarisation ROGUE (F1 score from the n-gram precision and recall)
Mathematical descriptions of measures used in Table 2 are given in Annex B.
Additionally, several criteria are used for determining the efficacy of model training procedures, without a
direct correlation with the functional correctness of the model. Some such criteria include:
— model size,
— training time,
— convergence rate.
It is notable that, whereas AI system quality encompasses multiple characteristics and sub-characteristics,
[1]
as described in ISO/IEC 25059 , existing metrics are mainly relevant to measurement of functional
suitability, and measures of several other important characteristics, such as reliability, maintainability,
usability and security appear not to be addressed. The consequences of this imbalance are reviewed further
in 5.4.
5 Benchmarking AI systems
5.1 Benchmarking AI system quality
Benchmarking the quality characteristics of AI systems is crucial for several reasons. Firstly, it allows
measurement and comparison of the quality characteristics of different AI models objectively, providing
valuable insights into their strengths and weaknesses. By benchmarking quality characteristics such
as accuracy, efficiency, reliability, and robustness, stakeholders can identify areas for improvement and
innovation, driving advancements in AI technology. Additionally, benchmarking facilitates standardization
and transparency within the AI ecosystem, enabling stakeholders to make informed decisions about
which models are most suitable for their specific needs. Furthermore, benchmarking helps to establish
benchmarks against which future AI systems can be evaluated, fostering a continuous cycle of improvement
and innovation. In addition to this, benchmarks can be used in AI management system controls.
© ISO/IEC 2025 – All rights reserved
Several methods exist for benchmarking AI systems, each tailored to measure specific quality characteristics.
[33]
IEEE 2937 provides formalized methods for benchmarking of hardware-related metrics of AI server
systems, emphasizing the measurement of training time, power consumption and inference latency. To
measure the functional correctness and suitability of AI systems, the general approach is the use of reference
datasets and evaluation metrics, where AI models are tested on established reference datasets, such as
[34] [35]
ImageNet for image classification or MNIST for handwritten digit recognition . These datasets come
with predefined training and testing datasets, enabling consistent evaluation across different methods.
Another method involves organizing competitions and challenges, such as the ImageNet Large Scale
[36] [37]
Visual Recognition Challenge (ILSVRC) or the Common Objects in Context (COCO) challenge , where
researchers and developers submit their AI models to compete against each other on specific tasks. These
competitions provide a platform for rigorous evaluation and comparison of AI systems in diverse scenarios.
Most relevant for this document, organizations like the National Institute of Standards and Technology
[38] [39]
(NIST) FRVT framework and the AI Benchmarking Initiative (AIBench) have developed reference
methodologies and benchmarks for evaluating AI systems in specific domains, promoting transparency and
reproducibility in AI research.
While current approaches for benchmarking AI systems are valuable, they also have several limitations. An
important limitation is dataset bias, where the quality of AI models can be skewed due to biases present
in the training data. This can lead to overfitting to specific datasets and poor generalization to real-world
[40]
scenarios . Leader-board competitions, the most common form of accuracy benchmarking for AI systems,
are particularly sensitive to dataset decay, and require careful handling to prevent overfitting to held out
[41]
data . Another challenge is the proliferation of evaluation metrics across domains, making it difficult
to compare the quality of AI models across different tasks. Reference datasets and competitions typically
focus on narrow tasks or domains, limiting the scope of evaluation and potentially overlooking important
aspects of AI systems, such as ethical considerations and societal impact. Moreover, the reproducibility
of benchmarking results can be challenging, particularly when details about model architectures,
hyperparameters, and training procedures are not adequately documented.
5.2 Context of use
Some software quality standards have historically co-evolved with reliability engineering, which in turn
historically focused on the maintenance and upkeep of mechanical systems. For such mechanical systems,
component reliability tends to correlate well with nearly all desirable quality metrics, such as functional
correctness, safety and resilience. For the most part, since software systems also have conceptually
enumerable input-output characteristics, approaches rooted in reliability engineering have translated well
to them.
However, this historical provenance of software quality standards systematically under-emphasises the role
of context of use on the quality characteristics of software-based systems. This is a significant limitation,
as the context of use offers considerable information about the possible hazards of a system’s use, which
is necessary to design appropriate requirements for the system. As Nancy Leveson observes, “system
and software requirements development are necessarily a system engineering problem, not a software
[42]
engineering problem.”
It is therefore helpful to consider AI systems from a sociotechnical perspective, ensuring that the degree of
quality assurance is aligned with the degree of quality expected of the system based on the context of use.
5.3 Complex adaptive systems
[43]
NIST AI 100-1:2023 , Annex B summarizes key aspects in which risks from AI systems are different from
risks from traditional software systems. These differences include the following:
a) Data used in model training is not always representative of the context of use of the system.
b) It is possible that real ground truth data does not exist, or is not available.
c) Data distributions can drift over time, and become detached from the original context in which the
system was trained.
d) Use of pre-trained models limits controllability of data quality and bias mitigation strategies
© ISO/IEC 2025 – All rights reserved
In addition to these risks to system correctness, multiple additional sociotechnical considerations apply for
other quality characteristics of AI systems, such as:
e) Humans interacting with AI systems can change their behaviours to work around the narrow intelligence
of such systems replacing human operators.
f) AI systems can be subjected to data poisoning and spoofing attacks, reducing their effectiveness, when
deployed.
g) Human operators working alongside AI decision support systems can become overconfident and accept
system suggestions by default
h) Human operators working alongside AI decision support systems can mistrust and ignore AI system
suggestions
i) AI system integration into legacy IT systems can expand the cybersecurity threat envelope of the
existing system in ways that are difficult to detect with an audit of the two systems in isolation.
This list of considerations is not comprehensive, and is presented primarily to emphasize the thematic point
that AI systems are best validated with a sociotechnical systems approach, accounting for the fact that they
interact with users and third parties in complex ways, and that other entities adapt to being interacted with
by AI systems in ways that are not always foreseeable. Thus, new methods and approaches for benchmarking
complex and adaptive systems can be proposed.
5.4 Limitations in benchmarking AI systems
While benchmarking is already a challenging activity for simpler systems, involving multiple facets of data,
processes and measurements, benchmarking AI systems poses novel challenges that are to be addressed
with care. In particular, it is challenging to benchmark AI systems due to the following factors:
— AI systems are applied in a variety of sectors and contexts of use, each with different sources of risk and
uncertainty. Benchmarking such systems can either be adaptive to these differences, or be sufficiently
comprehensive to address all of them. For example, object detection models are frequently benchmarked
[33]
using mean average precision (mAP) across object classes, described for instance in IEEE 2937 .
However, there are several contexts of use, (e.g. object recognition for driverless vehicles) wherein
misidentification of some classes of objects, (e.g. pedestrians at risk), is of greater importance than other
objects, (e.g. street signs). In such contexts, mAP can exaggerate the functional suitability of the system,
since low importance classes are more likely to natively be encountered in the data environment than
high importance classes.
— Designers report AI system quality characteristics using a variety of metrics which makes comparisons
across metrics infeasible. For example, classification models for healthcare frequently report functional
correctness using an F1 score or the area under an ROC curve. However, such measures assume the
availability of plentiful clinical resources to act upon model predictions. In reality, clinicians are only
able to act upon a limited number of inputs from such classification models, thus favouring evaluation
metrics drawing upon the recommender systems literature, such as mean reciprocal rank, top-k
precision, etc. These metrics are mutually incommensurable, making it difficult to assess the true value
[44]
of such systems in use .
— Reference datasets used for benchmarking can contain noise, imbalance, and bias in unknown quantities.
Quality characteristic measurements inherit these problems in the form of fragility, inaccuracy and
algorithmic bias respectively. Examples of AI algorithms perpetuating societal and demographic biases
abound in the academic literature, and a vast literature on fairness in machine learning has emerged in
[45]
response to this problem .
— Evaluating very large models requires specialized techniques and infrastructure, which are not equally
accessible under resource constraints. Particularly for large language models, the computational
and energy requirements necessary to train models are very large, and inaccessible to most public
[46] [47] [48]
institutions , with the evaluations of such models also not homogeneous , .
© ISO/IEC 2025 – All rights reserved
— Many AI applications involve interaction with humans, and the nature of this interaction changes
reflexively as humans adapt to the use of the system. For example, automation of actions in cockpits has
[49]
been shown to be associated with atrophy of flying skills in human pilots , and similar deficits are
[50]
anticipated in the use case of driverless cars . Benchmarking AI systems in such contexts is predicated
on careful consideration of human factors and user experience, which adds considerable complexity to
any possible evaluation.
These inter-related problems are caused by the fact that modern AI systems are developed using very large
datasets and very large models, with downstream sociotechnical considerations not clearly known at the
time of system benchmarking. While it is possible to develop comprehensive benchmarking references
that accommodate the large scale and complexity of AI systems in use, the application of such standards is
predicated on high levels of expertise and resource allocations. .
Alternatively, it is possible to conceive of approaches for differentiated benchmarking of AI systems, such
that quality characteristics of systems are benchmarked at different levels, and with different degrees of
standardization, adaptive to sociotechnical consideration of where such a system lies on a spectrum of
potential for harm. In this way, the conformity burden can rationally scale with the harm potential of AI
systems, thus simultaneously enabling innovation while maintaining safety.
6 Frameworks for differentiated benchmarking
6.1 AI management frameworks
AI management frameworks can potentially be used for differentiated benchmarking. For example,
[51]
ISO/IEC 42001 specifies requirements for establishing, maintaining and improving AI management
systems within organizations.
[52]
The AI Risk Management Framework (AI RMF) is another representative example . This framework
utilizes a descriptive methodology, offering flexibility in implementation. It focuses on assessing hazards,
exposures, and vulnerabilities associated with AI systems, allowing organizations to manage risks
effectively across various use cases and sectors.
[53]
ALTAI is a procedural framework that can be used as a tool for management frameworks. It attempts
to cover all principles and stages of AI implementation, and attempts to offer region-agnostic and sector-
agnostic perspectives. ALTAI follows a procedural approach, emphasizing trustworthiness in AI systems.
It assesses hazards, exposures, and vulnerabilities to ensure the ethical and trustworthy deployment of AI
technologies.
[54]
The Algorithm Impact Assessment Tool (AIA) addresses planning, requirements analysis, design, and
testing stages. The framework takes a procedural approach, ensuring that AI implementations are region-
agnostic and sector-agnostic. AIA assesses hazards, exposures, and vulnerabilities without specifying
particular domains.
[55]
IEEE 7010 follows a descriptive approach. While specific focus areas are not explicitly mentioned, the
practices are designed to be region-agnostic and sector-agnostic. The framework provides guidance on
assessing hazards, exposures, and vulnerabilities associated with autonomous and intelligent systems,
emphasizing their impact on human well-being.
6.2 Classification-based frameworks
[56]
Classification is a very common form of standardization activity . The standardization of IT systems,
in particular, seems to lend itself well to classification-based frameworks, as is evidenced by NIST's
cybersecurity framework subcategories, which enables an organization to standardize processes
[57]
relevant for its specific needs . The judgment of relevance provides the source of differentiation in the
standardization process in such frameworks, with the most common frame of relevance judgment being risk
or impact assessment.
© ISO/IEC 2025 – All rights reserved
A number of frameworks for risk or impact assessment of AI systems pre-exist. Some of these frameworks
use risk-based classification to differentiate benchmarking treatment for various AI products. Some such
frameworks are reviewed below.
[58]
The German Data Ethics Commission has created a guidance document describing five criticality classes
for AI systems depicting harm for, i.e. the physical as well as psychical well-being, finance, date, manipulation
of information as well as a negative form of nudging. Based on this guidance, regulation classes for AI systems
can vary depending on the jurisdiction and specific regulations in place. The document describes these five
regulation classes with corresponding duties for responsible parties, such as providers and manufacturers
as well as concerns which justify the placement of an AI system into a specific class.
a) Class 1: No or minimal potential for harm:
Duties: correctness checks, transparency, system analyses in cases of suspicion.
Concerns: potential for unexpected or unintended consequences.
b) Class 2: Low risk:
Duties: risk assessment, transparency obligations, and basic safety standards.
Concerns: undue risks to individuals or society.
c) Class 3: Moderate risk:
Duties: oversight, risk assessments, third-party audits, and adherence to specific industry standards.
Concerns: harm to individuals, privacy violations, diffusion in accountability, fairness in AI decision-
making.
d) Class 4: High risk:
Duties: thorough risk assessments, continuous monitoring, and robust fail-safes, independent audits,
conformity with strict safety and security standards, and regular reporting to regulatory authorities.
Concerns: significant harm to individuals, society, or critical infrastructure as well as negative ethical,
legal, and social implications, including, i.a., discrimination, bias, and transparency.
e) Class 5: Forbidden:
Duties: extraction of product from market by supervision authorities.
Concerns: extreme potential for harm, including threats to human life, national security, or global
stability. Immediate detection of product of such classes, as well clarity in accountability are
indispensable.
The German Data Ethics Commission framework can be adapted in several ways, for instance by making
higher risk criticality classes subject to the duties of lower risk or otherwise tweaking specific duties and
tying the terms "transparency" and "system analyses" to specific definitions.
The EU AI Act also adopts a risk-based classification approach, categorizing AI systems into different risk
[59]
levels based on their potential impact on rights, safety and societal values . High risk systems are expected
to be subject to stricter requirements and oversight, and providers of high risk systems are expected to meet
additional requirements related to data quality, documentation, transparency and traceability throughout
the AI system's life-cycle. The Act also allows for conformity assessment to verify conformity with the
requirements set forth in the regulation.
The Automated Decision-Making Systems in the Public Sector: An Impact Assessment Tool for Public
[60]
Authorities is designed for public authorities in Germany . It provides guidelines for assessing hazards,
exposures, and vulnerabilities associated with AI implementations in the public sector. This framework
follows a tiered procedural approach, wherein systems that exceed a set threshold score in a first-level
checklist during evaluation are taken through a more extensive set of controls than systems that do not.
© ISO/IEC 2025 – All rights reserved
The Artificial Intelligence Impact Assessment framework provides guidelines for assessing the impact of
[61]
AI technologies . The framework is designed to be region-agnostic and sector-agnostic. This framework
also follows a tiered procedural approach, offering organizations guidance on assessing hazards, exposures,
vulnerabilities, and mitigation risks associated with AI implementations based on the perceived criticality
of the deployment.
[62]
The Model Rules on Impact Assessment of Algorithmic Decision-Making Systems are designed to be
region-agnostic and applicable to public sectors. This framework also follows a tiered procedural approach,
guiding organizations in conducting impact assessments related to hazard, exposure, and vulnerability
associated with algorithmic decision-making systems based on the perceived criticality of the deployment.
Classification-based approaches to risk and impact assessment have the advantage of being commonly
known, easily reproducible, easily documentable, and intuitive to work with for regulatory and governance
bodies. However, there are also several limitations to such approaches.
For example, risk matrix based approaches to risk assessment presuppose a utilitarian view of risk, such that
[63]
there exists an implicit acceptance of severe risks provided the likelihood of such risks is acceptably low .
It is, however, well known that people systematically underestimate the probability of unlikely events in
[64]
decisions they make from experience . Therefore, analysts' likelihood estimates inevitably understate
[65]
expected risk when risk matrix approaches are applied in the determination of risk .
Additionally, classification is inherently a unidimensional approach to standardization, which is reasonable
in cases where the dimension along which risk or impact is expected to vary is clearly understood, but not
in cases where the dimensionality of risk variability itself is complex and multi-dimensional. Risk-based
classification of AI systems fundamentally inherits this defect.
6.3 Levels of specification
Risk-based classification schemes can guide entities towards specific actions appropriate for a relevant
context-of-use, but such schemes are based on a relatively unidimensional evaluation of context and risk.
The complex sociotechnical nature of AI system deployments can be more effectively addressed by adapting
generalized classification-based approaches, such as the use of levels of specification.
Software specifications offer guidance for ensuring that programs actually do implement the logic they
are expected to implement. These specifications can vary in their level of detail, with standard modules
specified mostly as flowcharts, and critical subunits specified with additional information about data ranges,
exception possibilities, etc. Repurposing this software idiom for the task of benchmarking AI systems, levels
of specification can be created to specify the set of benchmarking actions appropriate for AI systems with
different potentials for harm.
© ISO/IEC 2025 – All rights reserved
Quality characteristics
Standard ISO XXX Clause 5.17 is applicable
Standard ISO XXY Clause 8.9 is applicable
Figure 1 — Schematic illustration of levels of specification
A levels-based approach to benchmarking, as illustrated in Figure 1 , can specify the set of benchmarks or
benchmarking procedures helpful to establish quality characteristics for AI systems with a particular level
of harm potential. The levels approach, therefore, is a generalization of the classification-based approach,
seeking to map AI systems with different harm potentials to benchmarks targeted at an enumerated set of
quality characteristics.
The construction of AI specification levels is based on:
a) designing property-action matrices,
...
ISO/IEC JTC 1/SC 42
ISO/IEC CD TR 42106(en)
Secretariat: ANSI
Date: 2025-12-03
Information technology — Artificial intelligence (AI) — Overview of
differentiated benchmarking of AI system quality characteristics
DTR stage
ISO/IEC CD TRDTR 42106:2025(en)
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication
may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying,
or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO
at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
E-mail: copyright@iso.org
Website: www.iso.org
Published in Switzerland
© ISO/IEC 2025 – All rights reserved
ii
ISO/IEC CD TRDTR 42106:2025(en)
Contents
Foreword . v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Overview of relevant benchmarking methods . 2
4.1 Review of benchmarking definitions . 2
4.2 Types of benchmarking . 3
4.3 Metrics, measures and criteria . 4
5 Benchmarking AI systems . 5
5.1 Benchmarking AI system quality . 5
5.2 Context of use . 6
5.3 Complex adaptive systems . 6
5.4 Limitations in benchmarking AI systems . 7
6 Frameworks for differentiated benchmarking . 8
6.1 AI Management frameworks . 8
6.2 Classification-based frameworks . 9
6.3 Levels of specification . 11
7 Feasibility analysis . 13
7.1 Case Example 1: On-job training recommendation system . 13
7.2 Case Example 2: User intent recognition . 14
7.3 Case Example 3: Generation of clinical pathways . 15
Informative Annexures . 16
Annex A (informative) Definitions of benchmarking . 17
Annex B (informative) Sample levels of specification . 19
Annex C (informative) Descriptions of measures . 1
Bibliography . 3
Foreword . v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Overview of relevant benchmarking methods . 2
4.1 Review of benchmarking definitions . 2
4.2 Types of benchmarking . 3
4.3 Metrics, measures and criteria . 4
5 Benchmarking AI systems . 5
5.1 Benchmarking AI system quality . 5
5.2 Context of use . 6
5.3 Complex adaptive systems . 6
5.4 Limitations in benchmarking AI systems . 7
6 Frameworks for differentiated benchmarking . 8
© ISO/IEC 2025 – All rights reserved
iii
ISO/IEC CD TRDTR 42106:2025(en)
6.1 AI management frameworks . 8
6.2 Classification-based frameworks . 9
6.3 Levels of specification . 11
7 Feasibility analysis . 13
7.1 Case Example 1: On-job training recommendation system . 13
7.2 Case Example 2: User intent recognition . 14
7.3 Case Example 3: Generation of clinical pathways . 15
Annex A (informative) Sample levels of specification . 19
Annex B (informative) Descriptions of measures . 1
Bibliography . 3
© ISO/IEC 2025 – All rights reserved
iv
ISO/IEC CD TRDTR 42106:2025(en)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members
of ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
document should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC
Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the use of
(a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any claimed
patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not received
notice of (a) patent(s) which may be required to implement this document. However, implementers are
cautioned that this may not represent the latest information, which may be obtained from the patent database
available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held responsible for
identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 42, Artificial intelligence.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html and www.iec.ch/national-
committees.
© ISO/IEC 2025 – All rights reserved
v
ISO/IEC CD TRDTR 42106:2025(en)
Introduction
AIArtificial intelligence (AI) systems are diverse in nature and heterogeneous in terms of their potential
impact on consumers and third-parties. For example, a company’s use of disposable income estimates using a
linear regression model to serve ads targeted at different socio-economic profiles reflects a system with low
complexity and potential impact. Conversely, a bank’s use of disposable income estimates using a large neural
network model to make housing loan decisions reflects a system with both high complexity and high impact.
Benchmarking is often used to compare quality characteristics of software systems against reference or target
values. Given the diverse nature and heterogeneous impact of AI systems, such reference or target values can
differ widely across systems and across deployments of similar systems across contexts of use. To
accommodate this diversity and heterogeneity, differentiated benchmarking can target different quality
characteristics at different reference values. This document reviews AI management frameworks for their
ability to offer guidance for differentiated benchmarking of AI system quality characteristics.
By evaluating frameworks for specifying differing levels of benchmarks and benchmarking, commensurate
with expected social impact of AI systems, this document identifies gaps in current frameworks that, when
filled, can yield guidance for differentiated benchmarking of AI systems, which can help rationalize the
standardization implementation effort of AI providers, while maintaining system trustworthiness for AI
customers and AI partners.
© ISO/IEC 2025 – All rights reserved
vi
ISO/IEC CD TR 42106:2025(en)
Information technology — Artificial intelligence (AI) — Overview of
differentiated benchmarking of AI system quality characteristics
1 Scope
This document provides an overview of conceptual frameworks for graded benchmarking of AIartificial
intelligence (AI) system quality characteristics. The aim is to examine the feasibility of using differentiated
benchmarking of quality characteristics based on the complexity and context of use of an AI system.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
EN ISO/IEC 22989, Information technology -— Artificial intelligence -— Artificial intelligence concepts and
terminology (ISO/IEC 22989:2022)
EN ISO/IEC 23053, Framework for Artificial Intelligence (AI) Systems Using Machine Learning (ML) (ISO/IEC
23053:2022)
EN ISO/IEC 25059, Software engineering -— Systems and software Quality Requirements and Evaluation
(SQuaRE) -— Quality model for AI systems (ISO/IEC 25059:2023)
3 Terms and definitions
For the purposes of this document, the terms and definitions given in EN ISO/IEC 22989, EN ISO/IEC 23053,
EN ISO/IEC 25059 and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— IEC Electropedia: available at httphttps://www.electropedia.org/
Field Code Changed
— ISO Online browsing platform: available at http://www.iso.org/obphttps://www.iso.org/obp
3.1
benchmark
reference point against which comparisons can be made
[1]
Note 1 to entry: For AI system benchmarking, an AI system quality characteristic (EN ISO/IEC 25059 ) is the object of
[2]
comparison[SOURCE: ISO/IEC 29155-1:2017 , 3.2, modified - Note 1 to entry has been replaced]ISO/IEC 25059) is the
object of comparison.
[SOURCE: ISO/IEC 29155-1, 3.2, modified —Note 1 to entry has been replaced.]
3.2
benchmarking
activity of comparing objects of interest to each other or against a benchmark to evaluate characteristic(s)
(3.1)
Note 1 to entry: For AI system benchmarking, the object of interest is an AI system quality characteristic (EN ISO/IEC
[1] [2]
25059 ).[SOURCE: ISO/IEC 29155-1:2017 , 3.3, modified - Note 1 to entry has been replaced]ISO/IEC 25059).
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
[SOURCE: ISO/IEC 29155-1, 3.3, modified — Note 1 to entry has been replaced.]
4 Overview of relevant benchmarking methods
4.1 Review of benchmarking definitions
1) 2) 3)
SearchingWhen searching for "benchmarking" fromin the following standardsISO , IEC and ITU
terminology databases, there arewere 76 results from the ISO OBP; 0 results and none from IEV, 0 results
fromthe IEC and ITU-T databases. By deleting not relevant terms and definitions and merging the same
definitions together, 14 definitions arewere collected (see Table A.1 inAnnex A).
— ISO Online browsing platform (OBP): available at http://www.iso.org/obp
— IEC Electropedia (IEV): available at http://www.electropedia.org/
— ITU-T (ITU-T) Terms and Definitions available at: https://www.itu.int/br_tsb_terms/#/
These 14 definitions include several instances that define "benchmark" and "benchmarking" as a pair, with
the definition of benchmarking relying on the paired definition of benchmark (e.g. the pairs {D3, D5}, {D13,
D14}, {D4, D2}). A clustered view of the objects of interest for each of these definitions of
benchmark/benchmarking is given in Table 1.
Table 1 — Clustered objects and characteristics relevant for benchmark and benchmarking
Cluste Object Description of characteristics Sources
r
1 a reference comparisons can be made;, A, list item A.d), A, list item A.e), A,
point/tool/method list item A.f), A, list item A.j), A, list
process , performance or quality
item A.k), A, list item A.l), A, list item
can be measured; other
(metric against;
A.m), A, list item A.o)ISO 41011,
characteristics can be measured
any standard or reference;
3.8.5;
point of fixed location;
ISO 14031, 3.4.8;
permanent mark)
ISO/IEC 29155-1, 2.1 and 3.2;
(benchmark)
ISO 17258, 3.1;
ISO 14050, 3.2.15;
ISO 21678, 3.2;
ISO 21931-1, 3.2.16;
ISO/IEC/IEEE 24765, 3.362;
ISO/TS 18667, 3.1.1;
ISO 20468-1, 3.1.2;
ISO 13053-2, 2.1;
ISO/IEC 25040
2 activity of comparing, objects of interest to each other or A, list item A.c), A, list item A.g), A,
evaluating and analysis against a benchmark , , list item A.h), A, list item A.i), A, list
item A.n), A, list item A.p)ISO/IEC
(activity of comparing or characteristic; similar operational
29155-1, 3.2 and 3.3;
practices; an organization can use
evaluating ; comparative
to search for and compare ISO/IEC 18520, 3.1.1;
evaluation or analysis;
practices inside and outside the
1)
ISO Online browsing platform (OBP): available at
2)
IEC Electropedia (IEV): available at
3)
ITU-T Terms and Definitions available at:
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
activity of measurement and organization, with the aim of ISO/TR 24514, 3.1;
analysis) improving its
ISO 14644-16, 3.3.1;
performance’performance;
ISO 10010, 3.4;
similar operational practices
ISO 10014, 3.8;
ISO 30400, 3.1.18 and 3.17;
ISO 32210, 3.34;
ISO/IEC/IEEE 24765, 3.362 and
3.363;
ISO/IEC TS 25058.
3 process of comparing the same nature, under the same A, list item A.b)ISO 41011, 3.8.5.1
processes, performances or circumstances and with similar
quality against practices measures
4 single value (benchmark) used for orientation A, list item A.a)ISO 17258, 3.2;
ISO 24523, 3.2;
ISO 24513, 3.7.1.1.2;
ISO/TR 24514, 3.1
Core concepts about "benchmarking" are reflected in the repetition of words and phrases across these 14
definitions. Among these, "comparisons" (10 times), “performances” (7 times), "can be measured" (6 times),
and "practices" (5 times) are the most frequently used core concepts.
The terms “process” (4 times), “organization” (4 times), "a reference point" (3 times), “metric against" (3
times), “evaluate (3 times)”, and “standard” (3 times) also contribute to an understanding of the basic concept
of "benchmarking”.
4.2 Types of benchmarking
From the review of uses of benchmarking in the standardization literature, it is evident that there are primarily
two types of benchmarking in existing definitions of ISO deliverables: benchmarking can be intended in terms
[3] [4]
of an activity and components (ISO/IEC 29155-1 ISO/IEC 29155-1 )) or for processes (EN ISO 41011 )ISO
41011) as objects of benchmarking.
With regards to benchmarking activity and components, the focus lies on comparing objects of interest to each
[3]
other or against benchmark to evaluatecharacteristics (evaluate characteristics (ISO/IEC 29155-1 ISO/IEC
29155-1 ).). Such activities have characteristics of similar operation practice, similar attributes, processes or
performance that are comparable. The benchmark refers to a reference point against which comparisons can
be made. For the relevant stakeholder, a reference point can be any of the following:
— a tool for performance improvement through systematic search and adaptation of leading practicpractice;
— a standard against which results can be measured or evaluated;
— a method for comparing the performance of organizations in a market segment ;
— a test procedure that can be used to compare systems or components to each other or to a standard.
[3]
ISO/IEC 29155-1 ISO/IEC 29155-1 relies for product quality evaluation on the ISO/IEC 25000 seriesISO/IEC
[5]
25000 and in this perspective ISO/IEC 25040 ISO/IEC 25040 defines the context for using benchmarks as
quality criteria.
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
With regards to benchmarking processes, the focus lies on comparing any combination of processes,
performances and quality against practices of the same nature, under the same circumstances and with similar
measures. Its special considerations are the systematic process for the identification of, becoming acquainted
with and for adoption of successful practices of benchmarking partners. This concept is use in the domain of
[4]
facility management ( EN ISO 41011 ).(ISO 41011).
Within this document, the concept of benchmarking is used to focus on the activity of comparing objects of
interest against a benchmark to evaluate characteristics, and the concept of benchmark to focus on a reference
point to which comparisons can be made. Such concepts are used widely in domain of information technology
project performance benchmarking framework of systems and software engineering, as defined in ISO/IEC
[5] [6]
25040 ISO/IEC 25040 and ISO/IEC TS 25058 ISO/IEC TS 25058 .
Therefore, the definitions of benchmark and benchmarking given in this document are adapted from ISO/IEC
[2] [2]
29155-1:2017 ISO/IEC 29155-1, 3.2 and ISO/IEC 29155-1:2017 ISO/IEC 29155-1, 3.3 respectively, for
reflecting the emphasis on product benchmarking most clearly.
4.3 Metrics, measures and criteria
AI system functional correctness is measured using a vast array of quantitative metrics. Additionally, a number
of measures, such as loss functions, are relevant for measuring functional reliability during model training,
but not during actual system usage. In addition, some criteria are used for model size determination, model
selection, model training time, etc., and so forth, but are not directly reported as functional correctness indices.
This document describes metrics, measures and criteria used for common tasks performed by AI systems. The
given list is not comprehensive, but is intended to provide a useful overview of tasks and their corresponding
measures.
Table 2 — Metrics and measures for common AI system tasks
Task Measures
[7]
Classification ISO/IEC TS 4213 ISO/IEC TS 4213 Accuracy
Cross validation
Precision,
Recall,
Confusion matrix
ROC (receiver operating characteristic) curve
[8][27]
Regression Root mean squared error (RMSE)
[8][27]
Prediction ranking Spearman’s rank correlation
Localisation (bounding box around an Intersection over Union (IoU)
[9][28]
object)
[10][29][9][28]
Object detection Mean average precision (mAP)
[10][29]
Image - semantic segmentation Mean intersection over union (mIoU)
[11][30]
Time-series forecasting MSE, MAE, MAPE (percentage error), Mean absolute scaled
error (MASE)
[12][31]
POS (part-of-speech) tagging Accuracy
[12][31]
Named entity recognition Precision, Recall, F1 Score
[12][31]
Dependency parsing Labelled attachment score (LAS), unlabeledunlabelled
attachment score (UAS), label accuracy score (LS)
[12][31]
Information retrieval PR curve, interpolated precision,
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
Task Measures
Mean average precision
[13][32]
Summarisation ROGUE( (F1 score from the n-gram precision and recall)
Mathematical descriptions of measures used in Table 2Table 2 are given in Annex C.Annex B.
Additionally, several criteria are used for determining the efficacy of model training procedures, without a
direct correlation with the functional correctness of the model. Some such criteria include:
— model size,
— training time,
— convergence rate.
It is notable that, whereas AI system quality encompasses multiple characteristics and sub-characteristics, as
[1]
described in EN ISO/IEC 25059 ,ISO/IEC 25059, existing metrics are mainly relevant to measurement of
functional suitability, and measures of several other important characteristics, such as reliability,
maintainability, usability and security appear not to be addressed. The consequences of this imbalance are
reviewed further in 5.4.5.4.
5 Benchmarking AI systems
5.1 Benchmarking AI system quality
Benchmarking the quality characteristics of AI systems is crucial for several reasons. Firstly, it allows
measurement and comparison of the quality characteristicsofcharacteristics of different AI models
objectively, providing valuable insights into their strengths and weaknesses. By benchmarking quality
characteristics such as accuracy, efficiency, reliability, and robustness, stakeholders can identify areas for
improvement and innovation, driving advancements in AI technology. Additionally, benchmarking facilitates
standardization and transparency within the AI ecosystem, enabling stakeholders to make informed decisions
about which models are most suitable for their specific needs. Furthermore, benchmarking helps to establish
benchmarks against which future AI systems can be evaluated, fostering a continuous cycle of improvement
and innovation. In addition to this, benchmarks can be used in AI management system controls.
Several methods exist for benchmarking AI systems, each tailored to measure specific quality characteristics.
[14]
IEEE 2937 IEEE 2937 provides formalized methods for benchmarking of hardware-related metrics of AI
server systems, emphasizing the measurement of training time, power consumption and inference latency. To
measure the functional correctness and suitability of AI systems, the general approach is the use of reference
datasets and evaluation metrics, where AI models are tested on established reference datasets, such as
[15][34] [16][35]
ImageNet for image classification or MNIST for handwritten digit recognition . These datasets come
with predefined training and testing datasets, enabling consistent evaluation across different methods.
Another method involves organizing competitions and challenges, such as the ImageNet Large Scale Visual
[17][36] [18][37]
Recognition Challenge (ILSVRC) ) or the Common Objects in Context (COCO) challenge , where
researchers and developers submit their AI models to compete against each other on specific tasks. These
competitions provide a platform for rigorous evaluation and comparison of AI systems in diverse scenarios.
Most relevant for this document, organizations like the National Institute of Standards and Technology (NIST)
[19][38] [20][39]
FRVT framework and the AI Benchmarking Initiative (AIBench) ) have developed reference
methodologies and benchmarks for evaluating AI systems in specific domains, promoting transparency and
reproducibility in AI research.
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
While current approaches for benchmarking AI systems are valuable, they also have several limitations. An
important limitation is dataset bias, where the quality of AI models can be skewed due to biases present in the
training data. This can lead to overfitting to specific datasets and poor generalization to real-world
[21][40]
scenarios . Leader-board competitions, the most common form of accuracy benchmarking for AI systems,
are particularly sensitive to dataset decay, and require careful handling to prevent overfitting to held out data
[22][41]
. Another challenge is the proliferation of evaluation metrics across domains, making it difficult to
compare the quality of AI models across different tasks. Reference datasets and competitions typically focus
on narrow tasks or domains, limiting the scope of evaluation and potentially overlooking important aspects
of AI systems, such as ethical considerations and societal impact. Moreover, the reproducibility of
benchmarking results can be challenging, particularly when details about model architectures,
hyperparameters, and training procedures are not adequately documented.
5.2 Context of use
Some software quality standards have historically co-evolved with reliability engineering, which in turn
historically focused on the maintenance and upkeep of mechanical systems. For such mechanical systems,
component reliability tends to correlate well with nearly all desirable quality metrics, such as functional
correctness, safety and resilience. For the most part, since software systems also have conceptually
enumerable input-output characteristics, approaches rooted in reliability engineering have translated well to
them.
However, this historical provenance of software quality standards systematically under-emphasises the role
of context of use on the quality characteristics of software-based systems. This is a significant limitation, as
the context of use offers considerable information about the possible hazards of a system’s use, which is
necessary to design appropriate requirements for the system. As Nancy Leveson observes, “Systemsystem and
software requirements development are necessarily a system engineering problem, not a software
[23][42]
engineering problem.” .” .
It is therefore helpful to consider AI systems from a sociotechnical perspective, ensuring that the degree of
quality assurance is aligned with the degree of quality expected of the system based on the context of use.
5.3 Complex adaptive systems
[24]
NIST AI 100-1:2023 NIST AI 100-1, Annex B summarizes key aspects in which risks from AI systems are
different from risks from traditional software systems. These differences include the following:
a) Data used in model training is not always representative of the context of use of the system.
b) It is possible that real ground truth data does not exist, or is not available.
c) Data distributions can drift over time, and become detached from the original context in which the system
was trained.
d) Use of pre-trained models limits controllability of data quality and bias mitigation strategies
In addition to these risks to system correctness, multiple additional sociotechnical considerations apply for
other quality characteristics of AI systems, such as:
e) Humans interacting with AI systems can change their behaviours to work around the narrow intelligence
of such systems replacing human operators.
f) AI systems can be subjected to data poisoning and spoofing attacks, reducing their effectiveness, when
deployed.
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
g) Human operators working alongside AI decision support systems can become overconfident and accept
system suggestions by default
h) Human operators working alongside AI decision support systems can mistrust and ignore AI system
suggestions
i) AI system integration into legacy IT systems can expand the cybersecurity threat envelope of the existing
system in ways that are difficult to detect with an audit of the two systems in isolation.
This list of considerations is not comprehensive, and is presented primarily to emphasize the thematic point
that AI systems are best validated with a sociotechnical systems approach, accounting for the fact that they
interact with users and third parties in complex ways, and that other entities adapt to being interacted with
by AI systems in ways that are not always foreseeable. Thus, new methods and approaches for benchmarking
complex and adaptive systems can be proposed.
5.4 Limitations in benchmarking AI systems
While benchmarking is already a challenging activity for simpler systems, involving multiple facets of data,
processes and measurements, benchmarking AI systems poses novel challenges that needare to be addressed
with care. In particular, it is challenging to benchmark AI systems due to the following factors:
— AI systems are applied in a variety of sectors and contexts of use, each with different sources of risk and
uncertainty. Benchmarking such systems can either be adaptive to these differences, or be sufficiently
comprehensive to address all of them. For example, object detection models are frequently benchmarked
[14]
using mean average precision (mAP) across object classes, described for instance in IEEE 2937 IEEE
2937 . However, there are several contexts of use, (e.g. object recognition for driverless vehicles) wherein
misidentification of some classes of objects, (e.g. pedestrians at risk), is of greater importance than other
objects, (e.g. street signs). In such contexts, mAP can exaggerate the functional suitability of the system,
since low importance classes are more likely to natively be encountered in the data environment than high
importance classes.
— Designers report AI system quality characteristics using a variety of metrics which makes comparisons
across metrics infeasible. For example, classification models for healthcare frequently report functional
correctness using an F1 score or the area under an ROC curve. However, such measures assume the
availability of plentiful clinical resources to act upon model predictions. In reality, clinicians are only able
to act upon a limited number of inputs from such classification models, thus favouring evaluation metrics
drawing upon the recommender systems literature, such as mean reciprocal rank, top-k precision, etc.
These metrics are mutually incommensurable, making it difficult to assess the true value of such systems
[2544]
in use .
Field Code Changed
— Reference datasets used for benchmarking can contain noise, imbalance, and bias in unknown quantities.
Quality characteristic measurements inherit these problems in the form of fragility, inaccuracy and
algorithmic bias respectively. Examples of AI algorithms perpetuating societal and demographic biases
abound in the academic literature, and a vast literature on fairness in machine learning has emerged in
[2645]
response to this problem .
Field Code Changed
— Evaluating very large models requires specialized techniques and infrastructure, which are not equally
accessible under resource constraints. Particularly for large language models, the computational and
energy requirements necessary to train models are very large, and inaccessible to most public
[27][46] [28][47] [2948]
institutions , with the evaluations of such models also not homogeneous , .
Field Code Changed
— Many AI applications involve interaction with humans, and the nature of this interaction changes
reflexively as humans adapt to the use of the system. For example, automation of actions in cockpits has
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
[30][49]
been shown to be associated with atrophy of flying skills in human pilots , and similar deficits are
[31][50]
anticipated in the use case of driverless cars . Benchmarking AI systems in such contexts is predicated
on careful consideration of human factors and user experience, which adds considerable complexity to
any possible evaluation.
These inter-related problems are caused by the fact that modern AI systems are developed using very large
datasets and very large models, with downstream sociotechnical considerations not clearly known at the time
of system benchmarking. While it is possible to develop comprehensive benchmarking references that
accommodate the large scale and complexity of AI systems in use, the application of such standards is
predicated on high levels of expertise and resource allocations. .
Alternatively, it is possible to conceive of approaches for differentiated benchmarking of AI systems, such that
quality characteristics of systems are benchmarked at different levels, and with different degrees of
standardization, adaptive to sociotechnical consideration of where such a system lies on a spectrum of
potential for harm. In this way, the conformity burden can rationally scale with the harm potential of AI
systems, thus simultaneously enabling innovation while maintaining safety.
6 Frameworks for differentiated benchmarking
6.1 AI Managementmanagement frameworks
AI Managementmanagement frameworks, such as management systems standards (MSS), can potentially be
used for differentiated benchmarking. An MSS is a set of requirements to help organizations implement
effective management practices. These standards provide a framework for organizations to structure their
processes, improve efficiency, and achieve specific objectives.
[32]
For example, ISO/IEC 42001 ISO/IEC 42001 is the ISO/IEC MSS specifying specifies requirements for
establishing, maintaining and improving AI management systems within organizations. The target for this
standard are organizations producing AI-based products and services.
MSS do not overlap with any other management system (quality, security, etc.); they complement them.
Moreover, an organization following the MSS can generate evidence of its responsibility and accountability
regarding its role with respect to AI systems and can, if desired, provide self-attestation, interested party
assessment or independent verification (e.g. certification) of this evidence.
ISO Management Systems are based over the Deming wheel that, as a part of the cycle, include checks, that in
turn relies on measures and produces outcomes for evaluation. In this context, knowing that ISO MSS does not
include by itself the target values to be satisfied, benchmarks can be used as thresholds or references for AI
MS controls. Thus, to some extent MSS can provide guidance with respect to differentiated benchmarking by
pointing organizations to specific processes that require control and that can be audited in certain contexts of
use. Therefore, this clause reviews MSS and other management frameworks relevant for AI systems.
The AI Risk Management Framework (AI RMF) developed by the US National Institute of Standards and
[33][52]
Technology (NIST) is another representative example . This framework utilizes a descriptive
methodology, offering flexibility in implementation. It focuses on assessing hazards, exposures, and
vulnerabilities associated with AI systems, allowing organizations to manage risks effectively across various
use cases and sectors.
[53]
ALTAI, developed by the European Commission's High-Level Expert Group on AI, is a procedural
framework released in June 2019 and updated until July 2020 that can be used as a tool for management
[34]
frameworks . It attempts to cover all principles and stages of AI implementation, and attempts to offer
region-agnostic and sector-agnostic perspectives. ALTAI follows a procedural approach, emphasizing
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
trustworthiness in AI systems. It assesses hazards, exposures, and vulnerabilities to ensure the ethical and
trustworthy deployment of AI technologies.
The Algorithm Impact Assessment Tool (AIA) is a Canadian government initiative established in 2019 and
[35][54]
updated until November 2022 ) . While specific focus areas are not explicitly mentioned, AIA addresses
planning, requirements analysis, design, and testing stages. The framework takes a procedural approach,
ensuring that AI implementations are region-agnostic and sector-agnostic. AIA assesses hazards, exposures,
and vulnerabilities without specifying particular domains.
[36]
IEEE 7010 IEEE 7010 , Recommended Practices for Assessing the Impact of Autonomous and Intelligent
Systems on Human Well-being, a standard introduced by the US Institute of Electrical and Electronics
Engineers (IEEE) in May 2020, follows a descriptive approach. While specific focus areas are not explicitly
mentioned, the practices are designed to be region-agnostic and sector-agnostic. The framework provides
guidance on assessing hazards, exposures, and vulnerabilities associated with autonomous and intelligent
systems, emphasizing their impact on human well-being.
MSS are workhorses of standardization activities. They enable organizations to address aspects in their
management processes without having to rework their internal organizational vocabularies, foregrounding
the voluntary nature of standardization. MSS also permit organizations to address all aspects of their
workflows, including qualitative elements that are difficult to address with measurements.
However, MSS, for all their salutary properties, cannot by themselves produce trustworthy products. They can
be supplemented by standards that document technical aspects of system development, testing and
benchmarking. Additionally, it can be helpful to provide guidance for which technical benchmarks apply to
which system in which context of use.
6.2 Classification-based frameworks
[37][56]
Classification is a very common form of standardization activity . The standardization of IT systems, in
particular, seems to lend itself well to classification-based frameworks, as is evidenced by NIST's cybersecurity
framework subcategories, which enables an organization to standardize processes relevant for its specific
[38][57]
needs . The judgment of relevance provides the source of differentiation in the standardization process
in such frameworks, with the most common frame of relevance judgment being risk or impact assessment.
A number of frameworks for risk or impact assessment of AI systems pre-exist. Some of these frameworks use
risk-based classification to differentiate benchmarking treatment for various AI products. Some such
frameworks are reviewed below.
[39][58]
The German Data Ethics Commission has created a guidance document describing five criticality classes
for AI systems depicting harm for, i.e. the physical as well as psychical well-being, finance, date, manipulation
of information as well as a negative form of nudging. Based on this guidance, regulation classes for AI systems
can vary depending on the jurisdiction and specific regulations in place. The document describes these five
regulation classes with corresponding duties for responsible parties, such as providers and manufacturers as
well as concerns which justify the placement of an AI system into a specific class.
a) Class 1 -: No or minimal potential for harm:
Duties: correctness checks, transparency, system analyses in cases of suspicion.
Concerns: potential for unexpected or unintended consequences.
b) Class 2 -: Low risk:
Duties: risk assessment, transparency obligations, and basic safety standards.
© ISO/IEC 2025 – All rights reserved
ISO/IEC CD TR 42106:2025(en)
Concerns: undue risks to individuals or society.
c) Class 3 -: Moderate risk:
Duties: oversight, risk assessments, third-party audits, and adherence to specific industry
standards.
Concerns: harm to individuals, privacy violations, diffusion in accountability, fairness in AI
decision-making.
d) Class 4 -: High risk:
Duties: thorough risk assessments, continuous monitoring, a
...










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...