ISO/IEC TR 24029-1:2021
(Main)Artificial Intelligence (AI) - Assessment of the robustness of neural networks - Part 1: Overview
Artificial Intelligence (AI) - Assessment of the robustness of neural networks - Part 1: Overview
This document provides background about existing methods to assess the robustness of neural networks.
Intelligence artificielle (IA) — Évaluation de la robustesse des réseaux de neurones — Partie 1: Vue d'ensemble
General Information
Overview - ISO/IEC TR 24029-1:2021 (AI - Robustness of Neural Networks - Part 1: Overview)
ISO/IEC TR 24029-1:2021 is a technical report that provides a comprehensive overview of existing methods to assess the robustness of neural networks. It summarizes concepts, workflows and background knowledge rather than prescribing normative requirements. The report focuses on robustness as the ability of an AI system to maintain performance under varying input or operating circumstances and highlights the particular challenges neural networks present (non-linearity, limited explainability, domain shifts).
Key topics and technical coverage
- Robustness concept and definitions
- Clear definitions for terms used in robustness assessment (e.g., neural network, input data, testing, validation, robustness).
- Typical workflow to assess robustness
- Steps such as stating robustness goals, planning tests, defining metrics and test data, and documenting a testing protocol for stakeholders.
- Classification of assessment methods
- Three main classes are described:
- Statistical methods - performance metrics, sampling strategies and statistical measures for interpolation and classification.
- Formal methods - approaches that use proofs, solvers, optimization and abstract interpretation to guarantee properties (e.g., interpolation stability, maximum stable perturbation regions).
- Empirical methods - field trials, a posteriori testing and benchmarking under realistic conditions.
- Three main classes are described:
- Robustness metrics and measurement
- Overview of available statistical metrics and contrastive measures used to quantify robustness for different AI tasks.
- Supporting material
- Informative annexes on data perturbation and the principle of abstract interpretation; bibliography for further reading.
- Limitations and research context
- Notes that characterizing robustness for neural networks is an open research area and that combined approaches are commonly used.
Practical applications and target users
This report is useful for:
- AI system architects and machine learning engineers planning robustness validation and testing
- Safety and quality assurance teams integrating AI into regulated systems (automotive, aerospace, healthcare)
- Test engineers designing testing protocols, metrics and field trials for neural-network-based components
- Researchers and tool developers working on robustness metrics, formal verification tools and benchmarking suites
Practical uses include selecting appropriate methods (statistical/formal/empirical), defining robustness objectives, preparing test data/benchmarks, and documenting validation evidence.
Related standards and references
- References in the report include industry standards such as ISO 26262 (automotive functional safety) and other ISO/IEC/IEEE guidance on testing and validation (e.g., ISO/IEC/IEEE 16085, ISO/IEC 25000). These contextualize how robustness assessment integrates into system-level assurance.
Keywords: ISO/IEC TR 24029-1:2021, AI robustness assessment, neural network robustness, robustness testing, statistical methods, formal verification, empirical testing, abstract interpretation, data perturbation.
Frequently Asked Questions
ISO/IEC TR 24029-1:2021 is a technical report published by the International Organization for Standardization (ISO). Its full title is "Artificial Intelligence (AI) - Assessment of the robustness of neural networks - Part 1: Overview". This standard covers: This document provides background about existing methods to assess the robustness of neural networks.
This document provides background about existing methods to assess the robustness of neural networks.
ISO/IEC TR 24029-1:2021 is classified under the following ICS (International Classification for Standards) categories: 35.020 - Information technology (IT) in general. The ICS classification helps identify the subject area and facilitates finding related standards.
You can purchase ISO/IEC TR 24029-1:2021 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.
Standards Content (Sample)
TECHNICAL ISO/IEC TR
REPORT 24029-1
First edition
2021-03
Artificial Intelligence (AI) —
Assessment of the robustness of
neural networks —
Part 1:
Overview
Reference number
©
ISO/IEC 2021
© ISO/IEC 2021
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2021 – All rights reserved
Contents Page
Foreword .iv
Introduction .v
1 Scope .1
2 Normative references .1
3 Terms and definitions .1
4 Overview of the existing methods to assess the robustness of neural networks .3
4.1 General . 3
4.1.1 Robustness concept . 3
4.1.2 Typical workflow to assess robustness . 3
4.2 Classification of methods . 6
5 Statistical methods .7
5.1 General . 7
5.2 Robustness metrics available using statistical methods . 8
5.2.1 General. 8
5.2.2 Examples of performance measures for interpolation . 8
5.2.3 Examples of performance measures for classification . 9
5.2.4 Other measures .13
5.3 Statistical methods to measure robustness of a neural network .14
5.3.1 General.14
5.3.2 Contrastive measures .14
6 Formal methods .14
6.1 General .14
6.2 Robustness goal achievable using formal methods.15
6.2.1 General.15
6.2.2 Interpolation stability .15
6.2.3 Maximum stable space for perturbation resistance.15
6.3 Conduct the testing using formal methods .16
6.3.1 Using uncertainty analysis to prove interpolation stability .16
6.3.2 Using solver to prove a maximum stable space property .16
6.3.3 Using optimization techniques to prove a maximum stable space property .16
6.3.4 Using abstract interpretation to prove a maximum stable space property .17
7 Empirical methods .17
7.1 General .17
7.2 Field trials .17
7.3 A posteriori testing .18
7.4 Benchmarking of neural networks .19
Annex A (informative) Data perturbation .20
Annex B (informative) Principle of abstract interpretation .25
Bibliography .26
© ISO/IEC 2021 – All rights reserved iii
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that
are members of ISO or IEC participate in the development of International Standards through
technical committees established by the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other
international organizations, governmental and non-governmental, in liaison with ISO and IEC, also
take part in the work.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives or www .iec .ch/ members
_experts/ refdocs).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC
list of patent declarations received (see patents.iec.ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html. In the IEC, see www .iec .ch/ understanding -standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 42, Artificial intelligence.
A list of all parts in the ISO/IEC 24029 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html and www .iec .ch/ national
-committees.
iv © ISO/IEC 2021 – All rights reserved
Introduction
When designing an AI system, several properties are often considered desirable, such as robustness,
resiliency, reliability, accuracy, safety, security, privacy. A definition of robustness is provided in 3.6.
Robustness is a crucial property that poses new challenges in the context of AI systems. For example, in
AI systems there are some risks specifically tied to the robustness of AI systems. Understanding these
risks is essential for the adoption of AI in many contexts. This document aims at providing an overview
of the approaches available to assess these risks, with a particular focus on neural networks, which are
heavily used in industry, government and academia.
In many organizations, software validation is an essential part of putting software into production.
The objective is to ensure various properties including safety and performance of the software used
in all parts of the system. In some domains, the software validation and verification process is also
an important part of system certification. For example, in the automotive or aeronautic fields, existing
standards, such as ISO 26262 or Reference [2], require some specific actions to justify the design, the
implementation and the testing of any piece of embedded software.
The techniques used in AI systems are also subject to validation. However, common techniques used in
AI systems pose new challenges that require specific approaches in order to ensure adequate testing
and validation.
AI technologies are designed to fulfil various tasks, including interpolation/regression, classification
and other tasks.
While many methods exist for validating non-AI systems, they are not always directly applicable to
AI systems, and neural networks in particular. Neural network systems represent a specific challenge
as they are both hard to explain and sometimes have unexpected behaviour due to their non-linear
nature. As a result, alternative approaches are needed.
Methods are categorized into three groups: statistical methods, formal methods and empirical methods.
This document provides background on these methods to assess the robustness of neural networks.
It is noted that characterizing the robustness of neural networks is an open area of research, and there
are limitations to both testing and validation approaches.
© ISO/IEC 2021 – All rights reserved v
TECHNICAL REPORT ISO/IEC TR 24029-1:2021(E)
Artificial Intelligence (AI) — Assessment of the robustness
of neural networks —
Part 1:
Overview
1 Scope
This document provides background about existing methods to assess the robustness of neural
networks.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
artificial intelligence
AI
capability of an engineered system to acquire, process and apply knowledge and skills
3.2
field trial
trial of a new system in actual situations for which it is intended (potentially with a restricted user group)
Note 1 to entry: Situation encompasses environment and process of usage.
3.3
input data
data for which a deployed machine learning model calculates a predicted output or inference
Note 1 to entry: Input data is also referred to by machine learning practitioners as out-of-sample data, new data
and production data.
© ISO/IEC 2021 – All rights reserved 1
3.4
neural network
neural net
NN
artificial neural network
ANN
network of primitive processing elements connected by weighted links with adjustable weights, in
which each element produces a value by applying a non-linear function to its input values, and transmits
it to other elements or presents it as an output value
Note 1 to entry: Whereas some neural networks are intended to simulate the functioning of neurons in the nervous
system, most neural networks are used in artificial intelligence as realizations of the connectionist model.
Note 2 to entry: Examples of non-linear functions are a threshold function, a sigmoid function and a polynomial
function.
[SOURCE: ISO/IEC 2382:2015, 2120625, modified — Abbreviated terms have been added under the
terms and Notes 3 to 5 to entry have been removed.]
3.5
requirement
statement which translates or expresses a need and its associated constraints and conditions
[SOURCE: ISO/IEC/IEEE 15288:2015, 4.1.37]
3.6
robustness
ability of an AI system to maintain its level of performance under any circumstances
Note 1 to entry: This document mainly describes data input circumstances such as domain change but the
definition is broader not to exclude hardware failure and other types of circumstances.
3.7
testing
activity in which a system or component is executed under specified conditions, the results are observed
or recorded, and an evaluation is made of some aspect of the system or component
[SOURCE: ISO/IEC/IEEE 26513:2017, 3.42]
3.8
test data
subset of input data (3.3) samples used to assess the generalization error of a final machine learning
(ML) model selected from a set of candidate ML models
[SOURCE: Reference [2]]
3.9
training dataset
set of samples used to fit a machine learning model
3.10
validation
confirmation, through the provision of objective evidence, that the requirements (3.5) for a specific
intended use or application have been fulfilled
[SOURCE: ISO/IEC 25000:2014, 4.41, modified — Note 1 to entry has been removed.]
2 © ISO/IEC 2021 – All rights reserved
3.11
validation data
subset of input data (3.3) samples used to assess the prediction error of a candidate machine
learning model
Note 1 to entry: Machine learning (ML) model validation (3.10) can be used for ML model selection.
[SOURCE: Reference [2]]
3.12
verification
confirmation, through the provision of objective evidence, that specified requirements have been
fulfilled
[SOURCE: ISO/IEC 25000:2014, 4.43, modified — Note 1 to entry has been removed.]
4 Overview of the existing methods to assess the robustness of neural networks
4.1 General
4.1.1 Robustness concept
Robustness goals aim at answering the question “To what degree is the system required to be robust?”
or “What are the robustness properties of interest?”. Robustness properties demonstrate the degree to
which the system performs with atypical data as opposed to the data expected in typical operations.
4.1.2 Typical workflow to assess robustness
This subclause explains how the robustness of neural networks is assessed for different classes of AI
applications such as classification, interpolation and other complex tasks.
There are different ways to assess the robustness of neural networks using objective information.
A typical workflow for determining neural network (or other technique) robustness is as shown in
Figure 1.
© ISO/IEC 2021 – All rights reserved 3
4 © ISO/IEC 2021 – All rights reserved
Key
I.I.I. incomplete, incorrect or insufficient
start/end
step
input/output
decision
Figure 1 — Typical workflow to determine neural network robustness
Step 1: State robustness goals
The process begins with a statement of the robustness goals. During this initial step, the targets to
be tested for robustness are identified. The metrics to quantify the objects that demonstrate the
achievement of robustness are subsequently identified. This constitutes the set of decision criteria
on robustness properties that can be subject to further approval by relevant stakeholders (see
ISO/IEC/IEEE 16085:2021, 7.4.2).
Step 2: Plan testing
This step plans the tests that demonstrate robustness. The tests rely on different methods, for example:
statistical, formal or empirical methods. In practice, a combination of methods is used. Statistical
approaches usually rely on a mathematical testing process and are able to illustrate a certain level
of confidence in the results. Formal methods rely on formal proofs to demonstrate a mathematical
property over a domain. Empirical methods rely on experimentation, observation and expert judgement.
In planning the testing, the environment setup needs to be identified, data collection planned, and data
characteristics defined (that is, which data element ranges and data types will be used, which edge
cases will be specified to test robustness, etc.). The output of Step 2 is a testing protocol that comprises
a document stating the rationale, objectives, design and proposed analysis, methodology, monitoring,
conduct and record-keeping of the tests (more details of the content of a testing protocol are available
through the definition of the clinical investigation plan found in ISO 14155:2020, 3.9).
Step 3: Conduct testing
The testing is then conducted according to the defined testing protocol, and outcomes are collected.
It is possible to perform the tests using a real-world experiment or a simulation, and potentially a
combination of these approaches.
Step 4: Analyze outcome
After completion, tests outcomes are analysed using the metrics chosen in Step 1.
Step 5: Interpret results
The analysis results are then interpreted to inform the decision.
Step 6: Test objective achieved?
A decision on system robustness is then formulated given the criteria identified earlier and the resulting
interpretation of the analysis results.
If the test objectives are not met, an analysis of the process is conducted and the process returns to the
appropriate preceding step, in order to alleviate deficiencies, e.g. add robustness goals, modify or add
metrics, add consideration of different aspects to measure, re-plan tests, etc.
AI systems that significantly rely on neural networks, particularly deep neural networks (DNN),
bear built-in malfunctions. These malfunctions are showing up by a system behaviour that resembles
© ISO/IEC 2021 – All rights reserved 5
an occurrence of a conventional software. Typical situations have been demonstrated by feeding
"adversarial examples" to object recognition systems, e.g. in Reference [5]. These built-in errors of
DNNs are not simple to "fix". Research on this problem shows that there are measures to improve the
[6],[7]
robustness of DNNs with respect to adversarial examples, but this works to a certain degree only .
However, if detected during a test procedure, the AI system is able to signal a problem when an
associated input pattern is encountered.
Data sourcing:
Data sourcing is the process of selecting, producing and/or generating the testing data and objects that
are needed for conducting the testing.
This sometimes includes consideration of legal or other regulatory requirements, as well as practical or
technical issues.
The testing protocol contains the requirements and the criteria necessary for data sourcing. Data
sourcing issues and methods are not covered in detail in this document.
Especially the following issues can have an impact on robustness:
— scale;
— diversity, representativeness, and range of outliers;
— choice of real or synthetic data;
— datasets used specifically for robustness testing;
— adversarial and other examples that explore hypothetic domain extremes;
— composition of training, testing, and validation datasets.
4.2 Classification of methods
Following the workflow defined above for determining robustness, the remainder of this document
describes the methods and metrics applicable to the various testing types, i.e. statistical, formal and
empirical methods.
Statistical approaches usually rely on a mathematical testing process on some datasets, and help
ensure a certain level of confidence in the results. Formal methods rely on a sound formal proof in
order to demonstrate a mathematical property over a domain. Formal methods in this document are
not constrained to the traditional notion of syntactic proof methods and include correctness checking
methods, such as model checking. Empirical methods rely on experimentation, observation and expert
judgement.
While it is possible to characterize a system through either observation or proof, this document chooses
to separate observation techniques into statistical and empirical methods. Statistical methods generate
reproducible measures of robustness based on specified datasets. Empirical methods produce data
that can be analysed with statistical methods but is not necessarily reproducible due to the inclusion
of subjective assessment. Therefore, it is usually necessary that methods from both categories be
performed jointly.
Thus, this document first considers statistical approaches which are the most common approaches
used to assess robustness. They are characterized by a testing approach defined by a methodology
using mathematical metrics. This document then examines approaches to attain a formal proof that
are increasingly being used to assess robustness. Finally, this document presents empirical approaches
that rely on subjective observations that complement the assessment of robustness when statistical
and formal approaches are not sufficient or viable.
6 © ISO/IEC 2021 – All rights reserved
In practice, in the current state of the art, these methods are not used to directly assess robustness
as a whole. Instead, they each target complementary aspects of robustness, providing several partial
indicators whose conjunction enables robustness assessment.
For an evaluator, it is indeed possible to use these methods to answer different kinds of questions on
the system they intend to validate. For example:
— statistical methods allow the evaluator to check if the systems properties reach a desired target
threshold (e.g. how many defective units are produced?);
— formal methods allow the evaluator to check if the properties are provable on the domain of use (e.g.
does the system always operate within the specified safety bounds?);
— empirical methods allow the evaluator to assess the degree to which the system’s properties hold
true in the scenario tested (e.g. is the observed behaviour satisfactory?).
The principle of applying such methods to robustness assessment is to evaluate to which extent these
properties hold when circumstances change:
— when using statistical methods: how is the measured value of performance affected when changing
the conditions?
— when using formal methods: do the new conditions still belong to the domain where the properties
are provable?
— when using empirical methods: do the properties still hold true in other scenarios?
It is noted that characterizing the robustness of neural networks is an active area of research, and there
are limitations to both testing and validation approaches. With testing approaches, the variation of
possible inputs is unlikely to be large enough to provide any guarantees on system performance. With
validation approaches, approximations are usually required to handle the high dimensionality of inputs
and parameterizations of a neural network.
5 Statistical methods
5.1 General
One aspect of robustness is the effect of changing circumstances on quantitative performance, which
statistical methods are particularly suited to measure. They enable this assessment by direct evaluation
of the performance in various scenarios using comparative measures.
When using statistical methods, the four following main criteria are used in the computation of
robustness:
1) Appropriate testing data. To evaluate the robustness of a model, a dataset that spans the
distribution and the input conditions of interest for the target application is first established, either
through acquisition of real measurement data or simulated data. Several sources for the data are
possible, such as noisy data that was not accounted for during the initial training of the model, data
from similar domain applications, data from a different but equivalent data source. While there is
no general method to assess the relevance of a dataset and it often relies on human judgment, some
techniques exist (e.g. based on intermediate representations of the data) to support this analysis
with various indicators. The evaluation of the robustness of neural network models can vary with
different testing datasets.
2) Choose a setting of the model. The evaluation can also assess the robustness using different
settings of the trained model (for example the model precision, quantized weight, etc).
3) A choice of metrics or metrics of performance. Based on the context, the task at hand and the
nature of the data, some metrics are not always appropriate, as they can lead to irrelevant or
misleading results. An appropriate set of metrics (see 5.2) helps to avoid these situations.
© ISO/IEC 2021 – All rights reserved 7
4) A method for a decision on robustness. Given a selected metric, an appropriate statistical test
is performed to reach a decision regarding whether the model is sufficiently robust for the chosen
robustness goal(s) or not.
A robustness property assessed through statistical methods is defined by one or more thresholds over a
set of metrics that need to hold on some testing data. The evaluation of robustness is case-specific, given
that certain organizations or situations require different robustness goals and metrics to determine if
a goal is met.
This clause follows the general workflow described in Figure 1 to assess the robustness of a neural
network. In particular, it focuses on Steps 1, 2 and 3 of the workflow defined in 4.1.2, i.e. state robustness
goal, plan testing and conduct testing.
Subclauses 5.2 and 5.3 present metrics and methods to assess the robustness of neural networks
statistically, more detailed information on each is available in References [8],[9],[10] and[11].
5.2 Robustness metrics available using statistical methods
5.2.1 General
This subclause presents background information about statistical metrics that are available and
typically used on the output of neural networks. It describes the robustness goals, using Step 1 of
Figure 1. Robustness goals need t3o be well defined. For example, to say simply that “the trained neural
network has to be robust to inputs dissimilar to those on which it has been trained” is not sufficiently
well defined. It is possible for a neural network to have full compliance or no compliance with this goal
depending on the input. For example, it is possible for a neural network to be fully robust to inputs that
follow a different distribution from the initial training and testing sets but remains within the scope
of the domain. On the other hand, it is likely to have a neural network that is not compliant at all if the
inputs are in a completely different domain than those it was trained for. Therefore, the robustness goal
needs to be stated sufficiently to enable meaningful determination of a neural network’s robustness.
An example of a well-defined goal (structured in three parts) is as follows.
1) The trained neural network needs to be robust to inputs dissimilar to those on which it has been
trained.
2) Inputs are assumed to be from the same domain and can include both physically realizable and
hypothetical.
3) Metrics that can be used are included in 5.2.2.
Depending on the task addressed by the AI system (e.g. classification, interpolation/regression)
different statistical metrics are possible. This subclause describes common statistical metrics available
and the way to calculate them. The list is not exhaustive and some of these metrics are compatible with
other tasks. It is possible to use them independently or in combination. Depending on the application,
there are also numerous task-specific metrics (e.g. BLEU, TER or METEOR for machine translation,
intersection over union for object detection in images, or mean average precision for ranked retrieval),
but their description is out of the scope of this document.
5.2.2 Examples of performance measures for interpolation
5.2.2.1 Root mean square error or root mean square deviation
The root mean square error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are and RMSE is a measure of
the spread of the residuals.
8 © ISO/IEC 2021 – All rights reserved
5.2.2.2 Max error
Max error is either an absolute or a relative metric calculating a value in the input data and the
corresponding value in the prediction from the AI system. The absolute max error is the maximal
signed difference between a value in the input data and the corresponding value in the prediction from
the AI system. The relative max error is the percentage of the width of the variation domain on which
the AI system operates.
5.2.2.3 Actual/predicted correlation
The actual/predicted correlation is the linear correlation (in the statistical sense) between the actual
values and the predicted values for every value considered in a set.
5.2.3 Examples of performance measures for classification
5.2.3.1 General notions and associated basic metrics
A set of samples can have the following characteristics:
— total population: the total number of samples in the data;
— condition positive: the number of real positive cases in the data;
— condition negative: the number of real negative cases in the data;
— prediction positive: the number of samples classified as positive;
— prediction negative: the number of samples classified as negative;
— prevalence: the proportion of a particular class in the total number of samples.
Each instance in the set of samples is classified by the classification system in one of the following ways:
— true positive (hit): instance belongs to the class and is predicted as belonging to the class;
— true negative (correct rejection): instance does not belong to the class and is predicted as not
belonging to the class;
— false positive (false alarm, Type I error): instance does not belong to the class and is predicted as
belonging to the class;
— false negative (miss, Type II error): instance belongs to the class and is predicted as not belonging
to the class.
Several metrics are built on top of these sample characteristics, as described in Table 1.
— True positive rate, sensitivity: the true positive rate (also known as sensitivity, recall or probability
of detection) indicates the proportion of objects correctly classified as positive in the total number
of positive objects.
— True negative rate, specificity: the true negative rate (also known as specificity or selectivity)
indicates the proportion of objects correctly classified as negative in the total number of negative
objects.
— False negative rate: the false negative rate (also known as miss rate) indicates the proportion of
objects falsely classified as negative in the total number of positive objects.
— False positive rate: the false positive rate (also known as fall-out or probability of false alarm)
indicates the proportion of objects falsely classified as positive that are negative. Thus, the
probability of a false alarm is given.
— Accuracy: the accuracy indicates the proportion of all objects that are correctly classified.
© ISO/IEC 2021 – All rights reserved 9
— positive predictive value: the positive predictive value (also known as precision or relevance)
indicates the proportion of results correctly classified as positive in the total of results classified as
positive.
— Negative predictive value: the negative predictive value (also known as separation ability) indicates
the proportion of results correctly classified as negative in the total number of results classified as
negative.
— False discovery rate: the false discovery rate indicates the ratio of mistakenly rejected null
hypotheses (false positives, false alarm, type I errors) to the total rejected null hypotheses
(prediction positives).
— False omission rate: the false omission rate indicates the ratio of mistakenly rejected false negatives
to the total number of predicted negatives.
— Positive likelihood relation: the positive likelihood relation indicates the ratio of the true positive
rate to the false positive rate.
— Negative likelihood relation: the negative likelihood relation indicates the ratio of the false negative
rate to the true negative rate.
— Diagnostic odds rate: indicates the ratio of the probability of true positives to the probability of false
positives and is independent of prevalence.
— F1 score: the F1 score combines the true positive rate and the positive predictive value using the
harmonic mean.
10 © ISO/IEC 2021 – All rights reserved
Table 1 — Sample characteristics and relevant basic metrics built upon them
True condition
Prevalence Accuracy
Condition Condition
Total
positive negative
N NN+
C+
TT+ -
population
(CP) (CN)
P P
tot tot
Positive predictive
False positive False discovery rate
value (V )
P+
True positive (FP)
Prediction
Precision, relevance
positive
(TP)
N N
T+ F+
Type I error
N N
P+ P+
False omission rate Negative predictive value
False negative
True negative Separation ability
Prediction
(FN)
negative
N N
F- T-
Type II error (TN)
N N
P- P-
True positive False positive Positive likelihood
rate (R ) rate (R ) ratio (R )
T+ F+ L+
Sensitivity,
recall
Probability of Fall-out, proba-
detection bility false alarm
F score
Diagnostic
odds rate
N N R
−1
T+ F+ T+
−−11
RV+
T+ P+
R
N N R L+
C+ C- F+
R
L-
False negative True negative Negative likelihood
rate (R ) rate (R ) ratio (R )
F- T- L-
Specificity,
Miss rate
selectivity
N N R
F- T- F-
N N R
C+ C- T-
where
N is the number of true positives;
T+
N is the number of true negatives;
T-
N is the number of false positives;
F+
N is the number of false negatives;
F-
N is the number of conditions positive;
C+
N is the number of conditions negative;
C-
P is the total population;
tot
N is the number of predictions positive;
P+
© ISO/IEC 2021 – All rights reserved 11
Predicted condition
N is the number of predictions negative;
P-
R is the true positive rate;
T+
R is the true negative rate;
T-
R is the false positive rate;
F+
R is the false negative rate;
F+
R is the likelihood positive ratio;
L+
R is the likelihood negative ratio;
L-
V is the positive predictive value.
P+
Table 1 provides a synthetic view of the sample characteristics and metrics described in this subclause.
All these sample characteristics and metrics apply primarily to binary classification but have also
generalized definitions in the multiclass and multilabel cases.
5.2.3.2 Advanced metrics
5.2.3.2.1 Precision recall curve
Precision/recall pairs of metrics are computed at different output thresholds. Precision/recall pairs
express trade-offs between precision and recall when these metrics are used to evaluate robustness.
5.2.3.2.2 Receiver operating characteristic (ROC)
The ROC curve is a plot of the true positive rate against the false positive rate at different settings of the
hyperparameters (e.g. decision threshold).
ROC express trade-offs between true positive rates and false positive rates when these metrics are
used to evaluate robustness. ROCs are used when one metric is associated with a significant cost or
benefit in robustness evaluation, such as in the medical domain where false diagnoses can be especially
problematic.
5.2.3.3 Lift
The lift metric is a measure comparing the relative performance of a prediction system against another
control group (usually randomly selected).
5.2.3.4 Area under curve
Area under curve measures the integral of the ROC curve which represents the performance of a model
for every threshold of classification. The ROC curve shows the true positive rate relative to the false
positive rate.
5.2.3.5 Balanced accuracy
Balanced accuracy is the average recall obtained on each class as described in Reference [12].
5.2.3.6 Micro average and macro average
Measures like precision or recall computed over the whole dataset are sometimes misleading, in cases
of unbalanced datasets. A possible strategy to alleviate this is to compute a macro-averaged measure,
which is the average of the measure computed for each class separately, instead of the micro-average
[13]
which is the standard computation without class separation .
12 © ISO/IEC 2021 – All rights reserved
5.2.3.7 Matthews correlation coefficient (MCC)
Matthews correlation coefficient is a measure on a set of classifications. Its range lies within [-1,1] in
which +1 represents perfect prediction, -1 represents total inverse prediction and 0 represents average
prediction. Crucially, this metric generalizes to instances where the classes are unbalanced in the data
[14],[15]
themselves (i.e. an MCC of 0 does not necessitate 1/N prediction accuracy, given N classes) .
It is computed in Formula 1:
NN ×−NN×
T+ T- F+ F-
(1)
NN+ NN+ NN+ NN+
()()()()
T+ F+ T+ F- T- F+ T- F-
where
N is the number of true positives;
T+
N is the number of false negatives;
T-
N is the number of false positives;
F+
N is the number of false negatives.
F+
5.2.3.8 Confusion matrix and associated metrics
A confusion matrix allows a detailed analysis of the performance of a classifier and is helpful to
circumvent or uncover the weaknesses of individual metrics as it achieves a more rigorous and well-
rounded analysis of classifier performance. By contrast, using a single metric to express classifier
performance is not informative enough to conduct this analysis, as it does not indicate which classes
are best recognized or the type of errors committed by the classifier.
The confusion matrix C is a square matrix where entry C at row r and column c are the number of
r,c
th th
instances belonging to the r class or category that are labelled by the classifier as having the c class.
Confusion matrices include counts of true positives, true negatives, false positives and false negatives:
metrics such as accuracy, per-class recall, and per-class precision can be calculated from these. Further
metrics can be derived from confusion matrix elements, such as entropy of the histogram represented
by the matrix.
5.2.4 Other measures
5.2.4.1 Hinge loss
Hinge loss is an upper bound on the number of mistakes made by a classifier. In the general, multiclass
[16]
case, the margin is computed by the Crammer-Singer method .
5.2.4.2 Cohen’s kappa
Cohen’s kappa is a measure of inter-annotator agreement [see Formula (2)].
κ = (p −p ) / (1 − p) (2)
o e e
where
p is the prior probability of agreement of the label on any sample in observed data;
o
p is the expected agreement when each of two annotators assign labels independently and ac-
e
cording to their own measured prior distributions, given empirical data.
© ISO/IEC 2021 – All rights reserved 13
This measure is primarily used for evaluating data quality after (error-prone) human annotation,
but it also has applications as a proxy evaluation method when labels are missing, by comparing two
classifiers with each other.
5.3 Statistical methods to measure robustness of a neural network
5.3.1 General
When applying the metrics from 5.2 on testing data in order to assess robustness, several statistical
techniques are available. This subclause describes some of the statistical methodologies available to
perform steps 2 and 3 described in 4.1 to plan and conduct the testing. Performing a testing protocol
is not unique to neural networks and considerations include the testing environment set-up, what and
how to measure, and data sourcing and characteristics. The difference in neural network robustness
test planning is a more “intense” consideration of the data sourcing (e.g. quality, granularity, train/
test/validation datasets, etc.). While conducting the testing, planned data sourcing and availability of
computational resources are important considerations due to the sometimes massive amounts of data
and computational resources required by neural networks.
5.3.2 Contrastive measures
The statistical measures of performance are applied first on a reference dataset and then on one or
several datasets representative of the targeted changes of circumstances. For each of those, if the
performance drop from the reference test set is sufficiently low, then the system is deemed robust.
6 Formal methods
6.1 General
Another aspect of robustness is the degree to which changing circumstances affect the behaviour of
the system, independently of its performance. Formal methods are especially appropriate to assess
the stability of the system, i.e. the extent to which its outcome changes when input varies. Although a
robust system can be unstable and a stable system can be not robust, stability is a strong indicator for
robustness, as it makes the outcome more predictable.
Formal methods have been used to improve software reliability and allow stronger quali
...
The article discusses ISO/IEC TR 24029-1:2021, which is a document that provides an overview of methods for assessing the robustness of neural networks. It explains the background information related to existing methods in this field.
기사 제목: ISO/IEC TR 24029-1:2021 - 인공지능 (AI) - 신경망의 견고성 평가 - 제1부: 개요 기사 내용: 이 문서는 신경망의 견고성을 평가하기 위한 기존 방법에 대한 배경 정보를 제공합니다.
記事のタイトル:ISO/IEC TR 24029-1:2021 - Artificial Intelligence (AI) - Assessment of the robustness of neural networks - Part 1: Overview 記事の内容:この文書は、ニューラルネットワークの頑健性を評価するための既存の方法についての背景情報を提供します。








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...