ISO/IEC TR 24029-1:2021
(Main)Artificial Intelligence (AI) — Assessment of the robustness of neural networks — Part 1: Overview
Artificial Intelligence (AI) — Assessment of the robustness of neural networks — Part 1: Overview
This document provides background about existing methods to assess the robustness of neural networks.
Intelligence artificielle (IA) — Évaluation de la robustesse des réseaux de neurones — Partie 1: Vue d'ensemble
General Information
Buy Standard
Standards Content (Sample)
TECHNICAL ISO/IEC TR
REPORT 24029-1
First edition
2021-03
Artificial Intelligence (AI) —
Assessment of the robustness of
neural networks —
Part 1:
Overview
Reference number
ISO/IEC TR 24029-1:2021(E)
©
ISO/IEC 2021
---------------------- Page: 1 ----------------------
ISO/IEC TR 24029-1:2021(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2021
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2021 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC TR 24029-1:2021(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope .1
2 Normative references .1
3 Terms and definitions .1
4 Overview of the existing methods to assess the robustness of neural networks .3
4.1 General . 3
4.1.1 Robustness concept . 3
4.1.2 Typical workflow to assess robustness . 3
4.2 Classification of methods . 6
5 Statistical methods .7
5.1 General . 7
5.2 Robustness metrics available using statistical methods . 8
5.2.1 General. 8
5.2.2 Examples of performance measures for interpolation . 8
5.2.3 Examples of performance measures for classification . 9
5.2.4 Other measures .13
5.3 Statistical methods to measure robustness of a neural network .14
5.3.1 General.14
5.3.2 Contrastive measures .14
6 Formal methods .14
6.1 General .14
6.2 Robustness goal achievable using formal methods.15
6.2.1 General.15
6.2.2 Interpolation stability .15
6.2.3 Maximum stable space for perturbation resistance.15
6.3 Conduct the testing using formal methods .16
6.3.1 Using uncertainty analysis to prove interpolation stability .16
6.3.2 Using solver to prove a maximum stable space property .16
6.3.3 Using optimization techniques to prove a maximum stable space property .16
6.3.4 Using abstract interpretation to prove a maximum stable space property .17
7 Empirical methods .17
7.1 General .17
7.2 Field trials .17
7.3 A posteriori testing .18
7.4 Benchmarking of neural networks .19
Annex A (informative) Data perturbation .20
Annex B (informative) Principle of abstract interpretation .25
Bibliography .26
© ISO/IEC 2021 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC TR 24029-1:2021(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that
are members of ISO or IEC participate in the development of International Standards through
technical committees established by the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other
international organizations, governmental and non-governmental, in liaison with ISO and IEC, also
take part in the work.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives or www .iec .ch/ members
_experts/ refdocs).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC
list of patent declarations received (see patents.iec.ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html. In the IEC, see www .iec .ch/ understanding -standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 42, Artificial intelligence.
A list of all parts in the ISO/IEC 24029 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html and www .iec .ch/ national
-committees.
iv © ISO/IEC 2021 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC TR 24029-1:2021(E)
Introduction
When designing an AI system, several properties are often considered desirable, such as robustness,
resiliency, reliability, accuracy, safety, security, privacy. A definition of robustness is provided in 3.6.
Robustness is a crucial property that poses new challenges in the context of AI systems. For example, in
AI systems there are some risks specifically tied to the robustness of AI systems. Understanding these
risks is essential for the adoption of AI in many contexts. This document aims at providing an overview
of the approaches available to assess these risks, with a particular focus on neural networks, which are
heavily used in industry, government and academia.
In many organizations, software validation is an essential part of putting software into production.
The objective is to ensure various properties including safety and performance of the software used
in all parts of the system. In some domains, the software validation and verification process is also
an important part of system certification. For example, in the automotive or aeronautic fields, existing
standards, such as ISO 26262 or Reference [2], require some specific actions to justify the design, the
implementation and the testing of any piece of embedded software.
The techniques used in AI systems are also subject to validation. However, common techniques used in
AI systems pose new challenges that require specific approaches in order to ensure adequate testing
and validation.
AI technologies are designed to fulfil various tasks, including interpolation/regression, classification
and other tasks.
While many methods exist for validating non-AI systems, they are not always directly applicable to
AI systems, and neural networks in particular. Neural network systems represent a specific challenge
as they are both hard to explain and sometimes have unexpected behaviour due to their non-linear
nature. As a result, alternative approaches are needed.
Methods are categorized into three groups: statistical methods, formal methods and empirical methods.
This document provides background on these methods to assess the robustness of neural networks.
It is noted that characterizing the robustness of neural networks is an open area of research, and there
are limitations to both testing and validation approaches.
© ISO/IEC 2021 – All rights reserved v
---------------------- Page: 5 ----------------------
TECHNICAL REPORT ISO/IEC TR 24029-1:2021(E)
Artificial Intelligence (AI) — Assessment of the robustness
of neural networks —
Part 1:
Overview
1 Scope
This document provides background about existing methods to assess the robustness of neural
networks.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
artificial intelligence
AI
capability of an engineered system to acquire, process and apply knowledge and skills
3.2
field trial
trial of a new system in actual situations for which it is intended (potentially with a restricted user group)
Note 1 to entry: Situation encompasses environment and process of usage.
3.3
input data
data for which a deployed machine learning model calculates a predicted output or inference
Note 1 to entry: Input data is also referred to by machine learning practitioners as out-of-sample data, new data
and production data.
© ISO/IEC 2021 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO/IEC TR 24029-1:2021(E)
3.4
neural network
neural net
NN
artificial neural network
ANN
network of primitive processing elements connected by weighted links with adjustable weights, in
which each element produces a value by applying a non-linear function to its input values, and transmits
it to other elements or presents it as an output value
Note 1 to entry: Whereas some neural networks are intended to simulate the functioning of neurons in the nervous
system, most neural networks are used in artificial intelligence as realizations of the connectionist model.
Note 2 to entry: Examples of non-linear functions are a threshold function, a sigmoid function and a polynomial
function.
[SOURCE: ISO/IEC 2382:2015, 2120625, modified — Abbreviated terms have been added under the
terms and Notes 3 to 5 to entry have been removed.]
3.5
requirement
statement which translates or expresses a need and its associated constraints and conditions
[SOURCE: ISO/IEC/IEEE 15288:2015, 4.1.37]
3.6
robustness
ability of an AI system to maintain its level of performance under any circumstances
Note 1 to entry: This document mainly describes data input circumstances such as domain change but the
definition is broader not to exclude hardware failure and other types of circumstances.
3.7
testing
activity in which a system or component is executed under specified conditions, the results are observed
or recorded, and an evaluation is made of some aspect of the system or component
[SOURCE: ISO/IEC/IEEE 26513:2017, 3.42]
3.8
test data
subset of input data (3.3) samples used to assess the generalization error of a final machine learning
(ML) model selected from a set of candidate ML models
[SOURCE: Reference [2]]
3.9
training dataset
set of samples used to fit a machine learning model
3.10
validation
confirmation, through the provision of objective evidence, that the requirements (3.5) for a specific
intended use or application have been fulfilled
[SOURCE: ISO/IEC 25000:2014, 4.41, modified — Note 1 to entry has been removed.]
2 © ISO/IEC 2021 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/IEC TR 24029-1:2021(E)
3.11
validation data
subset of input data (3.3) samples used to assess the prediction error of a candidate machine
learning model
Note 1 to entry: Machine learning (ML) model validation (3.10) can be used for ML model selection.
[SOURCE: Reference [2]]
3.12
verification
confirmation, through the provision of objective evidence, that specified requirements have been
fulfilled
[SOURCE: ISO/IEC 25000:2014, 4.43, modified — Note 1 to entry has been removed.]
4 Overview of the existing methods to assess the robustness of neural networks
4.1 General
4.1.1 Robustness concept
Robustness goals aim at answering the question “To what degree is the system required to be robust?”
or “What are the robustness properties of interest?”. Robustness properties demonstrate the degree to
which the system performs with atypical data as opposed to the data expected in typical operations.
4.1.2 Typical workflow to assess robustness
This subclause explains how the robustness of neural networks is assessed for different classes of AI
applications such as classification, interpolation and other complex tasks.
There are different ways to assess the robustness of neural networks using objective information.
A typical workflow for determining neural network (or other technique) robustness is as shown in
Figure 1.
© ISO/IEC 2021 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO/IEC TR 24029-1:2021(E)
4 © ISO/IEC 2021 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/IEC TR 24029-1:2021(E)
Key
I.I.I. incomplete, incorrect or insufficient
start/end
step
input/output
decision
Figure 1 — Typical workflow to determine neural network robustness
Step 1: State robustness goals
The process begins with a statement of the robustness goals. During this initial step, the targets to
be tested for robustness are identified. The metrics to quantify the objects that demonstrate the
achievement of robustness are subsequently identified. This constitutes the set of decision criteria
on robustness properties that can be subject to further approval by relevant stakeholders (see
ISO/IEC/IEEE 16085:2021, 7.4.2).
Step 2: Plan testing
This step plans the tests that demonstrate robustness. The tests rely on different methods, for example:
statistical, formal or empirical methods. In practice, a combination of methods is used. Statistical
approaches usually rely on a mathematical testing process and are able to illustrate a certain level
of confidence in the results. Formal methods rely on formal proofs to demonstrate a mathematical
property over a domain. Empirical methods rely on experimentation, observation and expert judgement.
In planning the testing, the environment setup needs to be identified, data collection planned, and data
characteristics defined (that is, which data element ranges and data types will be used, which edge
cases will be specified to test robustness, etc.). The output of Step 2 is a testing protocol that comprises
a document stating the rationale, objectives, design and proposed analysis, methodology, monitoring,
conduct and record-keeping of the tests (more details of the content of a testing protocol are available
through the definition of the clinical investigation plan found in ISO 14155:2020, 3.9).
Step 3: Conduct testing
The testing is then conducted according to the defined testing protocol, and outcomes are collected.
It is possible to perform the tests using a real-world experiment or a simulation, and potentially a
combination of these approaches.
Step 4: Analyze outcome
After completion, tests outcomes are analysed using the metrics chosen in Step 1.
Step 5: Interpret results
The analysis results are then interpreted to inform the decision.
Step 6: Test objective achieved?
A decision on system robustness is then formulated given the criteria identified earlier and the resulting
interpretation of the analysis results.
If the test objectives are not met, an analysis of the process is conducted and the process returns to the
appropriate preceding step, in order to alleviate deficiencies, e.g. add robustness goals, modify or add
metrics, add consideration of different aspects to measure, re-plan tests, etc.
AI systems that significantly rely on neural networks, particularly deep neural networks (DNN),
bear built-in malfunctions. These malfunctions are showing up by a system behaviour that resembles
© ISO/IEC 2021 – All rights reserved 5
---------------------- Page: 10 ----------------------
ISO/IEC TR 24029-1:2021(E)
an occurrence of a conventional software. Typical situations have been demonstrated by feeding
"adversarial examples" to object recognition systems, e.g. in Reference [5]. These built-in errors of
DNNs are not simple to "fix". Research on this problem shows that there are measures to improve the
[6],[7]
robustness of DNNs with respect to adversarial examples, but this works to a certain degree only .
However, if detected during a test procedure, the AI system is able to signal a problem when an
associated input pattern is encountered.
Data sourcing:
Data sourcing is the process of selecting, producing and/or generating the testing data and objects that
are needed for conducting the testing.
This sometimes includes consideration of legal or other regulatory requirements, as well as practical or
technical issues.
The testing protocol contains the requirements and the criteria necessary for data sourcing. Data
sourcing issues and methods are not covered in detail in this document.
Especially the following issues can have an impact on robustness:
— scale;
— diversity, representativeness, and range of outliers;
— choice of real or synthetic data;
— datasets used specifically for robustness testing;
— adversarial and other examples that explore hypothetic domain extremes;
— composition of training, testing, and validation datasets.
4.2 Classification of methods
Following the workflow defined above for determining robustness, the remainder of this document
describes the methods and metrics applicable to the various testing types, i.e. statistical, formal and
empirical methods.
Statistical approaches usually rely on a mathematical testing process on some datasets, and help
ensure a certain level of confidence in the results. Formal methods rely on a sound formal proof in
order to demonstrate a mathematical property over a domain. Formal methods in this document are
not constrained to the traditional notion of syntactic proof methods and include correctness checking
methods, such as model checking. Empirical methods rely on experimentation, observation and expert
judgement.
While it is possible to characterize a system through either observation or proof, this document chooses
to separate observation techniques into statistical and empirical methods. Statistical methods generate
reproducible measures of robustness based on specified datasets. Empirical methods produce data
that can be analysed with statistical methods but is not necessarily reproducible due to the inclusion
of subjective assessment. Therefore, it is usually necessary that methods from both categories be
performed jointly.
Thus, this document first considers statistical approaches which are the most common approaches
used to assess robustness. They are characterized by a testing approach defined by a methodology
using mathematical metrics. This document then examines approaches to attain a formal proof that
are increasingly being used to assess robustness. Finally, this document presents empirical approaches
that rely on subjective observations that complement the assessment of robustness when statistical
and formal approaches are not sufficient or viable.
6 © ISO/IEC 2021 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/IEC TR 24029-1:2021(E)
In practice, in the current state of the art, these methods are not used to directly assess robustness
as a whole. Instead, they each target complementary aspects of robustness, providing several partial
indicators whose conjunction enables robustness assessment.
For an evaluator, it is indeed possible to use these methods to answer different kinds of questions on
the system they intend to validate. For example:
— statistical methods allow the evaluator to check if the systems properties reach a desired target
threshold (e.g. how many defective units are produced?);
— formal methods allow the evaluator to check if the properties are provable on the domain of use (e.g.
does the system always operate within the specified safety bounds?);
— empirical methods allow the evaluator to assess the degree to which the system’s properties hold
true in the scenario tested (e.g. is the observed behaviour satisfactory?).
The principle of applying such methods to robustness assessment is to evaluate to which extent these
properties hold when circumstances change:
— when using statistical methods: how is the measured value of performance affected when changing
the conditions?
— when using formal methods: do the new conditions still belong to the domain where the properties
are provable?
— when using empirical methods: do the properties still hold true in other scenarios?
It is noted that characterizing the robustness of neural networks is an active area of research, and there
are limitations to both testing and validation approaches. With testing approaches, the variation of
possible inputs is unlikely to be large enough to provide any guarantees on system performance. With
validation approaches, approximations are usually required to handle the high dimensionality of inputs
and parameterizations of a neural network.
5 Statistical methods
5.1 General
One aspect of robustness is the effect of changing circumstances on quantitative performance, which
statistical methods are particularly suited to measure. They enable this assessment by direct evaluation
of the performance in various scenarios using comparative measures.
When using statistical methods, the four following main criteria are used in the computation of
robustness:
1) Appropriate testing data. To evaluate the robustness of a model, a dataset that spans the
distribution and the input conditions of interest for the target application is first established, either
through acquisition of real measurement data or simulated data. Several sources for the data are
possible, such as noisy data that was not accounted for during the initial training of the model, data
from similar domain applications, data from a different but equivalent data source. While there is
no general method to assess the relevance of a dataset and it often relies on human judgment, some
techniques exist (e.g. based on intermediate representations of the data) to support this analysis
with various indicators. The evaluation of the robustness of neural network models can vary with
different testing datasets.
2) Choose a setting of the model. The evaluation can also assess the robustness using different
settings of the trained model (for example the model precision, quantized weight, etc).
3) A choice of metrics or metrics of performance. Based on the context, the task at hand and the
nature of the data, some metrics are not always appropriate, as they can lead to irrelevant or
misleading results. An appropriate set of metrics (see 5.2) helps to avoid these situations.
© ISO/IEC 2021 – All rights reserved 7
---------------------- Page: 12 ----------------------
ISO/IEC TR 24029-1:2021(E)
4) A method for a decision on robustness. Given a selected metric, an appropriate statistical test
is performed to reach a decision regarding whether the model is sufficiently robust for the chosen
robustness goal(s) or not.
A robustness property assessed through statistical methods is defined by one or more thresholds over a
set of metrics that need to hold on some testing data. The evaluation of robustness is case-specific, given
that certain organizations or situations require different robustness goals and metrics to determine if
a goal is met.
This clause follows the general workflow described in Figure 1 to assess the robustness of a neural
network. In particular, it focuses on Steps 1, 2 and 3 of the workflow defined in 4.1.2, i.e. state robustness
goal, plan testing and conduct testing.
Subclauses 5.2 and 5.3 present metrics and methods to assess the robustness of neural networks
statistically, more detailed information on each is available in References [8],[9],[10] and[11].
5.2 Robustness metrics available using statistical methods
5.2.1 General
This subclause presents background information about statistical metrics that are available and
typically used on the output of neural networks. It describes the robustness goals, using Step 1 of
Figure 1. Robustness goals need t3o be well defined. For example, to say simply that “the trained neural
network has to be robust to inputs dissimilar to those on which it has been trained” is not sufficiently
well defined. It is possible for a neural network to have full compliance or no compliance with this goal
depending on the input. For example, it is possible for
...
TECHNICAL ISO/IEC TR
REPORT 24029-1
First edition
Artificial intelligence (AI) —
Assessment of the robustness of
neural networks —
Part 1:
Overview
PROOF/ÉPREUVE
Reference number
ISO/IEC TR 24029-1:2021(E)
©
ISO/IEC 2021
---------------------- Page: 1 ----------------------
ISO/IEC TR 24029-1:2021(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2021
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii PROOF/ÉPREUVE © ISO/IEC 2021 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC TR 24029-1:2021(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope .1
2 Normative references .1
3 Terms and definitions .1
4 Overview of the existing methods to assess the robustness of neural networks .3
4.1 General . 3
4.1.1 Robustness concept . 3
4.1.2 Typical workflow to assess robustness . 3
4.2 Classification of methods . 6
5 Statistical methods .7
5.1 General . 7
5.2 Robustness metrics available using statistical methods . 8
5.2.1 General. 8
5.2.2 Examples of performance measures for interpolation . 8
5.2.3 Examples of performance measures for classification . 9
5.2.4 Other measures .13
5.3 Statistical methods to measure robustness of a neural network .14
5.3.1 General.14
5.3.2 Contrastive measures .14
6 Formula methods .14
6.1 General .14
6.2 Robustness goal achievable using formal methods.15
6.2.1 General.15
6.2.2 Interpolation stability .15
6.2.3 Maximum stable space for perturbation resistance.15
6.3 Conduct the testing using formal methods .16
6.3.1 Using uncertainty analysis to prove interpolation stability .16
6.3.2 Using solver to prove a maximum stable space property .16
6.3.3 Using optimization techniques to prove a maximum stable space property .16
6.3.4 Using abstract interpretation to prove a maximum stable space property .17
7 Empirical methods .17
7.1 General .17
7.2 Field trials .17
7.3 A posteriori testing .18
7.4 Benchmarking of neural networks .19
Annex A (informative) Data perturbation .20
Annex B (informative) Principle of abstract interpretation .25
Bibliography .26
© ISO/IEC 2021 – All rights reserved PROOF/ÉPREUVE iii
---------------------- Page: 3 ----------------------
ISO/IEC TR 24029-1:2021(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that
are members of ISO or IEC participate in the development of International Standards through
technical committees established by the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other
international organizations, governmental and non-governmental, in liaison with ISO and IEC, also
take part in the work.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC
list of patent declarations received (see patents.iec.ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 42, Artificial intelligence.
A list of all parts in the ISO/IEC 24029 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html.
iv PROOF/ÉPREUVE © ISO/IEC 2021 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC TR 24029-1:2021(E)
Introduction
When designing an AI system, several properties are often considered desirable, such as robustness,
resiliency, reliability, accuracy, safety, security, privacy. A definition of robustness is provided in 3.6.
Robustness is a crucial property that poses new challenges in the context of AI systems. For example, in
AI systems there are some risks specifically tied to the robustness of AI systems. Understanding these
risks is essential for the adoption of AI in many contexts. This document aims at providing an overview
of the approaches available to assess these risks, with a particular focus on neural networks, which are
heavily used in industry, government and academia.
In many organizations, software validation is an essential part of putting software into production.
The objective is to ensure various properties including safety and performance of the software used
in all parts of the system. In some domains, the software validation and verification process is also
an important part of system certification. For example, in the automotive or aeronautic fields, existing
standards, such as ISO 26262 or Reference [2], require some specific actions to justify the design, the
implementation and the testing of any piece of embedded software.
The techniques used in AI systems are also subject to validation. However, common techniques used in
AI systems pose new challenges that require specific approaches in order to ensure adequate testing
and validation.
AI technologies are designed to fulfil various tasks, including interpolation/regression, classification
and other tasks.
While many methods exist for validating non-AI systems, they are not always directly applicable to
AI systems, and neural networks in particular. Neural network systems represent a specific challenge
as they are both hard to explain and sometimes have unexpected behaviour due to their non-linear
nature. As a result, alternative approaches are needed.
Methods are categorized into three groups: statistical methods, formal methods and empirical methods.
This document provides background on these methods to assess the robustness of neural networks.
It is noted that characterizing the robustness of neural networks is an open area of research, and there
are limitations to both testing and validation approaches.
© ISO/IEC 2021 – All rights reserved PROOF/ÉPREUVE v
---------------------- Page: 5 ----------------------
TECHNICAL REPORT ISO/IEC TR 24029-1:2021(E)
Artificial Intelligence (AI) — Assessment of the robustness
of neural networks —
Part 1:
Overview
1 Scope
This document provides background about existing methods to assess the robustness of neural
networks.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
artificial intelligence
AI
capability of an engineered system to acquire, process and apply knowledge and skills
3.2
field trial
trial of a new system in actual situations for which it is intended (potentially with a restricted user group)
Note 1 to entry: Situation encompasses environment and process of usage.
3.3
input data
data for which a deployed machine learning model calculates a predicted output or inference
Note 1 to entry: Input data is also referred to by machine learning practitioners as out-of-sample data, new data
and production data.
© ISO/IEC 2021 – All rights reserved PROOF/ÉPREUVE 1
---------------------- Page: 6 ----------------------
ISO/IEC TR 24029-1:2021(E)
3.4
neural network
neural net
NN
artificial neural network
ANN
network of primitive processing elements connected by weighted links with adjustable weights, in
which each element produces a value by applying a non-linear function to its input values, and transmits
it to other elements or presents it as an output value
Note 1 to entry: Whereas some neural networks are intended to simulate the functioning of neurons in the nervous
system, most neural networks are used in artificial intelligence as realizations of the connectionist model.
Note 2 to entry: Examples of non-linear functions are a threshold function, a sigmoid function and a polynomial
function.
[SOURCE: ISO/IEC 2382:2015, 2120625, modified — Abbreviated terms have been added under the
terms and Notes 3 to 5 to entry have been removed.]
3.5
requirement
statement which translates or expresses a need and its associated constraints and conditions
[SOURCE: ISO/IEC/IEEE 15288:2015, 4.1.37]
3.6
robustness
ability of an AI system to maintain its level of performance under any circumstances
Note 1 to entry: This document mainly describes data input circumstances such as domain change but the
definition is broader not to exclude hardware failure and other types of circumstances.
3.7
testing
activity in which a system or component is executed under specified conditions, the results are observed
or recorded, and an evaluation is made of some aspect of the system or component
[SOURCE: ISO/IEC/IEEE 26513:2017, 3.42]
3.8
test data
subset of input data (3.3) samples used to assess the generalization error of a final machine learning
(ML) model selected from a set of candidate ML models
[SOURCE: Reference [2]]
3.9
training dataset
set of samples used to fit a machine learning model
3.10
validation
confirmation, through the provision of objective evidence, that the requirements (3.5) for a specific
intended use or application have been fulfilled
[SOURCE: ISO/IEC 25000:2014, 4.41, modified — Note 1 to entry has been removed.]
2 PROOF/ÉPREUVE © ISO/IEC 2021 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/IEC TR 24029-1:2021(E)
3.11
validation data
subset of input data (3.3) samples used to assess the prediction error of a candidate machine
learning model
Note 1 to entry: Note to entry: Machine learning (ML) model validation (3.10) can be used for ML model
selection.
[SOURCE: Reference [2]]
3.12
verification
confirmation, through the provision of objective evidence, that specified requirements have been
fulfilled
[SOURCE: ISO/IEC 25000:2014, 4.43, modified — Note 1 to entry has been removed.]
4 Overview of the existing methods to assess the robustness of neural networks
4.1 General
4.1.1 Robustness concept
Robustness goals aim at answering the question “To what degree is the system required to be robust?”
or “What are the robustness properties of interest?”. Robustness properties demonstrate the degree to
which the system performs with atypical data as opposed to the data expected in typical operations.
4.1.2 Typical workflow to assess robustness
This subclause explains how the robustness of neural networks is assessed for different classes of AI
applications such as classification, interpolation and other complex tasks.
There are different ways to assess the robustness of neural networks using objective information.
A typical workflow for determining neural network (or other technique) robustness is as shown in
Figure 1.
© ISO/IEC 2021 – All rights reserved PROOF/ÉPREUVE 3
---------------------- Page: 8 ----------------------
ISO/IEC TR 24029-1:2021(E)
4 PROOF/ÉPREUVE © ISO/IEC 2021 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/IEC TR 24029-1:2021(E)
Key
I.I.I incomplete, incorrect or insufficient
start/end
step
input/output
decision
Figure 1 — Typical workflow to determine neural network robustness
Step 1: State robustness goals
The process begins with a statement of the robustness goals. During this initial step, the targets to
be tested for robustness are identified. The metrics to quantify the objects that demonstrate the
achievement of robustness are subsequently identified. This constitutes the set of decision criteria
on robustness properties that can be subject to further approval by relevant stakeholders (see
ISO/IEC/IEEE 16085:2021, 7.4.2).
Step 2: Plan testing
This step plans the tests that demonstrate robustness. The tests rely on different methods, for example:
statistical, formal or empirical methods. In practice, a combination of methods is used. Statistical
approaches usually rely on a mathematical testing process and are able to illustrate a certain level
of confidence in the results. Formal methods rely on formal proofs to demonstrate a mathematical
property over a domain. Empirical methods rely on experimentation, observation and expert judgement.
In planning the testing, the environment setup needs to be identified, data collection planned, and data
characteristics defined (that is, which data element ranges and data types will be used, which edge
cases will be specified to test robustness, etc.). The output of Step 2 is a testing protocol that comprises
a document stating the rationale, objectives, design and proposed analysis, methodology, monitoring,
conduct and record-keeping of the tests (more details of the content of a testing protocol are available
through the definition of the clinical investigation plan found in ISO 14155:2020, 3.9).
Step 3: Conduct testing
The testing is then conducted according to the defined testing protocol, and outcomes are collected.
It is possible to perform the tests using a real-world experiment or a simulation, and potentially a
combination of these approaches.
Step 4: Analyze outcome
After completion, tests outcomes are analysed using the metrics chosen in Step 1.
Step 5: Interpret results
The analysis results are then interpreted to inform the decision.
Step 6: Test objective achieved?
A decision on system robustness is then formulated given the criteria identified earlier and the resulting
interpretation of the analysis results.
If the test objectives are not met, an analysis of the process is conducted and the process returns to the
appropriate preceding step, in order to alleviate deficiencies, e.g. add robustness goals, modify or add
metrics, add consideration of different aspects to measure, re-plan tests, etc.
© ISO/IEC 2021 – All rights reserved PROOF/ÉPREUVE 5
---------------------- Page: 10 ----------------------
ISO/IEC TR 24029-1:2021(E)
AI systems that significantly rely on neural networks, particularly deep neural networks (DNN),
bear built-in malfunctions. These malfunctions are showing up by a system behaviour that resembles
an occurrence of a conventional software. Typical situations have been demonstrated by feeding
"adversarial examples" to object recognition systems, e.g. in Reference [5]. These built-in errors of
DNNs are not simple to "fix". Research on this problem shows that there are measures to improve the
robustness of DNNs with respect to adversarial examples, but this works to a certain degree only.
[6],[7]
However, if detected during a test procedure, the AI system is able to signal a problem when an
associated input pattern is encountered.
Data sourcing:
Data sourcing is the process of selecting, producing and/or generating the testing data and objects that
are needed for conducting the testing.
This sometimes includes consideration of legal or other regulatory requirements, as well as practical or
technical issues.
The testing protocol contains the requirements and the criteria necessary for data sourcing. Data
sourcing issues and methods are not covered in detail in this document.
Especially the following issues can have an impact on robustness:
— scale;
— diversity, representativeness, and range of outliers;
— choice of real or synthetic data;
— datasets used specifically for robustness testing;
— adversarial and other examples that explore hypothetic domain extremes;
— composition of training, testing, and validation datasets.
4.2 Classification of methods
Following the workflow defined above for determining robustness, the remainder of this document
describes the methods and metrics applicable to the various testing types, i.e. statistical, formal and
empirical methods.
Statistical approaches usually rely on a mathematical testing process on some datasets, and help
ensure a certain level of confidence in the results. Formal methods rely on a sound formal proof in
order to demonstrate a mathematical property over a domain. Formal methods in this document are
not constrained to the traditional notion of syntactic proof methods and include correctness checking
methods, such as model checking. Empirical methods rely on experimentation, observation and expert
judgement.
While it is possible to characterize a system through either observation or proof, this document chooses
to separate observation techniques into statistical and empirical methods. Statistical methods generate
reproducible measures of robustness based on specified datasets. Empirical methods produce data
that can be analysed with statistical methods but is not necessarily reproducible due to the inclusion
of subjective assessment. Therefore, it is usually necessary that methods from both categories be
performed jointly.
Thus, this document first considers statistical approaches which are the most common approaches
used to assess robustness. They are characterized by a testing approach defined by a methodology
using mathematical metrics. This document then examines approaches to attain a formal proof that
are increasingly being used to assess robustness. Finally, this document presents empirical approaches
that rely on subjective observations that complement the assessment of robustness when statistical
and formal approaches are not sufficient or viable.
6 PROOF/ÉPREUVE © ISO/IEC 2021 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/IEC TR 24029-1:2021(E)
In practice, in the current state of the art, these methods are not used to directly assess robustness
as a whole. Instead, they each target complementary aspects of robustness, providing several partial
indicators whose conjunction enables robustness assessment.
For an evaluator, it is indeed possible to use these methods to answer different kinds of questions on
the system they intend to validate. For example:
— statistical methods allow the evaluator to check if the systems properties reach a desired target
threshold (e.g. how many defective units are produced?);
— formal methods allow the evaluator to check if the properties are provable on the domain of use (e.g.
does the system always operate within the specified safety bounds?);
— empirical methods allow the evaluator to assess the degree to which the system’s properties hold
true in the scenario tested (e.g. is the observed behaviour satisfactory?).
The principle of applying such methods to robustness assessment is to evaluate to which extent these
properties hold when circumstances change:
— when using statistical methods: how is the measured value of performance affected when changing
the conditions?
— when using formal methods: do the new conditions still belong to the domain where the properties
are provable?
— when using empirical methods: do the properties still hold true in other scenarios?
It is noted that characterizing the robustness of neural networks is an active area of research, and there
are limitations to both testing and validation approaches. With testing approaches, the variation of
possible inputs is unlikely to be large enough to provide any guarantees on system performance. With
validation approaches, approximations are usually required to handle the high dimensionality of inputs
and parameterizations of a neural network.
5 Statistical methods
5.1 General
One aspect of robustness is the effect of changing circumstances on quantitative performance, which
statistical methods are particularly suited to measure. They enable this assessment by direct evaluation
of the performance in various scenarios using comparative measures.
When using statistical methods, the four following main criteria are used in the computation of
robustness:
1) Appropriate testing data. To evaluate the robustness of a model, a dataset that spans the
distribution and the input conditions of interest for the target application is first established, either
through acquisition of real measurement data or simulated data. Several sources for the data are
possible, such as noisy data that was not accounted for during the initial training of the model, data
from similar domain applications, data from a different but equivalent data source. While there is
no general method to assess the relevance of a dataset and it often relies on human judgment, some
techniques exist (e.g. based on intermediate representations of the data) to support this analysis
with various indicators. The evaluation of the robustness of neural network models can vary with
different testing datasets.
2) Choose a setting of the model. The evaluation can also assess the robustness using different
settings of the trained model (for example the model precision, quantized weight, etc).
3) A choice of metrics or metrics of performance. Based on the context, the task at hand and the
nature of the data, some metrics are not always appropriate, as they can lead to irrelevant or
misleading results. An appropriate set of metrics (see 5.2) helps to avoid these situations.
© ISO/IEC 2021 – All rights reserved PROOF/ÉPREUVE 7
---------------------- Page: 12 ----------------------
ISO/IEC TR 24029-1:2021(E)
4) A method for a decision on robustness. Given a selected metric, an appropriate statistical test
is performed to reach a decision regarding whether the model is sufficiently robust for the chosen
robustness goal(s) or not.
A robustness property assessed through statistical methods is defined by one or more thresholds over a
set of metrics that need to hold on some testing data. The evaluation of robustness is case-specific, given
that certain organizations or situations require different robustness goals and metrics to determine if
a goal is met.
This clause follows the general workflow described in Figure 1 to assess the robustness of a neural
network. In particular, it focusses on Steps 1, 2 and 3 of the workflow defined in 4.1.2, i.e. state
robustness goal, plan testing and conduct testing.
Subclauses 5.2 and 5.3 present metrics and methods to assess the robustness of neural networks
statistically, more detailed information on each is available in References[8],[9],[10] and[11].
5.2 Robustness metrics available using statistical methods
5.2.1 General
This subclause presents background information about statistical metrics that are available and
typically used on the output of neural networks. It describes the robustness goals, using Step 1 of
Figure 1. Robustness goals need to be well defined. For example, to say simply that “the trained neural
network has to be robust to inputs dissimilar to those on which it has been trained” is not sufficiently
well defined. It is possible for a neural network to have full compliance or no compliance with this goal
depending on the input. For example, it is
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.