ISO/TR 9839:2023
(Main)Road vehicles — Application of predictive maintenance to hardware with ISO 26262-5
Road vehicles — Application of predictive maintenance to hardware with ISO 26262-5
This document is intended to be applied to the usage of predictive maintenance methods for the detection of degrading faults in safety related E/E hardware elements. It applies to hardware elements developed for compliance with the ISO 26262[1] series in which degrading faults are shown to be relevant due to, for instance, the technology used. Specific technical implementations of predictive maintenance solutions are not in scope of this document.
Véhicules routiers — Application de la maintenance prédictive au matériel à l'aide de l'ISO 26262-5
General Information
Buy Standard
Standards Content (Sample)
TECHNICAL ISO/TR
REPORT 9839
First edition
2023-08
Road vehicles — Application of
predictive maintenance to hardware
with ISO 26262-5
Véhicules routiers — Application de la maintenance prédictive au
matériel à l'aide de l'ISO 26262-5
Reference number
ISO/TR 9839:2023(E)
© ISO 2023
---------------------- Page: 1 ----------------------
ISO/TR 9839:2023(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2023
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
© ISO 2023 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/TR 9839:2023(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Literature survey of degrading faults . 4
5.1 General . 4
5.2 Degrading faults in industry standards . 4
[4]
5.2.1 JEDEC JEP122H . 4
5.3 Degrading faults in technical publications . 5
[5]
5.3.1 Advanced CMOS Reliability Update: Sub 20 nm FinFET Assessment . 5
[6]
5.3.2 Circuit-Based Reliability Consideration in FinFET Technology . 6
[7]
5.3.3 Intermittent Faults and Effects on Reliability of Integrated Circuits . 6
6 Literature survey on predictive maintenance . 6
6.1 General . 6
6.2 Predictive maintenance in industry standards . 6
[9]
6.2.1 IEC 61508 . 6
[3]
6.2.2 IEEE Std 1856 . 6
6.3 Predictive maintenance in technical publications . 7
[10]
6.3.1 A Survey of Online Failure Prediction Methods . . 7
[11]
6.3.2 An Odometer for CPUs . 7
[12]
6.3.3 Circuit Failure Prediction for Robust System Design in Scaled CMOS . 8
[13]
6.3.4 A Circuit Failure Prediction Mechanism (DART) for High Field Reliability . 8
6.3.5 Predicting Remediations for Hardware Failures in Large-Scale
[14]
Datacenters . 8
6.3.6 Improving Analog Functional Safety Using Data-Driven Anomaly Detection
[15]
. 8
7 Degrading faults and the ISO 26262 series . 8
7.1 Understanding the lifecycle of degrading faults. 8
7.2 Classification of degrading faults .12
7.3 Quantifying degrading fault base failure rate .12
7.3.1 Industry standards and models .12
7.3.2 Field data . 13
7.3.3 Expert judgement . 13
8 Applying predictive maintenance .13
8.1 Diagnostic coverage (DC) evaluation for predictive mechanisms .13
8.2 Considering random hardware metrics . 13
8.2.1 Impacting the SPFM and LFM . 13
8.2.2 Application as a dedicated measure . 14
8.3 Considering RUL prediction . 14
Annex A (informative) An approach to handling degrading faults .16
Bibliography .18
iii
© ISO 2023 – All rights reserved
---------------------- Page: 3 ----------------------
ISO/TR 9839:2023(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use
of (a) patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed
patent rights in respect thereof. As of the date of publication of this document, ISO had not received
notice of (a) patent(s) which may be required to implement this document. However, implementers are
cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents. ISO shall not be held responsible for identifying any or all
such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to
the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see
www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 22, Road vehicles, Subcommittee SC 32,
Electrical and electronic components and general system aspects.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
© ISO 2023 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/TR 9839:2023(E)
Introduction
Hardware elements wear out or degrade with time and usage. The presence of certain faults can
cause the rate of degradation to increase. If the rate of degradation exceeds critical thresholds, then
a hardware element can fail during its normal expected lifespan. Addressing fault behaviours which
change over time is difficult. Functional safety standards such as the ISO 26262 series have traditionally
addressed degrading faults with avoidance measures and simplified assumptions of static behaviours.
Understanding of degrading faults is improving over time. Many industries are taking proactive steps
to control degrading faults using predictive maintenance. Predictive maintenance can detect degrading
faults and predict remaining useful life. Safety mechanisms based on predictive maintenance are not
explicitly discussed in the ISO 26262 series.
This document provides a survey of current state of the art for degrading faults and predictive
maintenance techniques. Approaches are presented to consider degrading faults and predictive
maintenance techniques in an ISO 26262 safety argument. Much of the content is focused on
semiconductors, but the concepts can be applied to other hardware elements.
v
© ISO 2023 – All rights reserved
---------------------- Page: 5 ----------------------
TECHNICAL REPORT ISO/TR 9839:2023(E)
Road vehicles — Application of predictive maintenance to
hardware with ISO 26262-5
1 Scope
This document is intended to be applied to the usage of predictive maintenance methods for the
detection of degrading faults in safety related E/E hardware elements. It applies to hardware elements
[1]
developed for compliance with the ISO 26262 series in which degrading faults are shown to be
relevant due to, for instance, the technology used.
Specific technical implementations of predictive maintenance solutions are not in scope of this
document.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 26262-1, Road vehicles — Functional safety — Part 1: Vocabulary
3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 26262-1 and the following
apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
degrading fault
fault whose characteristics are not constant and degrade over time, that can result in an error or failure
when stimulated after degradation exceeds a critical threshold
Note 1 to entry: Permanent and intermittent faults can first manifest as degrading faults. Transient faults do not
manifest as degrading faults.
Note 2 to entry: Degrading faults do not create errors or failures until degradation exceeds critical thresholds.
The capability to generate an error or failure is related to the current state of degradation.
Note 3 to entry: Degrading faults exhibit abnormal conditions which can cause an error or failure over time.
Normal degradation does not exhibit abnormal conditions which are necessary to be classified as a fault. Normal
degradation can result in a loss of functionality after expected lifespan has elapsed but cannot be considered a
fault as it is not abnormal.
3.2
degrading fault detection time interval
DFDTI
timespan from the occurrence of a degrading fault (3.1) to its detection
1
© ISO 2023 – All rights reserved
---------------------- Page: 6 ----------------------
ISO/TR 9839:2023(E)
3.3
degrading fault handling time interval
DFHTI
sum of the degrading fault detection time interval (3.2) and the degrading fault reaction time interval
(3.4).
Note 1 to entry: The degrading fault handling time interval is a property of a predictive maintenance (3.5) related
safety mechanism.
Note 2 to entry: The degrading fault handling time interval is considered in addition to the fault handling time
interval. See Figure 4.
Note 3 to entry: The timespan from occurrence of a degrading fault (3.1) until it has the capability to generate
an error or failure is the maximum degrading fault handling time interval that can be specified for a predictive
maintenance related safety mechanism to support the functional safety concept.
Note 4 to entry: A degrading fault (3.1) is covered in a timely manner by the corresponding safety mechanism if
there is detection and reaction within the degrading fault handling time interval.
3.4
degrading fault reaction time interval
DFRTI
timespan from the detection of a degrading fault (3.1) to reaching a safe state or reaching emergency
operation
3.5
predictive maintenance
techniques that are used to detect degrading faults (3.1), predict remaining useful life (3.6), and react
appropriately
Note 1 to entry: Approaches include the use of data driven methods such as machine learning applied locally or
[2]
on a remote system. Guidance for developing safety related ML systems can be found in ISO/IEC TR 5469 .
Note 2 to entry: Prediction of remaining useful life (3.6) can be used to replace a faulty element before it can cause
an error or failure.
3.6
remaining useful life
RUL
length of time from the present time to the estimated time that the item or element is expected to no
longer perform its intended function within desired specifications
Note 1 to entry: RUL can be estimated using predictive maintenance (3.5) or with other approaches.
Note 2 to entry: RUL can be estimated for expected degradation or degradation in the presence of a fault.
[3]
[SOURCE: IEEE Std 1856-2017 , modified — The phrase "system (or product)" was replaced with "item
or element".]
4 Abbreviated terms
ADAS Advanced Driver Assistance System
ADS Automated Driving System
AI Artificial Intelligence
BEoL Back End of Line (sometimes BEOL)
BFR Base Failure Rate
2
© ISO 2023 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/TR 9839:2023(E)
BLM Barrier Layer Material
CHC Channel Hot Carrier
COTS Commercial Off The Shelf
DC Diagnostic Coverage
DFDTI Degrading Fault Detection Time Interval
DFHTI Degrading Fault Handling Time Interval
DFRTI Degrading Fault Reaction Time Interval
DRAM Dynamic Random Access Memory
EM Electromigration
ESD Electrostatic Discharge
FEoL Front End of Line (sometimes FEOL)
FET Field Effect Transistor
FDTI Fault Detection Time Interval
FHTI Fault Handling Time Interval
FTTI Fault Tolerant Time Interval
HCI Hot Carrier Injection
ILD Inter-Layer Dielectric
LFM Latent Fault Metric
ML Machine Learning
MoL Middle of Line (sometimes MOL)
MEoL Middle End of Line (sometimes MEOL)
MPFDTI Multiple Point Fault Detection Time Interval
NBTI Negative Bias Temperature Instability
NVM Non-Volatile Memory
PCM Phase Change Memory
PHM Prognostics and Health Management
RUL Remaining Useful Life
SBD Soft Breakdown
SHE Self-Heating Effect
SILC Stress-Induced Leakage Current
SM Stress Migration
3
© ISO 2023 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/TR 9839:2023(E)
SoC System on Chip
SPFM Single Point Fault Metric
TDDB Time Dependent Dielectric Breakdown
TDJD Time Dependent Junction Degradation
TID Total Ionizing Dose
5 Literature survey of degrading faults
5.1 General
This document reviews many technical documents to summarize the current state of the art
understanding of degrading faults in industry standards and technical publications.
NOTE Terminology in the referenced publications and standards is not always aligned to terms and
definitions of the ISO 26262 series. When referencing publications and standards, the terminology of the
referenced work is used.
5.2 Degrading faults in industry standards
[4]
5.2.1 JEDEC JEP122H
The JEDEC Solid State Technology Association is a semiconductor industry trade association and
standardization body. JEDEC has over 300 companies as members and publishes electronics standards
on a wide variety of topics.
JEDEC JEP122H is the latest revision on JEDEC’s standard for “Failure Mechanisms and Models for
Semiconductor Devices,” last updated in 2016. The standard describes eighteen different failure
mechanisms, classifying them as being related to the die front end of line (FEoL), die back end of line
(BEoL), or packaging. Models are provided for estimating the rates of degradation per failure mode. The
information provided in JEP122H is validated by a team of reliability experts from the SEMATECH/ISMI
Reliability Council and supported by extensive references to technical publications.
The die FEoL failure mechanisms described by the JEP122H include:
— time dependent dielectric breakdown (TDDB) due to gate oxide breakdown;
— hot carrier Injection (HCI);
— negative bias temperature instability (NBTI);
— surface inversion due to mobile ions;
— floating gate non-volatile memory (NVM) data retention;
— localized charge trapping NVM data retention;
— phase change memory (PCM) NVM data retention.
The die BEoL failure mechanisms described by JEP122H include:
— TDDB due to ILD/low-k/mobile Cu ions;
— aluminium electromigration (EM);
— copper EM;
— aluminium and copper corrosion;
4
© ISO 2023 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/TR 9839:2023(E)
— aluminium stress migration (SM);
— copper SM.
The packaging failure mechanisms described in JEP122H include:
— fatigue failures due to temperature cycling and thermal shock;
— interfacial failures due to temperature cycling and thermal shock;
— intermetallic and oxidation failure due to high temperature;
— tin whiskers;
— ion mobility kinetics due to component cleanliness.
5.3 Degrading faults in technical publications
[5]
5.3.1 Advanced CMOS Reliability Update: Sub 20 nm FinFET Assessment
Reference [5] was published by Sandia National Laboratories, a research organization of the United
States Department of Energy, in 2020. The purpose of the report is to document the most critical
failure modes impacting advanced semiconductor technologies using FinFET technology. FinFET
based semiconductors are used for most current generation SoCs (system on chip devices), dGPUs
(discrete graphics processing units), and DRAMs (dynamic random-access memories) which are used
in infotainment, ADAS (advanced driver assistance systems), and ADS (automated driving system)
applications. While the use of FinFET transistors enables smaller process geometries (e.g. <20 nm
feature size) and faster processing, it also changes the failure mode susceptibility characteristics
compared to more traditional planar transistor technologies found in 28 nm and larger process
technologies.
The report provides details for the following failure modes:
— die related failure modes:
— bias temperature instability (BTI);
— dielectric integrity;
— HCI;
— BEoL, EM and stress voiding;
— middle end of line (MEoL) concerns (also known as middle of line, or MoL);
— packaging and package-die interaction;
— integrated die design and process reliability – electrostatic discharge (ESD);
— radiation effects:
— total ionizing dose (TID);
— displacement damage;
— COTS electronics and radiation effects.
The die and packaging related failure modes discussed can generally be argued to manifest as random
degrading faults before becoming intermittent or permanent faults. The ESD and radiation effects can
generally be argued to be systematic or transient in nature.
Also of interest is the section on reliability degradation and its impact on circuit/system performance.
This section focuses on “soft” logic failures which manifest before “hard” physical failures of the
5
© ISO 2023 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/TR 9839:2023(E)
semiconductor devices. As most of the degradation mechanisms discussed result in parameter
degradation, it is suggested that statistical methods can be used to predict circuit failures.
[6]
5.3.2 Circuit-Based Reliability Consideration in FinFET Technology
Reference [6] is authored by four experts from Taiwan Semiconductor Manufacturing Company
(TSMC) in 2017 to describe the primary reliability failure modes of concern for FinFET based process
technologies and to present a model for estimating reliability. Comparisons are made between the
performance of 28 nm planar technologies versus 16 nm and 7 nm FinFET technologies.
The authors highlight several new reliability concerns including bias temperature instability (BTI),
stress-induced leakage currents (SILCs), self-heating effects (SHEs), and time dependent junction
degradation (TDJD). Models are proposed to estimate the reliability impacts of these mechanisms,
based on a combination of simulation and reliability testing.
[7]
5.3.3 Intermittent Faults and Effects on Reliability of Integrated Circuits
Reference [7] is authored by a reliability expert from AMD in 2008 and studies intermittent faults. An
experiment was conducted using more than 250 servers (from pre-2008) to provide over 300 server
years of operational data. Identified memory single bit errors were analysed for root cause, and the
findings documented. The rates of occurrence of the errors introduced by these faults can vary from
one design to another and one technology to another.
Failure modes discussed in this paper include ultra-thin oxide breakdown, soft breakdown (SBD), EM
voids, barrier layer material (BLM) cracks, and crosstalk as sources of intermittent faults. It is noted
that these intermittent faults can be detected by monitoring the Vmin (voltage minimum) thresholds
necessary for correct operation. Mitigations are discussed in terms of systematic avoidance, screening
at manufacturing test, and online fault detection in application including failure prediction.
6 Literature survey on predictive maintenance
6.1 General
This document reviews many technical documents to summarize the current state of the art
understanding of predictive maintenance in industry standards and technical publication. Additional
[8]
application domain specific standards are in development (e.g. IEC 63270 ).
NOTE Terminology in the referenced publications and standards is not always aligned to the terms and
definitions of the ISO 26262 series. When referencing publications and standards, the terminology of the
referenced work is used.
6.2 Predictive maintenance in industry standards
[9]
6.2.1 IEC 61508
IEC 61508 is a basic safety publication for functional safety which was the original basis for the
ISO 26262 series. The 2010 edition of the standard includes guidance on the use of fault forecasting,
maintenance, and supervisory actions supported by artificial intelligence (AI) systems.
[3]
6.2.2 IEEE Std 1856
IEEE Std 1856 provides an industry-independent approach to the use of predictive maintenance and
similar techniques. This standard is intended to be applied at many different levels of design abstraction
and is not specific to semiconductor technologies. This standard applies the terms “prognostics” and
“Prognostics and Health Management (PHM)” interchangeably with predictive methods. The standard
is primarily focused on estimating the remaining useful life (RUL) after a fault is detected, rather than
the method by which the fault is detected.
6
© ISO 2023 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/TR 9839:2023(E)
IEEE Std 1856 provides a lifecycle model for PHM as illustrated in Figure 1:
— the product is initially deployed without faults;
— an off-nominal behaviour (fault) is detected;
— a failure occurs.
Figure 1 — IEEE Std 1856-2017 lifecycle model for prognostics
The IEEE Std 1856 model introduces three metrics, which when used together can be used to compare
the effectiveness of different PHM approaches:
— the response time for the predictive algorithm, defined as the time between first fault detection and
first correct prediction of RUL;
— the prognostic distance, defined as the time between the correct prediction and the occurrence of a
failure;
— the prognostic system accuracy, defined as the difference between the predicted failure time and
the actual failure time.
NOTE Prognostic system accuracy can be positive (failure occurs before prediction) or negative (failure
occurs after prediction).
In IEEE Std 1856-2017, Annex A, the standard provides additional guidance. The content on levels of
PHM implementation closely matches the ISO 26262 series approach of performing analysis on multiple
levels of design hierarchy: device, component, assembly, sub-system, system, and system of systems.
6.3 Predictive maintenance in technical publications
[10]
6.3.1 A Survey of Online Failure Prediction Methods
Reference [10] is a literature survey compiled by three researchers from Humboldt University in Berlin
in 2010. It is intended to provide a picture of the state of the art in online failure prediction methods as
of 2010. Some of the key information included in this document is:
— a lifecycle approach based on the progression of faults to errors to failures which is largely compatible
with the ISO 26262 series;
— a definition of nine metrics to evaluate predictive methods, with focus on precision and recall;
— a taxonomy of failure prediction methods which introduces twenty-two categories;
— a review and classification of forty-seven different implementations defined in technical publications
into the twenty-two categories.
[11]
6.3.2 An Odometer for CPUs
Reference [11] is an article published in the IEEE Spectrum magazine in 2011. It provides a simplified
introduction to the topic of degrading faults and introduces one possible detection mechanism. The
7
© ISO 2023 – All rights reserved
---------------------- Page: 12 ----------------------
ISO/TR 9839:2023(E)
description of silicon aging mechanisms includes HCI, BTI, and oxide breakdown. A degradation
detection mechanism is described which uses the comparison of two ring oscillators (one run in normal
conditions, the other under stress conditions).
[12]
6.3.3 Circuit Failure Prediction for Robust System Design in Scaled CMOS
Reference [12] is one of many papers on the subject written by Professor Mitra, director of the Stanford
University Robust Systems Group. The work focuses on the detection of faults before they can generate
errors or failures of a system. Mitra separates the product lifecycle into early life, useful life, and end of
life according to a bathtub curve model and emphasizes the need to address the semiconductor physics
dominant in each lifecycle phase to maximiz
...
ISO/TR DTR 9839
2023-08
ISO/TC 22/SC 32
Secretariat: AFNOR JISC
Date: 2023-05-10
Road vehicles — Application of predictive maintenance to
hardware with ISO 26262-5
Véhicules routiers — Application de l’entretienla maintenance prédictive au matériel avec ISO à l'aide de
l'ISO 26262-5
FDIS stage
© ISO 2023 – All rights reserved
---------------------- Page: 1 ----------------------
© ISO 2023
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this
publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical,
including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can
be requested from either ISO at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
Fax: +41 22 749 09 47
EmailE-mail: copyright@iso.org
Website: www.iso.org
Published in Switzerland
© ISO 2023 – All rights reserved 2
---------------------- Page: 2 ----------------------
ISO/TRDTR 9839:2023(:(E)
Contents
Foreword . v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Literature survey of degrading faults . 5
5.1 General . 5
5.2 Degrading faults in industry standards . 5
[4]
5.2.1 JEDEC JEP122H . 5
5.3 Degrading faults in technical publications . 6
[5]
5.3.1 Advanced CMOS Reliability Update: Sub 20 nm FinFET Assessment . 6
[6]
5.3.2 Circuit-Based Reliability Consideration in FinFET Technology . 7
[7]
5.3.3 Intermittent Faults and Effects on Reliability of Integrated Circuits . 7
6 Literature survey on predictive maintenance. 8
6.1 General . 8
6.2 Predictive maintenance in industry standards. 8
[9]
6.2.1 IEC 61508 . 8
[3]
6.2.2 IEEE 1856 . 8
6.3 Predictive maintenance in technical publications. 9
[10]
6.3.1 A Survey of Online Failure Prediction Methods . 9
[11]
6.3.2 An Odometer for CPUs . 9
[12]
6.3.3 Circuit Failure Prediction for Robust System Design in Scaled CMOS . 10
[13]
6.3.4 A Circuit Failure Prediction Mechanism (DART) for High Field Reliability . 10
[14]
6.3.5 Predicting Remediations for Hardware Failures in Large-Scale Datacenters . 10
[15]
6.3.6 Improving Analog Functional Safety Using Data-Driven Anomaly Detection . 10
7 Degrading faults and the ISO 26262 series . 10
7.1 Understanding the lifecycle of degrading faults . 10
7.2 Classification of degrading faults . 14
7.3 Quantifying degrading fault base failure rate . 15
7.3.1 Industry standards and models . 15
7.3.2 Field data . 15
7.3.3 Expert judgement . 15
8 Applying predictive maintenance . 15
© ISO 2023 – All rights reserved iii
© ISO 2023 – All rights reserved 3
---------------------- Page: 3 ----------------------
ISO/TRDTR 9839:2023(:(E)
8.1 Diagnostic coverage (DC) evaluation for predictive mechanisms . 15
8.2 Considering random hardware metrics . 16
8.2.1 Impacting the SPFM and LFM . 16
8.2.2 Application as a dedicated measure . 16
8.3 Considering RUL prediction . 17
Annex A (informative) An approach to handling degrading faults . 18
Bibliography . 20
iv
4 © ISO 2023 – All rights reserved
© ISO 2023 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/TRDTR 9839:2023(:(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO
collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO documentsdocument should be noted. This document was drafted in accordance
with the editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Field Code Changed
Attention is drawnISO draws attention to the possibility that some of the elementsimplementation of this
document may beinvolve the subjectuse of (a) patent(s). ISO takes no position concerning the evidence,
validity or applicability of any claimed patent rights in respect thereof. As of the date of publication of
this document, ISO had not received notice of (a) patent(s) which may be required to implement this
document. However, implementers are cautioned that this may not represent the latest information,
which may be obtained from the patent database available at www.iso.org/patents. ISO shall not be held
responsible for identifying any or all such patent rights. Details of any patent rights identified during the
development of the document will be in the Introduction and/or on the ISO list of patent declarations
received (see ).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the World
Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see
www.iso.org/iso/foreword.html.
Field Code Changed
This document was prepared by Technical Committee ISO/TC 22, Road Vehiclesvehicles, Subcommittee
SC 32, Electrical and electronic components and general system aspects.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
© ISO 2023 – All rights reserved v
© ISO 2023 – All rights reserved 5
---------------------- Page: 5 ----------------------
ISO/TRDTR 9839:2023(:(E)
Introduction
Hardware elements wear out or degrade with time and usage. The presence of certain faults can cause
the rate of degradation to increase. If the rate of degradation exceeds critical thresholds, then a hardware
element can fail during its normal expected lifespan. Addressing fault behaviours which change over time
is difficult. Functional safety standards such as the ISO 26262 series have traditionally addressed
degrading faults with avoidance measures and simplified assumptions of static behaviours.
Understanding of degrading faults is improving over time. Many industries are taking proactive steps to
control degrading faults using predictive maintenance. Predictive maintenance can detect degrading
faults and predict remaining useful life. Safety mechanisms based on predictive maintenance are not
explicitly discussed in the ISO 26262:2018 series.
This document provides a survey of current state of the art for degrading faults and predictive
maintenance techniques. Approaches are presented to consider degrading faults and predictive
maintenance techniques in an ISO 26262 safety argument. Much of the content is focused on
semiconductors, but the concepts can be applied to other hardware elements.
vi
6 © ISO 2023 – All rights reserved
© ISO 2023 – All rights reserved
---------------------- Page: 6 ----------------------
ISO/DTR 9839:(E)
Road vehicles — Application of predictive maintenance to
hardware with ISO 26262-5
1 Scope
This document is intended to be applied to the usage of predictive maintenance methods for the detection
of degrading faults in safety related E/E hardware elements. It applies to hardware elements developed
[1]
for compliance with the ISO 26262 series in which degrading faults are shown to be relevant due to, for
instance, the technology used.
Specific technical implementations of predictive maintenance solutions are not in scope of this document.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 26262--1: 2018, Road vehicles –— Functional Safety safety — Part 1: Vocabulary
ISO 26262-5: 2018 Road vehicles – Functional Safety — Part 5: Product development at the hardware level
43 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 26262-1:2018 and the following
apply.
ISO and IEC maintain terminologicalterminology databases for use in standardization at the following
addresses:
— — ISO Online browsing platform: available at https://www.iso.org/obp
— — IEC Electropedia: available at https://www.electropedia.org/
3.1
degrading fault
fault whose characteristics are not constant and degrade over time, that can result in an error or failure
when stimulated after degradation exceeds a critical threshold
Note 1 to entry: Permanent and intermittent faults can first manifest as degrading faults. Transient faults do not
manifest as degrading faults.
Note 2 to entry: Degrading faults do not create errors or failures until degradation exceeds critical thresholds. The
capability to generate an error or failure is related to the current state of degradation.
Note 3 to entry: Degrading faults exhibit abnormal conditions which can cause an error or failure over time. Normal
degradation does not exhibit abnormal conditions which are necessary to be classified as a fault. Normal
degradation can result in a loss of functionality after expected lifespan has elapsed but cannot be considered a fault
as it is not abnormal.
3.2
degrading fault detection time interval
(DFDTI)
timespan from the occurrence of a degrading fault ((3.1) to its detection
© ISO 2023 – All rights reserved 1
---------------------- Page: 7 ----------------------
ISO/DTR 9839:(E)
3.3
degrading fault handling time interval
(DFHTI)
sum of the degrading fault detection time interval ((3.2) and the degrading fault reaction time interval
((3.4).
Note 1 to entry: The degrading fault handling time interval is a property of a predictive maintenance (3.5) related
safety mechanism.
Note 2 to entry: The degrading fault handling time interval is considered in addition to the fault handling time
interval. See Figure 4.
Note 3 to entry: The timespan from occurrence of a degrading fault ((3.1) until it has the capability to generate an
error or failure is the maximum degrading fault handling time interval that can be specified for a predictive
maintenance related safety mechanism to support the functional safety concept.
Note 4 to entry: A degrading fault ((3.1) is covered in a timely manner by the corresponding safety mechanism if
there is detection and reaction within the degrading fault handling time interval.
3.4
degrading fault reaction time interval
(DFRTI)
timespan from the detection of a degrading fault ((3.1) to reaching a safe state or reaching emergency
operation
3.5
predictive maintenance
techniques that are used to detect degrading faults ((3.1), predict remaining useful life ((3.6), and react
appropriately
Note 1 to entry: Approaches include the use of data driven methods such as machine learning applied locally or on
a remote system. Guidance for developing safety related ML systems can be found in ISO/IEC TR 5469 and ISO PAS
[2 ]
8800 . .
Note 2 to entry: Prediction of remaining useful life ((3.6) can be used to replace a faulty element before it can cause
an error or failure.
3.6
remaining useful life
(RUL)
length of time from the present time to the estimated time that the item or element is expected to no
longer perform its intended function within desired specifications
[SOURCE: IEEE Std 1856-2017, modified for compliance to ISO directives]
Note 1 to entry: RUL can be estimated using predictive maintenance ((3.5) or with other approaches.
Note 2 to entry: RUL can be estimated for expected degradation or degradation in the presence of a fault.
[3]
[SOURCE: IEEE 1856-2017 , modified for compliance to ISO directives]
54 Abbreviated terms
ADAS Advanced Driver Assistance System
ADS Automated Driving System
2 © ISO 2023 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/DTR 9839:(E)
AI Artificial Intelligence
BEoL Back End of Line (sometimes BEOL)
BFR Base Failure Rate
BIST Built-In Self-Test
BLM Barrier Layer Material
CHC Channel Hot Carrier
COTS Commercial Off The Shelf
DC Diagnostic Coverage
DFDTI Degrading Fault Detection Time Interval
DFHTI Degrading Fault Handling Time Interval
DFRTI Degrading Fault Reaction Time Interval
DRAM Dynamic Random Access Memory
EM Electromigration
ESD Electrostatic Discharge
FEoL Front End of Line (sometimes FEOL)
FET Field Effect Transistor
FDTI Fault Detection Time Interval
FHTI Fault Handling Time Interval
FTTI Fault Tolerant Time Interval
HCI Hot Carrier Injection
ILD Inter-Layer Dielectric
LFM Latent Fault Metric
ML Machine Learning
MoL Middle of Line (sometimes MOL)
MEoL Middle End of Line (sometimes MEOL)
MPFDTI Multiple Point Fault Detection Time Interval
NBTI Negative Bias Temperature Instability
NVM Non-Volatile Memory
PCM Phase Change Memory
© ISO 2023 – All rights reserved 3
---------------------- Page: 9 ----------------------
ISO/DTR 9839:(E)
PHM Prognostics and Health Management
QMS Quality Management System
RUL Remaining Useful Life
SBD Soft Breakdown
SHE Self Heating Effect
SILC Stress Induced Leakage Current
SM Stress Migration
SoC System on Chip
SPFM Single Point Fault Metric
TDDB Time Dependent Dielectric Breakdown
TDJD Time Dependent Junction Degradation
TID Total Ionizing Dose
ADAS Advanced Driver Assistance System
ADS Automated Driving System
AI Artificial Intelligence
BEoL Back End of Line (sometimes BEOL)
BFR Base Failure Rate
BLM Barrier Layer Material
CHC Channel Hot Carrier
COTS Commercial Off The Shelf
DC Diagnostic Coverage
DFDTI Degrading Fault Detection Time Interval
DFHTI Degrading Fault Handling Time Interval
DFRTI Degrading Fault Reaction Time Interval
DRAM Dynamic Random Access Memory
EM Electromigration
ESD Electrostatic Discharge
FEoL Front End of Line (sometimes FEOL)
FET Field Effect Transistor
FDTI Fault Detection Time Interval
FHTI Fault Handling Time Interval
FTTI Fault Tolerant Time Interval
HCI Hot Carrier Injection
4 © ISO 2023 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/DTR 9839:(E)
ILD Inter-Layer Dielectric
LFM Latent Fault Metric
ML Machine Learning
MoL Middle of Line (sometimes MOL)
MEoL Middle End of Line (sometimes MEOL)
MPFDTI Multiple Point Fault Detection Time Interval
NBTI Negative Bias Temperature Instability
NVM Non-Volatile Memory
PCM Phase Change Memory
PHM Prognostics and Health Management
RUL Remaining Useful Life
SBD Soft Breakdown
SHE Self-Heating Effect
SILC Stress-Induced Leakage Current
SM Stress Migration
SoC System on Chip
SPFM Single Point Fault Metric
TDDB Time Dependent Dielectric Breakdown
TDJD Time Dependent Junction Degradation
TID Total Ionizing Dose
65 Literature survey of degrading faults
5.1 General
This technical reportdocument reviews many technical documents to summarize the current state of the
art understanding of degrading faults in industry standards and technical publications.
NOTE : Terminology in the referenced publications and standards is not always aligned to ISO 26262 terms
and definitions of the ISO 26262 series. When referencing publications and standards, the terminology of the
referenced work is used.
6.15.2 Degrading faults in industry standards
[4]
6.1.15.2.1 JEDEC JEP122H
The JEDEC Solid State Technology Association is a semiconductor industry trade association and
standardization body. JEDEC has over 300 companies as members and publishes electronics standards
on a wide variety of topics.
The JEDEC JEP122H standard is the latest revision on JEDEC’s standard for “Failure Mechanisms and
Models for Semiconductor Devices,” last updated in 2016. The standard describes eighteen different
failure mechanisms, classifying them as being related to the die Front Endfront end of Lineline (FEoL),
die Back Endback end of Lineline (BEoL), or packaging. Models are provided for estimating the rates of
degradation per failure mode. The information provided in JEP122H is validated by a team of reliability
experts from the SEMATECH/ISMI Reliability Council and supported by extensive references to technical
publications.
© ISO 2023 – All rights reserved 5
---------------------- Page: 11 ----------------------
ISO/DTR 9839:(E)
The die FEoL failure mechanisms described by the JEP122H include:
— Time Dependent Dielectric Breakdown time dependent dielectric breakdown (TDDB) due to gate
oxide breakdown;
— Hot Carrierhot carrier Injection (HCI) );
— Negative Bias Temperature Instabilitynegative bias temperature instability (NBTI) );
— Surfacesurface inversion due to mobile ions;
— Floatingfloating gate Non-Volatile Memorynon-volatile memory (NVM) data retention;
— Localizedlocalized charge trapping NVM data retention;
— Phase Change Memoryphase change memory (PCM) NVM data retention.
The die BEoL failure mechanisms described by JEP122H include:
— TDDB due to ILD/Lowlow-k/Mobilemobile Cu ions;
— Aluminium Electromigrationaluminium electromigration (EM) );
— Coppercopper EM;
— Aluminiumaluminium and copper corrosion;
— Aluminium Stress Migrationaluminium stress migration (SM) );
— Coppercopper SM.
The packaging failure mechanisms described in JEP122H include:
— Fatiguefatigue failures due to temperature cycling and thermal shock;
— Interfacialinterfacial failures due to temperature cycling and thermal shock;
— Intermetallicintermetallic and oxidation failure due to high temperature;
— Tintin whiskers;
— Ion Mobility Kineticsion mobility kinetics due to component cleanliness.
6.25.3 Degrading faults in technical publications
[5]
6.2.15.3.1 Advanced CMOS Reliability Update: Sub 20nm20 nm FinFET Assessment
Reference [5] was published by Sandia National Laboratories, a research organization of the United
States Department of Energy, in 2020. The purpose of the report is to document the most critical failure
modes impacting advanced semiconductor technologies using FinFET technology. FinFET based
semiconductors are used for most current generation SoCs (Systemsystem on Chipchip devices), dGPUs
(discrete Graphics Processing Unitsgraphics processing units), and DRAMs (Dynamic Random-Access
Memoriesdynamic random-access memories) which are used in infotainment, ADAS (Advanced Driver
Assistance Systemsadvanced driver assistance systems), and ADS (Automated Driving Systemautomated
driving system) applications. While the use of FinFET transistors enables smaller process geometries
(e.g., <20nm. <20 nm feature size) and faster processing, it also changes the failure mode susceptibility
characteristics compared to more traditional planar transistor technologies found in 28nm28 nm and
larger process technologies.
6 © ISO 2023 – All rights reserved
---------------------- Page: 12 ----------------------
ISO/DTR 9839:(E)
The report provides details for the following failure modes:
— Diedie related failure modes:
— Bias Temperature Instabilitybias temperature instability (BTI) );
— Dielectricdielectric integrity;
o Hot Carrier Injection (HCI)
— HCI;
— BEoL, EM, and stress voiding;
— Middle Endmiddle end of Lineline (MEoL) concerns (also known as Middlemiddle of Lineline,
or MoL) );
— Packagingpackaging and package-die interaction;
— Integratedintegrated die design and process reliability – electrostatic discharge (ESD) );
— Radiationradiation effects:
— Total Ionizing Dosetotal ionizing dose (TID));
— Displacementdisplacement damage;
— COTS electronics and radiation effects.
The die and packaging related failure modes discussed can generally be argued to manifest as random
degrading faults before becoming intermittent or permanent faults. The ESD and radiation effects can
generally be argued to be systematic or transient in nature.
Also of interest is the section on reliability degradation and its impact on circuit/system performance.
This section focuses on “soft” logic failures which manifest before “hard” physical failures of the
semiconductor devices. As most of the degradation mechanisms discussed result in parameter
degradation, it is suggested that statistical methods can be used to predict circuit failures.
[6]
6.2.25.3.2 Circuit-Based Reliability Consideration in FinFET Technology
Reference [6] is authored by four experts from Taiwan Semiconductor Manufacturing Company (TSMC)
in 2017 to describe the primary reliability failure modes of concern for FinFET based process
technologies and to present a model for estimating reliability. Comparisons are made between the
performance of 28nm28 nm planar technologies vs. 16nmversus 16 nm and 7nm7 nm FinFET
technologies.
The authors highlight several new reliability concerns including Bias Temperature Instabilitybias
temperature instability (BTI), Stress-Induced Leakage Currentsstress-induced leakage currents (SILCs),
Self-Heating Effectsself-heating effects (SHEs), and Time Dependent Junction Degradationtime
dependent junction degradation (TDJD). Models are proposed to estimate the reliability impacts of these
mechanisms, based on a combination of simulation and reliability testing.
[7]
6.2.35.3.3 Intermittent Faults and Effects on Reliability of Integrated Circuits
Reference [7] is authored by a reliability expert from AMD in 2008 and studies intermittent faults. An
experiment was conducted using more than 250 servers (from pre-2008) to provide over 300 server
years of operational data. Identified memory single bit errors were analysed for root cause, and the
findings documented. The rates of occurrence of the errors introduced by these faults can vary from one
design to another and one technology to another.
© ISO 2023 – All rights reserved 7
---------------------- Page: 13 ----------------------
ISO/DTR 9839:(E)
Failure modes discussed in this paper include ultra-thin oxide breakdown, soft breakdown (SBD),
electromigration (EM)EM voids, barrier layer material (BLM) cracks, and crosstalk as sources of
intermittent faults. It is noted that these intermittent faults can be detected by monitoring the Vmin
(voltage minimum) thresholds necessary for correct operation. Mitigations are discussed in terms of
systematic avoidance, screening at manufacturing test, and online fault detection in application including
failure prediction.
76 Literature survey on predictive maintenance
6.1 General
This technical reportdocument reviews many technical documents to summarize the current state of the
art understanding of predictive maintenance in industry standards and technical publication. Additional
[8 ]
application domain specific standards are in development (e.g.,. IEC 63270 ). ).
NOTE : Terminology in the referenced publications and standards is not always aligned to ISO 26262the terms
and definitions of the ISO 26262 series. When referencing publications and standards, the terminology of the
referenced work is used.
7.16.2 Predictive maintenance in industry standards
[9]
7.1.16.2.1 IEC 61508
The IEC 61508 standard is a basic safety publication for functional safety which was the original basis for
the ISO 26262 standardseries. The 2010 edition of the standard includes guidance on the use of fault
forecasting, maintenance, and supervisory actions supported by Artificial Intelligenceartificial
intelligence (AI) systems.
[3]
7.1.26.2.2 IEEE Std 1856
The IEEE 1856 standard, “IEEE Standard Framework for Prognostics and Health Management of
Electronic Systems,” provides an industry-independent approach to the use of predictive maintenance
and similar techniques. This standard is intended to be applied at many different levels of design
abstraction and is not specific to semiconductor technologies. This standard applies the terms
“prognostics” and “Prognostics and Health Management (PHM)” interchangeably with predictive
methods. The standard is primarily focused on estimating the remaining useful life (RUL) after a fault is
detected, rather than the method by which the fault is detected.
IEEE 1856 provides a lifecycle model for PHM as illustrated in Figure 1.:
— Thethe product is initially deployed without faults;
— Anan off-nominal behaviour (fault) is detected;
— Aa failure occurs.
8 © ISO 2023 – All rights reserved
---------------------- Page: 14 -----------
...
FINAL
TECHNICAL ISO/DTR
DRAFT
REPORT 9839
ISO/TC 22/SC 32
Road vehicles — Application of
Secretariat: JISC
predictive maintenance to hardware
Voting begins on:
2023-05-24 with ISO 26262-5
Voting terminates on:
Véhicules routiers — Application de la maintenance prédictive au
2023-07-19
matériel à l'aide de l'ISO 26262-5
RECIPIENTS OF THIS DRAFT ARE INVITED TO
SUBMIT, WITH THEIR COMMENTS, NOTIFICATION
OF ANY RELEVANT PATENT RIGHTS OF WHICH
THEY ARE AWARE AND TO PROVIDE SUPPOR TING
DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
Reference number
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
ISO/DTR 9839:2023(E)
LOGICAL, COMMERCIAL AND USER PURPOSES,
DRAFT INTERNATIONAL STANDARDS MAY ON
OCCASION HAVE TO BE CONSIDERED IN THE
LIGHT OF THEIR POTENTIAL TO BECOME STAN-
DARDS TO WHICH REFERENCE MAY BE MADE IN
NATIONAL REGULATIONS. © ISO 2023
---------------------- Page: 1 ----------------------
ISO/DTR 9839:2023(E)
FINAL
TECHNICAL ISO/DTR
DRAFT
REPORT 9839
ISO/TC 22/SC 32
Road vehicles — Application of
Secretariat: JISC
predictive maintenance to hardware
Voting begins on:
with ISO 26262-5
Voting terminates on:
Véhicules routiers — Application de la maintenance prédictive au
matériel à l'aide de l'ISO 26262-5
COPYRIGHT PROTECTED DOCUMENT
© ISO 2023
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
RECIPIENTS OF THIS DRAFT ARE INVITED TO
ISO copyright office
SUBMIT, WITH THEIR COMMENTS, NOTIFICATION
OF ANY RELEVANT PATENT RIGHTS OF WHICH
CP 401 • Ch. de Blandonnet 8
THEY ARE AWARE AND TO PROVIDE SUPPOR TING
CH-1214 Vernier, Geneva
DOCUMENTATION.
Phone: +41 22 749 01 11
IN ADDITION TO THEIR EVALUATION AS
Reference number
Email: copyright@iso.org
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO
ISO/DTR 9839:2023(E)
Website: www.iso.org
LOGICAL, COMMERCIAL AND USER PURPOSES,
DRAFT INTERNATIONAL STANDARDS MAY ON
Published in Switzerland
OCCASION HAVE TO BE CONSIDERED IN THE
LIGHT OF THEIR POTENTIAL TO BECOME STAN
DARDS TO WHICH REFERENCE MAY BE MADE IN
ii
© ISO 2023 – All rights reserved
NATIONAL REGULATIONS. © ISO 2023
---------------------- Page: 2 ----------------------
ISO/DTR 9839:2023(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Literature survey of degrading faults . 4
5.1 General . 4
5.2 Degrading faults in industry standards . 4
[4]
5.2.1 JEDEC JEP122H . 4
5.3 Degrading faults in technical publications . 5
[5]
5.3.1 Advanced CMOS Reliability Update: Sub 20 nm FinFET Assessment . 5
[6]
5.3.2 Circuit-Based Reliability Consideration in FinFET Technology . 6
[7]
5.3.3 Intermittent Faults and Effects on Reliability of Integrated Circuits . 6
6 Literature survey on predictive maintenance . 6
6.1 General . 6
6.2 Predictive maintenance in industry standards . 6
[9]
6.2.1 IEC 61508 . 6
[3]
6.2.2 IEEE 1856 . 6
6.3 Predictive maintenance in technical publications . 7
[10]
6.3.1 A Survey of Online Failure Prediction Methods . . 7
[11]
6.3.2 An Odometer for CPUs . 7
[12]
6.3.3 Circuit Failure Prediction for Robust System Design in Scaled CMOS . 8
[13]
6.3.4 A Circuit Failure Prediction Mechanism (DART) for High Field Reliability . 8
6.3.5 Predicting Remediations for Hardware Failures in LargeScale
[14]
Datacenters . 8
6.3.6 Improving Analog Functional Safety Using Data-Driven Anomaly Detection
[15]
. 8
7 Degrading faults and the ISO 26262 series . 8
7.1 Understanding the lifecycle of degrading faults. 8
7.2 Classification of degrading faults .12
7.3 Quantifying degrading fault base failure rate .12
7.3.1 Industry standards and models .12
7.3.2 Field data . 13
7.3.3 Expert judgement . 13
8 Applying predictive maintenance .13
8.1 Diagnostic coverage (DC) evaluation for predictive mechanisms .13
8.2 Considering random hardware metrics . 13
8.2.1 Impacting the SPFM and LFM . 13
8.2.2 Application as a dedicated measure . 14
8.3 Considering RUL prediction . 14
Annex A (informative) An approach to handling degrading faults .16
Bibliography .18
iii
© ISO 2023 – All rights reserved
---------------------- Page: 3 ----------------------
ISO/DTR 9839:2023(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and nongovernmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use
of (a) patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed
patent rights in respect thereof. As of the date of publication of this document, ISO had not received
notice of (a) patent(s) which may be required to implement this document. However, implementers are
cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents. ISO shall not be held responsible for identifying any or all
such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to
the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see
www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 22, Road vehicles, Subcommittee SC 32,
Electrical and electronic components and general system aspects.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
© ISO 2023 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/DTR 9839:2023(E)
Introduction
Hardware elements wear out or degrade with time and usage. The presence of certain faults can
cause the rate of degradation to increase. If the rate of degradation exceeds critical thresholds, then
a hardware element can fail during its normal expected lifespan. Addressing fault behaviours which
change over time is difficult. Functional safety standards such as the ISO 26262 series have traditionally
addressed degrading faults with avoidance measures and simplified assumptions of static behaviours.
Understanding of degrading faults is improving over time. Many industries are taking proactive steps
to control degrading faults using predictive maintenance. Predictive maintenance can detect degrading
faults and predict remaining useful life. Safety mechanisms based on predictive maintenance are not
explicitly discussed in the ISO 26262 series.
This document provides a survey of current state of the art for degrading faults and predictive
maintenance techniques. Approaches are presented to consider degrading faults and predictive
maintenance techniques in an ISO 26262 safety argument. Much of the content is focused on
semiconductors, but the concepts can be applied to other hardware elements.
v
© ISO 2023 – All rights reserved
---------------------- Page: 5 ----------------------
TECHNICAL REPORT ISO/DTR 9839:2023(E)
Road vehicles — Application of predictive maintenance to
hardware with ISO 26262-5
1 Scope
This document is intended to be applied to the usage of predictive maintenance methods for the
detection of degrading faults in safety related E/E hardware elements. It applies to hardware elements
[1]
developed for compliance with the ISO 26262 series in which degrading faults are shown to be
relevant due to, for instance, the technology used.
Specific technical implementations of predictive maintenance solutions are not in scope of this
document.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 262621, Road vehicles — Functional safety — Part 1: Vocabulary
3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 26262-1 and the following
apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
degrading fault
fault whose characteristics are not constant and degrade over time, that can result in an error or failure
when stimulated after degradation exceeds a critical threshold
Note 1 to entry: Permanent and intermittent faults can first manifest as degrading faults. Transient faults do not
manifest as degrading faults.
Note 2 to entry: Degrading faults do not create errors or failures until degradation exceeds critical thresholds.
The capability to generate an error or failure is related to the current state of degradation.
Note 3 to entry: Degrading faults exhibit abnormal conditions which can cause an error or failure over time.
Normal degradation does not exhibit abnormal conditions which are necessary to be classified as a fault. Normal
degradation can result in a loss of functionality after expected lifespan has elapsed but cannot be considered a
fault as it is not abnormal.
3.2
degrading fault detection time interval
DFDTI
timespan from the occurrence of a degrading fault (3.1) to its detection
1
© ISO 2023 – All rights reserved
---------------------- Page: 6 ----------------------
ISO/DTR 9839:2023(E)
3.3
degrading fault handling time interval
DFHTI
sum of the degrading fault detection time interval (3.2) and the degrading fault reaction time interval
(3.4).
Note 1 to entry: The degrading fault handling time interval is a property of a predictive maintenance (3.5) related
safety mechanism.
Note 2 to entry: The degrading fault handling time interval is considered in addition to the fault handling time
interval. See Figure 4.
Note 3 to entry: The timespan from occurrence of a degrading fault (3.1) until it has the capability to generate
an error or failure is the maximum degrading fault handling time interval that can be specified for a predictive
maintenance related safety mechanism to support the functional safety concept.
Note 4 to entry: A degrading fault (3.1) is covered in a timely manner by the corresponding safety mechanism if
there is detection and reaction within the degrading fault handling time interval.
3.4
degrading fault reaction time interval
DFRTI
timespan from the detection of a degrading fault (3.1) to reaching a safe state or reaching emergency
operation
3.5
predictive maintenance
techniques that are used to detect degrading faults (3.1), predict remaining useful life (3.6), and react
appropriately
Note 1 to entry: Approaches include the use of data driven methods such as machine learning applied locally or
[2]
on a remote system. Guidance for developing safety related ML systems can be found in ISO/IEC TR 5469 .
Note 2 to entry: Prediction of remaining useful life (3.6) can be used to replace a faulty element before it can cause
an error or failure.
3.6
remaining useful life
RUL
length of time from the present time to the estimated time that the item or element is expected to no
longer perform its intended function within desired specifications
Note 1 to entry: RUL can be estimated using predictive maintenance (3.5) or with other approaches.
Note 2 to entry: RUL can be estimated for expected degradation or degradation in the presence of a fault.
[3]
[SOURCE: IEEE 18562017 , modified for compliance to ISO directives]
4 Abbreviated terms
ADAS Advanced Driver Assistance System
ADS Automated Driving System
AI Artificial Intelligence
BEoL Back End of Line (sometimes BEOL)
BFR Base Failure Rate
BLM Barrier Layer Material
2
© ISO 2023 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/DTR 9839:2023(E)
CHC Channel Hot Carrier
COTS Commercial Off The Shelf
DC Diagnostic Coverage
DFDTI Degrading Fault Detection Time Interval
DFHTI Degrading Fault Handling Time Interval
DFRTI Degrading Fault Reaction Time Interval
DRAM Dynamic Random Access Memory
EM Electromigration
ESD Electrostatic Discharge
FEoL Front End of Line (sometimes FEOL)
FET Field Effect Transistor
FDTI Fault Detection Time Interval
FHTI Fault Handling Time Interval
FTTI Fault Tolerant Time Interval
HCI Hot Carrier Injection
ILD Inter-Layer Dielectric
LFM Latent Fault Metric
ML Machine Learning
MoL Middle of Line (sometimes MOL)
MEoL Middle End of Line (sometimes MEOL)
MPFDTI Multiple Point Fault Detection Time Interval
NBTI Negative Bias Temperature Instability
NVM Non-Volatile Memory
PCM Phase Change Memory
PHM Prognostics and Health Management
RUL Remaining Useful Life
SBD Soft Breakdown
SHE SelfHeating Effect
SILC StressInduced Leakage Current
SM Stress Migration
SoC System on Chip
3
© ISO 2023 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/DTR 9839:2023(E)
SPFM Single Point Fault Metric
TDDB Time Dependent Dielectric Breakdown
TDJD Time Dependent Junction Degradation
TID Total Ionizing Dose
5 Literature survey of degrading faults
5.1 General
This document reviews many technical documents to summarize the current state of the art
understanding of degrading faults in industry standards and technical publications.
NOTE Terminology in the referenced publications and standards is not always aligned to terms and
definitions of the ISO 26262 series. When referencing publications and standards, the terminology of the
referenced work is used.
5.2 Degrading faults in industry standards
[4]
5.2.1 JEDEC JEP122H
The JEDEC Solid State Technology Association is a semiconductor industry trade association and
standardization body. JEDEC has over 300 companies as members and publishes electronics standards
on a wide variety of topics.
JEDEC JEP122H is the latest revision on JEDEC’s standard for “Failure Mechanisms and Models for
Semiconductor Devices,” last updated in 2016. The standard describes eighteen different failure
mechanisms, classifying them as being related to the die front end of line (FEoL), die back end of line
(BEoL), or packaging. Models are provided for estimating the rates of degradation per failure mode. The
information provided in JEP122H is validated by a team of reliability experts from the SEMATECH/ISMI
Reliability Council and supported by extensive references to technical publications.
The die FEoL failure mechanisms described by the JEP122H include:
— time dependent dielectric breakdown (TDDB) due to gate oxide breakdown;
— hot carrier Injection (HCI);
— negative bias temperature instability (NBTI);
— surface inversion due to mobile ions;
— floating gate non-volatile memory (NVM) data retention;
— localized charge trapping NVM data retention;
— phase change memory (PCM) NVM data retention.
The die BEoL failure mechanisms described by JEP122H include:
— TDDB due to ILD/low-k/mobile Cu ions;
— aluminium electromigration (EM);
— copper EM;
— aluminium and copper corrosion;
— aluminium stress migration (SM);
4
© ISO 2023 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/DTR 9839:2023(E)
— copper SM.
The packaging failure mechanisms described in JEP122H include:
— fatigue failures due to temperature cycling and thermal shock;
— interfacial failures due to temperature cycling and thermal shock;
— intermetallic and oxidation failure due to high temperature;
— tin whiskers;
— ion mobility kinetics due to component cleanliness.
5.3 Degrading faults in technical publications
[5]
5.3.1 Advanced CMOS Reliability Update: Sub 20 nm FinFET Assessment
Reference [5] was published by Sandia National Laboratories, a research organization of the United
States Department of Energy, in 2020. The purpose of the report is to document the most critical
failure modes impacting advanced semiconductor technologies using FinFET technology. FinFET
based semiconductors are used for most current generation SoCs (system on chip devices), dGPUs
(discrete graphics processing units), and DRAMs (dynamic random-access memories) which are used
in infotainment, ADAS (advanced driver assistance systems), and ADS (automated driving system)
applications. While the use of FinFET transistors enables smaller process geometries (e.g. <20 nm
feature size) and faster processing, it also changes the failure mode susceptibility characteristics
compared to more traditional planar transistor technologies found in 28 nm and larger process
technologies.
The report provides details for the following failure modes:
— die related failure modes:
— bias temperature instability (BTI);
— dielectric integrity;
— HCI;
— BEoL, EM and stress voiding;
— middle end of line (MEoL) concerns (also known as middle of line, or MoL);
— packaging and package-die interaction;
— integrated die design and process reliability – electrostatic discharge (ESD);
— radiation effects:
— total ionizing dose (TID);
— displacement damage;
— COTS electronics and radiation effects.
The die and packaging related failure modes discussed can generally be argued to manifest as random
degrading faults before becoming intermittent or permanent faults. The ESD and radiation effects can
generally be argued to be systematic or transient in nature.
Also of interest is the section on reliability degradation and its impact on circuit/system performance.
This section focuses on “soft” logic failures which manifest before “hard” physical failures of the
semiconductor devices. As most of the degradation mechanisms discussed result in parameter
degradation, it is suggested that statistical methods can be used to predict circuit failures.
5
© ISO 2023 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/DTR 9839:2023(E)
[6]
5.3.2 Circuit-Based Reliability Consideration in FinFET Technology
Reference [6] is authored by four experts from Taiwan Semiconductor Manufacturing Company
(TSMC) in 2017 to describe the primary reliability failure modes of concern for FinFET based process
technologies and to present a model for estimating reliability. Comparisons are made between the
performance of 28 nm planar technologies versus 16 nm and 7 nm FinFET technologies.
The authors highlight several new reliability concerns including bias temperature instability (BTI),
stress-induced leakage currents (SILCs), self-heating effects (SHEs), and time dependent junction
degradation (TDJD). Models are proposed to estimate the reliability impacts of these mechanisms,
based on a combination of simulation and reliability testing.
[7]
5.3.3 Intermittent Faults and Effects on Reliability of Integrated Circuits
Reference [7] is authored by a reliability expert from AMD in 2008 and studies intermittent faults. An
experiment was conducted using more than 250 servers (from pre-2008) to provide over 300 server
years of operational data. Identified memory single bit errors were analysed for root cause, and the
findings documented. The rates of occurrence of the errors introduced by these faults can vary from
one design to another and one technology to another.
Failure modes discussed in this paper include ultra-thin oxide breakdown, soft breakdown (SBD), EM
voids, barrier layer material (BLM) cracks, and crosstalk as sources of intermittent faults. It is noted
that these intermittent faults can be detected by monitoring the Vmin (voltage minimum) thresholds
necessary for correct operation. Mitigations are discussed in terms of systematic avoidance, screening
at manufacturing test, and online fault detection in application including failure prediction.
6 Literature survey on predictive maintenance
6.1 General
This document reviews many technical documents to summarize the current state of the art
understanding of predictive maintenance in industry standards and technical publication. Additional
[8]
application domain specific standards are in development (e.g. IEC 63270 ).
NOTE Terminology in the referenced publications and standards is not always aligned to the terms and
definitions of the ISO 26262 series. When referencing publications and standards, the terminology of the
referenced work is used.
6.2 Predictive maintenance in industry standards
[9]
6.2.1 IEC 61508
IEC 61508 is a basic safety publication for functional safety which was the original basis for the
ISO 26262 series. The 2010 edition of the standard includes guidance on the use of fault forecasting,
maintenance, and supervisory actions supported by artificial intelligence (AI) systems.
[3]
6.2.2 IEEE 1856
IEEE 1856 provides an industry-independent approach to the use of predictive maintenance and similar
techniques. This standard is intended to be applied at many different levels of design abstraction
and is not specific to semiconductor technologies. This standard applies the terms “prognostics” and
“Prognostics and Health Management (PHM)” interchangeably with predictive methods. The standard
is primarily focused on estimating the remaining useful life (RUL) after a fault is detected, rather than
the method by which the fault is detected.
IEEE 1856 provides a lifecycle model for PHM as illustrated in Figure 1:
— the product is initially deployed without faults;
6
© ISO 2023 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/DTR 9839:2023(E)
— an off-nominal behaviour (fault) is detected;
— a failure occurs.
Figure 1 — IEEE 1856-2017 lifecycle model for prognostics
The IEEE 1856 model introduces three metrics, which when used together can be used to compare the
effectiveness of different PHM approaches:
— the response time for the predictive algorithm, defined as the time between first fault detection and
first correct prediction of RUL;
— the prognostic distance, defined as the time between the correct prediction and the occurrence of a
failure;
— the prognostic system accuracy, defined as the difference between the predicted failure time and
the actual failure time.
NOTE Prognostic system accuracy can be positive (failure occurs before prediction) or negative (failure
occurs after prediction).
In IEEE 1856-2017, Annex A, the standard provides additional guidance. The content on levels of PHM
implementation closely matches the ISO 26262 series approach of performing analysis on multiple
levels of design hierarchy: device, component, assembly, sub-system, system, and system of systems.
6.3 Predictive maintenance in technical publications
[10]
6.3.1 A Survey of Online Failure Prediction Methods
Reference [10] is a literature survey compiled by three researchers from Humboldt University in Berlin
in 2010. It is intended to provide a picture of the state of the art in online failure prediction methods as
of 2010. Some of the key information included in this document is:
— a lifecycle approach based on the progression of faults to errors to failures which is largely compatible
with the ISO 26262 series;
— a definition of nine metrics to evaluate predictive methods, with focus on precision and recall;
—
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.