Artificial Intelligence enabled Medical Devices - Computer assisted analysis software for pulmonary images - Algorithm performance test methods

Dispositifs médicaux activés par l'intelligence artificielle - Logiciel d'analyse assistée par ordinateur des images pulmonaires - Méthodes de test de performance des algorithmes

Medicinske naprave, podprte z umetno inteligenco - Računalniško podprta programska oprema za analizo slik pljuč - Preskusne metode zmogljivosti algoritma

General Information

Status
Not Published
Publication Date
25-Apr-2027
Current Stage
4020 - Enquiry circulated - Enquiry
Start Date
03-Oct-2025
Due Date
07-Nov-2025
Completion Date
03-Oct-2025
Draft
prEN IEC 63524:2025
English language
41 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day

Standards Content (Sample)


SLOVENSKI STANDARD
01-december-2025
Medicinske naprave, podprte z umetno inteligenco - Računalniško podprta
programska oprema za analizo slik pljuč - Preskusne metode zmogljivosti
algoritma
Artificial Intelligence enabled Medical Devices - Computer assisted analysis software for
pulmonary images - Algorithm performance test methods
Dispositifs médicaux activés par l'intelligence artificielle - Logiciel d'analyse assistée par
ordinateur des images pulmonaires - Méthodes de test de performance des algorithmes
Ta slovenski standard je istoveten z: prEN IEC 63524:2025
ICS:
11.040.55 Diagnostična oprema Diagnostic equipment
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

62B/1391/CDV
COMMITTEE DRAFT FOR VOTE (CDV)
PROJECT NUMBER:
IEC 63524 ED1
DATE OF CIRCULATION: CLOSING DATE FOR VOTING:
2025-10-03 2025-12-26
SUPERSEDES DOCUMENTS:
62B/1370/CD, 62B/1378A/CC
IEC SC 62B : MEDICAL IMAGING EQUIPMENT, SOFTWARE, AND SYSTEMS
SECRETARIAT: SECRETARY:
Germany Ms Regina Geierhofer
OF INTEREST TO THE FOLLOWING COMMITTEES: HORIZONTAL FUNCTION(S):

TC 62,SC 62A,SC 62C,SC 62D
ASPECTS CONCERNED:
Safety
SUBMITTED FOR CENELEC PARALLEL VOTING NOT SUBMITTED FOR CENELEC PARALLEL VOTING
Attention IEC-CENELEC parallel voting
The attention of IEC National Committees, members of
CENELEC, is drawn to the fact that this Committee Draft
for Vote (CDV) is submitted for parallel voting.
The CENELEC members are invited to vote through the
CENELEC online voting system.
This document is still under study and subject to change. It should not be used for reference purposes.
Recipients of this document are invited to submit, with their comments, notification of any relevant patent rights of
which they are aware and to provide supporting documentation.
Recipients of this document are invited to submit, with their comments, notification of any relevant “In Some
Countries” clauses to be included should this proposal proceed. Recipients are reminded that the CDV stage is
the final stage for submitting ISC clauses. (SEE AC/22/2007 OR NEW GUIDANCE DOC).

TITLE:
Artificial Intelligence enabled Medical Devices - Computer assisted analysis software for
pulmonary images - Algorithm performance test methods

PROPOSED STABILITY DATE: 2030
download this electronic file, to make a copy and to print out the content for the sole purpose of preparing National
Committee positions. You may not copy or "mirror" the file or printed version of the document, or any part of it,
for any other purpose without permission in writing from IEC.

IEC CDV 63524 © IEC 2025
NOTE FROM TC/SC OFFICERS:
IEC CDV 63524 © IEC 2025
Link to Committee Draft for Vote (CDV) online document:
Click here
How to access
This link leads you to the Online Standards Development (OSD) platform for National Mirror
Committee’s (NMC) comments. The project draft may be found further down this document.

Resource materials
We recommend NCs to review the available materials to better understand the member commenting
on the OSD platform. This includes the:
• OSD NC roles overview
• How to add and submit comments to the IEC

Contact
Should you require any assistance, please contact the IEC IT Helpdesk.

IEC CDV 63524 © IEC 2025
CONTENTS
CONTENTS . 1
FOREWORD . 6
INTRODUCTION . 8
1 Scope . 9
2 Normative references . 9
3 Terms and definitions . 9
4 Abbreviated terms . 10
5 Overview . 11
6 Algorithm performance test methods . 12
6.1 Test methods for different application scenario . 12
6.1.1 Computer-aided detection . 12
6.1.2 Segmentation and measurement. 14
6.1.3 Classification . 16
6.1.4 Multifunctional Combination Scenario . 20
6.1.5 Follow-up Assessment Scenario . 21
6.1.6 Patient Triage Scenario . 21
6.2 Test methods for quality characteristics . 21
6.2.1 General requirement for generalizability test . 21
6.2.2 Robustness test . 21
6.2.3 Repeatability . 22
6.2.4 Consistency . 23
6.2.5 Efficiency . 23
6.2.6 Analysis of algorithm errors . 23
Annex A (normative) General consideration on preparation of algorithm test . 24
A.1 Test environment . 24
A.2 Test resource . 24
A.2.1 General requirement on test set . 24
A.2.2 Sample size . 25
A.2.3 Test set configuration . 25
A.2.4 Data synthesis . 25
A.2.5 Phantom and devices . 26
A.3 Test tools . 26
A.4 Performance metrics and passing criteria . 26
A.5 Test procedure. 26
A.6 Reporting of test results . 27
Annex B (informative) Example of testing set description of chest CT pulmonary
nodule . 28
B.1 Overview . 28
B.2 Application scenario of datasets. 28
B.3 Data collection . 28
B.4 Distribution and Composition of Dataset . 29
B.5 Annotation Rules of Datasets . 29
B.6 Estimation of Sample Size . 31
B.7 Testing Set Bias Analysis . 31
Annex C (informative) General consideration of Performance Indicators and Statistical
Analysis . 32
IEC CDV 63524 © IEC 2025
C.1 Overview . 32
C.2 Scenario I: Test results are variables of binary classification. 32
C.3 Scenario II: Test results are ordered or continuous variables . 33
C.3.1 Ordered variables . 33
C.3.2 Continuous Variables . 35
C.3.3 Scenario III: Test results involve image positions . 36
C.3.4 Hypothesis Test of Main Indicators . 38
C.4 Sample size determination for test set . 38
Bibliography . 41

Figure B.1 – Flow Chart for Annotation of Pulmonary Nodule . 30

Table 1 – n-classification Confusion Matrix . 17
Table 2 – Binary-classification Confusion Matrix . 17
Table 3 – Binary confusion matrix deducted from multi-classification situation . 18
Table B.1 – Diversity Statistics of Data Sources . 28
Table B.2 – Statistics of Distribution of Pulmonary Nodules . 29
Table C.1 – Confusion matrix of binary classification . 32
Table C.2 – Ordered data structure of diagnostic test . 34

IEC CDV 63524 © IEC 2025
INTERNATIONAL ELECTROTECHNICAL COMMISSION
____________
Artificial Intelligence enabled Medical Devices - Computer assisted
analysis software for pulmonary images - Algorithm performance test
methods
FOREWORD
a) The International Electrotechnical Commission (IEC) is a worldwide organization for standardization comprising
all national electrotechnical committees (IEC National Committees). The object of IEC is to promote international
co-operation on all questions concerning standardization in the electrical and electronic fields. To this end and
in addition to other activities, IEC publishes International Standards, Technical Specifications, Technical
Reports, Publicly Available Specifications (PAS) and Guides (hereafter referred to as “IEC Publication(s)”). Their
preparation is entrusted to technical committees; any IEC National Committee interested in the subject dealt
with may participate in this preparatory work. International, governmental and non-governmental organizations
liaising with the IEC also participate in this preparation. IEC collaborates closely with the International
Organization for Standardization (ISO) in accordance with conditions determined by agreement between the two
organizations.
b) The formal decisions or agreements of IEC on technical matters express, as nearly as possible, an international
consensus of opinion on the relevant subjects since each technical committee has representation from all
interested IEC National Committees.
c) IEC Publications have the form of recommendations for international use and are accepted by IEC National
Committees in that sense. While all reasonable efforts are made to ensure that the technical content of IEC
Publications is accurate, IEC cannot be held responsible for the way in which they are used or for any
misinterpretation by any end user.
d) In order to promote international uniformity, IEC National Committees undertake to apply IEC Publications
transparently to the maximum extent possible in their national and regional publications. Any divergence between
any IEC Publication and the corresponding national or regional publication shall be clearly indicated in the latter.
e) IEC itself does not provide any attestation of conformity. Independent certification bodies provide conformity
assessment services and, in some areas, access to IEC marks of conformity. IEC is not responsible for any
services carried out by independent certification bodies.
f) All users should ensure that they have the latest edition of this publication.
g) No liability shall attach to IEC or its directors, employees, servants or agents including individual experts and
members of its technical committees and IEC National Committees for any personal injury, property damage or
other damage of any nature whatsoever, whether direct or indirect, or for costs (including legal fees) and
expenses arising out of the publication, use of, or reliance upon, this IEC Publication or any other IEC
Publications.
h) Attention is drawn to the Normative references cited in this publication. Use of the referenced publications is
indispensable for the correct application of this publication.
i) IEC draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). IEC takes no position concerning the evidence, validity or applicability of any claimed patent rights in
respect thereof. As of the date of publication of this document, IEC had not received notice of (a) patent(s),
which may be required to implement this document. However, implementers are cautioned that this may not
represent the latest information, which may be obtained from the patent database available at
https://patents.iec.ch. IEC shall not be held responsible for identifying any or all such patent rights.
IEC 6XXXX has been prepared by subcommittee 62B: Medical imaging equipment, software,
and systems, of IEC technical committee TC62: Medical equipment, software, and systems. It
is an International Standard.
The text of this International Standard is based on the following documents:
Draft Report on voting
XX/XX/FDIS XX/XX/RVD
Full information on the voting for its approval can be found in the report on voting indicated in
the above table.
The language used for the development of this International Standard is English.
IEC CDV 63524 © IEC 2025
This document was drafted in accordance with ISO/IEC Directives, Part 2, and developed in
accordance with ISO/IEC Directives, Part 1 and ISO/IEC Directives, IEC Supplement, available
at www.iec.ch/members_experts/refdocs. The main document types developed by IEC are
described in greater detail at www.iec.ch/publications.
The committee has decided that the contents of this document will remain unchanged until the
stability date indicated on the IEC website under webstore.iec.ch in the data related to the
specific document. At this date, the document will be
• reconfirmed,
• withdrawn, or
• revised.
IEC CDV 63524 © IEC 2025
INTRODUCTION
Algorithm performance testing methods are important for verification and validation of AI -
enabled medical device, which will provide objective evidence for stakeholders such as
manufacturers, regulators and clinical users. While horizontal standards for th e testing and
performance evaluation of AI/ML-MD are being developed by IEC TC 62, this standard focuses
on algorithm performance testing method of a specific subbranch of AI-enabled medical device,
computer aided analysis software for pulmonary images. Such products are intended for post-
processing of pulmonary images by AI, including computer aided detection, diagnosis, triage,
segmentation, measurement and other prediction. Stakeholders of this standard include
manufacturers, regulators, clinical users and third party testing laboratories. The
implementation of this standard will promote the reliability and comparability of algorithm
performance test results among different products, and support regulatory activities. This
document describes the method of standalone performance test, which directly compares AI
output with ground truth.
In the main text, chapter 5 further explains the purpose of this standard. Chapter 6 describes
algorithm performance metrics, test procedure and quality characteristics covered in the testing
activities, such as robustness and repeatedness.
Normative Annex A proposes requirement for preparation of algorithm performance testing and
reporting of test results. Informative Annex B provides an example to describe test set.
Informative Annex C provides additional information on statistcal consideration during the test.
For the purposes of this document: “shall” means that conformance with a requirement is
mandatory for conformance with this document; “should” means that conformance with a
requirement is recommended but is not mandatory for conformance with this document; “may”
is used to describe a permissible way to achieve conformance with a requirement; “establish”
means to define, document, and implement; and where this document uses the term “as
appropriate” in conjunction with a required process, activity, task or output, the intention is that
the manufacturer shall use the process, activity, task or output unless the manufacturer can
document a justification for not so doing.
IEC CDV 63524 © IEC 2025
1 Scope
This document describes algorithm performance test methods of computer assisted analysis
software for pulmonary radiological images based on artificial intelligence technology.
This document is applicable to medical device that uses artificial intelligence to support the
post-processing of pulmonary radiological images. The data modality include but is not limited
to X-ray, CT (computed tomography)and MRI (magnetic resonance imaging)
This document is not applicable to medical device that uses artificial intelligence to support pre -
processing or procedure optimization of clinical workflow.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies.
For undated references, the latest edition of the referenced document (including any
amendments) applies.
ISO/IEC TS 4213,
ISO 14971,
IEC 62304:2006/AMD1:2015, Amendment 1 - Medical device software - Software life cycle
processes
IEC 63521 Machine Learning-enabled Medical Device – Performance Evaluation Process
IEC CD 63450 Testing of Artificial Intelligence / Machine Learning-enabled Medical Devices
ISO CD 24971-2Medical devices — Guidance on the application of ISO 14971 — Part 2:
Machine learning in artificial intelligence
3 Terms and definitions
3.1
baseline image
image used as the reference among a series of examinations of the same patient
3.2
follow-up image
image acquired at scan during follow-up visit
3.3
repeat screening
screening repeated at regular intervals
3.4
sign
clues and indications that can be usually obtained through objective measurement and provided
for doctors to learn about medical progress and disease status during physical or pathological
examination
IEC CDV 63524 © IEC 2025
3.5
signs in radiology
signs obtained through radiology means
3.6
stress test
a method to test algorithm performance and robustness in case of extreme or abnormal input,
aiming at identifying potential error, malfunction or vulnerability of the algorithm under test
Note 1 to entry: Stress test is different from load test, which has normal input but different conditions like heavy
computation load.
Note 2 to entry: Inputs in stress test are often called stress samples, which may exist in the real world.
3.7
reference standard
ground truth
benchmarks to compare with the output of AI algorithm
Note 1 to entry: There are different approaches to acquire reference standard, such as diagnosis, treatment and
annotation.
[SOURCE: Modified from IEEE 2802:2022 [1]]
3.8
repeatability
the degree of consistency between multiple independent test/measurement results obtained by
the same operator in the same way, using the same test or measurement facilities, from the
same subject
[SOURCE: Modified from IEEE 2802-2022IEEE 2802:2022 [1]]
3.9
synthetic data
data that is artificially generated rather than produced by real-world events
[SOURCE: IEC CD 63450 Testing of Artificial Intelligence / Machine Learning-enabled Medical
Devices]
3.10
augmented data
data created from an original data set by introducing minor transformations to the original data
set to increase the size and diversity of the data set
Note 1 to entry: Augmented data is a subset of synthetic data.
3.11
data augmentation
a method applied to a limited set of data to increase the data set by creating modified copies
using digital transformations that introduce minor changes to the original data set
Note 1 to entry: Data augmentation is usually conducted in an understandable manner, such as rotation,
segmentation, superposition noise/artifact, superposition filter and reconstruction of radiological images.
4 Abbreviated terms
AI artificial intelligence
AIMD artificial intelligence medical device
IEC CDV 63524 © IEC 2025
AP average precision
AUROC area under the receiver operating characteristics curve
AUT algorithm under test
FN false negative
FP false positive
FROC free-response receiver operating characteristics curve
GT ground truth
IMDRF International medical device regulation forum
MAP mean average precision
ML machine learning
ROC receiver operating characteristics curve
ROI region of interest
SAMD software as medical device
SIMD software in medical device
SUT software under test
TN true negative
TP true positive
5 Overview
General testing principles for SAMD( IEC 62304:2006/AMD1:2015), SIMD and AIMD(IEC CD
63450 Testing of Artificial Intelligence / Machine Learning-enabled Medical Devices and IEC
63521 Machine Learning-enabled Medical Device – Performance Evaluation Process) are
applicable to artificial intelligence medical device. This standard covers particular requirements
for the testing of artificial intelligence algorithm intended to analyze pulmonary radiologic
images.
As an important step for verifying and validating the computer assisted analysis software for
pulmonary images, the algorithm performance test is generally based on test sets and
quantitatively compares the output result of the algorithm with the ground tr uth so as to obtain
specific indicators such as false positive and false negative, repeatability and reproducibility,
robustness, efficiency, etc.
This document describes the method of standalone performance test. To fulfill IMDRF [2]
principles on safety and essential performance of medical device and in vitro diagnostics
medical device, quality characteristics specific to AIMD is also considered in this document,
including generalization, robustness, repeatability, consistency, and efficiency. Analysis of
algorithm errors during the test is also recommended in this document, which may provide
further information to evaluate other quality attributes.
To avoid intentional optimization or overfitting on a test set, reuse of test sets are limited to
keep the software under testing(SUT) from learning through the testing process.
General consideration to prepare algorithm test is described in Annex A.
IEC CDV 63524 © IEC 2025
6 Algorithm performance test methods
6.1 Test methods for different application scenario
6.1.1 Computer-aided detection
6.1.1.1 Matching criterion
When the AUT is intended for computer aided detection, the manufacturer shall clearly define
and document the matching criterion to measure the consistency between the ROI detected by
the AUT and those annotated by GT. If applicable,the manufacturer shall clearly define and
document criteria for annotating the ROI on the pulmonary images that the AI algorithm
analyzes. This ensures consistent evaluation and clinical relevance. The testing should be
performed on images where the ROI is consistently and accurately annotated by
experts.Testing personnel should record the matching method and threshold between the ROI
detected by the AUT and GT in the test plan.
NOTE 1 If the SUT doesn't visualize the ROI explicitly, the internal information of the ROI will be used in the test.
Such information includes but is not limited to ROI center position and border vertex position.
Examples of common matching criterions include:
a) Region overlapping: Determine the matching result by calculating the overlapping degree
(such as Dice coefficient and Jaccard coefficient) between the ROI detected by the AUT
and GT and comparing with the matching threshold.
b) Distance from the central point: Determine the matching result by calculating the distance
between the ROI detected by the AUT and the center of the GT region and setting the
matching threshold.
c) Central point hitting: Determine the matching result by judging whether the center of the
ROI detected by the AUT falls into the range of the GT.
NOTE 2 The selection of the central point is associated with the ROI. The shape of the ROI is usually determined
by signs, especially the signs in radiology. For example, the central point of a pulmonary nodule (usually convex) is
the intersection between the long diameter and short diameter within the range of the ROI. The long diameter is
defined as the distance between the two farthest points of the maximum cross-sectional space within the range of
the ROI. The short diameter is defined as the longest distance perpendicular to the long diameter in the nodule.
The matching results are divided into three cases:
a)True positive,number of positive instances that have been correctly classified as positive,
recorded as TP;
b)False positive, number of negative instances that have been incorrectly classified as positive,
recorded as FP;
c)False negative, number of positive instances that have been incorrectly classified as negative,
recorded as FN.
If multiple ROIs detected by the AUT may match with GT, the matching priority should be
considered below:
a)If the region overlapping method is adopted, the ROI detected by the AUT with higher region
overlapping degree should be selected as TP;
b)If the central point distance method is adopted, the ROI detected by the AUT with smaller
central point distance should be selected as TP;
c)If the central point hitting method is adopted, the ROI detected by the AUT with smaller central
point distance should be selected as TP.
IEC CDV 63524 © IEC 2025
6.1.1.2 Test steps
Test steps include:
a)  Import the test set to the AUT and export results predicted by AI,which has compatible
format with GT.
b) Obtain TP, FP and FN according to the rule in 6.1.1.1;
c)  Calculate applicable performance metrics following the description in 6.1.1.3 to 6.1.1.8.
Statistical calculation can be based on different units such as lesions and cases. If a case
contains multiple lesions, the performance metric within a case can be calculated.
The manufacturer of the SUT shall clarify which performance metrics are applicable.
NOTE For example, a manufacturer may claim recall as the main performance metric while maintaining a low level
of false positive lesion per image.
6.1.1.3 Recall
See ISO/IEC 4213 6.2.4
6.1.1.4 Precision
See ISO/IEC 4213, 6.2.4
6.1.1.5 F-score
See ISO/IEC 4213, 6.2.5
6.1.1.6 Average precision
Change the setting of algorithm thresholds and calculate the precision and recall rate
corresponding to each threshold. Take the recall rate as x-coordinate and the precision as y-
coordinate to generate a precision-recall rate curve, and calculate the integral area under the
curve, namely the average precision. Whether the curve is smoothed and the method adopted
should be described in the test records.
6.1.1.7 Mean average precision
If multi-target detection is applicable, the average precision of various targets should be
calculated, the mean value of which should be recorded as the mean average precision.
6.1.1.8 FROC curve
To construct FROC curve, the prediction threshold of the AUT is tuned step by step. The
corresponding recall and non-lesion location rate are calculated under each prediction
threshold. The FROC curve is plotted by setting the recall as y-coordinate and the non-lesion
location rate as x-coordinate. The value of the x-coordinate is generally set as a geometric
progression, i.e., 0.5, 1, 2,., N, where the upper limit N should be greater than the average
number of lesions of the individual case. Taking the computer aided detectionof pulmonary
nodules as an example, assuming that a single case has an average of 7.5 pulmonary nodules,
the value of N should not be less than 8.
The non-lesion location rate is expressed as NLR. See Formula (1) for its expression:
𝐹𝑃
(1)
𝑁𝐿𝑅 = × 100%
𝑁
IEC CDV 63524 © IEC 2025
where
NLR Non-lesion location rate;
FP The number of lesion sites that are detected by the AUT but not identified by GT;
N The number of all cases.
NOTE NLR is also called the mean number of false positives in a single patient case.
6.1.2 Segmentation and measurement
6.1.2.1 Test steps
Test steps include:
a) Import test sets to the AUT and export segmentation results by AI, at least including
boundary vertex coordinates of the segmentation region;
b) Calculate performance indicators using the ROIs segmented by the AUT and GT according
to the formulas described in 6.1.2.2 to 6.1.2.9;
c) Record the mean value, median, confidence interval and standard deviation of the
calculation results;
d) If the AUT detect and segment lesions simultaneously, only TP regions are taken into
account in the testing of segmentation performance.
6.1.2.2 Recall
See ISO/IEC 4213, 6.2.4
6.1.2.3 Precision
Divide the intersection between the target region segmented by the AUT and the target region
of GT by the target region segmented by the AUT, as expressed in Formula (2):
𝑆 ∩ 𝑆
pr gt
𝑃 = (2)
re
𝑆
pr
where
𝑃 Precision;
𝑟𝑒
𝑆 Target region segmented by the AUT;
𝑝𝑟
Target region segmented by the GT
𝑆
𝑔𝑡
6.1.2.4 Target area overlapping ratio
When the target region is a general entity , Dice coefficient or Jaccard coefficient should be
used to evaluate the overlap of target area segmented by AUT and GT.
Dice coefficient is twice the intersection of the target region segmented by the AUT and the
target region of GT divided by the sum of the two (harmonic average of recall and precision),
as expressed in Formula (3):
2 × |𝑆 | ∩ |𝑆 |
pr gt
Dice = (3)
|𝑆 | + |𝑆 |
pr gt
IEC CDV 63524 © IEC 2025
where
Dice Dice coefficient;
𝑆 Target region segmented by the AUT
𝑝𝑟
𝑆 Target region segmented by the GT
𝑔𝑡
Jaccard coefficient is the intersection of the target region of segmented by the AUT and the
target region of GT divided by the union of the two, as expressed in Formula (4):
|𝑆 ∩ 𝑆 |
pr gt
Jaccard = (4)
|𝑆 ∪ 𝑆 |
pr gt
where
Jaccard Jaccard coefficient;
𝑆 Target region segmented by the AUT
𝑝𝑟
𝑆 Target region segmented by the GT.
𝑔𝑡
6.1.2.5 Tree length detection,TLD
When the ROI is tubular or other tree structure, the TLD should be adopted to evaluate and
calculate the proportion between the correctly segmented ROI length and the tROI length in the
GT, which is expressed by Formula (5):
𝐿
TP
𝑇𝐿𝐷 = (5)
𝐿
gt
where
TLD Tree length detection
𝐿 correctly segemented ROI length(Unit: pixels or SI units like mm)
𝑇𝑃
𝐿 ROI length in the GT(Unit: pixels or SI units like mm)
𝑔𝑡
6.1.2.6 Surface distance
Surface distance refers to the distance between the ROIs given by the AUT and the GT, which
can be used for evaluating the effect of contour segmentation.
Taking X as the ROI segmented by the AUT and Y as the ROI sgemented by the GT, the two -
( )
way Hausdorff distance 𝑑 𝑋, 𝑌 is calculated according to Formula (6) as follows:
𝐻
𝑑 (𝑋, 𝑌) = max{max min 𝑑(𝑥, 𝑦), max min 𝑑(𝑥, 𝑦)}
𝐻 (6)
𝑥∈𝑋 𝑦∈𝑌 𝑦∈𝑌 𝑥∈𝑋
where
𝑑 (𝑋, 𝑌) two-way Hausdorff distance
𝐻
d(x,y) the distance between any two points in X and Y regions
X ROI segmented by the AUT
Y ROI segmented by the GT
IEC CDV 63524 © IEC 2025
6.1.2.7 Density measurement
Compare the density value or gray value of pixels in the ROI identified by the AUT with the
result of the ROI identified by the GT to calculate the mean value of the absolute relative error
according to Formula (7):
𝐿 − 𝐿
𝑖 𝑎,𝑖
𝑛
∑ | |
𝑖=1
𝐿 (7)
𝑎,𝑖
𝑆 = × 100%
𝑛
where
S Deviation value;
ith
𝐿
Measured value of the i case;
𝑖
ith
𝐿
GT value of the i case;
𝑎,𝑖
n Number of cases.
6.1.2.8 Size measurement
The object of dimension measurement includes the dimensions of the target region, such as
long and short diameter, length and width of tightly bounded rectangular box. The target region
can be located in a two-dimensional plane or a three-dimensional space.
When the target region identified by the AUT can be approximated as a convex shape, the
rotating caliper method or other methods can be used on the contour (including boundary) of
the target region identified by the algorithm to locate the key points with medical
significance.The long diameter, short diameter and average value are then calculated and
compared with the results of the target region according to the GT. The average of the absolute
value of relative error is finally calculated.
6.1.2.9 Volume measurement
Calculate the difference of voxel numbers in the ROI identified by the AUT and GT
respectively.aMultiply the difference by the volume of each voxel and calculate the absolute
error of the volume measurement. The difference divided by voxel numbers of GT yields relative
volume error of a ROI. The sum of the absolute value of relative volume errors divided by the
number of ROIs would yield overall relative volume error on a test set. The mean values,
median, confidence interval and standard deviation are included in the reports.
NOTE Counting voxel numbers may be replaced by multiplying ROI area with pixel heights or other equivalent
methods.
6.1.3 Classification
6.1.3.1 Test steps
In the image classification scenario, the algorithm test is performed as follows:
a) Import the testing set to the AUT and export the result classified by the AUT; the format of
the result should be compatible with the GT, and the content should include the classification
result and classification probability (if applicable);
b) Compare the classification result of the AUT with GT and calculate the number of true
positive, false positive, true negative and false negative samples to build a confusion matrix.
With regard to the classification, the confusion matrix is generally shown in Table 1.
IEC CDV 63524 © IEC 2025
Table 1 – n-classification Confusion Matrix
Classification Pred_1 Pred_2 … Pred_n
True_1 N N … N
1,1 1,2 1,n
True_2 N N … N
2,1 2,2 2,n
… … … … …
True_n N N … N
n,1 n,2 n,n
NOTE Pred_𝑥(𝑥 = 1~n) refers to the category classified as x by AUT; True_𝑥(𝑥 = 1~n) refers to the category
classified as x by the GT; N (𝑖 = 1~n,𝑗 = 1~n) refers to the number of categories that are classified as i by the
i,j
GT but j by AUT; n refers to the number of categories.
The confusion matrix of binary classification can be simplified as shown in Table 2.
Table 2 – Binary-classification Confusion Matrix
Classification Classified by AUT
Positive Negative
Positive TP FN
Classified by GT
Negative FP TN
The problem of multi-classification can be actually converted into multiple binary classification.
The confusion matrix of Category i classified by the GT with other categories can be simplified
as shown in Table 3
IEC 63524 ED1 © IEC 2025 DRAFT © IEC 2025 [INSERT_DOCUMENT_NUMBER]

Table 3 – Binary confusion matrix deducted from multi-classification situation
Classification Classified by AUT
Positive Negative
𝑛
Positive 𝑇𝑃 = 𝑁 𝐹𝑁 = ∑ 𝑁
𝑖,𝑖 𝑗=1,𝑗≠𝑖 𝑖,𝑗
Classified by GT
𝑛 𝑛 𝑛
Negative 𝐹𝑃 = ∑ 𝑁 𝑇𝑁 = ∑ ∑ 𝑁
𝑗=1,𝑗≠𝑖 𝑗,𝑖 𝑗=1,𝑗≠𝑖 𝑙=1,𝑙≠𝑖 𝑗,𝑙
IEC 63524 DRAFT © IEC 2025 [INSERT_DOCUMENT_NUMBER]
ED1 © IEC
6.1.3.2 Sensitivity
See ISO/IEC 4213 3.2.10.
6.1.3.3 Specificity
See ISO/IEC 4213 3.2.11.
6.1.3.4 Missed rates,MR
Missed rate is expressed by MR, and the expression is shown in Formula (8):
𝑀𝑅 = 1 − 𝑆𝑒𝑛 (8)
where
Sen Sensitivity;
MR Missed detection rate.
6.1.3.5 Positive prediction value,PPV
See ISO/IEC 4213 3.2.9
NOTE The bias of test set may impact PPV.
6.1.3.6 Negative prediction value,NPV
Negative prediction value is expressed as NPV, and the expression is shown in Formula (9):
𝑇𝑁
𝑁𝑃𝑉 = (9)
𝐹𝑁 + 𝑇𝑁
where
NPV Negative prediction value;
TN Number of true negative samples;
FN Number of false negative samples.
6.1.3.7 Accuracy
Accuracy is expressed as Acc, and the expression is shown in Formula (10):
𝑛
∑ 𝑁
𝑖=1 𝑖,𝑖
Acc =
(10)
𝑛 𝑛
∑ ∑ 𝑁
𝑗,𝑖
𝑗=1 𝑖=1
where
th th
N
It generally refers to the elements in the j row and the i column of the confusion
j,i
matrix;
th th
N
It generally refers to the elements in the i row and the i column of the confusion
i,i
matrix.
IEC 63524 DRAFT © IEC 2025 [INSERT_DOCUMENT_NUMBER]
ED1 © IEC
6.1.3.8 Youden Index
Youden index is expressed as Y, and the expression is shown in Formula (11):
𝑌 = Sen + Spe − 1 (11)
where
Sen Sensitivity
Spe Specificity.
6.1.3.9 Kappa Coefficient
Kappa coefficient is expressed as K. See Formula (12) for its expression:
𝐴𝑐𝑐 − 𝑝
𝑒
𝐾 = (12)
1 − 𝑝
𝑒
where
𝑛
𝑛 𝑛
∑ ∑
∑ ( 𝑁 × 𝑁 )
𝑗=1 𝑖,𝑗 𝑗=1 𝑗,𝑖
(13)
𝑖=1
𝑝 =
𝑒
𝑛 𝑛
(∑ ∑ 𝑁 )
𝑎,𝑏
𝑎=1 𝑏=1
where
th th
N
It generally refers to the elements in the j row and the i column of the confusion
j,i
matrix;
th th
N
It generally refers to the elements in the i row and the j column of the confusion
i,j
matrix;
Acc Accuracy.
6.1.3.10 Receiver Operating Characteristics Curve (ROC curve)
Manufacturers should claim the area under curve (AUC) value of the receiver operating
characteristics curve (ROC). The testing personnel should adjust different classification
thresholds (no less than 1,000 steps, and the step size can be set uniformly), compare the
algorithm classification results with the GT classification results, calculate the sensitivity and
specificity at each threshold, plot the ROC curve with 1-specificity as the x-coordinate and
sensitivity as the y-coordinate, and calculate the integral area under the ROC curve.
6.1.4 Multifunctional Combination Scenario
For SUTs with functions of detection, classification, segmentation and measurement at the
same time, testing personnel should evaluate the algorithm performance corresponding to the
above functions step by step, such as:
• First, match ROIs detected by AUT and GT, and calculate performance indicators of
computer aided detection;
• Second, calculate the indicators of classification and segmentation for the TP regions;
• Finally, calculate the indicators of measurement based on segmentation.
IEC 63524 DRAFT © IEC 2025 [INSERT_DOCUMENT_NUMBER]
ED1 © IEC
When the AUT has restrictions on use or technical constraints (e.g. the algorithm only identifies
lesions greater than a certain size, the algorithm is only applicable to images with CT slice
thickness of 2mm or less), testing personnel should make corresponding constraints on the
testing set and the GT, and describe them in the test plan and test record.
6.1.5 Follow-up Assessment Scenario
For SUTs with follow-up assessment function, import the data from the same subject at different
time points, such as baseline images, follow-up images and repeated screening, to the AUT.
Then compare the difference of the analysis results between the AUT and the GT for the same
target region; meanwhile, a dynamic curve can be plotted based on the AUT outputs at each
time node to calculate the consistency with the GT curve.
6.1.6 Patient Triage Scenario
For SUT with patient triage function, classification labels for test data should be created in the
testing set according to clinical guidelines or expert consensus, for example, negative -positive
triage or critical triage, and should be compared with the labels output by the AUT to establish
the confusion matrix, so as to calculate the indexes such as sensitivity, specificity, and Kappa
coefficient by using the methods in 6.1.3.
For SUTs with patient prioritization function, the method in this clause should apply.
If patient triage function is based on other AI task like detection or classification, performance
test input of patient triage function should have ground truth, in addition to use AI task result.
6.2 Test methods for quality characteristics
6.2.1 General requirement for generalizability test
Manufacturers should analyze the differences between the training sets used in the design
phase and unfamiliar samples in the real world according to the intended use and deployment
environment of the SUT, and form a document to configure testing sets. In actual tests, the
generalizability of the algorithm should be verified according to the diversity and variability of
testing sets. A key measure of generalisability is to report stratified analysis per relevant
confounder or effect modifier (e.g. lesion size, lesion type, lesion location, disease stage,
imaging or scanning protocol).
6.2.2 Robustness test
6.2.2.1 General requirement
Manufacturers should evaluate various factors that may interfere with the algorithm
performance at the stage of clinical use according to product risk analysis (ISO 14971and ISO
CD 24971-2Medical devices — Guidance on the application of ISO 14971 — Part 2: Machine
learning in artificial intelligence) and the characteristics of clinical deployment environment,
acquire or synthesize relevant data to form special test sets; carry out an exploratory test on
the algorithm performan
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...