SIST ISO 16269-4:2014
Statistical interpretation of data - Part 4: Detection and treatment of outliers
Statistical interpretation of data - Part 4: Detection and treatment of outliers
ISO 16269-4:2010 provides detailed descriptions of sound statistical testing procedures and graphical data analysis methods for detecting outliers in data obtained from measurement processes. It recommends sound robust estimation and testing procedures to accommodate the presence of outliers.
ISO 16269-4:2010 is primarily designed for the detection and accommodation of outlier(s) from univariate data. Some guidance is provided for multivariate and regression data.
Interprétation statistique des données - Partie 4: Détection et traitement des valeurs aberrantes
Statistično tolmačenje podatkov - 4. del: Zaznavanje in obravnava osamelcev
Ta del standarda ISO 16269 zagotavlja podrobne opise učinkovitih postopkov za statistično preskušanje in metod za analizo grafičnih podatkov za zaznavanje osamelcev v podatkih, pridobljenih na podlagi merilnih procesov. Priporoča učinkovite in zanesljive ocenjevalne in preskusne postopke za prilagajanje prisotnosti osamelcev. Ta del standarda ISO 16269 je namenjen zlasti zaznavanju in prilagajanju prisotnosti osamelcev iz univariatnih podatkov. Zagotovljeni so delni napotki za multivariatne in regresijske podatke.
General Information
- Status
- Published
- Publication Date
- 24-Nov-2013
- Technical Committee
- ISTM - Statistical methods
- Current Stage
- 6060 - National Implementation/Publication (Adopted Project)
- Start Date
- 20-Nov-2013
- Due Date
- 25-Jan-2014
- Completion Date
- 25-Nov-2013
Overview
SIST ISO 16269-4:2014 - "Statistical interpretation of data - Part 4: Detection and treatment of outliers" is an international standard that defines sound statistical procedures and graphical methods for identifying and accommodating outliers in measurement data. Focused primarily on univariate data, it also offers guidance for multivariate and regression contexts. The standard supports robust practice in measurement processes, quality control and data analysis by recommending validated tests, robust estimators and graphical diagnostics.
Key topics and technical requirements
- Definitions and terminology: formalizes terms such as outlier, masking, resistant and robust estimation, order statistics, quartiles and box plot.
- Data screening and graphical methods: recommended procedures for initial screening, including modified box plots, interquartile range (IQR) techniques and five-number summaries.
- Statistical tests for outliers: procedures for samples from normal, exponential and other known (and unknown) distributions; includes guidance on Cochran’s test for outlying variances and procedures for detecting single or multiple outliers.
- Robust estimation and accommodation: recommended robust estimators for location and scale (e.g., trimmed means, biweight estimators) and correction factors for scale estimation when outliers may be present.
- Multivariate and regression data: guidance on detecting outlying Y and X observations, influential points and suggestions for robust regression techniques.
- Annexes and implementation aids: Annex A presents an algorithm for the GESD outlier detection procedure; Annexes B–E provide critical-value and factor tables (exponential samples, modified box plot factors, robust estimator correction factors, Cochran’s test); Annex F provides a structured flow chart for univariate outlier detection.
Practical applications
ISO 16269-4 is applicable wherever measurement data integrity is critical:
- Manufacturing process control and product quality assurance
- Metrology and laboratory measurement analysis
- Calibration, testing and inspection
- Data cleaning in research, product development and regulatory reporting
- Statistical analysis pipelines where robust inference is required
Using standardized outlier detection improves comparability of analyses across laboratories, suppliers and regulatory bodies.
Who should use this standard
- Statisticians and data scientists implementing robust detection routines
- Quality engineers and process control specialists
- Metrologists and laboratory analysts
- Auditors and compliance professionals assessing measurement data validity
Related standards
- Other parts of ISO 16269 (Part 6: tolerance intervals; Part 7: median estimation; Part 8: prediction intervals) provide complementary statistical methods for measurement interpretation.
Keywords: SIST ISO 16269-4:2014, outliers detection, outlier treatment, robust estimation, modified box plot, Cochran test, GESD, univariate data, multivariate outliers, regression diagnostics.
Frequently Asked Questions
SIST ISO 16269-4:2014 is a standard published by the Slovenian Institute for Standardization (SIST). Its full title is "Statistical interpretation of data - Part 4: Detection and treatment of outliers". This standard covers: ISO 16269-4:2010 provides detailed descriptions of sound statistical testing procedures and graphical data analysis methods for detecting outliers in data obtained from measurement processes. It recommends sound robust estimation and testing procedures to accommodate the presence of outliers. ISO 16269-4:2010 is primarily designed for the detection and accommodation of outlier(s) from univariate data. Some guidance is provided for multivariate and regression data.
ISO 16269-4:2010 provides detailed descriptions of sound statistical testing procedures and graphical data analysis methods for detecting outliers in data obtained from measurement processes. It recommends sound robust estimation and testing procedures to accommodate the presence of outliers. ISO 16269-4:2010 is primarily designed for the detection and accommodation of outlier(s) from univariate data. Some guidance is provided for multivariate and regression data.
SIST ISO 16269-4:2014 is classified under the following ICS (International Classification for Standards) categories: 03.120.30 - Application of statistical methods. The ICS classification helps identify the subject area and facilitates finding related standards.
You can purchase SIST ISO 16269-4:2014 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of SIST standards.
Standards Content (Sample)
SLOVENSKI STANDARD
01-januar-2014
6WDWLVWLþQRWROPDþHQMHSRGDWNRYGHO=D]QDYDQMHLQREUDYQDYDRVDPHOFHY
Statistical interpretation of data - Part 4: Detection and treatment of outliers
Interprétation statistique des données - Partie 4: Détection et traitement des valeurs
aberrantes
Ta slovenski standard je istoveten z: ISO 16269-4:2010
ICS:
03.120.30 8SRUDEDVWDWLVWLþQLKPHWRG Application of statistical
methods
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 16269-4
First edition
2010-10-15
Statistical interpretation of data —
Part 4:
Detection and treatment of outliers
Interprétation statistique des données —
Partie 4: Détection et traitement des valeurs aberrantes
Reference number
©
ISO 2010
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2010 – All rights reserved
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Terms and definitions .1
3 Symbols.10
4 Outliers in univariate data .11
4.1 General .11
4.1.1 What is an outlier? .11
4.1.2 What are the causes of outliers? .11
4.1.3 Why should outliers be detected?.11
4.2 Data screening.12
4.3 Tests for outliers .14
4.3.1 General .14
4.3.2 Sample from a normal distribution.14
4.3.3 Sample from an exponential distribution.16
4.3.4 Samples taken from some known non-normal distributions.18
4.3.5 Sample taken from unknown distributions.19
4.3.6 Cochran's test for outlying variance .21
4.4 Graphical test of outliers .22
5 Accommodating outliers in univariate data.23
5.1 Robust data analysis.23
5.2 Robust estimation of location.24
5.2.1 General .24
5.2.2 Trimmed mean .24
5.2.3 Biweight location estimate .25
5.3 Robust estimation of dispersion .25
5.3.1 General .25
5.3.2 Median-median absolute pair-wise deviation.25
5.3.3 Biweight scale estimate.26
6 Outliers in multivariate and regression data .26
6.1 General .26
6.2 Outliers in multivariate data .26
6.3 Outliers in linear regression.28
6.3.1 General .28
6.3.2 Linear regression models.29
6.3.3 Detecting outlying Y observations.31
6.3.4 Identifying outlying X observations.31
6.3.5 Detecting influential observations.32
6.3.6 A robust regression procedure.35
Annex A (informative) Algorithm for the GESD outliers detection procedure .36
Annex B (normative) Critical values of outliers test statistics for exponential samples .37
Annex C (normative) Factor values of the modified box plot .44
Annex D (normative) Values of the correction factors for the robust estimators of the scale
parameter .47
Annex E (normative) Critical values of Cochran's test statistic .48
Annex F (informative) A structured guide to detection of outliers in univariate data .51
Bibliography.54
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 16269-4 was prepared by Technical Committee ISO/TC 69, Applications of statistical methods.
ISO 16269 consists of the following parts, under the general title Statistical interpretation of data:
⎯ Part 4: Detection and treatment of outliers
⎯ Part 6: Determination of statistical tolerance intervals
⎯ Part 7: Median — Estimation and confidence intervals
⎯ Part 8: Determination of prediction intervals
iv © ISO 2010 – All rights reserved
Introduction
Identification of outliers is one of the oldest problems in interpreting data. Causes of outliers include
measurement error, sampling error, intentional under- or over-reporting of sampling results, incorrect
recording, incorrect distributional or model assumptions of the data set, and rare observations, etc.
Outliers can distort and reduce the information contained in the data source or generating mechanism. In the
manufacturing industry, the existence of outliers will undermine the effectiveness of any process/product
design and quality control procedures. Possible outliers are not necessarily bad or erroneous. In some
situations, an outlier may carry essential information and thus it should be identified for further study.
The study and detection of outliers from measurement processes leads to better understanding of the
processes and proper data analysis that subsequently results in improved inferences.
In view of the enormous volume of literature on the topic of outliers, it is of great importance for the
international community to identify and standardize a sound subset of methods used in the identification and
treatment of outliers. The implementation of this part of ISO 16269 enables business and industry to recognize
the data analyses conducted across member countries or organizations.
Six annexes are provided. Annex A provides an algorithm for computing the test statistic and critical values of
a procedure in detecting outliers in a data set taken from a normal distribution. Annexes B, D and E provide
the tables needed to implement the recommended procedures. Annex C provides the tables and statistical
theory that underlie the construction of modified box plots in outlier detection. Annex F provides a structured
guide and flow chart to the procedures recommended in this part of ISO 16269.
INTERNATIONAL STANDARD ISO 16269-4:2010(E)
Statistical interpretation of data —
Part 4:
Detection and treatment of outliers
1 Scope
This part of ISO 16269 provides detailed descriptions of sound statistical testing procedures and graphical
data analysis methods for detecting outliers in data obtained from measurement processes. It recommends
sound robust estimation and testing procedures to accommodate the presence of outliers.
This part of ISO 16269 is primarily designed for the detection and accommodation of outlier(s) from univariate
data. Some guidance is provided for multivariate and regression data.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
sample
data set
subset of a population made up of one or more sampling units
NOTE 1 The sampling units could be items, numerical values or even abstract entities depending on the population of
interest.
NOTE 2 A sample from a normal (2.22), a gamma (2.23), an exponential (2.24), a Weibull (2.25), a
lognormal (2.26) or a type I extreme value (2.27) population will often be referred to as a normal, a gamma, an
exponential, a Weibull, a lognormal or a type I extreme value sample, respectively.
2.2
outlier
member of a small subset of observations that appears to be inconsistent with the remainder of a given
sample (2.1)
NOTE 1 The classification of an observation or a subset of observations as outlier(s) is relative to the chosen model for
the population from which the data set originates. This or these observations are not to be considered as genuine
members of the main population.
NOTE 2 An outlier may originate from a different underlying population, or be the result of incorrect recording or gross
measurement error.
NOTE 3 The subset may contain one or more observations.
2.3
masking
presence of more than one outlier (2.2), making each outlier difficult to detect
2.4
some-outside rate
probability that one or more observations in an uncontaminated sample will be wrongly classified as
outliers (2.2)
2.5
outlier accommodation method
method that is insensitive to the presence of outliers (2.2) when providing inferences about the population
2.6
resistant estimation
estimation method that provides results that change only slightly when a small portion of the data values in a
data set (2.1) is replaced, possibly with very different data values from the original ones
2.7
robust estimation
estimation method that is insensitive to small departures from assumptions about the underlying probability
model of the data
NOTE An example is an estimation method that works well for, say, a normal distribution (2.22), and remains
reasonably good if the actual distribution is skew or heavy-tailed. Classes of such methods include the L-estimation
[weighted average of order statistics (2.10)] and M-estimation methods (see Reference [9]).
2.8
rank
position of an observed value in an ordered set of observed values
NOTE 1 The observed values are arranged in ascending order (counting from below) or descending order (counting
from above).
NOTE 2 For the purposes of this part of ISO 16269, identical observed values are ranked as if they were slightly
different from one another.
2.9
depth
〈box plot〉 smaller of the two ranks (2.8) determined by counting up from the smallest value of the
sample (2.1), or counting down from the largest value
NOTE 1 The depth may not be an integer value (see Annex C).
NOTE 2 For all summary values other than the median (2.11), a given depth identifies two (data) values, one below
the median and the other above the median. For example, the two data values with depth 1 are the smallest value
(minimum) and largest value (maximum) in the given sample (2.1).
2.10
order statistic
statistic determined by its ranking in a non-decreasing arrangement of random variables
[ISO 3534-1:2006, definition 1.9]
NOTE 1 Let the observed values of a random sample be {x , x , …, x }. Reorder the observed values in non-
1 2 n
decreasing order designated as x u x u … u x u … u x ; then x is the observed value of the kth order statistic in
(1) (2) (k) (n) (k)
a sample of size n.
NOTE 2 In practical terms, obtaining the order statistics for a sample (2.1) amounts to sorting the data as formally
described in Note 1.
2 © ISO 2010 – All rights reserved
2.11
median
sample median
median of a set of numbers
Q
[(n + 1)/2]th order statistic (2.10), if the sample size n is odd; sum of the [n/2]th and the [(n/2) + 1]th order
statistics divided by 2, if the sample size n is even
[ISO 3534-1:2006, definition 1.13]
NOTE The sample median is the second quartile (Q ).
2.12
first quartile
sample lower quartile
Q
for an odd number of observations, median (2.11) of the smallest (n − 1)/2 observed values; for an even
number of observations, median of the smallest n/2 observed values
NOTE 1 There are many definitions in the literature of a sample quartile, which produce slightly different results. This
definition has been chosen both for its ease of application and because it is widely used.
NOTE 2 Concepts such as hinges or fourths (2.19 and 2.20) are popular variants of quartiles. In some cases
(see Note 3 to 2.19), the first quartile and the lower fourth (2.19) are identical.
2.13
third quartile
sample upper quartile
Q
for an odd number of observations, median of the largest (n − 1)/2 observed values; for an even number of
observations, median of the largest n/2 observed values
NOTE 1 There are many definitions in the literature of a sample quartile, which produce slightly different results. This
definition has been chosen both for its ease of application and because it is widely used.
NOTE 2 Concepts such as hinges or fourths (2.19 and 2.20) are popular variants of quartiles. In some cases
(see Note 3 to 2.20), the third quartile and the upper fourth (2.20) are identical.
2.14
interquartile range
IQR
difference between the third quartile (2.13) and the first quartile (2.12)
NOTE 1 This is one of the widely used statistics to describe the spread of a data set.
NOTE 2 The difference between the upper fourth (2.20) and the lower fourth (2.19) is called the fourth-spread and is
sometimes used instead of the interquartile range.
2.15
five-number summary
the minimum, first quartile (2.12), median (2.11), third quartile (2.13), and maximum
NOTE The five-number summary provides numerical information about the location, spread and range.
2.16
box plot
horizontal or vertical graphical representation of the five-number summary (2.15).
NOTE 1 For the horizontal version, the first quartile (2.12) and the third quartile (2.13) are plotted as the left and
right sides, respectively, of a box, the median (2.11) is plotted as a vertical line across the box, the whiskers stretching
downwards from the first quartile to the smallest value at or above the lower fence (2.17) and upwards from the third
quartile to the largest value at or below the upper fence (2.18), and value(s) beyond the lower and upper fences are
marked separately as outlier(s) (2.2). For the vertical version, the first and third quartiles are plotted as the bottom and the
top, respectively, of a box, the median is plotted as a horizontal line across the box, the whiskers stretching downwards
from the first quartile to the smallest value at or above the lower fence and upwards from the third quartile to the largest
value at or below the upper fence and value(s) beyond the lower and upper fences are marked separately as outlier(s).
NOTE 2 The box width and whisker length of a box plot provide graphical information about the location, spread,
skewness, tail lengths, and outlier(s) of a sample. Comparisons between box plots and the density function of a) uniform,
b) bell-shaped, c) right-skewed, and d) left-skewed distributions are given in the diagrams in Figure 1. In each distribution,
a histogram is shown above the boxplot.
NOTE 3 A box plot constructed with its lower fence (2.17) and upper fence (2.18) evaluated by taking k to be a value
based on the sample size n and the knowledge of the underlying distribution of the sample data is called a modified box
plot (see example, Figure 2). The construction of a modified box plot is given in 4.4.
a) Uniform distribution b) Bell-shaped distribution
Figure 1 (continued)
4 © ISO 2010 – All rights reserved
c) Right-skewed distribution d) Left-skewed distribution
Key
X data values
Y frequency
In each distribution, a histogram is shown above the box plot.
Figure 1 — Box plots and histograms for a) uniform, b) bell-shaped, c) right-skewed,
and d) left-skewed distributions
Figure 2 — Modified box plot with lower and upper fences
2.17
lower fence
lower outlier cut-off
lower adjacent value
value in a box plot (2.16) situated k times the interquartile range (2.14) below the first quartile (2.12), with
a predetermined value of k
NOTE In proprietary statistical packages, the lower fence is usually taken to be Q − k (Q − Q ) with k taken to be
1 3 1
either 1,5 or 3,0. Classically, this fence is called the “inner lower fence” when k is 1,5, and “outer lower fence” when k is
3,0.
2.18
upper fence
upper outlier cut-off
upper adjacent value
value in a box plot situated k times the interquartile range (2.14) above the third quartile (2.13), with a
predetermined value of k
NOTE In proprietary statistical packages, the upper fence is usually taken to be Q + k (Q − Q ), with k taken to be
3 3 1
either 1,5 or 3,0. Classically, this fence is called the “inner upper fence” when k is 1,5, and the “outer upper fence” when k
is 3,0.
6 © ISO 2010 – All rights reserved
2.19
lower fourth
x
L:n
for a set x u x u … u x of observed values, the quantity 0,5 [x + x ] when f = 0 or x when f > 0,
(1) (2) (n) (i) (i + 1) (i + 1)
where i is the integral part of n/4 and f is the fractional part of n/4
NOTE 1 This definition of a lower fourth is used to determine the recommended values of k and k given in Annex C
L U
and is the default or optional setting in some widely used statistical packages.
NOTE 2 The lower fourth and the upper fourth (2.20) as a pair are sometimes called hinges.
NOTE 3 The lower fourth is sometimes referred to as the first quartile (2.12).
NOTE 4 When f = 0, 0,5 or 0,75, the lower fourth is identical to the first quartile. For example:
Sample size i = integral f = fractional First quartile Lower fourth
n part of n/4 part of n/4
9 2 0,25 [x + x ]/2 x
(2) (3) (3)
10 2 0,50 x x
(3) (3)
11 2 0,75 x x
(3) (3)
12 3 0 [x + x]/2 [x + x ]/2
(3) (4) (3) (4)
2.20
upper fourth
x
U:n
for a set x u x u … u x of observed values, the quantity 0,5 [x + x ] when f = 0 or x
(1) (2) (n) (n − i) (n − i + 1) (n − i)
when f > 0, where i is the integral part of n/4 and f is the fractional part of n/4
NOTE 1 This definition of an upper fourth is used to determine the recommended values of k and k given in Annex C
L U
and is the default or optional setting in some widely used statistical packages.
NOTE 2 The lower fourth (2.19) and the upper fourth as a pair are sometimes called hinges.
NOTE 3 The upper fourth is sometimes referred to as the third quartile (2.13).
NOTE 4 When f = 0, 0,5 or 0,75, the upper fourth is identical to the third quartile. For example:
Sample size i = integral f = fractional Third quartile Upper fourth
n part of n/4 part of n/4
9 2 0,25 [x + x ]/2 x
(7) (8) (7)
10 2 0,50 x x
(8) (8)
11 2 0,75 x x
(9) (9)
12 3 0 [x + x ]/2 [x + x ]/2
(9) (10) (9) (10)
2.21
Type I error
rejection of the null hypothesis when in fact it is true
[ISO 3534-1:2006, definition 1.46]
NOTE 1 A Type I error is an incorrect decision. Hence, it is desired to keep the probability of making such an incorrect
decision as small as possible.
NOTE 2 It is possible in some situations (for example, testing the binomial parameter p) that a pre-specified
significance level such as 0,05 is not attainable due to discreteness in outcomes.
2.22
normal distribution
Gaussian distribution
continuous distribution having the probability density function
⎧⎫
x − µ
1 ()
⎪⎪
fx()=−exp
⎨⎬
σπ2
2σ
⎪⎪
⎩⎭
where −∞ < x < ∞ and with parameters −∞ < µ < ∞ and σ > 0
[ISO 3534-1:2006, definition 2.50]
NOTE 1 The location parameter µ is the mean and the scale parameter σ is the standard deviation of the normal
distribution.
NOTE 2 A normal sample is a random sample (2.1) taken from a population that follows a normal distribution.
2.23
gamma distribution
continuous distribution having the probability density function
α −1
xxexp − / β
()
fx() =
α
βαΓ()
where x > 0 and parameters α > 0, β > 0
[ISO 3534-1:2006, definition 2.56]
NOTE 1 The gamma distribution is used in reliability applications for modelling time to failure. It includes the
exponential distribution (2.24) as a special case as well as other cases with failure rates that increase with age.
NOTE 2 The mean of the gamma distribution is αβ. The variance of the gamma distribution is αβ .
NOTE 3 A gamma sample is a random sample (2.1) taken from a population that follows a gamma distribution.
2.24
exponential distribution
continuous distribution having the probability density function
−1
fx()=−ββexp x/
()
where x > 0 and with parameter β > 0
[ISO 3534-1:2006, definition 2.58]
8 © ISO 2010 – All rights reserved
NOTE 1 The exponential distribution provides a baseline in reliability applications, corresponding to the case of “lack of
ageing” or memory-less property.
NOTE 2 The mean of the exponential distribution is β. The variance of the exponential distribution is β .
NOTE 3 An exponential sample is a random sample (2.1) taken from a population that follows an exponential
distribution.
2.25
Weibull distribution
type III extreme-value distribution
continuous distribution having the distribution function
κ
⎧⎫
⎛⎞x −θ
⎪⎪
Fx()=−1 exp −
⎨⎬
⎜⎟
β
⎝⎠
⎪⎪
⎩⎭
where x > θ with parameters −∞ < θ < ∞, β > 0, κ > 0
[ISO 3534-1:2006, definition 2.63]
NOTE 1 In addition to serving as one of the three possible limiting distributions of extreme order statistics, the Weibull
distribution occupies a prominent place in diverse applications, particularly reliability and engineering. The Weibull
distribution has been demonstrated to provide usable fits to a variety of data sets.
NOTE 2 The parameter θ is a location or threshold parameter in the sense that it is the minimum value that a Weibull
variate can achieve. The parameter β is a scale parameter (related to the standard deviation of a Weibull variate). The
parameter κ is a shape parameter.
NOTE 3 A Weibull sample is a random sample (2.1) taken from a population that follows a Weibull distribution.
2.26
lognormal distribution
continuous distribution having the probability density function
⎧⎫
1(⎪⎪ln x − µ)
fx()=−exp
⎨⎬
xσ 2π
2σ
⎪⎪
⎩⎭
where x > 0 and with parameters −∞ < µ < ∞ and σ > 0
[ISO 3534-1:2006, definition 2.52]
2.27
type I extreme-value distribution
Gumbel distribution
continuous distribution having the distribution function
−−()x µ/σ
Fx()=−exp e
{ }
where −∞ < x < ∞ and with parameters −∞ < µ < ∞ and σ > 0
NOTE Extreme-value distributions provide appropriate reference distributions for the extreme order statistics (2.10)
x and x .
(1) (n)
[ISO 3534-1:2006, definition 2.61]
3 Symbols
The symbols and abbreviated terms used in this part of ISO 16269 are as follows:
GESD generalized extreme studentized deviate
G Greenwood's statistic
E
g critical value of the Greenwood's test statistic for sample size n
E;n
(0)
I reduced sample of size n − l after removing the most extreme observation x in the original sample
l
(1)
I of size n, removing the most extreme observation x in the reduced sample I of size n − 1,….,
0 1
(l − 1)
and removing the most extreme observation x in the reduced sample I of size n − l + 1
l − 1
F pth percentile of a F-distribution with ν and ν degrees of freedom
p;,ν ν 1 2
(l)
λ critical value of the GESD test in testing whether the value x is an outlier
l
L lower fence of a modified box plot
F
U upper fence of a modified box plot
F
M or Q sample median
M median absolute deviation about the median
ad
Q first quartile
Q third quartile
(l)
R GESD test statistic for testing whether the value x is an outlier
l
s(I ) standard deviation of the reduced sample I
l l
T total median
M
T biweight location estimate from a sample of size n
n
()i
T estimate of T at the ith iteration based on a sample of size n
n n
t pth percentile of a t-distribution with v degrees of freedom
p; ν
χ pth percentile of a chi-square distribution with v degrees of freedom
p; ν
x ith observation in the ordered data set
(i)
(l)
x most extreme value in the reduced sample I
l
xI() mean of the reduced sample I
l l
x ()α α-trimmed mean
T
x lower fourth of a box plot for a sample of size n
L:n
x upper fourth of a box plot for a sample of size n
U:n
10 © ISO 2010 – All rights reserved
4 Outliers in univariate data
4.1 General
4.1.1 What is an outlier?
In the simplest case, an outlier is an observation that appears to be inconsistent with the rest of a given data
set. In general, there may be more than one outlier at one or both ends of the data set. The problem is to
determine whether or not apparently inconsistent observations are in fact outliers. This determination is
performed by means of a pre-specified significance test with respect to a presumed underlying distribution.
Observations that lead to a significant result are deemed to be outliers with respect to that distribution.
The importance of using the correct underlying distribution in an outlier test cannot be over-stressed. Often in
practice, an underlying normal distribution is assumed when the data arise from a different distribution. Such
an erroneous assumption can lead to observations being incorrectly classified as outliers.
4.1.2 What are the causes of outliers?
Outlying observations or outliers typically are attributable to one or more of the following causes (see
Reference [1] for more detail and perspective):
a) Measurement or recording error. The measurements are imprecisely generated, incorrectly observed,
incorrectly recorded, or incorrectly entered into the database.
b) Contamination. The data arise from two or more distributions, i.e. the basic one and one or more
contaminating distributions. If the contaminating distributions have significantly different means, larger
standard deviations and/or heavier tails than the basic distribution, then there is a possibility that extreme
observations coming from the contaminating distributions may appear as outliers in the basic distribution.
NOTE 1 The cause of contamination can be due to sampling error where a small portion of sample data is
inadvertently regarded as having been drawn from a different population than the rest of sample data; or intentional
under- or over-reporting of experiments or sampling surveys.
c) Incorrect distributional assumption. The data set is regarded as drawn from a particular distribution, but it
should have been regarded as drawn from another distribution.
EXAMPLE The data set is regarded as drawn from a normal distribution, but it should have been regarded as
drawn from a highly skewed distribution (e.g. exponential or lognormal) or a symmetric but heavier-tailed distribution
(e.g. a t-distribution). Therefore, observations that deviate far from the central location can be incorrectly labelled as
outliers even though they are valid observations with respect to a highly skewed or heavy-tailed distribution.
d) Rare observations. Highly improbable observations might occur on rare occasions, in samples regarded
as drawn from an assumed probability distribution. These extreme observations are usually incorrectly
labelled as outliers due to their rare occurrence, but they are not truly outliers.
NOTE 2 The occurrence of rare observations when the underlying distribution is symmetric but heavy-tailed may
lead to incorrect distributional assumptions.
4.1.3 Why should outliers be detected?
Outliers are not necessarily bad or erroneous. They can be taken as an indication of the existence of rare
phenomena that could be a reason for further investigation. For example, if an outlier is caused exclusively by
a particular industrial treatment, important discoveries may be made by investigating the cause.
Many statistical techniques and summary statistics are sensitive to the presence of outliers. For example, the
sample mean and sample standard deviation are easily influenced by the presence of even a single outlier
that could subsequently lead to invalid inferences.
The study of the nature and frequency of outliers in a particular problem can lead to appropriate modifications
of the distributional or model assumptions regarding the data set, and also lead to appropriate choices of
robust methods that can accommodate the presence of possible outliers in subsequent data analyses and
thus result in improved inferences (see Clause 6).
4.2 Data screening
Data screening can begin with a simple visual inspection of the given data set. Simple data plots, such as dot
plot, scatter diagram, histogram, stem-and-leaf plot, probability plot, box plot, time series plot or arranging
data in non-decreasing order of magnitude, can reveal unanticipated sources of variability and
extreme/outlying data points. For example, a bimodal distribution of a data set revealed by the histogram or
stem-and-leaf plot might be evidence of a contaminated sample or mixture of data regarded as drawn from
two different populations. Probability plots and box plots are recommended for identifying extreme/outlying
data points. These possible outliers can then be further investigated using the methods given in 4.3 or 4.4.
A probability plot not only provides a graphical test of whether the observations, or the majority of the
observations, can be regarded as following an assumed distribution; it also reveals outlying observations in
the data set. Data points that deviate markedly from a straight line fitted by eye to the points on a probability
plot can be considered as possible outliers. Probability plot facilities for a wide range of distributions are
available in proprietary software.
The box plot is one of the most popular graphical tools for exploring data. It is useful for displaying the central
location, spread and shape of the distribution of a data set. The lower and upper fences of the box plot are
defined as
lower fence=−Qk (Q−Q )
13 1
(1)
upper fence =+Qk (Q−Q )
33 1
where Q and Q are the first and third quartiles of the data set and k is a constant value.
1 3
[2]
Tukey labelled data values that lie outside the lower and upper fences with k = 1,5 as suspected (possible)
outliers, and those that lie outside the fences with k = 3,0 as extreme outliers.
NOTE 1 Probability plotting paper for the normal, exponential, lognormal and Weibull distributions may be obtained at
the time of publication from http://www.weibull.com/GPaper/index.htm.
NOTE 2 The type of probability plot should depend on the distributional assumption of the population. For example, the
exponential probability plot should be used if it is assumed, or there is a priori knowledge, that the data set can be
regarded as drawn from an exponential population.
NOTE 3 A large number of observations may incorrectly be identified as potential outliers by the box plot with its lower
and upper fences defined in Equation (1) when the data set can be regarded as sampled from skewed distributions. The
recommended modified box plot that is able to handle this problem is given in 4.4.
EXAMPLE The dot plot, histogram, box plot and stem-and-leaf plot of the following data values are plotted in
Figures 3 a), 3 b), 3 c) and 3 d), respectively.
0,745 0,883 0,351 0,806 2,908 1,096 1,310 1,261 0,637 1,226
1,418 0,430 1,870 0,543 0,718 1,229 1,312 1,544 0,965 1,034
1,818 1,409 2,773 1,293 0,842 1,469 0,804 2,219 0,892 1,864
1,214 1,093 0,727 1,527 3,463 2,158 1,448 0,725 0,699 2,435
0,724 0,551 0,733 0,793 0,701 1,323 1,067 0,763 1,375 0,763
12 © ISO 2010 – All rights reserved
Loc 0,092 59
Scale 0,492 4
N 50
a) Dot plot of data set b) Histogram of data set
Lognormal
Stem-and-leaf of data set N = 50
Leaf unit = 0,10
1 0 3
4 0 455
16 0 667777777777
22 0 888889
(4) 1 0000
24 1 222223333
15 1 444455
9 1
9 1 888
6 2 1
5 2 2
4 2 4
3 2 7
2 2 9
1 3
1 3
1 3 4
c) Box plot of data set d) Stem-and-leaf display of data set
Key
X data set
Y frequency
Figure 3 — Plots of the data set
These plots reveal that the given data set has a longer right tail than left tail. Figures 3 a), 3 b) and 3 d) indicate that its
largest value (3,463) appears to be a potential outlier, whereas the box plot in Figure 3 c) classifies the three largest values
that fall above the upper fence as outliers. The first column of the stem-and-leaf display in Figure 3 d) is called the depth,
the second column contains the stems, and the third column contains the leaves. The rows of the depth column give the
cumulative count of leaves from the top and from the bottom except for the row that contains the median in parentheses.
The leaf unit indicates the position of decimal points. Leaf unit = 0,1 means that the decimal point goes before the leaf, thus
the first number in the display is 0,3, the second and third numbers are 0,4 and 0,5, respectively. (This example is
considered further in 4.3.5.)
4.3 Tests for outliers
4.3.1 General
[3]
There are a large number of outlier tests (see Reference [1]). ISO 5725-2 provides the Grubbs and Cochran
tests to identify outlying laboratories that yield unexplained abnormal test results. The Grubbs test is
applicable to individual observations or to the means of sets of data taken from normal distributions, and it can
only be used to detect up to the two largest and/or smallest observations as outliers in the data set. The
testing procedure given in 4.3.2 is more general, being capable of detecting multiple outliers from individual
observations or from the means of sets of data taken from a normal distribution. The procedures given in 4.3.3
and 4.3.4 are capable of detecting multiple outliers for data taken from an exponential, type I extreme-value,
Weibull or gamma distribution. The procedure given in 4.3.5 should be used to detect outliers in samples
regarded as taken from populations with unknown distribution. A test procedure that detects outliers from a
given set of variances evaluated from sets of samples is given in 4.3.6.
4.3.2 Sample from a normal distribution
One or more outliers on either side of a normal data set can be detected by using a procedure known as the
generalized extreme studentized deviate (GESD) many-outlier procedure (see Reference [4]). The GESD
procedure is able to control the Type I error of detecting more than l outliers at a significance level α when
there are l outliers present in the data set (1 u l < m), where m is a prescribed maximum number of outliers.
Before adopting this outlier detection method, it should be verified that the majority of the sample data
[18]
approximately follow the normal distribution. The graphical normal probability plot of ISO 5479 can be used
to test the validity of the normality assumption.
Steps to follow when using the GESD many-outlier procedure
Step 1. Plot the given sample data x , x , …, x on normal probability paper. Count the number of points that
1 2 n
appear to deviate significantly from a straight line that fits the remaining data points. This is the
suspected number of outliers.
Step 2. Select a significance level α and prescribe the number of outliers m to be larger than or equal to the
suspected number of outliers from step 1. Start the following steps with l = 0.
Step 3. Compute the test statistic
maxxx− (I )
l
xI∈
il
R = (2)
l
sI()
l
where
I denotes the original sample data set;
(l − 1)
I denotes the reduced sample of size n − l obtained by deleting the point x in I that
l l − 1
yields the value R ;
l − 1
xI() is the sample mean of the sample I ;
l l
s(I ) is the standard deviation of the sample I .
l l
NOTE 1 For the case when l = 0: xI() and s(I ) are the sample mean and sample standard deviation
0 0
obtained from the original sample I = {x , x , …, x } of size n, when the largest value among the values
0 1 2 n
(0)
xx−−()I, ()x xI, �, x−x()I is xx− ()I (say), we then have R=−[(xxI)]/s(I) and x = x .
10 2 0 n 0 20 02 0 0 2
(0)
Subsequently, I = I \ {x } = {x , x , …, x } is the reduced sample of size n−1 obtained by deleting the data
1 0 1 3 n
(0)
value x , i.e. x , in I .
2 0
14 © ISO 2010 – All rights reserved
Step 4. Compute the critical value
(1nl−−)t
pn;2−−l
λ = (3)
l
(2nl−− +t )(n−l)
pn;2−−l
1/(n − l)
where p = (1 − α/2) and t represents the 100pth percentile of a t-distribution with v degrees of
p;ν
freedom. Note that if one has the additional information that the outliers occur only on either the
upper or the lower extreme, substitute α for α/2 in the equation.
Step 5. Set l = l + 1.
Step 6. Repeat step 2 to step 4 until l = m.
Step 7. If R u λ for all l = 0, 1, 2, …, m, then no outliers are declared. Otherwise, the n most extreme
l l out
(0) (1) (n − 1)
out
observations x , x , …, x in the successively reduced samples are declared as outliers
...
INTERNATIONAL ISO
STANDARD 16269-4
First edition
2010-10-15
Statistical interpretation of data —
Part 4:
Detection and treatment of outliers
Interprétation statistique des données —
Partie 4: Détection et traitement des valeurs aberrantes
Reference number
©
ISO 2010
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2010 – All rights reserved
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Terms and definitions .1
3 Symbols.10
4 Outliers in univariate data .11
4.1 General .11
4.1.1 What is an outlier? .11
4.1.2 What are the causes of outliers? .11
4.1.3 Why should outliers be detected?.11
4.2 Data screening.12
4.3 Tests for outliers .14
4.3.1 General .14
4.3.2 Sample from a normal distribution.14
4.3.3 Sample from an exponential distribution.16
4.3.4 Samples taken from some known non-normal distributions.18
4.3.5 Sample taken from unknown distributions.19
4.3.6 Cochran's test for outlying variance .21
4.4 Graphical test of outliers .22
5 Accommodating outliers in univariate data.23
5.1 Robust data analysis.23
5.2 Robust estimation of location.24
5.2.1 General .24
5.2.2 Trimmed mean .24
5.2.3 Biweight location estimate .25
5.3 Robust estimation of dispersion .25
5.3.1 General .25
5.3.2 Median-median absolute pair-wise deviation.25
5.3.3 Biweight scale estimate.26
6 Outliers in multivariate and regression data .26
6.1 General .26
6.2 Outliers in multivariate data .26
6.3 Outliers in linear regression.28
6.3.1 General .28
6.3.2 Linear regression models.29
6.3.3 Detecting outlying Y observations.31
6.3.4 Identifying outlying X observations.31
6.3.5 Detecting influential observations.32
6.3.6 A robust regression procedure.35
Annex A (informative) Algorithm for the GESD outliers detection procedure .36
Annex B (normative) Critical values of outliers test statistics for exponential samples .37
Annex C (normative) Factor values of the modified box plot .44
Annex D (normative) Values of the correction factors for the robust estimators of the scale
parameter .47
Annex E (normative) Critical values of Cochran's test statistic .48
Annex F (informative) A structured guide to detection of outliers in univariate data .51
Bibliography.54
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 16269-4 was prepared by Technical Committee ISO/TC 69, Applications of statistical methods.
ISO 16269 consists of the following parts, under the general title Statistical interpretation of data:
⎯ Part 4: Detection and treatment of outliers
⎯ Part 6: Determination of statistical tolerance intervals
⎯ Part 7: Median — Estimation and confidence intervals
⎯ Part 8: Determination of prediction intervals
iv © ISO 2010 – All rights reserved
Introduction
Identification of outliers is one of the oldest problems in interpreting data. Causes of outliers include
measurement error, sampling error, intentional under- or over-reporting of sampling results, incorrect
recording, incorrect distributional or model assumptions of the data set, and rare observations, etc.
Outliers can distort and reduce the information contained in the data source or generating mechanism. In the
manufacturing industry, the existence of outliers will undermine the effectiveness of any process/product
design and quality control procedures. Possible outliers are not necessarily bad or erroneous. In some
situations, an outlier may carry essential information and thus it should be identified for further study.
The study and detection of outliers from measurement processes leads to better understanding of the
processes and proper data analysis that subsequently results in improved inferences.
In view of the enormous volume of literature on the topic of outliers, it is of great importance for the
international community to identify and standardize a sound subset of methods used in the identification and
treatment of outliers. The implementation of this part of ISO 16269 enables business and industry to recognize
the data analyses conducted across member countries or organizations.
Six annexes are provided. Annex A provides an algorithm for computing the test statistic and critical values of
a procedure in detecting outliers in a data set taken from a normal distribution. Annexes B, D and E provide
the tables needed to implement the recommended procedures. Annex C provides the tables and statistical
theory that underlie the construction of modified box plots in outlier detection. Annex F provides a structured
guide and flow chart to the procedures recommended in this part of ISO 16269.
INTERNATIONAL STANDARD ISO 16269-4:2010(E)
Statistical interpretation of data —
Part 4:
Detection and treatment of outliers
1 Scope
This part of ISO 16269 provides detailed descriptions of sound statistical testing procedures and graphical
data analysis methods for detecting outliers in data obtained from measurement processes. It recommends
sound robust estimation and testing procedures to accommodate the presence of outliers.
This part of ISO 16269 is primarily designed for the detection and accommodation of outlier(s) from univariate
data. Some guidance is provided for multivariate and regression data.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
sample
data set
subset of a population made up of one or more sampling units
NOTE 1 The sampling units could be items, numerical values or even abstract entities depending on the population of
interest.
NOTE 2 A sample from a normal (2.22), a gamma (2.23), an exponential (2.24), a Weibull (2.25), a
lognormal (2.26) or a type I extreme value (2.27) population will often be referred to as a normal, a gamma, an
exponential, a Weibull, a lognormal or a type I extreme value sample, respectively.
2.2
outlier
member of a small subset of observations that appears to be inconsistent with the remainder of a given
sample (2.1)
NOTE 1 The classification of an observation or a subset of observations as outlier(s) is relative to the chosen model for
the population from which the data set originates. This or these observations are not to be considered as genuine
members of the main population.
NOTE 2 An outlier may originate from a different underlying population, or be the result of incorrect recording or gross
measurement error.
NOTE 3 The subset may contain one or more observations.
2.3
masking
presence of more than one outlier (2.2), making each outlier difficult to detect
2.4
some-outside rate
probability that one or more observations in an uncontaminated sample will be wrongly classified as
outliers (2.2)
2.5
outlier accommodation method
method that is insensitive to the presence of outliers (2.2) when providing inferences about the population
2.6
resistant estimation
estimation method that provides results that change only slightly when a small portion of the data values in a
data set (2.1) is replaced, possibly with very different data values from the original ones
2.7
robust estimation
estimation method that is insensitive to small departures from assumptions about the underlying probability
model of the data
NOTE An example is an estimation method that works well for, say, a normal distribution (2.22), and remains
reasonably good if the actual distribution is skew or heavy-tailed. Classes of such methods include the L-estimation
[weighted average of order statistics (2.10)] and M-estimation methods (see Reference [9]).
2.8
rank
position of an observed value in an ordered set of observed values
NOTE 1 The observed values are arranged in ascending order (counting from below) or descending order (counting
from above).
NOTE 2 For the purposes of this part of ISO 16269, identical observed values are ranked as if they were slightly
different from one another.
2.9
depth
〈box plot〉 smaller of the two ranks (2.8) determined by counting up from the smallest value of the
sample (2.1), or counting down from the largest value
NOTE 1 The depth may not be an integer value (see Annex C).
NOTE 2 For all summary values other than the median (2.11), a given depth identifies two (data) values, one below
the median and the other above the median. For example, the two data values with depth 1 are the smallest value
(minimum) and largest value (maximum) in the given sample (2.1).
2.10
order statistic
statistic determined by its ranking in a non-decreasing arrangement of random variables
[ISO 3534-1:2006, definition 1.9]
NOTE 1 Let the observed values of a random sample be {x , x , …, x }. Reorder the observed values in non-
1 2 n
decreasing order designated as x u x u … u x u … u x ; then x is the observed value of the kth order statistic in
(1) (2) (k) (n) (k)
a sample of size n.
NOTE 2 In practical terms, obtaining the order statistics for a sample (2.1) amounts to sorting the data as formally
described in Note 1.
2 © ISO 2010 – All rights reserved
2.11
median
sample median
median of a set of numbers
Q
[(n + 1)/2]th order statistic (2.10), if the sample size n is odd; sum of the [n/2]th and the [(n/2) + 1]th order
statistics divided by 2, if the sample size n is even
[ISO 3534-1:2006, definition 1.13]
NOTE The sample median is the second quartile (Q ).
2.12
first quartile
sample lower quartile
Q
for an odd number of observations, median (2.11) of the smallest (n − 1)/2 observed values; for an even
number of observations, median of the smallest n/2 observed values
NOTE 1 There are many definitions in the literature of a sample quartile, which produce slightly different results. This
definition has been chosen both for its ease of application and because it is widely used.
NOTE 2 Concepts such as hinges or fourths (2.19 and 2.20) are popular variants of quartiles. In some cases
(see Note 3 to 2.19), the first quartile and the lower fourth (2.19) are identical.
2.13
third quartile
sample upper quartile
Q
for an odd number of observations, median of the largest (n − 1)/2 observed values; for an even number of
observations, median of the largest n/2 observed values
NOTE 1 There are many definitions in the literature of a sample quartile, which produce slightly different results. This
definition has been chosen both for its ease of application and because it is widely used.
NOTE 2 Concepts such as hinges or fourths (2.19 and 2.20) are popular variants of quartiles. In some cases
(see Note 3 to 2.20), the third quartile and the upper fourth (2.20) are identical.
2.14
interquartile range
IQR
difference between the third quartile (2.13) and the first quartile (2.12)
NOTE 1 This is one of the widely used statistics to describe the spread of a data set.
NOTE 2 The difference between the upper fourth (2.20) and the lower fourth (2.19) is called the fourth-spread and is
sometimes used instead of the interquartile range.
2.15
five-number summary
the minimum, first quartile (2.12), median (2.11), third quartile (2.13), and maximum
NOTE The five-number summary provides numerical information about the location, spread and range.
2.16
box plot
horizontal or vertical graphical representation of the five-number summary (2.15).
NOTE 1 For the horizontal version, the first quartile (2.12) and the third quartile (2.13) are plotted as the left and
right sides, respectively, of a box, the median (2.11) is plotted as a vertical line across the box, the whiskers stretching
downwards from the first quartile to the smallest value at or above the lower fence (2.17) and upwards from the third
quartile to the largest value at or below the upper fence (2.18), and value(s) beyond the lower and upper fences are
marked separately as outlier(s) (2.2). For the vertical version, the first and third quartiles are plotted as the bottom and the
top, respectively, of a box, the median is plotted as a horizontal line across the box, the whiskers stretching downwards
from the first quartile to the smallest value at or above the lower fence and upwards from the third quartile to the largest
value at or below the upper fence and value(s) beyond the lower and upper fences are marked separately as outlier(s).
NOTE 2 The box width and whisker length of a box plot provide graphical information about the location, spread,
skewness, tail lengths, and outlier(s) of a sample. Comparisons between box plots and the density function of a) uniform,
b) bell-shaped, c) right-skewed, and d) left-skewed distributions are given in the diagrams in Figure 1. In each distribution,
a histogram is shown above the boxplot.
NOTE 3 A box plot constructed with its lower fence (2.17) and upper fence (2.18) evaluated by taking k to be a value
based on the sample size n and the knowledge of the underlying distribution of the sample data is called a modified box
plot (see example, Figure 2). The construction of a modified box plot is given in 4.4.
a) Uniform distribution b) Bell-shaped distribution
Figure 1 (continued)
4 © ISO 2010 – All rights reserved
c) Right-skewed distribution d) Left-skewed distribution
Key
X data values
Y frequency
In each distribution, a histogram is shown above the box plot.
Figure 1 — Box plots and histograms for a) uniform, b) bell-shaped, c) right-skewed,
and d) left-skewed distributions
Figure 2 — Modified box plot with lower and upper fences
2.17
lower fence
lower outlier cut-off
lower adjacent value
value in a box plot (2.16) situated k times the interquartile range (2.14) below the first quartile (2.12), with
a predetermined value of k
NOTE In proprietary statistical packages, the lower fence is usually taken to be Q − k (Q − Q ) with k taken to be
1 3 1
either 1,5 or 3,0. Classically, this fence is called the “inner lower fence” when k is 1,5, and “outer lower fence” when k is
3,0.
2.18
upper fence
upper outlier cut-off
upper adjacent value
value in a box plot situated k times the interquartile range (2.14) above the third quartile (2.13), with a
predetermined value of k
NOTE In proprietary statistical packages, the upper fence is usually taken to be Q + k (Q − Q ), with k taken to be
3 3 1
either 1,5 or 3,0. Classically, this fence is called the “inner upper fence” when k is 1,5, and the “outer upper fence” when k
is 3,0.
6 © ISO 2010 – All rights reserved
2.19
lower fourth
x
L:n
for a set x u x u … u x of observed values, the quantity 0,5 [x + x ] when f = 0 or x when f > 0,
(1) (2) (n) (i) (i + 1) (i + 1)
where i is the integral part of n/4 and f is the fractional part of n/4
NOTE 1 This definition of a lower fourth is used to determine the recommended values of k and k given in Annex C
L U
and is the default or optional setting in some widely used statistical packages.
NOTE 2 The lower fourth and the upper fourth (2.20) as a pair are sometimes called hinges.
NOTE 3 The lower fourth is sometimes referred to as the first quartile (2.12).
NOTE 4 When f = 0, 0,5 or 0,75, the lower fourth is identical to the first quartile. For example:
Sample size i = integral f = fractional First quartile Lower fourth
n part of n/4 part of n/4
9 2 0,25 [x + x ]/2 x
(2) (3) (3)
10 2 0,50 x x
(3) (3)
11 2 0,75 x x
(3) (3)
12 3 0 [x + x]/2 [x + x ]/2
(3) (4) (3) (4)
2.20
upper fourth
x
U:n
for a set x u x u … u x of observed values, the quantity 0,5 [x + x ] when f = 0 or x
(1) (2) (n) (n − i) (n − i + 1) (n − i)
when f > 0, where i is the integral part of n/4 and f is the fractional part of n/4
NOTE 1 This definition of an upper fourth is used to determine the recommended values of k and k given in Annex C
L U
and is the default or optional setting in some widely used statistical packages.
NOTE 2 The lower fourth (2.19) and the upper fourth as a pair are sometimes called hinges.
NOTE 3 The upper fourth is sometimes referred to as the third quartile (2.13).
NOTE 4 When f = 0, 0,5 or 0,75, the upper fourth is identical to the third quartile. For example:
Sample size i = integral f = fractional Third quartile Upper fourth
n part of n/4 part of n/4
9 2 0,25 [x + x ]/2 x
(7) (8) (7)
10 2 0,50 x x
(8) (8)
11 2 0,75 x x
(9) (9)
12 3 0 [x + x ]/2 [x + x ]/2
(9) (10) (9) (10)
2.21
Type I error
rejection of the null hypothesis when in fact it is true
[ISO 3534-1:2006, definition 1.46]
NOTE 1 A Type I error is an incorrect decision. Hence, it is desired to keep the probability of making such an incorrect
decision as small as possible.
NOTE 2 It is possible in some situations (for example, testing the binomial parameter p) that a pre-specified
significance level such as 0,05 is not attainable due to discreteness in outcomes.
2.22
normal distribution
Gaussian distribution
continuous distribution having the probability density function
⎧⎫
x − µ
1 ()
⎪⎪
fx()=−exp
⎨⎬
σπ2
2σ
⎪⎪
⎩⎭
where −∞ < x < ∞ and with parameters −∞ < µ < ∞ and σ > 0
[ISO 3534-1:2006, definition 2.50]
NOTE 1 The location parameter µ is the mean and the scale parameter σ is the standard deviation of the normal
distribution.
NOTE 2 A normal sample is a random sample (2.1) taken from a population that follows a normal distribution.
2.23
gamma distribution
continuous distribution having the probability density function
α −1
xxexp − / β
()
fx() =
α
βαΓ()
where x > 0 and parameters α > 0, β > 0
[ISO 3534-1:2006, definition 2.56]
NOTE 1 The gamma distribution is used in reliability applications for modelling time to failure. It includes the
exponential distribution (2.24) as a special case as well as other cases with failure rates that increase with age.
NOTE 2 The mean of the gamma distribution is αβ. The variance of the gamma distribution is αβ .
NOTE 3 A gamma sample is a random sample (2.1) taken from a population that follows a gamma distribution.
2.24
exponential distribution
continuous distribution having the probability density function
−1
fx()=−ββexp x/
()
where x > 0 and with parameter β > 0
[ISO 3534-1:2006, definition 2.58]
8 © ISO 2010 – All rights reserved
NOTE 1 The exponential distribution provides a baseline in reliability applications, corresponding to the case of “lack of
ageing” or memory-less property.
NOTE 2 The mean of the exponential distribution is β. The variance of the exponential distribution is β .
NOTE 3 An exponential sample is a random sample (2.1) taken from a population that follows an exponential
distribution.
2.25
Weibull distribution
type III extreme-value distribution
continuous distribution having the distribution function
κ
⎧⎫
⎛⎞x −θ
⎪⎪
Fx()=−1 exp −
⎨⎬
⎜⎟
β
⎝⎠
⎪⎪
⎩⎭
where x > θ with parameters −∞ < θ < ∞, β > 0, κ > 0
[ISO 3534-1:2006, definition 2.63]
NOTE 1 In addition to serving as one of the three possible limiting distributions of extreme order statistics, the Weibull
distribution occupies a prominent place in diverse applications, particularly reliability and engineering. The Weibull
distribution has been demonstrated to provide usable fits to a variety of data sets.
NOTE 2 The parameter θ is a location or threshold parameter in the sense that it is the minimum value that a Weibull
variate can achieve. The parameter β is a scale parameter (related to the standard deviation of a Weibull variate). The
parameter κ is a shape parameter.
NOTE 3 A Weibull sample is a random sample (2.1) taken from a population that follows a Weibull distribution.
2.26
lognormal distribution
continuous distribution having the probability density function
⎧⎫
1(⎪⎪ln x − µ)
fx()=−exp
⎨⎬
xσ 2π
2σ
⎪⎪
⎩⎭
where x > 0 and with parameters −∞ < µ < ∞ and σ > 0
[ISO 3534-1:2006, definition 2.52]
2.27
type I extreme-value distribution
Gumbel distribution
continuous distribution having the distribution function
−−()x µ/σ
Fx()=−exp e
{ }
where −∞ < x < ∞ and with parameters −∞ < µ < ∞ and σ > 0
NOTE Extreme-value distributions provide appropriate reference distributions for the extreme order statistics (2.10)
x and x .
(1) (n)
[ISO 3534-1:2006, definition 2.61]
3 Symbols
The symbols and abbreviated terms used in this part of ISO 16269 are as follows:
GESD generalized extreme studentized deviate
G Greenwood's statistic
E
g critical value of the Greenwood's test statistic for sample size n
E;n
(0)
I reduced sample of size n − l after removing the most extreme observation x in the original sample
l
(1)
I of size n, removing the most extreme observation x in the reduced sample I of size n − 1,….,
0 1
(l − 1)
and removing the most extreme observation x in the reduced sample I of size n − l + 1
l − 1
F pth percentile of a F-distribution with ν and ν degrees of freedom
p;,ν ν 1 2
(l)
λ critical value of the GESD test in testing whether the value x is an outlier
l
L lower fence of a modified box plot
F
U upper fence of a modified box plot
F
M or Q sample median
M median absolute deviation about the median
ad
Q first quartile
Q third quartile
(l)
R GESD test statistic for testing whether the value x is an outlier
l
s(I ) standard deviation of the reduced sample I
l l
T total median
M
T biweight location estimate from a sample of size n
n
()i
T estimate of T at the ith iteration based on a sample of size n
n n
t pth percentile of a t-distribution with v degrees of freedom
p; ν
χ pth percentile of a chi-square distribution with v degrees of freedom
p; ν
x ith observation in the ordered data set
(i)
(l)
x most extreme value in the reduced sample I
l
xI() mean of the reduced sample I
l l
x ()α α-trimmed mean
T
x lower fourth of a box plot for a sample of size n
L:n
x upper fourth of a box plot for a sample of size n
U:n
10 © ISO 2010 – All rights reserved
4 Outliers in univariate data
4.1 General
4.1.1 What is an outlier?
In the simplest case, an outlier is an observation that appears to be inconsistent with the rest of a given data
set. In general, there may be more than one outlier at one or both ends of the data set. The problem is to
determine whether or not apparently inconsistent observations are in fact outliers. This determination is
performed by means of a pre-specified significance test with respect to a presumed underlying distribution.
Observations that lead to a significant result are deemed to be outliers with respect to that distribution.
The importance of using the correct underlying distribution in an outlier test cannot be over-stressed. Often in
practice, an underlying normal distribution is assumed when the data arise from a different distribution. Such
an erroneous assumption can lead to observations being incorrectly classified as outliers.
4.1.2 What are the causes of outliers?
Outlying observations or outliers typically are attributable to one or more of the following causes (see
Reference [1] for more detail and perspective):
a) Measurement or recording error. The measurements are imprecisely generated, incorrectly observed,
incorrectly recorded, or incorrectly entered into the database.
b) Contamination. The data arise from two or more distributions, i.e. the basic one and one or more
contaminating distributions. If the contaminating distributions have significantly different means, larger
standard deviations and/or heavier tails than the basic distribution, then there is a possibility that extreme
observations coming from the contaminating distributions may appear as outliers in the basic distribution.
NOTE 1 The cause of contamination can be due to sampling error where a small portion of sample data is
inadvertently regarded as having been drawn from a different population than the rest of sample data; or intentional
under- or over-reporting of experiments or sampling surveys.
c) Incorrect distributional assumption. The data set is regarded as drawn from a particular distribution, but it
should have been regarded as drawn from another distribution.
EXAMPLE The data set is regarded as drawn from a normal distribution, but it should have been regarded as
drawn from a highly skewed distribution (e.g. exponential or lognormal) or a symmetric but heavier-tailed distribution
(e.g. a t-distribution). Therefore, observations that deviate far from the central location can be incorrectly labelled as
outliers even though they are valid observations with respect to a highly skewed or heavy-tailed distribution.
d) Rare observations. Highly improbable observations might occur on rare occasions, in samples regarded
as drawn from an assumed probability distribution. These extreme observations are usually incorrectly
labelled as outliers due to their rare occurrence, but they are not truly outliers.
NOTE 2 The occurrence of rare observations when the underlying distribution is symmetric but heavy-tailed may
lead to incorrect distributional assumptions.
4.1.3 Why should outliers be detected?
Outliers are not necessarily bad or erroneous. They can be taken as an indication of the existence of rare
phenomena that could be a reason for further investigation. For example, if an outlier is caused exclusively by
a particular industrial treatment, important discoveries may be made by investigating the cause.
Many statistical techniques and summary statistics are sensitive to the presence of outliers. For example, the
sample mean and sample standard deviation are easily influenced by the presence of even a single outlier
that could subsequently lead to invalid inferences.
The study of the nature and frequency of outliers in a particular problem can lead to appropriate modifications
of the distributional or model assumptions regarding the data set, and also lead to appropriate choices of
robust methods that can accommodate the presence of possible outliers in subsequent data analyses and
thus result in improved inferences (see Clause 6).
4.2 Data screening
Data screening can begin with a simple visual inspection of the given data set. Simple data plots, such as dot
plot, scatter diagram, histogram, stem-and-leaf plot, probability plot, box plot, time series plot or arranging
data in non-decreasing order of magnitude, can reveal unanticipated sources of variability and
extreme/outlying data points. For example, a bimodal distribution of a data set revealed by the histogram or
stem-and-leaf plot might be evidence of a contaminated sample or mixture of data regarded as drawn from
two different populations. Probability plots and box plots are recommended for identifying extreme/outlying
data points. These possible outliers can then be further investigated using the methods given in 4.3 or 4.4.
A probability plot not only provides a graphical test of whether the observations, or the majority of the
observations, can be regarded as following an assumed distribution; it also reveals outlying observations in
the data set. Data points that deviate markedly from a straight line fitted by eye to the points on a probability
plot can be considered as possible outliers. Probability plot facilities for a wide range of distributions are
available in proprietary software.
The box plot is one of the most popular graphical tools for exploring data. It is useful for displaying the central
location, spread and shape of the distribution of a data set. The lower and upper fences of the box plot are
defined as
lower fence=−Qk (Q−Q )
13 1
(1)
upper fence =+Qk (Q−Q )
33 1
where Q and Q are the first and third quartiles of the data set and k is a constant value.
1 3
[2]
Tukey labelled data values that lie outside the lower and upper fences with k = 1,5 as suspected (possible)
outliers, and those that lie outside the fences with k = 3,0 as extreme outliers.
NOTE 1 Probability plotting paper for the normal, exponential, lognormal and Weibull distributions may be obtained at
the time of publication from http://www.weibull.com/GPaper/index.htm.
NOTE 2 The type of probability plot should depend on the distributional assumption of the population. For example, the
exponential probability plot should be used if it is assumed, or there is a priori knowledge, that the data set can be
regarded as drawn from an exponential population.
NOTE 3 A large number of observations may incorrectly be identified as potential outliers by the box plot with its lower
and upper fences defined in Equation (1) when the data set can be regarded as sampled from skewed distributions. The
recommended modified box plot that is able to handle this problem is given in 4.4.
EXAMPLE The dot plot, histogram, box plot and stem-and-leaf plot of the following data values are plotted in
Figures 3 a), 3 b), 3 c) and 3 d), respectively.
0,745 0,883 0,351 0,806 2,908 1,096 1,310 1,261 0,637 1,226
1,418 0,430 1,870 0,543 0,718 1,229 1,312 1,544 0,965 1,034
1,818 1,409 2,773 1,293 0,842 1,469 0,804 2,219 0,892 1,864
1,214 1,093 0,727 1,527 3,463 2,158 1,448 0,725 0,699 2,435
0,724 0,551 0,733 0,793 0,701 1,323 1,067 0,763 1,375 0,763
12 © ISO 2010 – All rights reserved
Loc 0,092 59
Scale 0,492 4
N 50
a) Dot plot of data set b) Histogram of data set
Lognormal
Stem-and-leaf of data set N = 50
Leaf unit = 0,10
1 0 3
4 0 455
16 0 667777777777
22 0 888889
(4) 1 0000
24 1 222223333
15 1 444455
9 1
9 1 888
6 2 1
5 2 2
4 2 4
3 2 7
2 2 9
1 3
1 3
1 3 4
c) Box plot of data set d) Stem-and-leaf display of data set
Key
X data set
Y frequency
Figure 3 — Plots of the data set
These plots reveal that the given data set has a longer right tail than left tail. Figures 3 a), 3 b) and 3 d) indicate that its
largest value (3,463) appears to be a potential outlier, whereas the box plot in Figure 3 c) classifies the three largest values
that fall above the upper fence as outliers. The first column of the stem-and-leaf display in Figure 3 d) is called the depth,
the second column contains the stems, and the third column contains the leaves. The rows of the depth column give the
cumulative count of leaves from the top and from the bottom except for the row that contains the median in parentheses.
The leaf unit indicates the position of decimal points. Leaf unit = 0,1 means that the decimal point goes before the leaf, thus
the first number in the display is 0,3, the second and third numbers are 0,4 and 0,5, respectively. (This example is
considered further in 4.3.5.)
4.3 Tests for outliers
4.3.1 General
[3]
There are a large number of outlier tests (see Reference [1]). ISO 5725-2 provides the Grubbs and Cochran
tests to identify outlying laboratories that yield unexplained abnormal test results. The Grubbs test is
applicable to individual observations or to the means of sets of data taken from normal distributions, and it can
only be used to detect up to the two largest and/or smallest observations as outliers in the data set. The
testing procedure given in 4.3.2 is more general, being capable of detecting multiple outliers from individual
observations or from the means of sets of data taken from a normal distribution. The procedures given in 4.3.3
and 4.3.4 are capable of detecting multiple outliers for data taken from an exponential, type I extreme-value,
Weibull or gamma distribution. The procedure given in 4.3.5 should be used to detect outliers in samples
regarded as taken from populations with unknown distribution. A test procedure that detects outliers from a
given set of variances evaluated from sets of samples is given in 4.3.6.
4.3.2 Sample from a normal distribution
One or more outliers on either side of a normal data set can be detected by using a procedure known as the
generalized extreme studentized deviate (GESD) many-outlier procedure (see Reference [4]). The GESD
procedure is able to control the Type I error of detecting more than l outliers at a significance level α when
there are l outliers present in the data set (1 u l < m), where m is a prescribed maximum number of outliers.
Before adopting this outlier detection method, it should be verified that the majority of the sample data
[18]
approximately follow the normal distribution. The graphical normal probability plot of ISO 5479 can be used
to test the validity of the normality assumption.
Steps to follow when using the GESD many-outlier procedure
Step 1. Plot the given sample data x , x , …, x on normal probability paper. Count the number of points that
1 2 n
appear to deviate significantly from a straight line that fits the remaining data points. This is the
suspected number of outliers.
Step 2. Select a significance level α and prescribe the number of outliers m to be larger than or equal to the
suspected number of outliers from step 1. Start the following steps with l = 0.
Step 3. Compute the test statistic
maxxx− (I )
l
xI∈
il
R = (2)
l
sI()
l
where
I denotes the original sample data set;
(l − 1)
I denotes the reduced sample of size n − l obtained by deleting the point x in I that
l l − 1
yields the value R ;
l − 1
xI() is the sample mean of the sample I ;
l l
s(I ) is the standard deviation of the sample I .
l l
NOTE 1 For the case when l = 0: xI() and s(I ) are the sample mean and sample standard deviation
0 0
obtained from the original sample I = {x , x , …, x } of size n, when the largest value among the values
0 1 2 n
(0)
xx−−()I, ()x xI, �, x−x()I is xx− ()I (say), we then have R=−[(xxI)]/s(I) and x = x .
10 2 0 n 0 20 02 0 0 2
(0)
Subsequently, I = I \ {x } = {x , x , …, x } is the reduced sample of size n−1 obtained by deleting the data
1 0 1 3 n
(0)
value x , i.e. x , in I .
2 0
14 © ISO 2010 – All rights reserved
Step 4. Compute the critical value
(1nl−−)t
pn;2−−l
λ = (3)
l
(2nl−− +t )(n−l)
pn;2−−l
1/(n − l)
where p = (1 − α/2) and t represents the 100pth percentile of a t-distribution with v degrees of
p;ν
freedom. Note that if one has the additional information that the outliers occur only on either the
upper or the lower extreme, substitute α for α/2 in the equation.
Step 5. Set l = l + 1.
Step 6. Repeat step 2 to step 4 until l = m.
Step 7. If R u λ for all l = 0, 1, 2, …, m, then no outliers are declared. Otherwise, the n most extreme
l l out
(0) (1) (n − 1)
out
observations x , x , …, x in the successively reduced samples are declared as outliers
when nl=+1max :R > λ .
{ }
out ll
0uulm
A computer algorithm that describes the necessary steps in implementing the GESD many-outlier procedure
is given in Annex A.
NOTE 2 The GESD test is equivalent to the Grubbs test when it is used to test whether the largest or the smallest
[3]
outlying observation is an outlier. The critical values of the Grubbs test are given in Table 5 of ISO 5725-2:1994 , and can
also be approximated from λ of step 4 by taking l = 0.
l
NOTE 3 In practice, the number of outliers m envisaged in the sample should be small. If many outlying observations
are expected in the sample, then it ceases to be an outlier detection problem and different approaches are needed.
However, m should not be too small, otherwise there is a possibility of a masking effect.
EXAMPLE Consider a data set of 20 observations:
−2,21 −1,84 −0,95 −0,91 −0,36 −0,19 −0,11 −0,10 0,18 0,30
0,43 0,51 0,64 0,67 0,93 1,22 1,35 1,73 5,80 12,6
where the latter two observations were originally 0,58 and 1,26, but the decimal commas were entered at the wrong place.
In detecting outliers using the GESD procedure, we shall first verify that the given observations are taken from a normal
distribution. The data points of the normal probability plot given in Figure 4 a) app
...














Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...