Standard Guide for Statistical Evaluation of Atmospheric Dispersion Model Performance

SIGNIFICANCE AND USE
5.1 Guidance is provided on designing model evaluation performance procedures and on the difficulties that arise in statistical evaluation of model performance caused by the stochastic nature of dispersion in the atmosphere. It is recognized there are examples in the literature where, knowingly or unknowingly, models were evaluated on their ability to describe something which they were never intended to characterize. This guide is attempting to heighten awareness, and thereby, to reduce the number of “unknowing” comparisons. A goal of this guide is to stimulate development and testing of evaluation procedures that accommodate the effects of natural variability. A technique is illustrated to provide information from which subsequent evaluation and standardization can be derived.
SCOPE
1.1 This guide provides techniques that are useful for the comparison of modeled air concentrations with observed field data. Such comparisons provide a means for assessing a model's performance, for example, bias and precision or uncertainty, relative to other candidate models. Methodologies for such comparisons are yet evolving; hence, modifications will occur in the statistical tests and procedures and data analysis as work progresses in this area. Until the interested parties agree upon standard testing protocols, differences in approach will occur. This guide describes a framework, or philosophical context, within which one determines whether a model's performance is significantly different from other candidate models. It is suggested that the first step should be to determine which model's estimates are closest on average to the observations, and the second step would then test whether the differences seen in the performance of the other models are significantly different from the model chosen in the first step. An example procedure is provided in Appendix X1 to illustrate an existing approach for a particular evaluation goal. This example is not intended to inhibit alternative approaches or techniques that will produce equivalent or superior results. As discussed in Section 6, statistical evaluation of model performance is viewed as part of a larger process that collectively is referred to as model evaluation.  
1.2 This guide has been designed with flexibility to allow expansion to address various characterizations of atmospheric dispersion, which might involve dose or concentration fluctuations, to allow development of application-specific evaluation schemes, and to allow use of various statistical comparison metrics. No assumptions are made regarding the manner in which the models characterize the dispersion.  
1.3 The focus of this guide is on end results, that is, the accuracy of model predictions and the discernment of whether differences seen between models are significant, rather than operational details such as the ease of model implementation or the time required for model calculations to be performed.  
1.4 This guide offers an organized collection of information or a series of options and does not recommend a specific course of action. This guide cannot replace education or experience and should be used in conjunction with professional judgment. Not all aspects of this guide may be applicable in all circumstances. This guide is not intended to represent or replace the standard of care by which the adequacy of a given professional service must be judged, nor should it be applied without consideration of a project's many unique aspects. The word “Standard” in the title of this guide means only that the document has been approved through the ASTM consensus process.  
1.5 This standard applies to gaussian plume models; it may not be applicable to non-point sources, heavy gas models from evaporation from pool (for example, liquid spills), as well as near-field receptors.  
1.6 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this guide...

General Information

Status
Published
Publication Date
31-Aug-2023
Technical Committee
D22 - Air Quality
Drafting Committee
D22.11 - Meteorology

Relations

Effective Date
01-Sep-2020
Effective Date
15-Mar-2020
Effective Date
15-Oct-2015
Effective Date
01-Jul-2015
Effective Date
01-Dec-2014
Effective Date
01-May-2014
Effective Date
15-Jan-2014
Effective Date
01-Apr-2010
Effective Date
01-May-2005
Effective Date
10-Nov-2000

Overview

ASTM D6589-23: Standard Guide for Statistical Evaluation of Atmospheric Dispersion Model Performance is a guide developed by ASTM International to assist professionals in the objective and statistically rigorous evaluation of atmospheric dispersion model performance. The guide provides a framework for comparing modeled air concentrations with observed field data and offers methods to assess critical aspects such as prediction bias, precision, and uncertainty. Emphasizing the importance of understanding the stochastic nature of atmospheric dispersion, the standard helps reduce misapplication of models and fosters the development of effective evaluation methods that accommodate natural variability.

This guide is designed to be flexible, supporting a wide range of atmospheric dispersion characterizations and enabling users to develop application-specific evaluation schemes using various statistical metrics. It is particularly applicable to gaussian plume models and provides practical value by standardizing the process of statistical evaluation in air quality modeling.

Key Topics

  • Model Evaluation Framework: The standard outlines a systematic approach for assessing whether an atmospheric dispersion model’s performance is significantly different from other candidate models. The process begins with identifying which model best matches observational data on average, then statistically testing the significance of differences among models.
  • Statistical Comparison Methods: Guidance is provided on paired and unpaired statistical comparison metrics, such as mean bias, root mean square error (RMSE), fractional bias, regression analysis, and robust highest concentration (RHC).
  • Bootstrap Resampling: The guide introduces bootstrap methods to quantify the uncertainty in statistical performance metrics, ensuring the significance of findings despite data variability.
  • Handling Variability and Uncertainty: It addresses the challenges presented by natural variability and measurement uncertainties, encouraging evaluation methods that account for inherent randomness in atmospheric dispersion.
  • Evaluation Objectives: Users are advised to define clear evaluation objectives (e.g., maximum concentration, plume spread) that align with the models’ intended capabilities and available data.
  • Model Selection Awareness: The guide cautions against evaluating models on phenomena they were not designed to represent, helping prevent misinformed conclusions.
  • Documentation and Peer Review: Emphasis is placed on the necessity of peer review and thorough documentation throughout the evaluation process.

Applications

ASTM D6589-23 is essential for professionals involved in air quality modeling and atmospheric simulations, including:

  • Model Developers and Analysts: Use the guide to compare air dispersion model outputs with field measurements, assess reliability, and refine model algorithms.
  • Regulatory Agencies: Apply standardized statistical evaluation procedures to judge model adequacy for policy-making, permitting, or compliance demonstration.
  • Environmental Consultants: Enhance the credibility of dispersion modeling in site assessments, risk analyses, and environmental impact statements by following consistent evaluation protocols.
  • Researchers: Advance methodological development in model evaluation, contributing to the evolution of air quality model assessment best practices.

The standard enables sound, transparent, and repeatable model performance assessments, ensuring that important decisions regarding air quality and public health are based on robust scientific evidence.

Related Standards

  • ASTM D1356: Terminology Relating to Sampling and Analysis of Atmospheres - Provides essential definitions for terms used in atmospheric measurements.
  • Other Atmospheric Dispersion Modeling Standards: Such as those covering model input definition, field measurement procedures, and model validation protocols.
  • International Standards for Air Quality Modeling: Guidance may be referenced alongside regional or international frameworks for modeling and evaluation, supporting harmonized best practices.

ASTM D6589-23 represents a key resource for achieving statistically valid, reliable, and comparable results in atmospheric dispersion model performance assessments. Its focus on methodology rather than prescribing a single solution makes it adaptable across many scientific and regulatory settings.

Buy Documents

Guide

ASTM D6589-23 - Standard Guide for Statistical Evaluation of Atmospheric Dispersion Model Performance

English language (18 pages)
sale 15% off
sale 15% off
Guide

REDLINE ASTM D6589-23 - Standard Guide for Statistical Evaluation of Atmospheric Dispersion Model Performance

English language (18 pages)
sale 15% off
sale 15% off

Get Certified

Connect with accredited certification bodies for this standard

NSF International

Global independent organization facilitating standards development and certification.

ANAB United States Verified

CIS Institut d.o.o.

Personal Protective Equipment (PPE) certification body. Notified Body NB-2890 for EU Regulation 2016/425 PPE.

SA Slovenia Verified

Kiwa BDA Testing

Building and construction product certification.

RVA Netherlands Verified

Sponsored listings

Frequently Asked Questions

ASTM D6589-23 is a guide published by ASTM International. Its full title is "Standard Guide for Statistical Evaluation of Atmospheric Dispersion Model Performance". This standard covers: SIGNIFICANCE AND USE 5.1 Guidance is provided on designing model evaluation performance procedures and on the difficulties that arise in statistical evaluation of model performance caused by the stochastic nature of dispersion in the atmosphere. It is recognized there are examples in the literature where, knowingly or unknowingly, models were evaluated on their ability to describe something which they were never intended to characterize. This guide is attempting to heighten awareness, and thereby, to reduce the number of “unknowing” comparisons. A goal of this guide is to stimulate development and testing of evaluation procedures that accommodate the effects of natural variability. A technique is illustrated to provide information from which subsequent evaluation and standardization can be derived. SCOPE 1.1 This guide provides techniques that are useful for the comparison of modeled air concentrations with observed field data. Such comparisons provide a means for assessing a model's performance, for example, bias and precision or uncertainty, relative to other candidate models. Methodologies for such comparisons are yet evolving; hence, modifications will occur in the statistical tests and procedures and data analysis as work progresses in this area. Until the interested parties agree upon standard testing protocols, differences in approach will occur. This guide describes a framework, or philosophical context, within which one determines whether a model's performance is significantly different from other candidate models. It is suggested that the first step should be to determine which model's estimates are closest on average to the observations, and the second step would then test whether the differences seen in the performance of the other models are significantly different from the model chosen in the first step. An example procedure is provided in Appendix X1 to illustrate an existing approach for a particular evaluation goal. This example is not intended to inhibit alternative approaches or techniques that will produce equivalent or superior results. As discussed in Section 6, statistical evaluation of model performance is viewed as part of a larger process that collectively is referred to as model evaluation. 1.2 This guide has been designed with flexibility to allow expansion to address various characterizations of atmospheric dispersion, which might involve dose or concentration fluctuations, to allow development of application-specific evaluation schemes, and to allow use of various statistical comparison metrics. No assumptions are made regarding the manner in which the models characterize the dispersion. 1.3 The focus of this guide is on end results, that is, the accuracy of model predictions and the discernment of whether differences seen between models are significant, rather than operational details such as the ease of model implementation or the time required for model calculations to be performed. 1.4 This guide offers an organized collection of information or a series of options and does not recommend a specific course of action. This guide cannot replace education or experience and should be used in conjunction with professional judgment. Not all aspects of this guide may be applicable in all circumstances. This guide is not intended to represent or replace the standard of care by which the adequacy of a given professional service must be judged, nor should it be applied without consideration of a project's many unique aspects. The word “Standard” in the title of this guide means only that the document has been approved through the ASTM consensus process. 1.5 This standard applies to gaussian plume models; it may not be applicable to non-point sources, heavy gas models from evaporation from pool (for example, liquid spills), as well as near-field receptors. 1.6 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this guide...

SIGNIFICANCE AND USE 5.1 Guidance is provided on designing model evaluation performance procedures and on the difficulties that arise in statistical evaluation of model performance caused by the stochastic nature of dispersion in the atmosphere. It is recognized there are examples in the literature where, knowingly or unknowingly, models were evaluated on their ability to describe something which they were never intended to characterize. This guide is attempting to heighten awareness, and thereby, to reduce the number of “unknowing” comparisons. A goal of this guide is to stimulate development and testing of evaluation procedures that accommodate the effects of natural variability. A technique is illustrated to provide information from which subsequent evaluation and standardization can be derived. SCOPE 1.1 This guide provides techniques that are useful for the comparison of modeled air concentrations with observed field data. Such comparisons provide a means for assessing a model's performance, for example, bias and precision or uncertainty, relative to other candidate models. Methodologies for such comparisons are yet evolving; hence, modifications will occur in the statistical tests and procedures and data analysis as work progresses in this area. Until the interested parties agree upon standard testing protocols, differences in approach will occur. This guide describes a framework, or philosophical context, within which one determines whether a model's performance is significantly different from other candidate models. It is suggested that the first step should be to determine which model's estimates are closest on average to the observations, and the second step would then test whether the differences seen in the performance of the other models are significantly different from the model chosen in the first step. An example procedure is provided in Appendix X1 to illustrate an existing approach for a particular evaluation goal. This example is not intended to inhibit alternative approaches or techniques that will produce equivalent or superior results. As discussed in Section 6, statistical evaluation of model performance is viewed as part of a larger process that collectively is referred to as model evaluation. 1.2 This guide has been designed with flexibility to allow expansion to address various characterizations of atmospheric dispersion, which might involve dose or concentration fluctuations, to allow development of application-specific evaluation schemes, and to allow use of various statistical comparison metrics. No assumptions are made regarding the manner in which the models characterize the dispersion. 1.3 The focus of this guide is on end results, that is, the accuracy of model predictions and the discernment of whether differences seen between models are significant, rather than operational details such as the ease of model implementation or the time required for model calculations to be performed. 1.4 This guide offers an organized collection of information or a series of options and does not recommend a specific course of action. This guide cannot replace education or experience and should be used in conjunction with professional judgment. Not all aspects of this guide may be applicable in all circumstances. This guide is not intended to represent or replace the standard of care by which the adequacy of a given professional service must be judged, nor should it be applied without consideration of a project's many unique aspects. The word “Standard” in the title of this guide means only that the document has been approved through the ASTM consensus process. 1.5 This standard applies to gaussian plume models; it may not be applicable to non-point sources, heavy gas models from evaporation from pool (for example, liquid spills), as well as near-field receptors. 1.6 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this guide...

ASTM D6589-23 is classified under the following ICS (International Classification for Standards) categories: 13.040.20 - Ambient atmospheres. The ICS classification helps identify the subject area and facilitates finding related standards.

ASTM D6589-23 has the following relationships with other standards: It is inter standard links to ASTM D1356-20a, ASTM D1356-20, ASTM D1356-15a, ASTM D1356-15, ASTM D1356-14b, ASTM D1356-14a, ASTM D1356-14, ASTM D1356-05(2010), ASTM D1356-05, ASTM D1356-00a. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

ASTM D6589-23 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.

Standards Content (Sample)


This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation: D6589 − 23
Standard Guide for
Statistical Evaluation of Atmospheric Dispersion Model
Performance
This standard is issued under the fixed designation D6589; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope operational details such as the ease of model implementation or
the time required for model calculations to be performed.
1.1 This guide provides techniques that are useful for the
1.4 This guide offers an organized collection of information
comparison of modeled air concentrations with observed field
or a series of options and does not recommend a specific course
data. Such comparisons provide a means for assessing a
of action. This guide cannot replace education or experience
model’s performance, for example, bias and precision or
and should be used in conjunction with professional judgment.
uncertainty, relative to other candidate models. Methodologies
Not all aspects of this guide may be applicable in all circum-
for such comparisons are yet evolving; hence, modifications
stances. This guide is not intended to represent or replace the
will occur in the statistical tests and procedures and data
standard of care by which the adequacy of a given professional
analysis as work progresses in this area. Until the interested
service must be judged, nor should it be applied without
parties agree upon standard testing protocols, differences in
consideration of a project’s many unique aspects. The word
approach will occur. This guide describes a framework, or
“Standard” in the title of this guide means only that the
philosophical context, within which one determines whether a
document has been approved through the ASTM consensus
model’s performance is significantly different from other
process.
candidate models. It is suggested that the first step should be to
determine which model’s estimates are closest on average to
1.5 This standard applies to gaussian plume models; it may
the observations, and the second step would then test whether
not be applicable to non-point sources, heavy gas models from
the differences seen in the performance of the other models are
evaporation from pool (for example, liquid spills), as well as
significantly different from the model chosen in the first step.
near-field receptors.
An example procedure is provided in Appendix X1 to illustrate
1.6 The values stated in SI units are to be regarded as
an existing approach for a particular evaluation goal. This
standard. No other units of measurement are included in this
example is not intended to inhibit alternative approaches or
guide.
techniques that will produce equivalent or superior results. As
1.7 This standard does not purport to address all of the
discussed in Section 6, statistical evaluation of model perfor-
safety concerns, if any, associated with its use. It is the
mance is viewed as part of a larger process that collectively is
responsibility of the user of this standard to establish appro-
referred to as model evaluation.
priate safety, health, and environmental practices and deter-
1.2 This guide has been designed with flexibility to allow
mine the applicability of regulatory limitations prior to use.
expansion to address various characterizations of atmospheric
1.8 This international standard was developed in accor-
dispersion, which might involve dose or concentration
dance with internationally recognized principles on standard-
fluctuations, to allow development of application-specific
ization established in the Decision on Principles for the
evaluation schemes, and to allow use of various statistical
Development of International Standards, Guides and Recom-
comparison metrics. No assumptions are made regarding the
mendations issued by the World Trade Organization Technical
manner in which the models characterize the dispersion.
Barriers to Trade (TBT) Committee.
1.3 The focus of this guide is on end results, that is, the
2. Referenced Documents
accuracy of model predictions and the discernment of whether
2.1 ASTM Standards:
differences seen between models are significant, rather than
D1356 Terminology Relating to Sampling and Analysis of
Atmospheres
This guide is under the jurisdiction of ASTM Committee D22 on Air Quality
and is the direct responsibility of Subcommittee D22.11 on Meteorology. For referenced ASTM standards, visit the ASTM website, www.astm.org, or
Current edition approved Sept. 1, 2023. Published September 2023. Originally contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM
approved in 2000. Last previous edition approved in 2015 as D6589 – 05 (2015). Standards volume information, refer to the standard’s Document Summary page on
DOI: 10.1520/D6589-23. the ASTM website.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
D6589 − 23
3. Terminology 4. Summary of Guide
4.1 Statistical evaluation of dispersion model performance
3.1 Definitions—For definitions of terms used in this guide,
with field data is viewed as part of a larger process that
refer to Terminology D1356.
collectively is called model evaluation. Section 6 discusses the
3.2 Definitions of Terms Specific to This Standard:
components of model evaluation.
3.2.1 atmospheric dispersion model, n—an idealization of
4.2 To statistically assess model performance, one must
atmospheric physics and processes to calculate the magnitude
define an overall evaluation goal or purpose. This will suggest
and location of pollutant concentrations based on fate,
features (evaluation objectives) within the observed and mod-
transport, and dispersion in the atmosphere. This may take the
eled concentration patterns to be compared, for example,
form of an equation, algorithm, or series of equations/
maximum surface concentrations, lateral extent of a dispersing
algorithms used to calculate average or time-varying concen-
plume. The selection and definition of evaluation objectives
tration. The model may involve numerical methods for solu-
typically are tailored to the model’s capabilities and intended
tion.
uses. The very nature of the problem of characterizing air
3.2.2 dispersion, absolute, n—the characterization of the
quality and the way models are applied make one single or
spreading of material released into the atmosphere based on a
absolute evaluation objective impossible to define that is
coordinate system fixed in space.
suitable for all purposes. The definition of the evaluation
objectives will be restricted by the limited range conditions
3.2.3 dispersion, relative, n—the characterization of the
experienced in the available comparison data suitable for use.
spreading of material released into the atmosphere based on a
For each evaluation objective, a procedure will need to be
coordinate system that is relative to the local median position
defined that allows definition of the evaluation objective from
of the dispersing material.
the available observations of concentration values.
3.2.4 evaluation objective, n—a feature or characteristic,
4.3 In assessing the performance of air quality models to
which can be defined through an analysis of the observed
characterize a particular evaluation objective, one should
concentration pattern, for example, maximum centerline con-
consider what the models are capable of providing. As dis-
centration or lateral extent of the average concentration pattern
cussed in Section 7, most models attempt to characterize the
as a function of downwind distance, which one desires to
ensemble average concentration pattern. If such models should
assess the skill of the models to reproduce.
provide favorable comparisons with observed concentration
3.2.5 evaluation procedure, n—the analysis steps to be
maxima, this is resulting from happenstance, rather than skill in
taken to compute the value of the evaluation objective from the
the model; therefore, in this discussion, it is suggested a model
observed and modeled patterns of concentration values.
be assessed on its ability to reproduce what it was designed to
3.2.6 fate, n—the destiny of a chemical or biological pol-
produce, for at least in these comparisons, one can be assured
lutant after release into the environment. that zero bias with the least amount of scatter is by definition
good model performance.
3.2.7 model input value, n—characterizations that must be
estimated or provided by the model developer or user before 4.4 As an illustration of the principles espoused in this
model calculations can be performed. guide, a procedure is provided in Appendix X1 for comparison
of observed and modeled near-centerline concentration values,
3.2.8 regime, n—a repeatable narrow range of conditions,
which accommodates the fact that observed concentration
defined in terms of model input values, which may or may not
values include a large component of stochastic, and possibly
be explicitly employed by all models being tested, needed for
deterministic, variability unaccounted for by current models.
dispersion model calculations. It is envisioned that the disper-
The procedure provides an objective statistical test of whether
sion observed should be similar for all cases having similar
differences seen in model performance are significant.
model input values.
3.2.9 uncertainty, n—refers to a lack of knowledge about
5. Significance and Use
specific factors or parameters. This includes measurement
5.1 Guidance is provided on designing model evaluation
errors, sampling errors, systematic errors, and differences
performance procedures and on the difficulties that arise in
arising from simplification of real-world processes. In
statistical evaluation of model performance caused by the
principle, uncertainty can be reduced with further information
stochastic nature of dispersion in the atmosphere. It is recog-
or knowledge (1).
nized there are examples in the literature where, knowingly or
3.2.10 variability, n—refers to differences attributable to
unknowingly, models were evaluated on their ability to de-
true heterogeneity or diversity in atmospheric processes that
scribe something which they were never intended to charac-
result in part from natural random processes. Variability
terize. This guide is attempting to heighten awareness, and
usually is not reducible by further increases in knowledge, but
thereby, to reduce the number of “unknowing” comparisons. A
it can in principle be better characterized (1).
goal of this guide is to stimulate development and testing of
evaluation procedures that accommodate the effects of natural
variability. A technique is illustrated to provide information
from which subsequent evaluation and standardization can be
The boldface numbers in parentheses refer to the list of references at the end of
this standard. derived.
D6589 − 23
6. Model Evaluation 6.4 Statistical Evaluations with Field Data—The objective
comparison of modeled concentrations with observed field data
6.1 Background—Air quality simulation models have been
provides a means for assessing model performance. Due to the
used for many decades to characterize the transport and
limited supply of evaluation data sets, there are severe practical
dispersion of material in the atmosphere (2-4). Early evalua-
limits in assessing model performance. For this reason, the
tions of model performance usually relied on linear least-
conclusions reached in the science peer reviews (see 6.3) and
squares analyses of observed versus modeled values, using
the supportive analyses (see 6.5) have particular relevance in
traditional scatter plots of the values, (5-7). During the 1980s,
deciding whether a model can be applied for the defined model
attempts have been made to encourage the standardization of
evaluation objectives. In order to conduct a statistical
methods used to judge air quality model performance (8-11).
comparison, one will have to define one or more evaluation
Further development of these proposed statistical evaluation
objectives for which objective comparisons are desired (Sec-
procedures was needed, as it was found that the rote applica-
tion 10). As discussed in 8.4.4, the process of summarizing the
tion of statistical metrics, such as those listed in (8), was
overall performance of a model over the range of conditions
incapable of discerning differences in model performance (12),
experienced within a field experiment typically involves deter-
whereas if the evaluation results were sorted by stability and
mining two points for each of the model evaluation objectives:
distance downwind, then differences in modeling skill could be
which of the models being assessed has on average the smallest
discerned (13). It was becoming increasingly evident that the
combined bias and scatter in comparisons with observations,
models were characterizing only a small portion of the ob-
and whether the differences seen in the comparisons with the
served variations in the concentration values (14). To better
other models statistically are significant in light of the uncer-
deduce the statistical significance of differences seen in model
tainties in the observations.
performance in the face of large unaccounted for uncertainties
6.5 Other Tasks Supportive to Model Evaluation—As atmo-
and variations, investigators began to explore the use of
spheric dispersion models become more sophisticated, it is not
bootstrap techniques (15). By the late 1980s, most of the model
easy to detect coding errors in the implementation of the model
performance evaluations involved the use of bootstrap tech-
algorithms. And as models become more complex, discerning
niques in the comparison of maximum values of modeled and
the sensitivity of the modeling results to input parameter
observed cumulative frequency distributions of the concentra-
variations becomes less clear; hence, two important tasks that
tions values (16). Even though the procedures and metrics to be
support model evaluation efforts are verification of software
employed in describing the performance of air quality simula-
and sensitivity and Monte Carlo analyses.
tion models are still evolving (17-19), there has been a general
6.5.1 Verification of Software—Often a set of modeling
acceptance that defining performance of air quality models
algorithms will require numerical solution. An important task
needs to address the large uncertainties inherent in attempting
supportive to a model evaluation is a review in which the
to characterize atmospheric fate, transport and dispersion
mathematics described in the technical description of the
processes. There also has been a consensus reached on the
model are compared with the numerical coding, to ensure that
philosophical reasons that models of earth science processes
the code faithfully implements the physics and mathematics.
can never be validated, in the sense of claiming that a model is
6.5.2 Sensitivity and Monte Carlo Analyses—Sensitivity and
truthfully representing natural processes. No general empirical
Monte Carlo analyses provide insight into the response of a
proposition about the natural world can be certain, since there
model to input variation. An example of this technique is to
will always remain the prospect that future observations may
systematically vary one or more of the model inputs to
call the theory in question (20). It is seen that numerical models
determine the effect on the modeling results (21). Each input
of air pollution are a form of a highly complex scientific
should be varied over a reasonable range likely to be encoun-
hypothesis concerning natural processes, that can be confirmed
tered. The traditional sensitivity studies (21) were developed to
through comparison with observations, but never validated.
better understand the performance of plume dispersion models
6.2 Components of Model Evaluation—A model evaluation
simulating the transport and dispersion of inert pollutants. For
includes science peer reviews and statistical evaluations with
characterization of the effects of input uncertainties on model-
field data. The completion of each of these components
ing results, Monte Carlo studies with simple random sampling
assumes specific model goals and evaluation objectives (see
are recommended (22), especially for models simulating
Section 10) have been defined.
chemically reactive species where there are strong nonlinear
couplings between the model input and output (23). Results
6.3 Science Peer Reviews—Given the complexity of char-
from sensitivity and Monte Carlo analyses provide useful
acterizing atmospheric processes, and the inevitable necessity
guidance on which inputs should be most carefully prescribed
of limiting model algorithms to a resolvable set, one compo-
because they account for the greatest sensitivity in the model-
nent of a model evaluation is to review the model’s science to
ing output. These analyses also provide a view of what to
confirm that the construct is reasonable and defensible for the
expect for model output in conditions for which data are not
defined evaluation objectives. A key part of the scientific peer
available.
review will include the review of residual plots where modeled
and observed evaluation objectives are compared over a range
7. A Framework for Model Evaluations
of model inputs, for example, maximum concentrations as a
function of estimated plume rise or as a function of distance 7.1 This section introduces a philosophical model for ex-
downwind. plaining how and why observations of physical processes and
D6589 − 23
model simulations of physical processes differ. It is argued that where:
observations are individual realizations, which in principle can d(Δα) represents the effects of uncertainty in specifying the
be envisioned as belonging to some ensemble. Most of the model inputs, and f(α) represents the effects of errors in the
current models attempt to characterize the average concentra- model formulations. The “m” subscript indicates a modeled
tion for each ensemble, but there are under development value.
models that attempt to characterize the distribution of concen- 7.3.3 A method for performing an evaluation of modeling
tration values within an ensemble. Having this framework for skill is to separately average the observations and modeling
describing how and why observations differ from model results over a series of non-overlapping limited-ranges of α,
simulations has important ramifications in how one assesses which are called “regimes.” Averaging the observations pro-
and describes a model’s ability to reproduce what is seen by vides an empirical estimate of what most of the current models
¯
way of observations. This framework provides a rigorous basis
are attempting to simulate, C α . A comparison of the respec-
~ !
o
for designing the statistical comparison of modeling results
tive observed and modeled averages over a series of α-groups
with observations.
provides an empirical estimate of the combined deterministic
error associated with input uncertainty and formulation errors.
7.2 The concept of “natural variability” acknowledges that
the details of the stochastic concentration field resulting from
7.3.4 This process is not without problems. The variance in
dispersion are difficult to predict. In this context, the difference observed concentration values due to natural variability is of
between the ensemble average and any one observed realiza-
order of the magnitude of the regime averages (17, 25), hence
tion (experimental observation) is ascribed to natural small sample sizes in the groups will lead to large uncertainties
variability, whose variation, σ , can be expressed as:
in the estimates of the ensemble averages. The variance in
n
2 modeled concentration values due to input uncertainty can be
2 ¯¯
σ 5 ~C 2 C ! (1)
n o o
quite large (22, 23), hence small sample sizes in the groups will
where: lead to large uncertainties in the estimates of the deterministic
error in each group. Grouping data together for analysis
C = the observed concentration (or evaluation objective,
o
requires large data sets, of which there are few.
see 10.3) seen within a realization; the overbars repre-
7.3.5 The observations and the modeling results come from
sent averages over all realizations within a given
¯ different statistical populations, whose means are, for an
ensemble, so that C is the estimated ensemble average.
o
unbiased model, the same. The variance seen in the observa-
The “o” subscript indicates an observed value.
tions results from differences in realizations of averages, that
7.2.1 The ensemble in Eq 1 refers to the ideal infinite
which the model is attempting to characterize, plus an addi-
population of all possible realizations meeting the (fixed)
tional variance caused by stochastic variations between indi-
characteristics associated with an ensemble. In practice, one
vidual realizations, which is not accounted for in the modeling.
will have only a small sample from this ensemble.
7.3.6 As the averaging time increases in the concentration
7.2.2 Measurement uncertainty in concentration values in
values and corresponding evaluation objectives, one might
most tracer experiments may be a small fraction of the
expect the respective variances in the observations and the
measurement threshold, and when this is true its contribution to
modeling results would increasingly reflect variations in en-
σ can usually be deemed negligible; however, as discussed in
n semble averages. As averaging time increases, one might
9.2 and 9.4, expert judgment is needed as the reliability and
expect the variance in the concentration values and correspond-
usefulness of field data will vary depending on the intended
ing evaluation objectives to decrease; however, as averaging
uses being made of the data.
time increases, the magnitude of the concentration values also
decreases. As averaging time increases, it is possible that the
7.3 Defining the characteristics of the ensemble in Eq 1
modeling uncertainties may yet be large when compared to the
using the model’s input values, α, one can view the observed
average modeled concentration values, and likewise, the unex-
concentrations (or evaluation objective) as:
plained variations in the observations yet may be large when
¯
C 5 C α,β 5 C α 1c Δc 1c α,β (2)
~ ! ~ ! ~ ! ~ !
o o o
compared to the average observed concentration values.
where 7.4 It is recommended that one goal of a model evaluation
β are the variables needed to describe the unresolved transport should be to assess the model’s skill in predicting what it was
¯
and dispersion processes, the overbar represents an average
intended to characterize, namely C ~α!, which can be viewed as
o
over all possible values of β for the specified set of model input
the systematic (deterministic) variation of the observations
parameters α; c(Δc) represents the effects of measurement
from one regime to the next. In such comparisons, there is a
uncertainty, and c(α,β) represents ignorance in β (unresolved
basis for believing that a well-formulated model would have
deterministic processes and stochastic fluctuations) (14, 24).
zero bias for all regimes. The model with the smallest
¯
7.3.1 Since C ~α! is an average over all β, it is only a deviations on average from the regime averages, would be the
o
¯
function of α, and in this context, C α represents the ensemble best performing model. One always has the privilege to test the
~ !
o
ability of a model to simulate something it was not intended to
average that the model ideally is attempting to characterize.
provide, such as the ability of a deterministic model to provide
7.3.2 The modeled concentrations, C , can be envisioned
m
an accurate characterization of extreme maximum values, but
as:
then one must realize that a well-formulated model may appear
¯
C 5 C α 1d Δα 1f α (3) to do poorly. If one selects as the best performing model, the
~ ! ~ ! ~ !
m o
D6589 − 23
2|P 2O |
model having the least bias and scatter, when compared with
i i
where AFB 5
i
P 1O
~ !
observed maxima, this may favor selection of models that i i
systematically overestimate the ensemble average by a com-
8.2.4 As a measure of gross error resulting from both bias
pensating bias to underestimate the lateral dispersion. Such a
and scatter, the root mean squared error, RMSE, is often used:
model may provide good comparisons with short-term ob-
¯ 2
=
RMSE 5 ~P 2 O ! (10)
served maxima, but it likely will not perform well for estimat-
i i
ing maximum impacts for longer averaging times. By assessing
8.2.5 Another measure of gross error resulting from both
performance of a model to simulate something it was not
bias and scatter, the normalized mean squared error, NMSE,
intended to provide, there is a risk of selecting poorly-formed
often is used:
models that may by happenstance perform well on the few
¯
experiments available for testing. These are judgment deci- P 2 O
~ !
i i
NMSE 5 (11)
sions that model users will decide based on the anticipated uses ¯ ¯
P O
and needs of the moment of the modeling results. This guide
The advantage of the NMSE over the RMSE is that the
has served its purpose, if users better realize the ramifications
normalization allows comparisons between experiments with
that arise in testing a model’s performance to simulate some-
vastly different average values. The disadvantage of the NMSE
thing that it was not intended to characterize.
versus RMSE is that uncertainty in the observation of low
concentration values will make the value of the NMSE so
8. Statistical Comparison Metrics and Methods
uncertain that meaningful conclusions may be precluded from
8.1 The preceding section described a philosophical frame-
being reached.
work for understanding why observations differ with model
8.2.6 For a scatter plot, where the predictions are plotted
simulation results. This section provides definitions of the
along the horizontal x-axis and the observations are plotted
comparison metrics methods most often employed in current
along the vertical y-axis, the linear regression (method of least
air quality model evaluations. This discussion is not meant to
squares) slope, m, and intercept, b, between the predicted and
be exhaustive. The list of possible metrics is extensive (8), but
observed values are:
it has been illustrated that a few well-chosen simple-to-
N P O 2 ~ P !~ O !
understand metrics can provide adequate characterization of a
( i i ( i ( i
m 5 (12)
model’s performance (14). The key is not in how many metrics N P 2 ~ P !
( i ( i
are used, but is in the statistical design used when the metrics
~ O !~ P ! 2 ~ P O !~ P !
( i ( i ( i i ( i
are applied (13). b 5 (13)
N P 2 ~ P !
( i ( i
8.2 Paired Statistical Comparison Metrics—In the follow-
8.2.7 As a measure of the linear correlation between the
ing equations, O is used to represent the observed evaluation
i
predicted and observed values, the Pearson correlation coeffi-
objective, and P is used to represent the corresponding
i
cient often is used:
model’s estimate of the evaluation objective, where the evalu-
ation objective, as explained in 10.3, is some feature that can
¯ ¯
~ !~ !
P 2 P O 2 O
( i i
r 5 (14)
be defined through the analysis of the concentration field. In
2 2 1/2
¯ ¯
@ ~P 2 P! · ~O 2 O! #
( l ( l
the equations, the subscript “i” refers to paired values and the
“overbar” indicates an average. 8.3 Unpaired Statistical Comparison Metrics—If the ob-
served and modeled values are sorted from highest to lowest,
8.2.1 Average bias, d, and standard deviation of the bias, σ ,
d
there are several statistical comparisons that are commonly
are:
employed. The focus in such comparisons usually is on
¯
d 5 d (4)
i
whether the maximum observed and modeled concentration
2 ¯ 2
values are similar, but one can substitute for the word
σ 5 d 2 d (5)
~ !
d i
“concentration,” any evaluation objective that can be expressed
where:
numerically. As discussed in 7.3.5, the direct comparison of
d = (P – O ).
individual observed realizations with modeled ensemble aver-
i i i
ages is the comparison of two different statistical populations
8.2.2 Fractional bias, FB, and standard deviation of the
with different sources of variance; hence, there are fundamen-
fractional bias, σ , are:
FB
tal philosophical problems with such comparisons. As men-
¯
FB 5 FB (6)
tioned in 7.4, such comparisons are going to be made, as this
i
may be how the modeling results will be used. At best, one can
2 ¯ 2
σ 5 FB 2 FB (7)
~ !
FB i
hope that such comparisons are made by individuals that are
2 P 2O
~ !
i i
cognizant of the philosophical problems involved.
where FB 5 .
i
~P 1O !
i i
8.3.1 The quantile-quantile plot is constructed by plotting
8.2.3 Absolute fractional bias, AFB, and standard deviation
the ranked concentration values against one another, for
of the absolute fractional bias, σ , are:
example, highest concentration observed versus the highest
AFB
concentration modeled, etc. If the observed and modeled
¯
AFB 5 AFB (8)
i
concentration frequency distributions are similar, then the
2 ¯ 2
σ 5 AFB 2 AFB (9) plotted values will lie along the 1:1 line on the plot. By visual
~ !
AFB i
D6589 − 23
inspection, one can easily see if the respective distributions are number of bootstrap samples, B, one can compute the mean, s¯,
similar and whether the observed and modeled concentration and standard deviation, σ , of the statistic of interest. For
s
maximum values are similar. estimation of standard errors, B typically is on the order of 50
to 500.
8.3.2 Cumulative frequency distribution plots are con-
structed by plotting the ranked concentration values (highest to
8.4.2 The bootstrap resampling procedure often can be
lowest) against the plotting position frequency, f (typically in
improved by blocking the data into two or more blocks or sets,
percent), where ρ is the rank (1=highest), N is the number of
with each block containing data having similar characteristics.
values and f is defined as (26):
This prevents the possibility of creating an unrealistic bootstrap
sample where all the members are the same value (15).
f 5 100 % ~ρ 2 0.4!/N, for ρ,N/2 (15)
8.4.3 When performing model performance evaluations, for
f 5 100 %2100 % N 2 ρ10.6 /N, for ρ.N/2 (16)
~ !
each hour there is not only the observed concentration values,
but also the modeling results from all the models being tested.
As with the quantile-quantile plot, a visual inspection of the
In such cases, the individual members, x , in the vector
respective cumulative frequency distribution plots (observed i
x=(x ,x ,.x ) are in themselves vectors, composed of the
and modeled), usually is sufficient to suggest whether the two 1 2 n
observed value and its associated modeling results (from all
distributions are similar, and whether there is a bias in the
models, if there are more than one); thus the selection of the
model to over- or under-estimate the maximum concentration
observed concentration x also includes each model’s estimate
values observed. 2
for this case. This is called “concurrent sampling.” The purpose
8.3.3 The Robust Highest Concentration (RHC) often is
of concurrent sampling is to preserve correlations inherent in
used where comparisons are being made of the maximum
the data (16). These temporal and spatial correlations affect the
concentration values and is envisioned as a more robust test
statistical properties of the data samples. One of the consider-
statistic than direct comparison of maximum values. The RHC
ations in devising a bootstrap sampling procedure is to address
is based on an exponential fit to the highest R-1 values of the
how best to preserve inherent correlations that might exist
cumulative frequency distribution, where R typically is set to
within the data.
be 26 for frequency distributions involving a year’s worth of
8.4.4 For assessing differences in model performance, one
values (averaging times of 24 h or less) (16). The RHC is
often wishes to test whether the differences seen in a perfor-
computed as:
mance metric computed between Model No. 1 and the obser-
3R 2 1
vations (say the RMSE ), is significantly different when
RHC 5 C~R!1Θ*ln (17)
S D
compared to that computed for another model (say Model No.
2, RMSE ) using the same observations. For testing whether
where:
the difference between statistical metrics is significant, the
Θ = average of the R-1 largest values minus C(R), and
following procedure is recommended. Let each bootstrap
th
C(R) = the R largest value.
b
sample be denoted, x* , where * indicates this is a bootstrap
NOTE 1—The value of R may be set to a lower value when there are
sample (8.4.1) and b indicates this is sample “b” of a series of
fewer values in the distribution to work with, see (16). The RHC of the
bootstrap samples (where the total number of bootstrap
observed and modeled cumulative frequency distributions are often
b
compared using a FB metric, and may or may not involve stratification of samples is B). From each bootstrap sample, x* , one computes
b b
the values by meteorological condition prior to computation of the RHC
the respective values for RMSE and RMSE . The difference
1 2
b b b
values.
Δ* = RMSE * – RMSE * then can be computed. Once all B
1 2
samples have been processed, compute from the set of B values
8.4 Bootstrap Resampling—Bootstrap sampling can be used
1 2 B
of Δ* = (Δ* , Δ* ,.Δ* ), the average and standard deviation,
to generate estimates of the sampling error in the statistical
¯ ¯
Δ and σ . The null hypothesis is that Δ is greater than zero with
metric computed (15, 16, 27). The distribution of some
Δ
a stated level of confidence, η, and the t-value for use in a
statistical metrics, for example, RMSE and RHC, are not
Student’s-t test is:
necessarily easily transformed to a normal distribution, which
is desirable when performing statistical tests to see if there are
¯
Δ
statistically significant differences in values computed, for
t 5 (18)
σ
Δ
example, in the comparison of RHC values computed from the
8760 values of 1 h observed and modeled concentration values For illustration purposes, assume the level of confidence is
for a year.
90 % (η = 0.1). Then, for large values of B, if the t-value from
8.4.1 Following the description provided by (27), suppose Eq 19 is larger than Student’s-t equal to 1.645, it can be
η/2
¯
one is analyzing a data set x ,x ,.x , which for convenience is
1 2 n concluded with 90 % confidence that Δ is not equal to zero, and
denoted by the vector x=(x ,x ,.x ). A bootstrap sample
hence, there is a significant difference in the RMSE values for
1 2 n
x*=(x *,x *,.x *) is obtained by randomly sampling n times,
1 2 n the two models being tested.
with replacement, from the original data points x=(x ,x ,.x ).
1 2 n
9. Considerations in Performing Statistical Evaluations
For instance, with n=7 one might obtain x*=(x ,x ,x ,x ,x ,x ,
5 7 5 4 7 3
x ). From each bootstrap sample one can compute some 9.1 Evaluation of the performance of a model mostly is
statistics (say the median, average, RHC, etc.). By creating a constrained by the amount and quality of observational data
D6589 − 23
available for comparison with modeling results. The simulation
models are capable of providing estimates of a larger set of
conditions than for which there is observational data.
Furthermore, most models do not provide estimates of directly
measurable quantities. For instance, even if a model provides
an estimate of the concentration at a specific location, it is most
likely an estimate of an ensemble average result which has an
implied averaging time, and for grid models represents an
average over some volume of air, for example, grid average;
hence, in establishing what abilities of the model are to be
tested, one must first consider whether there is sufficient
observational data available that can provide, either directly or
through analysis, observations of what is being modeled.
9.2 Understanding Observed Concentrations:
9.2.1 It is not necessary for a user of concentration obser-
vations to know or understand all details of how the observa-
tions were made, but some fundamental understanding of the
sampler limitations (operational range), background concentra-
tion value(s), and stochastic nature of the atmosphere is
necessary for developing effective evaluation procedures.
FIG. 1 Illustration of Effects of Natural Variability on Crosswind
9.2.2 All samplers have a detection threshold below which
Profiles of a Plume Dispersing Downwind (Grouped in a Relative
observed values either are not provided, or are considered Dispersion Context)
suspect. It is possible that there is a natural background of the
tracer, which either has been subtracted from the observations,
as with emissions from a large industrial power plant stack of
or needs to be considered in using the observations. Data
order 75 m in height and a buoyant plume rise of order 100 m
collected under a quality assurance program following consen-
above the stack, it is easy to understand that the observed
sus standards are more credible in most settings than data
lateral profile for individual experimental results might well
whose quality cannot be objectively documented. Some sam-
vary from the ideal Gaussian shape. It must be recognized that
plers have a saturation point which limits the maximum value
features like double peaks, saw-tooth patterns and other irregu-
that can be observed. The user of concentration observations
lar behavior are often observed for individual realizations.
should address these, as needed, in designing the evaluation
procedures
9.3 Understanding the Models to be Evaluated:
9.2.3 Atmospheric transport and dispersion processes in-
9.3.1 As in other branches of meteorology, a complete set of
clude stochastic components. The transport downwind follows
equations for the characterization of the transport and fate of
a serpentine path, being influenced by both random and
material dispersing through the atmosphere is so complex that
periodic wind oscillations, composed of both large and small
no unique analytical solution is known. Approximate analytical
scale eddies in the wind field. Fig. 1 illustrates the observed
principles, such as mass balance, are frequently combined with
concentrations seen along a sampling arc at 50 m downwind
other concepts to allow study of a particular situation (29).
and centered on a near-surface point-source release of sulfur-
Before evaluating a model, the user must have a sufficient
dioxide during Project Prairie Grass (28). Fig. 1 is a summary
understanding of the basis for the model and its operation to
over all 70 experiments. For each experiment the crosswind
know what it was intended to characterize. The user must know
receptor positions, y, relative to the observed center of mass
whether the model provides volume average concentration
along the arc have been divided by σ , which is the second-
estimates, or whether the model provides average concentra-
y
moment of the concentration values seen along each arc, that
tion estimates for specific positions above the ground. The user
is, the lateral dispersion which is a measure of the lateral extent
must know whether the characterizations of transport,
of the plume. The observed concentration values have been
dispersion, formation and removal processes are expressed
Y
Y
divided by C 5C /~σ =2π!, where C is the crosswind using equations that provide ensemble average estimates of
max y
integrated concentration along the arc. The crosswind inte- concentration values, or whether the equations and relation-
grated concentration is a measure of the vertical dilution the ships used provide stochastic estimates of concentration val-
plume has experienced in traveling to this downwind position. ues. Answers to these and like questions are necessary when
To assume that the crosswind concentration distribution fol- attempting to define the evaluation objectives (10.3).
lows a Gaussian curve, which is implicit in the relationship 9.3.2 A mass balance model tracks material entering and
used to compute C , is seen to be a reasonable approxima- leaving a particular air volume. Within this conceptual
max
tion when all the experimental results are combined. As shown framework, concentrations are increased by emissions that
by the results for Experiment 31, a Gaussian profile may not occur within the defined volume and by transport from other
apply that well for any one realization, where random effects adjacent volumes. Similarly, concentrations are decreased by
occurred, even though every attempt was made to collect data transport exiting the volume, either by removal by chemical/
under nearly ideal circumstances. Under less ideal conditions, physical sinks within the volume, for example, wet and dry
D6589 − 23
deposition, and for reactive species, or by conversion to other estimation for the selected data sets; determine the required
forms. These relationships can be specified through a differen- levels of temporal detail, for example, minute-by-minute or
tial equation quantifying factors related to material gain or loss hour-by-hour, and spatial detail, for example, vertical or
(29). Models of this type typically provide ensemble volume- horizontal variation in the meteorological conditions, for the
average concentration values as a function of time. One will
models to be evaluated, as well as the existence and variations
have to consult the model documentation in order to know of other sources of the same material within the modeling
whether the concentration values reported are averaged over
domain; ensure that the samplers are sufficiently close to one
some period of time, such as 1 h, or are the volume-average another and in sufficient numbers for definition of the evalua-
values at the end of time periods, such as at the end of each
tion objectives; and, find or collect appropriate data for
hour of simulation.
estimation of the model inputs and for comparison with model
9.3.3 Some models are entirely empirical. A common ex- outputs.
ample (30) involves analysis and characterization of the
9.4.3 In principle, the information required for the evalua-
concentration distributions using measurements under different
tion process includes not only measured atmospheric concen-
conditions across a variety of collection sites. Empirical
trations but also measurements of all model inputs. Model
models are strictly-speaking only applicable to the range of
inputs typically include: emission release characteristics
measurement conditions upon which they were developed.
(physical stack height, stack exit diameter, pollutant exit
9.3.4 Most atmospheric transport and dispersion models
temperature and velocity, emission rate), mass and size distri-
involve the combination of theoretical and empirical param-
bution of particulate emissions, upwind and downwind fetch
eterizations of the physical processes (31), therefore, even
characteristics, for example, land-cover, surface roughness
though theoretical models may be suitable to a wide range of
length, daytime and nighttime mixing heights, and surface-
applications in principle, they are limited to the physical
layer stability. In practice, since suitable data for all the
processes characterized, and to the inherent limitations of
required model inputs are rarely, if ever, available, one resorts
empirically derived relationships embedded within them.
to one or more of the following alternatives: compress the level
9.3.5 Generally speaking, as model complexity grows in
of temporal and spatial detail for model application to that for
terms of temporal and spatial detail, the task of supplying
which suitable data can be obtained; provide best estimates for
appropriate inputs becomes more demanding. It is not a given
model inputs, recognizing the limitations imposed by this
that increasing the complexity in the treatment of the transport
particular approach; or, collect the additional data required to
and fate of dispersing material will provide less uncertain
enable proper estimation of inputs. A n
...


This document is not an ASTM standard and is intended only to provide the user of an ASTM standard an indication of what changes have been made to the previous version. Because
it may not be technically possible to adequately depict all changes accurately, ASTM recommends that users consult prior editions as appropriate. In all cases only the current version
of the standard as published by ASTM is to be considered the official document.
Designation: D6589 − 05 (Reapproved 2015) D6589 − 23
Standard Guide for
Statistical Evaluation of Atmospheric Dispersion Model
Performance
This standard is issued under the fixed designation D6589; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope
1.1 This guide provides techniques that are useful for the comparison of modeled air concentrations with observed field data. Such
comparisons provide a means for assessing a model’s performance, for example, bias and precision or uncertainty, relative to other
candidate models. Methodologies for such comparisons are yet evolving; hence, modifications will occur in the statistical tests and
procedures and data analysis as work progresses in this area. Until the interested parties agree upon standard testing protocols,
differences in approach will occur. This guide describes a framework, or philosophical context, within which one determines
whether a model’s performance is significantly different from other candidate models. It is suggested that the first step should be
to determine which model’s estimates are closest on average to the observations, and the second step would then test whether the
differences seen in the performance of the other models are significantly different from the model chosen in the first step. An
example procedure is provided in Appendix X1 to illustrate an existing approach for a particular evaluation goal. This example
is not intended to inhibit alternative approaches or techniques that will produce equivalent or superior results. As discussed in
Section 6, statistical evaluation of model performance is viewed as part of a larger process that collectively is referred to as model
evaluation.
1.2 This guide has been designed with flexibility to allow expansion to address various characterizations of atmospheric
dispersion, which might involve dose or concentration fluctuations, to allow development of application-specific evaluation
schemes, and to allow use of various statistical comparison metrics. No assumptions are made regarding the manner in which the
models characterize the dispersion.
1.3 The focus of this guide is on end results, that is, the accuracy of model predictions and the discernment of whether differences
seen between models are significant, rather than operational details such as the ease of model implementation or the time required
for model calculations to be performed.
1.4 This guide offers an organized collection of information or a series of options and does not recommend a specific course of
action. This guide cannot replace education or experience and should be used in conjunction with professional judgment. Not all
aspects of this guide may be applicable in all circumstances. This guide is not intended to represent or replace the standard of care
by which the adequacy of a given professional service must be judged, nor should it be applied without consideration of a project’s
many unique aspects. The word “Standard” in the title of this guide means only that the document has been approved through the
ASTM consensus process.
1.5 This standard applies to gaussian plume models; it may not be applicable to non-point sources, heavy gas models from
evaporation from pool (for example, liquid spills), as well as near-field receptors.
This guide is under the jurisdiction of ASTM Committee D22 on Air Quality and is the direct responsibility of Subcommittee D22.11 on Meteorology.
Current edition approved April 1, 2015Sept. 1, 2023. Published April 2015September 2023. Originally approved in 2000. Last previous edition approved in 20102015 as
ε1
D6589 – 05 (2010)(2015). . DOI: 10.1520/D6589-05R15.10.1520/D6589-23.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
D6589 − 23
1.6 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this guide.
1.7 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility
of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of
regulatory limitations prior to use.
1.8 This international standard was developed in accordance with internationally recognized principles on standardization
established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued
by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
2. Referenced Documents
2.1 ASTM Standards:
D1356 Terminology Relating to Sampling and Analysis of Atmospheres
3. Terminology
3.1 Definitions—For definitions of terms used in this guide, refer to Terminology D1356.
3.2 Definitions of Terms Specific to This Standard:
3.2.1 atmospheric dispersion model, n—an idealization of atmospheric physics and processes to calculate the magnitude and
location of pollutant concentrations based on fate, transport, and dispersion in the atmosphere. This may take the form of an
equation, algorithm, or series of equations/algorithms used to calculate average or time-varying concentration. The model may
involve numerical methods for solution.
3.2.2 dispersion, absolute, n—the characterization of the spreading of material released into the atmosphere based on a coordinate
system fixed in space.
3.2.3 dispersion, relative, n—the characterization of the spreading of material released into the atmosphere based on a coordinate
system that is relative to the local median position of the dispersing material.
3.2.4 evaluation objective, n—a feature or characteristic, which can be defined through an analysis of the observed concentration
pattern, for example, maximum centerline concentration or lateral extent of the average concentration pattern as a function of
downwind distance, which one desires to assess the skill of the models to reproduce.
3.2.5 evaluation procedure, n—the analysis steps to be taken to compute the value of the evaluation objective from the observed
and modeled patterns of concentration values.
3.2.6 fate, n—the destiny of a chemical or biological pollutant after release into the environment.
3.2.7 model input value, n—characterizations that must be estimated or provided by the model developer or user before model
calculations can be performed.
3.2.8 regime, n—a repeatable narrow range of conditions, defined in terms of model input values, which may or may not be
explicitly employed by all models being tested, needed for dispersion model calculations. It is envisioned that the dispersion
observed should be similar for all cases having similar model input values.
3.2.9 uncertainty, n—refers to a lack of knowledge about specific factors or parameters. This includes measurement errors,
sampling errors, systematic errors, and differences arising from simplification of real-world processes. In principle, uncertainty can
be reduced with further information or knowledge (1).
For referenced ASTM standards, visit the ASTM website, www.astm.org, or contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM Standards
volume information, refer to the standard’s Document Summary page on the ASTM website.
The boldface numbers in parentheses refer to the list of references at the end of this standard.
D6589 − 23
3.2.10 variability, n—refers to differences attributable to true heterogeneity or diversity in atmospheric processes that result in part
from natural random processes. Variability usually is not reducible by further increases in knowledge, but it can in principle be
better characterized (1).
4. Summary of Guide
4.1 Statistical evaluation of dispersion model performance with field data is viewed as part of a larger process that collectively
is called model evaluation. Section 6 discusses the components of model evaluation.
4.2 To statistically assess model performance, one must define an overall evaluation goal or purpose. This will suggest features
(evaluation objectives) within the observed and modeled concentration patterns to be compared, for example, maximum surface
concentrations, lateral extent of a dispersing plume. The selection and definition of evaluation objectives typically are tailored to
the model’s capabilities and intended uses. The very nature of the problem of characterizing air quality and the way models are
applied make one single or absolute evaluation objective impossible to define that is suitable for all purposes. The definition of
the evaluation objectives will be restricted by the limited range conditions experienced in the available comparison data suitable
for use. For each evaluation objective, a procedure will need to be defined that allows definition of the evaluation objective from
the available observations of concentration values.
4.3 In assessing the performance of air quality models to characterize a particular evaluation objective, one should consider what
the models are capable of providing. As discussed in Section 7, most models attempt to characterize the ensemble average
concentration pattern. If such models should provide favorable comparisons with observed concentration maxima, this is resulting
from happenstance, rather than skill in the model; therefore, in this discussion, it is suggested a model be assessed on its ability
to reproduce what it was designed to produce, for at least in these comparisons, one can be assured that zero bias with the least
amount of scatter is by definition good model performance.
4.4 As an illustration of the principles espoused in this guide, a procedure is provided in Appendix X1 for comparison of observed
and modeled near-centerline concentration values, which accommodates the fact that observed concentration values include a large
component of stochastic, and possibly deterministic, variability unaccounted for by current models. The procedure provides an
objective statistical test of whether differences seen in model performance are significant.
5. Significance and Use
5.1 Guidance is provided on designing model evaluation performance procedures and on the difficulties that arise in statistical
evaluation of model performance caused by the stochastic nature of dispersion in the atmosphere. It is recognized there are
examples in the literature where, knowingly or unknowingly, models were evaluated on their ability to describe something which
they were never intended to characterize. This guide is attempting to heighten awareness, and thereby, to reduce the number of
“unknowing” comparisons. A goal of this guide is to stimulate development and testing of evaluation procedures that accommodate
the effects of natural variability. A technique is illustrated to provide information from which subsequent evaluation and
standardization can be derived.
6. Model Evaluation
6.1 Background—Air quality simulation models have been used for many decades to characterize the transport and dispersion of
material in the atmosphere (2-4). Early evaluations of model performance usually relied on linear least-squares analyses of
observed versus modeled values, using traditional scatter plots of the values, (5-7). During the 1980s, attempts have been made
to encourage the standardization of methods used to judge air quality model performance (8-11). Further development of these
proposed statistical evaluation procedures was needed, as it was found that the rote application of statistical metrics, such as those
listed in (8), was incapable of discerning differences in model performance (12), whereas if the evaluation results were sorted by
stability and distance downwind, then differences in modeling skill could be discerned (13). It was becoming increasingly evident
that the models were characterizing only a small portion of the observed variations in the concentration values (14). To better
deduce the statistical significance of differences seen in model performance in the face of large unaccounted for uncertainties and
variations, investigators began to explore the use of bootstrap techniques (15). By the late 1980s, most of the model performance
evaluations involved the use of bootstrap techniques in the comparison of maximum values of modeled and observed cumulative
frequency distributions of the concentrations values (16). Even though the procedures and metrics to be employed in describing
the performance of air quality simulation models are still evolving (17-19), there has been a general acceptance that defining
performance of air quality models needs to address the large uncertainties inherent in attempting to characterize atmospheric fate,
D6589 − 23
transport and dispersion processes. There also has been a consensus reached on the philosophical reasons that models of earth
science processes can never be validated, in the sense of claiming that a model is truthfully representing natural processes. No
general empirical proposition about the natural world can be certain, since there will always remain the prospect that future
observations may call the theory in question (20). It is seen that numerical models of air pollution are a form of a highly complex
scientific hypothesis concerning natural processes, that can be confirmed through comparison with observations, but never
validated.
6.2 Components of Model Evaluation—A model evaluation includes science peer reviews and statistical evaluations with field
data. The completion of each of these components assumes specific model goals and evaluation objectives (see Section 10) have
been defined.
6.3 Science Peer Reviews—Given the complexity of characterizing atmospheric processes, and the inevitable necessity of limiting
model algorithms to a resolvable set, one component of a model evaluation is to review the model’s science to confirm that the
construct is reasonable and defensible for the defined evaluation objectives. A key part of the scientific peer review will include
the review of residual plots where modeled and observed evaluation objectives are compared over a range of model inputs, for
example, maximum concentrations as a function of estimated plume rise or as a function of distance downwind.
6.4 Statistical Evaluations with Field Data—The objective comparison of modeled concentrations with observed field data
provides a means for assessing model performance. Due to the limited supply of evaluation data sets, there are severe practical
limits in assessing model performance. For this reason, the conclusions reached in the science peer reviews (see 6.3) and the
supportive analyses (see 6.5) have particular relevance in deciding whether a model can be applied for the defined model evaluation
objectives. In order to conduct a statistical comparison, one will have to define one or more evaluation objectives for which
objective comparisons are desired (Section 10). As discussed in 8.4.4, the process of summarizing the overall performance of a
model over the range of conditions experienced within a field experiment typically involves determining two points for each of
the model evaluation objectives: which of the models being assessed has on average the smallest combined bias and scatter in
comparisons with observations, and whether the differences seen in the comparisons with the other models statistically are
significant in light of the uncertainties in the observations.
6.5 Other Tasks Supportive to Model Evaluation—As atmospheric dispersion models become more sophisticated, it is not easy to
detect coding errors in the implementation of the model algorithms. And as models become more complex, discerning the
sensitivity of the modeling results to input parameter variations becomes less clear; hence, two important tasks that support model
evaluation efforts are verification of software and sensitivity and Monte Carlo analyses.
6.5.1 Verification of Software—Often a set of modeling algorithms will require numerical solution. An important task supportive
to a model evaluation is a review in which the mathematics described in the technical description of the model are compared with
the numerical coding, to ensure that the code faithfully implements the physics and mathematics.
6.5.2 Sensitivity and Monte Carlo Analyses—Sensitivity and Monte Carlo analyses provide insight into the response of a model
to input variation. An example of this technique is to systematically vary one or more of the model inputs to determine the effect
on the modeling results (21). Each input should be varied over a reasonable range likely to be encountered. The traditional
sensitivity studies (21) were developed to better understand the performance of plume dispersion models simulating the transport
and dispersion of inert pollutants. For characterization of the effects of input uncertainties on modeling results, Monte Carlo studies
with simple random sampling are recommended (22), especially for models simulating chemically reactive species where there are
strong nonlinear couplings between the model input and output (23). Results from sensitivity and Monte Carlo analyses provide
useful guidance on which inputs should be most carefully prescribed because they account for the greatest sensitivity in the
modeling output. These analyses also provide a view of what to expect for model output in conditions for which data are not
available.
7. A Framework for Model Evaluations
7.1 This section introduces a philosophical model for explaining how and why observations of physical processes and model
simulations of physical processes differ. It is argued that observations are individual realizations, which in principle can be
envisioned as belonging to some ensemble. Most of the current models attempt to characterize the average concentration for each
ensemble, but there are under development models that attempt to characterize the distribution of concentration values within an
ensemble. Having this framework for describing how and why observations differ from model simulations has important
ramifications in how one assesses and describes a model’s ability to reproduce what is seen by way of observations. This
framework provides a rigorous basis for designing the statistical comparison of modeling results with observations.
D6589 − 23
7.2 The concept of “natural variability” acknowledges that the details of the stochastic concentration field resulting from
dispersion are difficult to predict. In this context, the difference between the ensemble average and any one observed realization
(experimental observation) is ascribed to natural variability, whose variation, σ , can be expressed as:
n
2¯¯
σ 5~C 2 C ! (1)
n o o
where:
C = the observed concentration (or evaluation objective, see 10.3) seen within a realization; the overbars represent averages
o
¯
over all realizations within a given ensemble, so that C is the estimated ensemble average. The “o” subscript indicates an
o
observed value.
7.2.1 The ensemble in Eq 1 refers to the ideal infinite population of all possible realizations meeting the (fixed) characteristics
associated with an ensemble. In practice, one will have only a small sample from this ensemble.
7.2.2 Measurement uncertainty in concentration values in most tracer experiments may be a small fraction of the measurement
threshold, and when this is true its contribution to σ can usually be deemed negligible; however, as discussed in 9.2 and 9.4, expert
n
judgment is needed as the reliability and usefulness of field data will vary depending on the intended uses being made of the data.
7.3 Defining the characteristics of the ensemble in Eq 1 using the model’s input values, α, one can view the observed
concentrations (or evaluation objective) as:
¯
C 5 C ~α,β!5 C ~α!1c~Δc!1c~α,β! (2)
o o o
where
β are the variables needed to describe the unresolved transport and dispersion processes, the overbar represents an average over
all possible values of β for the specified set of model input parameters α; c(Δc) represents the effects of measurement uncertainty,
and c(α,β) represents ignorance in β (unresolved deterministic processes and stochastic fluctuations) (14, 24).
¯ ¯
7.3.1 Since C ~α! is an average over all β, it is only a function of α, and in this context, C ~α! represents the ensemble average that
o o
the model ideally is attempting to characterize.
7.3.2 The modeled concentrations, C , can be envisioned as:
m
¯
C 5 C α 1d Δα 1f α (3)
~ ! ~ ! ~ !
m o
where:
d(Δα) represents the effects of uncertainty in specifying the model inputs, and f(α) represents the effects of errors in the model
formulations. The “m” subscript indicates a modeled value.
7.3.3 A method for performing an evaluation of modeling skill is to separately average the observations and modeling results over
a series of non-overlapping limited-ranges of α, which are called “regimes.” Averaging the observations provides an empirical
¯
estimate of what most of the current models are attempting to simulate, C ~α!. A comparison of the respective observed and
o
modeled averages over a series of α-groups provides an empirical estimate of the combined deterministic error associated with
input uncertainty and formulation errors.
7.3.4 This process is not without problems. The variance in observed concentration values due to natural variability is of order
of the magnitude of the regime averages (17, 25), hence small sample sizes in the groups will lead to large uncertainties in the
estimates of the ensemble averages. The variance in modeled concentration values due to input uncertainty can be quite large (22,
23), hence small sample sizes in the groups will lead to large uncertainties in the estimates of the deterministic error in each group.
Grouping data together for analysis requires large data sets, of which there are few.
7.3.5 The observations and the modeling results come from different statistical populations, whose means are, for an unbiased
model, the same. The variance seen in the observations results from differences in realizations of averages, that which the model
is attempting to characterize, plus an additional variance caused by stochastic variations between individual realizations, which is
not accounted for in the modeling.
7.3.6 As the averaging time increases in the concentration values and corresponding evaluation objectives, one might expect the
D6589 − 23
respective variances in the observations and the modeling results would increasingly reflect variations in ensemble averages. As
averaging time increases, one might expect the variance in the concentration values and corresponding evaluation objectives to
decrease; however, as averaging time increases, the magnitude of the concentration values also decreases. As averaging time
increases, it is possible that the modeling uncertainties may yet be large when compared to the average modeled concentration
values, and likewise, the unexplained variations in the observations yet may be large when compared to the average observed
concentration values.
7.4 It is recommended that one goal of a model evaluation should be to assess the model’s skill in predicting what it was intended
¯
to characterize, namely C ~α!, which can be viewed as the systematic (deterministic) variation of the observations from one regime
o
to the next. In such comparisons, there is a basis for believing that a well-formulated model would have zero bias for all regimes.
The model with the smallest deviations on average from the regime averages, would be the best performing model. One always
has the privilege to test the ability of a model to simulate something it was not intended to provide, such as the ability of a
deterministic model to provide an accurate characterization of extreme maximum values, but then one must realize that a
well-formulated model may appear to do poorly. If one selects as the best performing model, the model having the least bias and
scatter, when compared with observed maxima, this may favor selection of models that systematically overestimate the ensemble
average by a compensating bias to underestimate the lateral dispersion. Such a model may provide good comparisons with
short-term observed maxima, but it likely will not perform well for estimating maximum impacts for longer averaging times. By
assessing performance of a model to simulate something it was not intended to provide, there is a risk of selecting poorly-formed
models that may by happenstance perform well on the few experiments available for testing. These are judgment decisions that
model users will decide based on the anticipated uses and needs of the moment of the modeling results. This guide has served its
purpose, if users better realize the ramifications that arise in testing a model’s performance to simulate something that it was not
intended to characterize.
8. Statistical Comparison Metrics and Methods
8.1 The preceding section described a philosophical framework for understanding why observations differ with model simulation
results. This section provides definitions of the comparison metrics methods most often employed in current air quality model
evaluations. This discussion is not meant to be exhaustive. The list of possible metrics is extensive (8), but it has been illustrated
that a few well-chosen simple-to-understand metrics can provide adequate characterization of a model’s performance (14). The key
is not in how many metrics are used, but is in the statistical design used when the metrics are applied (13).
8.2 Paired Statistical Comparison Metrics—In the following equations, O is used to represent the observed evaluation objective,
i
and P is used to represent the corresponding model’s estimate of the evaluation objective, where the evaluation objective, as
i
explained in 10.3, is some feature that can be defined through the analysis of the concentration field. In the equations, the subscript
“i” refers to paired values and the “overbar” indicates an average.
8.2.1 Average bias, d, and standard deviation of the bias, σ , are:
d
¯
d 5 d (4)
i
2¯2
σ 5 ~d 2 d! (5)
d i
where:
d = (P – O ).
i i i
8.2.2 Fractional bias, FB, and standard deviation of the fractional bias, σ , are:
FB
¯
FB 5 FB (6)
i
2¯2
σ 5 FB 2 FB (7)
~ !
FB i
2 P 2O
~ !
i i
where FB 5 .
i
P 1O
~ !
i i
8.2.3 Absolute fractional bias, AFB, and standard deviation of the absolute fractional bias, σ , are:
AFB
¯
AFB 5 AFB (8)
i
2¯2
σ 5 AFB 2 AFB (9)
~ !
AFB i
D6589 − 23
2|P 2O |
i i
where AFB 5
i
P 1O
~ !
i i
8.2.4 As a measure of gross error resulting from both bias and scatter, the root mean squared error, RMSE, is often used:
¯2
=
RMSE 5 ~P 2 O ! (10)
i i
8.2.5 Another measure of gross error resulting from both bias and scatter, the normalized mean squared error, NMSE, often is
used:
¯2
P 2 O
~ !
i i
NMSE 5 (11)
¯ ¯
P O
The advantage of the NMSE over the RMSE is that the normalization allows comparisons between experiments with vastly
different average values. The disadvantage of the NMSE versus RMSE is that uncertainty in the observation of low concentration
values will make the value of the NMSE so uncertain that meaningful conclusions may be precluded from being reached.
8.2.6 For a scatter plot, where the predictions are plotted along the horizontal x-axis and the observations are plotted along the
vertical y-axis, the linear regression (method of least squares) slope, m, and intercept, b, between the predicted and observed values
are:
N P O 2~ P !~ O !
( i i ( i ( i
m 5 (12)
N P 2~ P !
( i ( i
O P 2 P O P
~ !~ ! ~ !~ !
( i ( i ( i i ( i
b 5 (13)
N P 2~ P !
( i ( i
8.2.7 As a measure of the linear correlation between the predicted and observed values, the Pearson correlation coefficient often
is used:
¯ ¯
~P 2 P!~O 2 O!
( i i
r 5 (14)
2 2 1/2
¯ ¯
@ ~P 2 P! · ~O 2 O! #
l l
( (
8.3 Unpaired Statistical Comparison Metrics—If the observed and modeled values are sorted from highest to lowest, there are
several statistical comparisons that are commonly employed. The focus in such comparisons usually is on whether the maximum
observed and modeled concentration values are similar, but one can substitute for the word “concentration,” any evaluation
objective that can be expressed numerically. As discussed in 7.3.5, the direct comparison of individual observed realizations with
modeled ensemble averages is the comparison of two different statistical populations with different sources of variance; hence,
there are fundamental philosophical problems with such comparisons. As mentioned in 7.4, such comparisons are going to be
made, as this may be how the modeling results will be used. At best, one can hope that such comparisons are made by individuals
that are cognizant of the philosophical problems involved.
8.3.1 The quantile-quantile plot is constructed by plotting the ranked concentration values against one another, for example,
highest concentration observed versus the highest concentration modeled, etc. If the observed and modeled concentration
frequency distributions are similar, then the plotted values will lie along the 1:1 line on the plot. By visual inspection, one can easily
see if the respective distributions are similar and whether the observed and modeled concentration maximum values are similar.
8.3.2 Cumulative frequency distribution plots are constructed by plotting the ranked concentration values (highest to lowest)
against the plotting position frequency, f (typically in percent), where ρ is the rank (1=highest), N is the number of values and f
is defined as (26):
f 5 100 % ρ2 0.4 /N, for ρ,N/2 (15)
~ !
f 5 100 %2100 % N 2 ρ10.6 /N, for ρ.N/2 (16)
~ !
As with the quantile-quantile plot, a visual inspection of the respective cumulative frequency distribution plots (observed and
modeled), usually is sufficient to suggest whether the two distributions are similar, and whether there is a bias in the model to over-
or under-estimate the maximum concentration values observed.
8.3.3 The Robust Highest Concentration (RHC) often is used where comparisons are being made of the maximum concentration
values and is envisioned as a more robust test statistic than direct comparison of maximum values. The RHC is based on an
D6589 − 23
exponential fit to the highest R-1 values of the cumulative frequency distribution, where R typically is set to be 26 for frequency
distributions involving a year’s worth of values (averaging times of 24 h or less) (16). The RHC is computed as:
3R 2 1
RHC 5 C~R!1Θ*ln (17)
S D
where:
Θ = average of the R-1 largest values minus C(R), and
th
C(R) = the R largest value.
NOTE 1—The value of R may be set to a lower value when there are fewer values in the distribution to work with, see (16). The RHC of the observed
and modeled cumulative frequency distributions are often compared using a FB metric, and may or may not involve stratification of the values by
meteorological condition prior to computation of the RHC values.
8.4 Bootstrap Resampling—Bootstrap sampling can be used to generate estimates of the sampling error in the statistical metric
computed (15, 16, 27). The distribution of some statistical metrics, for example, RMSE and RHC, are not necessarily easily
transformed to a normal distribution, which is desirable when performing statistical tests to see if there are statistically significant
differences in values computed, for example, in the comparison of RHC values computed from the 8760 values of 1-h1 h observed
and modeled concentration values for a year.
8.4.1 Following the description provided by (27), suppose one is analyzing a data set x ,x ,.x , which for convenience is denoted
1 2 n
by the vector x=(x ,x ,.x ). A bootstrap sample x*=(x *,x *,.x *) is obtained by randomly sampling n times, with replacement,
1 2 n 1 2 n
from the original data points x=(x ,x ,.x ). For instance, with n=7 one might obtain x*=(x ,x ,x ,x ,x ,x ,x ). From each bootstrap
1 2 n 5 7 5 4 7 3 1
sample one can compute some statistics (say the median, average, RHC, etc.). By creating a number of bootstrap samples, B, one
can compute the mean, s¯, and standard deviation, σ , of the statistic of interest. For estimation of standard errors, B typically is
s
on the order of 50 to 500.
8.4.2 The bootstrap resampling procedure often can be improved by blocking the data into two or more blocks or sets, with each
block containing data having similar characteristics. This prevents the possibility of creating an unrealistic bootstrap sample where
all the members are the same value (15).
8.4.3 When performing model performance evaluations, for each hour there is not only the observed concentration values, but also
the modeling results from all the models being tested. In such cases, the individual members, x , in the vector x=(x ,x ,.x ) are
i 1 2 n
in themselves vectors, composed of the observed value and its associated modeling results (from all models, if there are more than
one); thus the selection of the observed concentration x also includes each model’s estimate for this case. This is called
“concurrent sampling.” The purpose of concurrent sampling is to preserve correlations inherent in the data (16). These temporal
and spatial correlations affect the statistical properties of the data samples. One of the considerations in devising a bootstrap
sampling procedure is to address how best to preserve inherent correlations that might exist within the data.
8.4.4 For assessing differences in model performance, one often wishes to test whether the differences seen in a performance
metric computed between Model No. 1 and the observations (say the RMSE ), is significantly different when compared to that
computed for another model (say Model No. 2, RMSE ) using the same observations. For testing whether the difference between
b
statistical metrics is significant, the following procedure is recommended. Let each bootstrap sample be denoted, x* , where *
indicates this is a bootstrap sample (8.4.1) and b indicates this is sample “b” of a series of bootstrap samples (where the total
b b b
number of bootstrap samples is B). From each bootstrap sample, x* , one computes the respective values for RMSE and RMSE .
1 2
b b b
The difference Δ* = RMSE * – RMSE * then can be computed. Once all B samples have been processed, compute from the
1 2
1 2 B
¯ ¯
set of B values of Δ* = (Δ* , Δ* ,.Δ* ), the average and standard deviation, Δ and σ . The null hypothesis is that Δ is greater
Δ
than zero with a stated level of confidence, η, and the t-value for use in a Student’s-t test is:
¯
Δ
t 5 (18)
σ
Δ
For illustration purposes, assume the level of confidence is 90 % (η = 0.1). Then, for large values of B, if the t-value from Eq
¯
19 is larger than Student’s-t equal to 1.645, it can be concluded with 90 % confidence that Δ is not equal to zero, and hence, there
η/2
is a significant difference in the RMSE values for the two models being tested.
9. Considerations in Performing Statistical Evaluations
9.1 Evaluation of the performance of a model mostly is constrained by the amount and quality of observational data available for
D6589 − 23
comparison with modeling results. The simulation models are capable of providing estimates of a larger set of conditions than for
which there is observational data. Furthermore, most models do not provide estimates of directly measurable quantities. For
instance, even if a model provides an estimate of the concentration at a specific location, it is most likely an estimate of an
ensemble average result which has an implied averaging time, and for grid models represents an average over some volume of air,
for example, grid average; hence, in establishing what abilities of the model are to be tested, one must first consider whether there
is sufficient observational data available that can provide, either directly or through analysis, observations of what is being
modeled.
9.2 Understanding Observed Concentrations:
9.2.1 It is not necessary for a user of concentration observations to know or understand all details of how the observations were
made, but some fundamental understanding of the sampler limitations (operational range), background concentration value(s), and
stochastic nature of the atmosphere is necessary for developing effective evaluation procedures.
9.2.2 All samplers have a detection threshold below which observed values either are not provided, or are considered suspect. It
is possible that there is a natural background of the tracer, which either has been subtracted from the observations, or needs to be
considered in using the observations. Data collected under a quality assurance program following consensus standards are more
credible in most settings than data whose quality cannot be objectively documented. Some samplers have a saturation point which
limits the maximum value that can be observed. The user of concentration observations should address these, as needed, in
designing the evaluation procedures
9.2.3 Atmospheric transport and dispersion processes include stochastic components. The transport downwind follows a
serpentine path, being influenced by both random and periodic wind oscillations, composed of both large and small scale eddies
in the wind field. Fig. 1 illustrates the observed concentrations seen along a sampling arc at 50-m50 m downwind and centered
on a near-surface point-source release of sulfur-dioxide during Project Prairie Grass (28).Fig. 1 is a summary over all 70
experiments. For each experiment the crosswind receptor positions, y, relative to the observed center of mass along the arc have
been divided by σ , which is the second-moment of the concentration values seen along each arc, that is, the lateral dispersion
y
which is a measure of the lateral extent of the plume. The observed concentration values have been divided by C
max
Y
Y
5C /~σ =2π!, where C is the crosswind integrated concentration along the arc. The crosswind integrated concentration is a
y
measure of the vertical dilution the plume has experienced in traveling to this downwind position. To assume that the crosswind
concentration distribution follows a Gaussian curve, which is implicit in the relationship used to compute C , is seen to be a
max
reasonable approximation when all the experimental results are combined. As shown by the results for Experiment 31, a Gaussian
profile may not apply that well for any one realization, where random effects occurred, even though every attempt was made to
collect data under nearly ideal circumstances. Under less ideal conditions, as with emissions from a large industrial power plant
stack of order 75 m in height and a buoyant plume rise of order 100 m above the stack, it is easy to understand that the observed
FIG. 1 Illustration of Effects of Natural Variability on Crosswind Profiles of a Plume Dispersing Downwind (Grouped in a Relative Dis-
persion Context)
D6589 − 23
lateral profile for individual experimental results might well vary from the ideal Gaussian shape. It must be recognized that features
like double peaks, saw-tooth patterns and other irregular behavior are often observed for individual realizations.
9.3 Understanding the Models to be Evaluated:
9.3.1 As in other branches of meteorology, a complete set of equations for the characterization of the transport and fate of material
dispersing through the atmosphere is so complex that no unique analytical solution is known. Approximate analytical principles,
such as mass balance, are frequently combined with other concepts to allow study of a particular situation (29). Before evaluating
a model, the user must have a sufficient understanding of the basis for the model and its operation to know what it was intended
to characterize. The user must know whether the model provides volume average concentration estimates, or whether the model
provides average concentration estimates for specific positions above the ground. The user must know whether the characteriza-
tions of transport, dispersion, formation and removal processes are expressed using equations that provide ensemble average
estimates of concentration values, or whether the equations and relationships used provide stochastic estimates of concentration
values. Answers to these and like questions are necessary when attempting to define the evaluation objectives (10.3).
9.3.2 A mass balance model tracks material entering and leaving a particular air volume. Within this conceptual framework,
concentrations are increased by emissions that occur within the defined volume and by transport from other adjacent volumes.
Similarly, concentrations are decreased by transport exiting the volume, either by removal by chemical/physical sinks within the
volume, for example, wet and dry deposition, and for reactive species, or by conversion to other forms. These relationships can
be specified through a differential equation quantifying factors related to material gain or loss (29). Models of this type typically
provide ensemble volume-average concentration values as a function of time. One will have to consult the model documentation
in order to know whether the concentration values reported are averaged over some period of time, such as 1-h,1 h, or are the
volume-average values at the end of time periods, such as at the end of each hour of simulation.
9.3.3 Some models are entirely empirical. A common example (30) involves analysis and characterization of the concentration
distributions using measurements under different conditions across a variety of collection sites. Empirical models are
strictly-speaking only applicable to the range of measurement conditions upon which they were developed.
9.3.4 Most atmospheric transport and dispersion models involve the combination of theoretical and empirical parameterizations
of the physical processes (31), therefore, even though theoretical models may be suitable to a wide range of applications in
principle, they are limited to the physical processes characterized, and to the inherent limitations of empirically derived
relationships embedded within them.
9.3.5 Generally speaking, as model complexity grows in terms of temporal and spatial detail, the task of supplying appropriate
inputs becomes more demanding. It is not a given that increasing the complexity in the treatment of the transport and fate of
dispersing material will provide less uncertain predictions. As the number of model input parameters increases, more sources are
provided for development of model uncertainty, d(Δα) in Eq 2. Understanding the sensitivity of the modeling results to model input
uncertainty should affect the definition of evaluation objectives and associated procedures. For instance, specifying the transport
direction of a dispersing plume is highly uncertain. It has been estimated that the uncertainty in characterizing the plume transport
is on the order of 25 % of the plume width or more (17). If one attempts to define the relative skill of several models with the
modeling results and observations paired in time and space, the uncertainties in positioning a plume relative to the receptor
positions will cause there to be no correlation between the model results and observations, when in fact some of the models may
be performing well, once uncertainties resulting from plume transport are mitigated (13, 17).
9.4 Choosing Data Sets for Model Evaluation:
9.4.1 In principle, data used for the evaluation process should be independent of the data used to develop the model. If independent
data cannot be found, there are two choices. Either use all available data from a variety of experiments and sites to broadly
challenge the models to be evaluated, or collect new data to support the evaluation process. Realistically, the latter approach is only
feasible in rare circumstances, given the cost to conduct full-scale comprehensive fi
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...