ASTM E3080-23
(Practice)Standard Practice for Regression Analysis with a Single Predictor Variable
Standard Practice for Regression Analysis with a Single Predictor Variable
ABSTRACT
This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables for use in predicting one variable from the other. This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable. The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the predictor variable X.
SIGNIFICANCE AND USE
4.1 Regression analysis is a procedure that uses data to study the statistical relationships between two or more variables (1, 2).3 This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable. The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the predictor variable X.
4.2 A regression model consists of: (1) a regression function that relates the mean values of the response variable distribution to fixed values of the predictor variable, and (2) a statistical distribution that describes the variability in the response variable values at a fixed value of the predictor variable.
4.2.1 The regression analysis utilizes either experimental or observational data to estimate the parameters defining a regression model and their precision. Diagnostic procedures are utilized to assess the resulting model fit and can suggest other models for improved prediction performance.
4.3 The information in this practice is arranged as follows.
4.3.1 Section 5 gives a general outline of the steps in the regression analysis procedure. The subsequent sections cover procedures for estimation of specific regression models.
4.3.2 Section 6 assumes a straight line relationship between the two variables. This is also known as the simple linear regression model or a first order model. This model should be used as a starting point for understanding the XY relationship and ultimately defining the best fitting model to the data.
4.3.3 Section 7 considers a proportional relationship between the variables, where the ratio of one variable to the other is constant. The intercept is constrained to be zero. This model is useful for single point calibration, where a reference material is run periodically as a standard during routine testing to correct for drift in instrument performance over a given range of test results.
4.3.4 Section 8 di...
SCOPE
1.1 This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables for use in predicting one variable from the other.
1.2 The regression analysis provides graphical and calculational procedures for selecting the best statistical model that describes the relationship and for evaluation of the fit of the data to the selected model.
1.3 The resulting regression model can be useful for developing process knowledge through description of the variable relationship, in making predictions of future values, in relating the precision of a test method to the value of the characteristic being measured, and in developing control methods for the process generating values of the variables.
1.4 The system of units for this practice is not specified. Dimensional quantities in the practice are presented only as illustrations of calculation methods. The examples are not binding on products or test methods treated.
1.5 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use.
1.6 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development ...
General Information
- Status
- Published
- Publication Date
- 31-Oct-2023
- Technical Committee
- E11 - Quality and Statistics
- Drafting Committee
- E11.10 - Sampling / Statistics
Relations
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Apr-2022
- Referred By
ASTM E2862-23 - Standard Practice for Probability of Detection Analysis for Hit/Miss Data - Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
Overview
ASTM E3080-23, titled Standard Practice for Regression Analysis with a Single Predictor Variable, is an internationally recognized standard developed by ASTM for the application of regression analysis involving two numerical variables: one predictor (independent variable) and one response (dependent variable). This standard outlines methodologies to statistically define, evaluate, and predict the relationship between these variables, supporting both experimental and observational datasets. The primary goal is to enable users to construct and utilize regression models for quality improvement, process understanding, and predictive analytics across various industries.
Key Topics
- Single Predictor Regression: Focuses solely on situations with one predictor and one response variable, streamlining model selection and interpretation for simple relationships.
- Model Selection and Evaluation: Provides guidance for the graphical (e.g., scatter plots, residual analysis) and computational methods needed to select the best-fit regression model and evaluate its adequacy.
- Core Regression Models:
- Simple Linear Regression: Assumes a straight-line relationship between variables.
- Proportional Model: Imposes an origin-through constraint, ideal for calibration or reference comparisons.
- Curvature Models: Considers the inclusion of nonlinear (quadratic) terms when straight-line assumptions do not suffice.
- Statistical Diagnostics: Instructs users on procedures for residual analysis, outlier detection, and assessment of variance constancy and normality, ensuring robust and reliable model fits.
- Precision and Prediction: Outlines the calculation of standard errors, confidence intervals for model parameters, and prediction intervals for new observations.
- Data Quality and Experimental Design: Discusses practical considerations for data collection, such as range, spacing, and repetition of predictor variable values, to maximize regression validity.
Applications
ASTM E3080-23 is highly relevant for practitioners seeking to:
- Predict future values: Estimate the likely value of a response variable based on a known value of the predictor, useful in manufacturing quality control, laboratory calibration, and engineering testing.
- Process control and improvement: Understand and control relationships between process variables for enhanced quality or performance in fields like chemical processing, materials testing, and product development.
- Method validation and calibration: Calibrate measurement systems and test methods, especially where regular standard checks are required to adjust for drift or bias.
- Evaluate test method precision: Relate measurement or test method variability to the value of the variable being measured.
- Statistical analysis education: Provide a clear and standardized framework for teaching or learning fundamental regression analysis techniques.
This standard is applicable across various industries, including engineering, manufacturing, environmental science, and laboratory research, wherever a sound statistical relationship between two quantitative measures is needed.
Related Standards
Users of ASTM E3080-23 may also benefit from these related ASTM standards:
- ASTM E178: Practice for Dealing With Outlying Observations.
- ASTM E2586: Practice for Calculating and Using Basic Statistics.
- ASTM E456: Terminology Relating to Quality and Statistics.
These referenced standards offer complementary methods and terminology for statistics, further supporting high-quality data analysis and interpretation.
By adhering to ASTM E3080-23, organizations ensure that their use of regression analysis with a single predictor variable is robust, standardized, and internationally recognized, promoting best practices in predictive modeling and statistical quality assurance.
Buy Documents
ASTM E3080-23 - Standard Practice for Regression Analysis with a Single Predictor Variable
REDLINE ASTM E3080-23 - Standard Practice for Regression Analysis with a Single Predictor Variable
Get Certified
Connect with accredited certification bodies for this standard

BSI Group
BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

Bureau Veritas
Bureau Veritas is a world leader in laboratory testing, inspection and certification services.

DNV
DNV is an independent assurance and risk management provider.
Sponsored listings
Frequently Asked Questions
ASTM E3080-23 is a standard published by ASTM International. Its full title is "Standard Practice for Regression Analysis with a Single Predictor Variable". This standard covers: ABSTRACT This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables for use in predicting one variable from the other. This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable. The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the predictor variable X. SIGNIFICANCE AND USE 4.1 Regression analysis is a procedure that uses data to study the statistical relationships between two or more variables (1, 2).3 This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable. The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the predictor variable X. 4.2 A regression model consists of: (1) a regression function that relates the mean values of the response variable distribution to fixed values of the predictor variable, and (2) a statistical distribution that describes the variability in the response variable values at a fixed value of the predictor variable. 4.2.1 The regression analysis utilizes either experimental or observational data to estimate the parameters defining a regression model and their precision. Diagnostic procedures are utilized to assess the resulting model fit and can suggest other models for improved prediction performance. 4.3 The information in this practice is arranged as follows. 4.3.1 Section 5 gives a general outline of the steps in the regression analysis procedure. The subsequent sections cover procedures for estimation of specific regression models. 4.3.2 Section 6 assumes a straight line relationship between the two variables. This is also known as the simple linear regression model or a first order model. This model should be used as a starting point for understanding the XY relationship and ultimately defining the best fitting model to the data. 4.3.3 Section 7 considers a proportional relationship between the variables, where the ratio of one variable to the other is constant. The intercept is constrained to be zero. This model is useful for single point calibration, where a reference material is run periodically as a standard during routine testing to correct for drift in instrument performance over a given range of test results. 4.3.4 Section 8 di... SCOPE 1.1 This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables for use in predicting one variable from the other. 1.2 The regression analysis provides graphical and calculational procedures for selecting the best statistical model that describes the relationship and for evaluation of the fit of the data to the selected model. 1.3 The resulting regression model can be useful for developing process knowledge through description of the variable relationship, in making predictions of future values, in relating the precision of a test method to the value of the characteristic being measured, and in developing control methods for the process generating values of the variables. 1.4 The system of units for this practice is not specified. Dimensional quantities in the practice are presented only as illustrations of calculation methods. The examples are not binding on products or test methods treated. 1.5 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use. 1.6 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development ...
ABSTRACT This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables for use in predicting one variable from the other. This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable. The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the predictor variable X. SIGNIFICANCE AND USE 4.1 Regression analysis is a procedure that uses data to study the statistical relationships between two or more variables (1, 2).3 This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable. The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the predictor variable X. 4.2 A regression model consists of: (1) a regression function that relates the mean values of the response variable distribution to fixed values of the predictor variable, and (2) a statistical distribution that describes the variability in the response variable values at a fixed value of the predictor variable. 4.2.1 The regression analysis utilizes either experimental or observational data to estimate the parameters defining a regression model and their precision. Diagnostic procedures are utilized to assess the resulting model fit and can suggest other models for improved prediction performance. 4.3 The information in this practice is arranged as follows. 4.3.1 Section 5 gives a general outline of the steps in the regression analysis procedure. The subsequent sections cover procedures for estimation of specific regression models. 4.3.2 Section 6 assumes a straight line relationship between the two variables. This is also known as the simple linear regression model or a first order model. This model should be used as a starting point for understanding the XY relationship and ultimately defining the best fitting model to the data. 4.3.3 Section 7 considers a proportional relationship between the variables, where the ratio of one variable to the other is constant. The intercept is constrained to be zero. This model is useful for single point calibration, where a reference material is run periodically as a standard during routine testing to correct for drift in instrument performance over a given range of test results. 4.3.4 Section 8 di... SCOPE 1.1 This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables for use in predicting one variable from the other. 1.2 The regression analysis provides graphical and calculational procedures for selecting the best statistical model that describes the relationship and for evaluation of the fit of the data to the selected model. 1.3 The resulting regression model can be useful for developing process knowledge through description of the variable relationship, in making predictions of future values, in relating the precision of a test method to the value of the characteristic being measured, and in developing control methods for the process generating values of the variables. 1.4 The system of units for this practice is not specified. Dimensional quantities in the practice are presented only as illustrations of calculation methods. The examples are not binding on products or test methods treated. 1.5 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use. 1.6 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development ...
ASTM E3080-23 is classified under the following ICS (International Classification for Standards) categories: 03.120.30 - Application of statistical methods. The ICS classification helps identify the subject area and facilitates finding related standards.
ASTM E3080-23 has the following relationships with other standards: It is inter standard links to ASTM E3080-19, ASTM E456-13a(2022)e1, ASTM E2862-23, ASTM E3297-21, ASTM D8406-22, ASTM D8405-21, ASTM E2586-19e1, ASTM E3023-21, ASTM D8272-19, ASTM E456-13a(2022), ASTM E2935-21, ASTM E3323-22. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
ASTM E3080-23 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.
Standards Content (Sample)
This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation: E3080 − 23 An American National Standard
Standard Practice for
Regression Analysis with a Single Predictor Variable
This standard is issued under the fixed designation E3080; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope E456 Terminology Relating to Quality and Statistics
E2586 Practice for Calculating and Using Basic Statistics
1.1 This practice covers regression analysis of a set of data
to define the statistical relationship between two numerical
3. Terminology
variables for use in predicting one variable from the other.
3.1 Definitions—Unless otherwise noted, terms relating to
1.2 The regression analysis provides graphical and calcula-
quality and statistics are as defined in Terminology E456.
tional procedures for selecting the best statistical model that
3.1.1 degrees of freedom, n—the number of independent
describes the relationship and for evaluation of the fit of the
data points minus the number of parameters that have to be
data to the selected model.
estimated before calculating the variance. E2586
1.3 The resulting regression model can be useful for devel-
3.1.2 predictor variable, X, n—a variable used to predict a
oping process knowledge through description of the variable
response variable using a regression model.
relationship, in making predictions of future values, in relating
the precision of a test method to the value of the characteristic
3.1.2.1 Discussion—Also called an independent or explana-
being measured, and in developing control methods for the
tory variable.
process generating values of the variables.
3.1.3 regression analysis, n—a statistical procedure used to
1.4 The system of units for this practice is not specified. characterize the association between two or more numerical
Dimensional quantities in the practice are presented only as variables for prediction of the response variable from the
predictor variable.
illustrations of calculation methods. The examples are not
binding on products or test methods treated.
3.1.3.1 Discussion—In this practice, only a single predictor
1.5 This standard does not purport to address all of the
variable is considered.
safety concerns, if any, associated with its use. It is the
3.1.4 residual, n—the observed value minus fitted value,
responsibility of the user of this standard to establish appro-
when a regression model is used.
priate safety, health, and environmental practices and deter-
3.1.5 response variable, Y, n—a variable predicted from a
mine the applicability of regulatory limitations prior to use.
regression model.
1.6 This international standard was developed in accor-
3.1.5.1 Discussion—Also called a dependent variable.
dance with internationally recognized principles on standard-
ization established in the Decision on Principles for the
3.1.6 sample coeffıcient of determination, r , n—square of
Development of International Standards, Guides and Recom-
the sample correlation coefficient.
mendations issued by the World Trade Organization Technical
3.1.7 sample correlation coeffıcient, r, n—a dimensionless
Barriers to Trade (TBT) Committee.
measure of association between two variables estimated from
the data.
2. Referenced Documents
3.1.8 sample covariance, s , n—an estimate of the associa-
xy
2.1 ASTM Standards:
tion of the response variable and predictor variable calculated
E178 Practice for Dealing With Outlying Observations
from the data.
3.2 Definitions of Terms Specific to This Standard:
This practice is under the jurisdiction of ASTM Committee E11 on Quality and
3.2.1 intercept, β , n—of a regression model, the value of
Statistics and is the direct responsibility of Subcommittee E11.10 on Sampling /
the response variable when the value of the predictor variable
Statistics.
Current edition approved Nov. 1, 2023. Published November 2023. Originally
is equal to zero.
approved in 2016. Last previous edition approved in 2019 as E3080 – 19. DOI:
3.2.2 regression model parameter, n—a descriptive constant
10.1520/E3080-23.
For referenced ASTM standards, visit the ASTM website, www.astm.org, or
defining a regression model that is to be estimated.
contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM
3.2.3 residual standard deviation, σ, n—of a regression
Standards volume information, refer to the standard’s Document Summary page on
the ASTM website. model, the square root of the residual variance.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E3080 − 23
3.2.4 residual variance, σ , n—of a regression model, the 3.4 Acronyms:
variance of the residuals (see residual). 3.4.1 ANOVA, n—analysis of variance
3.4.2 df, n—degrees of freedom
3.2.5 slope, β , n—of a regression model, the incremental
change in the response variable due to a unit change in the
3.4.3 LOF, n—lack of fit
predictor variable.
3.4.4 MS, n—mean square
3.3 Symbols:
3.4.5 MSE, n—mean square error
3.4.6 MSR, n—mean square regression
b = intercept parameter estimate (5.5.1)
b = slope parameter estimate (5.5)
3.4.7 MST, n—mean square total
b = curvature parameter estimate (8.1.1.1)
3.4.8 PE, n—pure error
β = intercept parameter in model (5.3.1)
3.4.9 SS, n—sum of squares
β = slope parameter in model (5.3.1)
β = curvature parameter in model (5.3.3)
3.4.10 SSE, n—sum of squares error
E = general point estimate of a parameter (5.7)
3.4.11 SSR, n—sum of squares regression
e = residual for data point i (5.5.2)
i
3.4.12 SST, n—sum of squares total
ε = error term in model (5.4)
F = F statistic (6.5.2)
4. Significance and Use
h = index for predicting any value in data range
(6.4.3) 4.1 Regression analysis is a procedure that uses data to
i = index for a data point (5.2)
study the statistical relationships between two or more vari-
L = lower confidence limit (5.7.2)
ables (1, 2). This practice is restricted in scope to consider
λ = Box-Cox parameter (A1.5.2)
only a single numerical response variable and a single numeri-
n = number of data points (5.2)
cal predictor variable. The objective is to obtain a regression
p = number of parameters in regression model (5.7)
model for use in predicting the value of the response variable
r = correlation coefficient (6.3.2.1)
Y for given values of the predictor variable X.
r = coefficient of determination (6.3.2.2)
4.2 A regression model consists of: (1) a regression function
S(b ,b ) = sum of squared deviations of Y to the regression
0 1 i
that relates the mean values of the response variable distribu-
line (A1.1.2)
tion to fixed values of the predictor variable, and (2) a
s = standard error of slope estimate (6.4.1)
b1
s = standard error of intercept estimate (6.4.2) statistical distribution that describes the variability in the
b0
s = general standard error of a point estimate (5.7) response variable values at a fixed value of the predictor
E
σ = residual standard deviation (5.4.1)
variable.
s = estimate of σ (6.2.6)
4.2.1 The regression analysis utilizes either experimental or
σ = residual variance (5.4.1)
observational data to estimate the parameters defining a
2 2
s = estimate of σ (6.2.6)
regression model and their precision. Diagnostic procedures
s = variance of X data (A1.2.1)
X
are utilized to assess the resulting model fit and can suggest
s = variance of Y data (A1.2.1)
Y
other models for improved prediction performance.
S = sum of squares of deviations of X data from
XX
4.3 The information in this practice is arranged as follows.
average (6.2.3)
4.3.1 Section 5 gives a general outline of the steps in the
S = sum of cross products of X and Y from their
XY
regression analysis procedure. The subsequent sections cover
averages (6.2.3)
s = sample covariance of X and Y (A1.2.1) procedures for estimation of specific regression models.
XY
ˆ
ˆ
s =
standard error of Y (6.4.3) 4.3.2 Section 6 assumes a straight line relationship between
Yh
h
ˆ
s = standard error of future individual Y value (6.4.4) the two variables. This is also known as the simple linear
Yh~ind!
S = sum of squares of deviations of Y data from regression model or a first order model. This model should be
YY
average (6.2.3)
used as a starting point for understanding the XY relationship
t = Student’s t distribution (5.7)
and ultimately defining the best fitting model to the data.
U = upper confidence limit (5.7.2)
4.3.3 Section 7 considers a proportional relationship be-
X = predictor variable (5.1)
tween the variables, where the ratio of one variable to the other
¯
= average of X data (6.2.3)
X
is constant. The intercept is constrained to be zero. This model
X = general value of X in its range (6.4.3)
h
is useful for single point calibration, where a reference material
X = value of X for data point i (5.2)
i
is run periodically as a standard during routine testing to
Y = response variable (5.1)
correct for drift in instrument performance over a given range
¯
= average of Y data (6.2.3)
Y
of test results.
˙
= geometric mean of Y data (A1.5.4)
Y
4.3.4 Section 8 discusses a regression function that consid-
'
Y = transformed Y (A1.5.2)
ers curvature in the XY relationship, the second order polyno-
ˆ
= predicted future individual Y for a value X (6.4.4)
Y
h
h ind
~ !
mial model.
Y = value of Y for data point i (5.2)
i
ˆ
= predicted value of Y for any value X (6.4.3)
Y
h
h
The boldface numbers in parentheses refer to a list of references at the end of
ˆ
= predicted value of Y for data point i (5.5.1)
Y
i
this standard.
E3080 − 23
4.3.5 Annex A1 provides supplemental information of a 5.2.1.1 The X values should cover the entire range of
more mathematical nature in regression. interest. Extrapolation beyond the range of observed X values
4.3.6 Appendix X1 lists calculations for the curvature may fail due to expanding estimation error outside the range
model estimates and exhibits a worksheet for these calcula- and the uncertainty of whether the model gives an adequate
tions. description of the XY relationship outside the range. When
inference is required for the Y intercept (the value of Y when X
5. Regression Analysis Procedure for a Single Predictor is zero) the range of X should extend down to zero or near zero.
Variable
5.2.1.2 Two X levels are necessary when the objective is to
determine if there is an effect of X on Y, and to give an estimate
5.1 Choose the response variable Y and the predictor
of the effect (slope). Three X levels are necessary to evaluate
variable X. The predictor variable X is assumed to have known
any curvature in the relationship. Four or more X levels give
values with little or no measurement error. For given values of
better definition of the model shape, particularly if there is a
X, the response variable Y has a distribution of values repre-
possible asymptote or a threshold in the relationship. The X
senting the random effect of measurement errors, and these
levels should be equally spaced. If X is transformed, such as to
distributions are defined within a given range of the X values.
logarithms, the equal spacing should be with respect to the
5.2 Obtain a data set consisting of n pairs of values
transformed X.
designated as (X , Y ), with the sample index i ranging from 1
i i
5.2.1.3 Usually the number of Y observations should be
through n. The data can arise in two different ways. Observa-
equal at each X level. When the objective is to estimate Y
tional data consists of X and Y values measured on a set of n
variance or evaluate variance constancy, then at least four
random test units. Experimental data consists of Y values
observations are recommended at each X level.
measured on n test units with X values set at controlled values
5.3 Choose a regression function that fits the data. A scatter
in an experimental study.
plot of the data is recommended for a visual look at the XY
5.2.1 When designing an experiment for defining the XY
relationship, and most computer packages have this as an
association some considerations are:
option. This is a plot of points on the XY plane having a value
(1) Range of X values.
of Y (on the vertical axis) and a value of X (on the horizontal
(2) Number of distinct X values.
axis) for each data pair, where it is useful for evaluating the
(3) Spacing of X values.
quality of the data and suggesting an appropriate regression
(4) Number of Y observations for each X value.
function to define the XY relationship. Fig. 1 gives examples of
The answers depend on the objectives of the investigation,
four scatter plots that illustrate different situations.
whether determining the nature of the regression function,
estimating the slope or intercept of the simple linear model, or 5.3.1 Fig. 1A shows a cluster of points that appear to be
estimating the measurement error of Y, as well as other elongated in a particular direction along a straight line that does
objectives. not pass through the origin (X=0, Y=0). This pattern suggests
FIG. 1 Scatter Plots
E3080 − 23
the straight line regression function Y5β 1β X. The two 5.4.1 The distribution for ɛ can often be assumed to have a
0 1
parameters for this function are the intercept β and the slope normal (Gaussian) distribution with a constant standard devia-
β . The slope is the amount of incremental change in Y units for tion over the range of X. Thus, the distribution of Y at a given
a unit change in X. The intercept is the value of Y when X = 0. X is a normal distribution with a mean of β 1β X and a
0 1
Both parameters are necessary to define this regression func- standard deviation of σ. An example of such a linear regression
tion. model is shown in Fig. 2 over a range of X from 0 to 40 X units.
Normal distributions of response Y with σ = 1.3 Y units are
5.3.2 Fig. 1B suggests a straight line that appears to go
depicted at X = 10, 20, and 30 X units.
through the origin, thus Y is proportional to X, and the
regression function is Y5β X. An intercept term is not required
1 5.4.2 Distributions other than the normal distribution may
because the Y intercept is constrained to equal zero, that is, the
also be considered, depending on knowledge of the application.
line goes through the origin.
For example, low microbial counts may use a Poisson error
distribution.
5.3.3 Fig. 1C indicates curvature in the relationship, and
there are several regression functions that can be used. For
5.5 Parameter estimation uses the data set to provide the
slight curvature, a simple model is to add a second order (X )
parameter estimates. For the simple regression functions de-
term to the straight line function as Y5β 1β X1β X .
0 1 11
scribed above, the procedures used are given in the following
5.3.4 Fig. 1D shows data with increasing variability with sections. In this practice, the parameters are lower-case Greek
letters and the estimates are the corresponding lower-case
larger mean values. This suggests the need for a weighted
regression procedure discussed in A1.4.3. Roman letters. For example, the estimate of the slope param-
eter β is b .
5.3.5 Data points appearing outside the swarm of data
1 1
ˆ
(outliers) can have an adverse effect on estimation of regres-
5.5.1 The fitted values of Y, denoted Y (read Y-hat), for each
i
sion function parameters. For the straight-line function, outliers data point (X , Y ) are calculated from the estimated regression
i i
at the extremes of the X range can greatly affect the estimate of
function. For the straight-line model, the fitted values of Y are
i
ˆ
the slope and intercept parameters, and outliers in the middle of
Y 5b 1b X . The right-hand function defines the regression
i 0 1 1
the range tend to affect the intercept estimate more than the
line, which may be shown on the scatter plot of the data to
slope. Outliers can be formally identified by statistical proce-
evaluate model fit.
dures (see Practice E178).
5.5.2 The estimates of the error term values ɛ are the
5.3.6 A special situation occurs when there are two data
ˆ
residuals ɛ , calculated as e 5Y 2Y , and these are used to
i i i i
swarms separated by a gap. This may indicate that there were
estimate the standard deviation parameter σ. Note that the
two sources of data with different values of a second lurking
residual values are the vertical distances of the points from the
predictor variable. Such a data set consists essentially of two
regression line.
data points in cases of a large gap.
5.6 Evaluation of the regression model is performed to
5.4 Define the regression model by adding an error term to
diagnose departure from model assumptions, such as model fit
the regression function that describes the variation in Y through
to the data, constancy of variance over the range of X, and
a statistical distribution. For example, the simple linear regres-
conformance to the assumed error distribution. Residual plots
sion model using the regression function in 5.3.1 is then stated
are useful for these diagnostics.
as Y5β 1β X1ε, where ɛ is a random error having a distribu-
0 1
tion with mean zero and standard deviation σ (variance σ ). 5.6.1 A plot of the residuals against their X values (or
FIG. 2 Graphical Depiction of a Straight Line Regression Model
E3080 − 23
ˆ
5.7.1 The estimates of model parameters and fitted Y values
equivalently, against their Y values) will detect certain depar-
i
are point estimates. For example, the estimate of the slope
tures from the assumptions. Residuals may also be plotted
parameter β is the estimate b that has been calculated from
against time of testing (if available) or against another known
1 1
the data. To give a sense of the precision for these estimates,
variable. Fig. 3 shows some of these patterns and discusses
interval estimates, or confidence intervals, can be provided. A
remedies for these departures. (The horizontal line on the plots
general form for the confidence interval for a general point
indicates a value of zero for the average of the residuals.)
estimate E is:
(1) Plot A – the desired horizontal pattern – indicates no
model deficiencies
E6ts (1)
E
(2) Plot B – increasing variance with X, consider weighted
where s is the standard error of the estimate and t is a
E
regression (see A1.4.2) or data transformations (see A1.5)
tabulated multiplier that is dependent upon the degrees of
(3) Plot C – curvature in the relationship, consider adding
freedom of the standard error and the desired confidence level,
a quadratic term or using a nonlinear model (see Section 8)
stated as a percentage. Thus, we may state that the true value
(4) Plot D – possible effect of time order of testing or the
of the parameter being estimated lies within the confidence
effect of another variable denoted as T
interval at a given confidence level. The degrees of freedom for
5.6.2 Plotting the residuals against a vertical scale of the
the standard error are generally n – p, where p is the number of
cumulative percentage of the normal distribution checks the
parameters in the regression model.
assumption of normality in the model. The fitted cumulative
5.7.2 To calculate these interval estimates, the form of the
normal distribution from the data is shown as a straight line on
statistical distribution for Y is required, and the normal distri-
the plot if the residuals fit a normal distribution. Computer
bution is often assumed. The widths of the interval estimates,
packages provide these plots and can also perform a more
given here as two-sided confidence intervals, are dependent on
rigorous statistical test for normality. If the plot indicates a
(1) the standard errors of the estimates, and (2) the level of
curve, a data transformation may be required to achieve a
confidence. The standard errors depend on the number of data
normal distribution.
pairs n and the values of the X .
i
5.6.3 Outlier testing in regression analysis takes two forms.
The confidence level is defined as 100(1 – α) %, where α is
Outlier testing can be performed upon any sets of multiple Y’s
the probability that the confidence interval does not contain the
collected at each unique value of X studied. Additionally,
parameter value. For example, α = 0.05 (or a risk of 5 %
outlier analysis can be performed on the entire set of residuals.
non-coverage) corresponds to a confidence level of 95 %,
In the latter case, finding an outlier could indicate an issue in
which shall be used for the examples in this practice. The value
either the X or Y value of the point in question or it may
of t is the upper (1 – α/2)th quantile of the Student’s t
indicate other issues with the regression analysis.
distribution with n – p degrees of freedom, for a confidence
5.7 Use of the model for interval estimates of regression level of 100(1 – α) %. Values of t are found in statistical texts
parameters and predicted Y values. and in commercial statistical software packages.
FIG. 3 Residual Plots – Some Patterns
E3080 − 23
n
5.7.3 The confidence interval can also be stated as the
¯
S 5 ~X 2 X! (6)
XX ( i
interval (L, U) between lower (L) and upper (U) confidence
i51
limits for the parameter being estimated. Practice E2586
n
¯
~ !
provides discussion of confidence intervals, standard error, and S 5 Y 2 Y (7)
YY ( i
i51
degrees of freedom.
n
¯ ¯
S 5 ~X 2 X!~Y 2 Y! (8)
XY ( i i
i51
6. Simple Linear Regression Analysis
6.1 Simple Linear Regression Model:
S is a known fixed constant. S and S are random
XX YY XY
6.1.1 This model defines the functional relationship be- variables.
tween X and Y as a straight line in the XY plane.
6.2.3.3 The least squares solution gives the parameter esti-
6.1.2 The regression function for the straight line relation-
mates:
ship is:
b 5 S ⁄ S (9)
1 XY XX
Y 5 β 1β X (2)
0 1
¯ ¯
b 5 Y 2 b X (10)
0 1
where the two parameters for the function are the intercept
ˆ
6.2.4 The fitted values Y for each data point Y are calcu-
β and the slope β . i i
0 1
lated from the estimated regression function as:
The intercept is the value of Y when X = 0, but this parameter
may not be of practical interest when the range of X is far
ˆ
Y 5 b 1b X (11)
i 0 1 i
removed from zero. The slope is the amount of incremental
change in Y units for a unit change in X.
6.2.5 The residual e is the difference between the response
i
6.1.3 The statistical distribution for Y is usually assumed to ˆ
data point Y and its fitted value Y :
i i
be a normal (Gaussian) distribution having a mean of β
ˆ
1β X with a standard deviation σ. The simple linear regression e 5 Y 2 Y (12)
1 i i i
model is then stated as:
Residuals are graphically the vertical distances on the scatter
Y 5 β 1β X1ε (3)
plot between the response data points Y and the estimated
0 1
i
regression line.
where ε is a random error that is normally distributed with
2 2
6.2.6 The estimates s of the variance σ and s of the
mean zero and standard deviation σ (variance σ ).
standard deviation σ of the Y distribution are calculated as the
6.2 Estimating Regression Model Parameters:
sum of the squared residuals divided by their degrees of
6.2.1 The model parameters β , and β , are estimated from
0 1
freedom:
a sample of data consisting of n pairs of values designated as
n n
(X , Y ), with the sample number i ranging from 1 through n.
i i ˆ
e ~Y 2 Y !
( i ( i i
i51 i21
The data can arise in two different ways. Observational data
s 5 5 (13)
n 2 2 n 2 2
~ ! ~ !
consists of X and Y values measured on a set of n random
samples. Experimental data consists of Y values measured on n
s 5 =s (14)
experimental units with X values set at fixed values. In both
cases the Y values may have measurement error, but the X
These estimates have n – 2 degrees of freedom because of
values are assumed known with negligible measurement error. prior estimation of two parameters, the slope and intercept of
6.2.2 The regression line parameters β , and β are esti- the line, which removed two degrees of freedom from the data
0 1
mated by the method of least squares, which finds their set of n data points prior to calculation of the residuals.
corresponding estimates b and b that minimize the sum of the
6.2.7 Example—A data set from Duncan, Ref. (3) lists
0 1
squares of the vertical distances between the Y values and their measurements of shear strength (inch-pounds) and weld diam-
i
respective line values at X . (For a further discussion of the
eter (mils) measured on 10 random test specimens, so this is an
i
least squares method, see A1.1.2.) observational data set with n = 10 pairs. Regression analysis
6.2.3 Calculate the following statistics from the X and Y
will be used to investigate the relationship between weld
values in the data set. diameter and shear strength, with the objective of predicting
6.2.3.1 Calculate the averages of X and Y:
shear strength Y from weld diameter X. The weld diameters are
considered to be measured with small error. The data are listed
n
X
in Table 1.
( i
i51
¯
X 5 (4)
6.2.7.1 The scatter plot for this example is shown in Fig. 4.
n
The shear strength appears to be increasing in a linear fashion
n
with weld diameter. There is some scatter but no apparent
Y
( i
i51
¯ outlying data points.
Y 5 (5)
n
6.2.7.2 The calculations, with equation numbers for each
6.2.3.2 Calculate the sums of squared deviations S and calculation, are shown in Table 1. The averages of X and Y are
XX
S of X and Y from their respective averages and the sum of respectively 233.9 mils and 975.0 inch-pounds. The deviations
YY
cross products S of the X and Y deviations from their of X and Y from their averages are listed for each observation,
XY
averages: and these are used to calculate values of the statistics S , S ,
XX YY
E3080 − 23
TABLE 1 Data and Calculations for Straight Line Regression Model Example
¯ ¯ ˆ
Sample, i X Y X 2X Y 2Y Y e Statistics Results EQ
i i i
i i i
1 190 680 –33.9 –295.0 741.2 –61.2 S 5268.90 Eq 6
XX
2 200 800 –23.9 –175.0 810.1 –10.1 S 330550.00 Eq 7
YY
3 209 780 –14.9 –195.0 872.2 –92.2 S 36345.00 Eq 8
XY
4 215 885 –8.9 –90.0 913.6 –28.6 Slope, b 6.8980 Eq 9
5 215 975 –8.9 0.0 913.6 61.4 Intercept, b –569.47 Eq 10
6 215 1025 –8.9 50.0 913.6 111.4 Variance, s 9980.16 Eq 13
7 230 1100 6.1 125.0 1017.1 82.9 St. Dev., s 99.90 Eq 14
8 250 1030 26.1 55.0 1155.0 –125.0
9 250 1300 26.1 325.0 1155.0 145.0
10 265 1175 14.1 200.0 1258.5 –83.5
¯ ¯
X Y
Average 223.9 975.0 0.0 0.0 975.0 0.0
Equation Eq 4 Eq 5 Eq 8 Eq 9
6.3 Evaluation of the Model:
6.3.1 This section discusses model evaluation through mea-
sures of association and plots of the residuals to check for
departures from the model assumptions and the presence of
data outliers.
6.3.2 Measures of Association Between X and Y:
6.3.2.1 The sample correlation coeffıcient is a dimensionless
statistic intended to measure the strength of a linear relation-
ship between two variables. The estimated correlation
coefficient, r, from a set of paired data (X , Y ) is calculated
i i
from three statistics, S , S , and S :
XX YY XY
S
XY
r 5 (15)
=S S
XX YY
The value of the correlation coefficient ranges between –1
and +1. The sign of r is the same as the sign of slope estimate
FIG. 4 Scatter Plot of Data with Fitted Linear Model
b . Values of r near 0 indicate a weak or nonexistent straight
line relationship. An r value closer to either +1 or –1 indicates
and S . The least squares estimates of the slope and intercept that a straight line provides an ever stronger explanation of the
XY
are calculated, resulting in the estimated model equation giving
relationship. Fig. 5 shows examples of scatter plots that appear
ˆ
for selected values of r.
fitted values Y 5-569.4716.898 X , and these values are listed for
i i
ˆ
each observation. The residuals e 5Y 5Y are also listed for
i i i
each observation. Estimates of the variance and standard
deviation of the Y distribution are calculated from squares of
the residuals. The estimated standard deviation is 99.90 inch-
pounds.
6.2.7.3 The least squares straight line is depicted with the
scatter plot in Fig. 4, and indicates that a straight line model
appears to give a reasonable fit to this data set. Some additional
comments from Table 1 are:
(1) The least squares estimated model equation is Y =
–569.47 + 6.898 X. Clearly the negative intercept is not a
plausible value for shear strength. This is apparently due to the
fact that the data are so far removed from the origin (0, 0) that
the estimate is poorly defined. It is also possible that there is
some nonlinear behavior in the relationship approaching the
origin.
(2) The averages of the deviations of X and Y from their
averages are zero, and the average of the residuals are zero.
These results follow from the property that sums of deviations
from averages are zero.
ˆ
(3) The average of the fitted values, Y , is the same as the
i FIG. 5 Typical Scatter Plots for Selected Values of the Correlation
average of the Y data. Coefficient, r
E3080 − 23
6.3.2.2 The coeffıcient of determination is the squared value
of the correlation coefficient with symbol r . It measures the
proportion of variation in the Y data explained by the predictor
variable X.
6.3.2.3 For the example the sample correlation coefficient
is:
r 5 5 0.8709
=~330550!~5268.9!
The sample coefficient of determination for the example is r
= 0.8709 = 0.7585. This means that approximately 76 % of the
FIG. 7 Plot of Residuals versus X — Duncan Example
variance in Y is explained by the straight line model (see 6.5.2).
These measures are often used as acceptance criteria for
linearity; but this usage should be discouraged, because these
statistics are not absolute measures of linearity and should be
used for comparative purposes only.
6.3.3 Residual Plots:
6.3.3.1 Plots of residuals e are used for evaluating outliers
i
in the data and various model assumptions over the range of X,
including normality, constant error variance, linearity of the
regression function, and independence of the error terms.
These check for outliers in the data, constancy of Y distribution
variance, curvature of the regression function, lack of indepen-
dence of errors, and normality of the Y distribution.
6.3.3.2 The residuals dot plot is a useful diagnostic for
finding outliers, which may be harder to detect from the data
set itself. Large outliers can distort the estimate of the
regression line because the least squares procedure will tend to
move the line towards the outlier, thus masking it. Formal
outlier testing procedures can be found in Practice E178.
A residuals dot plot for the example is shown in Fig. 6. There
FIG. 8 Normal Probability Plot of Residuals
are no apparent outliers at each end of the plot.
Additional graphics for this purpose are histograms, “stem
certain predicted values of Y at given values of X. For these
and leaf” plots, and “box and whiskers” plots. (See Practice
calculations the estimate s of the standard deviation σ of the Y
E2586.)
distribution is required with its degrees of freedom n – 2. Also
The plot of residuals against X in Fig. 7 indicates no
required is the choice of the confidence level, and for these
discernable pattern, such as curvature or increasing scatter
calculations a 95 % confidence interval will be used. In the
versus X, but this is a relatively small data set.
example, the standard deviation estimate is s = 99.9 inch-
6.3.3.3 Plotting the residuals against a vertical scale of the
pounds with n – 2 = 10 – 2 = 8 degrees of freedom. The value
cumulative percentage of the normal distribution checks the
of t for a 95 % two-sided confidence interval with 8 degrees of
assumption of normality in the model. The fitted cumulative
freedom is 2.306.
normal distribution from the data is shown as a straight line on
6.4.1 Confidence Interval for the Slope—The standard error
the plot if the residuals fit a normal distribution. Computer
for the slope estimate is:
packages provide these plots and can also perform a more
rigorous statistical test for normality.
s 5 s ⁄ =S (16)
b1 XX
For the example, the residual plot against X in Fig. 8
From the example:
indicates an approximate straight line pattern for the example,
supporting a normal distribution for the residuals.
s 5 99.9 ⁄ =5268.9 5 1.376
b1
6.4 Interval Estimates of Regression Parameters and Pre-
The confidence interval for the slope β is calculated as:
dicted Y Values—This section shows the calculations for the
b 6 ts (17)
interval estimates for b and b of their respective model 1 b1
0 1
parameters β and β for the simple linear model (see 5.7 for an
0 1
From the example, the 95 % confidence interval is:
introduction to this concept). Also given are calculations for
6.898 6 2.306 1.376 5 6.898 6 3.173
~ !~ !
or 3.725, 10.071
~ !
If the slope confidence interval includes zero, this supports
the assertion that there is no relationship between X and Y at the
FIG. 6 Dot Plot of Residuals given level of confidence. In this example, the slope confidence
E3080 − 23
ˆ
interval does not include zero, thus supporting the existence of
individual response Y at X = X is calculated as:
h~ind! h
a statistical relationship between Y and X.
¯
6.4.2 Confidence Interval for the Intercept—The standard
~ !
1 X 2 X
h
s ˆ 5 sŒ11 1 (23)
Y
error for the intercept estimate is: h~ind!
n S
XX
From the example, at X = 215 mils, the standard error is:
¯ 2
h
1 X
Œ
s 5 s 1 (18)
b0
n S 2
XX
1 ~215 2 223.9!
s ˆ 5 99.9 11 1 5 105.49 inch-pounds
Œ
Y
h~ind!
10 5268.9
From the example:
The confidence interval for a future new Y response at a
1 223.9
s 5 99.9Œ 1 5 309.76 value X = X is calculated as:
b0 h
10 5268.9
ˆ
ˆ
Y 6ts (24)
h ind Y
~ ! h ind
The confidence interval for the intercept β is calculated as: ~ !
This is known as a prediction interval, an interval estimate in
b 6 ts (19)
0 b0
which would contain a future observation with a given prob-
In this example, the 95 % confidence interval is:
ability based on the data set. Prediction intervals are wider than
-597.5 6 ~2.306!~309.76! 5 -569.5 6 714.3 confidence intervals because a prediction interval applies to an
individual value whereas the confidence interval applies to a
or -1283.8, 144.8
~ !
mean response. In the example, the prediction interval at 95 %
If the confidence interval includes zero, this technically
confidence for the predicted value of the response at a weld
supports the assertion that the line may go through the origin
diameter of 215 mils is:
(0,0) at the given level of confidence. However, this use of the
913.6 6 2.306 105.49 5 913.6 6 243.26
~ !~ !
confidence interval amounts to a rather large extrapolation
or 670.34, 1156.86
outside the range of the data, which explains the implausible ~ !
negative estimate mentioned in 6.2.7.3.
6.4.5 An array of confidence intervals and prediction inter-
6.4.3 Confidence Interval for the Predicted Value of the
vals shown as bands around the regression line is depicted in
ˆ
Mean Y at a Given X—The predicted value Y for a mean
h Fig. 9 for the example. The vertical intervals are narrowest at
response of Y at X is:
¯ ¯
h
the centroid, ~X , Y! of the data and become wider as the
distance from the center increases. These bands are valid for a
ˆ
Y 5 b 1b X (20)
h 0 1 h
single predictions only. Multiple predictions using the same
The index h is used instead of the index i because the data set are discussed in A1.1.8.1. These bands can be useful in
prediction is not necessarily from a value of X in the data set.
setting manufacturing requirements; for example, the confi-
Predictions outside the range of X (extrapolation) should be dence interval indicates that a minimum weld diameter of
performed with caution, as the regression function may not be
200 mils would be required to obtain an average shear strength
valid outside this range. of 700 inch-pounds at 95 % confidence. The prediction interval
ˆ
suggests that a minimum shear strength of 220 mils would be
The standard error for a mean Y response at a value X = X
h
h
necessary to guarantee that a single future item would have
is:
meet that shear strength with 95 % confidence.
¯
1 ~X 2 X!
h
ˆ Œ
s 5 s 1 (21)
Y
h
n S
XX
From the example, at X = 215 mils, the standard error is:
h
1 215 2 223.9
~ !
ˆ
s 5 99.9Œ 1 5 33.88
Y
h
10 5268.9
The confidence interval for the mean Y response at a value X
= X is calculated as:
h
ˆ
ˆ
Y 6ts (22)
h Y
h
From the example, the 95 % confidence interval for the
average predicted value of 913.6 inch-pounds is:
913.6 6 ~2.306!~33.88! 5 913.6 6 78.13
or 835.47, 991.73
~ !
Thus the expected mean response of Y at X = 215 falls
between 835.47 and 991.73 with 95 % confidence.
6.4.4 Confidence Interval for the Predicted Value of a
FIG. 9 Regression Plot with 95 % Confidence and
Future Value Y at a Given X—The standard error for an Prediction Intervals
E3080 − 23
TABLE 3 ANOVA Table for Example
6.5 Analysis of Variance (ANOVA) Calculations:
6.5.1 Statistical analysis packages are often used for regres- Source df SS MS F P
sion analysis. The output consists of the estimates of the Regression 1 250709 250709 25.12 0.001
Residual 8 79841 9980
regression parameters, various plots, and an ANOVA table. The
Total 9 330550
calculations for the ANOVA table are shown in Table 2. This
section discusses the ANOVA procedure and its relation to
earlier calculations.
7.1.1 The slope estimate b is calculated as:
6.5.2 ANOVA partitions the total sum of squares in the Y 1
n n
data, SST, into the residual sum of squares, SSE, and the
b 5 X Y ⁄ X (27)
1 ( i i ( i
regression sum of squares, SSR. The degrees of freedom (df)
i51 i51
for these sums of squares are respectively n – 1, n – 2, and 1.
ˆ
7.1.2 The fitted valuesY for each data point Y are calculated
i i
SST has been previously calculated as S in Eq 7, and SSE has
YY
from the estimated regression function as:
been previously calculated as the sum of the squared residuals,
the numerator of s in Eq 13. SSR is the sum of squares of
ˆ
Y 5 b X (28)
i 1 i
¯
deviations of the fitted values from their average Y, which
represents the variation removed from the Y data due to its 7.1.3 The residual e is the difference between the response
i
ˆ
estimated relationship with X. SSR may also be equivalently data point Y and its fitted value Y .
i i
calculated as:
ˆ
e 5 Y 2 Y (29)
i i i
2 2
ˆ ¯ ¯
SSR 5 Σ~Y 2 Y! 5 b Σ~X 2 X! (25)
2 2
i 1 i
7.1.4 The estimates s of the variance σ and s of the
This expression enables calculation of the sums of squares
standard deviation σ of the Y distributions are calculated as the
for regression and for error without first requiring calculation sum of the squared residuals divided by their degrees of
of fitted values and residuals. freedom.
The mean squares are variances, each calculated as a sum of
n n
2 ˆ
squares divided by its degrees of freedom. The F statistic is the e ~Y 2 Y !
( i ( i i
i51 i51 (30)
ratio of the regression mean square to the residual mean square, s 5 5
~n 2 1! ~n 2 1!
and is used to test the fit of the regression model, thus F =
MSR/MSE. MST is the variance of the Y data, see Eq A1.14. s 5 =s (31)
The p-value is the probability of obtaining a slope estimate
These estimates have n – 1 degrees of freedom because of
as large as that obtained from the data, assuming that the true
prior estimation of the slope of the line, which removed one
slope is zero. Low values of p, such as p < 0.05, are used to
degree of freedom from the data set of n data points prior to
reject the condition that the true slope is zero, thus confirming
calculation of the residuals.
that a relationship that is either linear, or that has a statistically-
7.1.5 The standard error for the slope estimate is:
meaningful trend component, exists between X and Y.
n
6.5.3 The ANOVA table for the example is shown in Table
S 5 s ⁄ x (32)
Œ
b1 ( i
3. The F test indicated a high level of statistical significance for
i51
the validity of the model with a low p value of 0.001. The
The 100(1 – α)th two-sided confidence interval for the slope
coefficient of determination r = SSR / SST = 250709 / 330550
β is calculated as:
= 0.7585, which agrees with the value in 6.3.2.3.
b 6ts (33)
1 b1
7. Zero Intercept Linear Model
where t is the (1 – α/2)th quantile of the t distribution with
7.1 An associated model often considered along with the
n – 1 degrees of freedom. The confidence bands for the line are
simple linear model is the model that constrains the intercept to
also straight lines with zero intercepts having slopes defined by
be zero. Thus Y is proportional to X throughout the range. This
the confidence limits on the slope (see Fig. 11).
model is useful in test methods where single-point calibration
7.2 Example—An experiment was conducted to determine
is conducted periodically, due to minor instabilities in the
an instrument response over a range of 0 mg ⁄L to 10 mg/L of
testing process. The regression model is:
a substance dissolved in a solvent. Five solution standards at
Y 5 β X1ε (26)
(2, 4, 6, 8, and 10) mg/L concentrations were run in duplicate
where the slope β is the single regression function param- and the results are shown in Fig. 10. A zero-intercept model
eter and ε is a random error term that is assumed to be was considered because the data points appeared to lie in a
normally distributed with mean zero and variance σ . straight line that approached the origin.
TABLE 2 ANOVA Table Calculations
Source of Variation Degrees of Freedom Sum of Squares Mean Square F statistic p-value
ˆ ¯
Regression 1 MSR = SSR / 1 F = MSR / MSE p
SSR5ΣsY 2 Yd
i
ˆ ¯
Residual n – 2 MSE = SSE / n – 2
SSE5ΣsY 2 Yd
i
ˆ ¯
Total n – 1 MST = SST / n – 1
SST5ΣsY 2 Yd
i
E3080 − 23
The 95 % confidence interval on the slope is b 6ts
1 b1
55.2456~2.262!~0.115!55.24560.260 and the slope confidence
limits are L = 4.985, U = 5.505. The t value is the upper 97.5th
percentile of the Student’s t distribution with 9 degrees of
freedom.
The 95 % confidence limits on the fitted line at listed data
values of X are listed in Table 4. The data points, fitted line,
i
and confidence bands on the line are depicted in Fig. 11.
7.3 Supplementary Information—The no-intercept model
(also known as regression through the origin) has a number of
differences from the simple linear model with intercept as
listed in Table 5.
FIG. 10 Instrument Response versus Concentration It is recommended that the model with intercept be estimated
as well as the no-intercept model for comparison of perfor-
mance near the origin. A computer output of model with
intercept is shown in Table 6. The intercept estimate is 1.76,
and slope estimate is 5.005. However, we see in Table 6 that
the 95 % confidence interval for the intercept (–2.389, 5.909)
contains zero, indicating that the intercept is not statistically
different from zero. In this situation, we could drop the
intercept term from the model. This supports our original
contention that the no-intercept model is appropriate for this
data. The two regression lines are shown graphically in Fig. 12,
and they track closely together in the concentration range of
4–10.
8. Dealing with Curvature in the XY Relationship
8.1 Second Order Polynomial Regression Model:
8.1.1 This model defines the functional relationship be-
FIG. 11 No-Intercept Line with Confidence Limits (L, U)
tween X and Y as a parabola in the XY plane. This model adds
a second order term to the straight-line regression function to
7.2.1 The slope b is estimated as follows:
1 accommodate slight curvature in the XY relationship. The
straight-line regression model may considered the first order
ΣX 5 440.0,□ ΣX Y 5 2307.8,□ b 5 2307.8⁄440.0 5 5.245
i i i 1
polynomial regression function.
ˆ ˆ
The predicted values Y 5b X and residuals e 5Y 2Y are
i i i i i i
8.1.1.1 The regression function for the second order rela-
listed in Table 4.
tionship is:
n
2 2
e Y 5 β 1β X1β X
( 0 1 11
i
52.719
i51
The variance estimate is s 5 5 55.8577
~n 2 1! 9 where the three parameters are the intercept β , the slope β ,
0 1
and the curvature β . The curvature parameter indicates the
=
The standard deviation estimate is s5 5.857752.42 1
...
This document is not an ASTM standard and is intended only to provide the user of an ASTM standard an indication of what changes have been made to the previous version. Because
it may not be technically possible to adequately depict all changes accurately, ASTM recommends that users consult prior editions as appropriate. In all cases only the current version
of the standard as published by ASTM is to be considered the official document.
Designation: E3080 − 19 E3080 − 23 An American National Standard
Standard Practice for
Regression Analysis with a Single Predictor Variable
This standard is issued under the fixed designation E3080; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope
1.1 This practice covers regression analysis of a set of data to define the statistical relationship between two numerical variables
for use in predicting one variable from the other.
1.2 The regression analysis provides graphical and calculational procedures for selecting the best statistical model that describes
the relationship and for evaluation of the fit of the data to the selected model.
1.3 The resulting regression model can be useful for developing process knowledge through description of the variable
relationship, in making predictions of future values, in relating the precision of a test method to the value of the characteristic being
measured, and in developing control methods for the process generating values of the variables.
1.4 The system of units for this practice is not specified. Dimensional quantities in the practice are presented only as illustrations
of calculation methods. The examples are not binding on products or test methods treated.
1.5 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility
of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of
regulatory limitations prior to use.
1.6 This international standard was developed in accordance with internationally recognized principles on standardization
established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued
by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
2. Referenced Documents
2.1 ASTM Standards:
E178 Practice for Dealing With Outlying Observations
E456 Terminology Relating to Quality and Statistics
E2586 Practice for Calculating and Using Basic Statistics
3. Terminology
3.1 Definitions—Unless otherwise noted, terms relating to quality and statistics are as defined in Terminology E456.
3.1.1 degrees of freedom, n—the number of independent data points minus the number of parameters that have to be estimated
before calculating the variance. E2586
This practice is under the jurisdiction of ASTM Committee E11 on Quality and Statistics and is the direct responsibility of Subcommittee E11.10 on Sampling / Statistics.
Current edition approved Sept. 1, 2019Nov. 1, 2023. Published January 2020November 2023. Originally approved in 2016. Last previous edition approved in 20172019
as E3080 – 17. DOI: 10.1520/E3080-19.19. DOI: 10.1520/E3080-23.
For referenced ASTM standards, visit the ASTM website, www.astm.org, or contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM Standards
volume information, refer to the standard’s Document Summary page on the ASTM website.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E3080 − 23
3.1.2 predictor variable, X, n—a variable used to predict a response variable using a regression model.
3.1.2.1 Discussion—
Also called an independent or explanatory variable.
3.1.3 regression analysis, n—a statistical procedure used to characterize the association between two or more numerical variables
for prediction of the response variable from the predictor variable.
3.1.3.1 Discussion—
In this practice, only a single predictor variable is considered.
3.1.4 residual, n—the observed value minus fitted value, when a regression model is used.
3.1.5 response variable, Y, n—a variable predicted from a regression model.
3.1.5.1 Discussion—
Also called a dependent variable.
3.1.6 sample coeffıcient of determination, r , n—square of the sample correlation coefficient.
3.1.7 sample correlation coeffıcient, r, n—a dimensionless measure of association between two variables estimated from the data.
3.1.8 sample covariance, s , n—an estimate of the association of the response variable and predictor variable calculated from the
xy
data.
3.2 Definitions of Terms Specific to This Standard:
3.2.1 intercept, β , n—of a regression model, the value of the response variable when the value of the predictor variable is equal
to zero.
3.2.2 regression model parameter, n—a descriptive constant defining a regression model that is to be estimated.
3.2.3 residual standard deviation, σ, n—of a regression model, the square root of the residual variance.
3.2.4 residual variance, σ , n—of a regression model, the variance of the residuals (see residual).
3.2.5 slope, β , n—of a regression model, the incremental change in the response variable due to a unit change in the predictor
variable.
3.3 Symbols:
b = intercept parameter estimate (5.5.1)
b = slope parameter estimate (5.5)
b = curvature parameter estimate (8.1.1.1)
β = intercept parameter in model (5.3.1)
β = slope parameter in model (5.3.1)
β = curvature parameter in model (5.3.3)
E = general point estimate of a parameter (5.7)
e = residual for data point i (5.5.2)
i
ε = error term in model (5.4)
F = F statistic (6.5.2)
h = index for predicting any value in data range (6.4.3)
i = index for a data point (5.2)
L = lower confidence limit (5.7.2)
λ = Box-Cox parameter (A1.5.4)
λ = Box-Cox parameter (A1.5.2)
n = number of data points (5.2)
p = number of parameters in regression model (5.7)
E3080 − 23
r = correlation coefficient (6.3.2.1)
r = coefficient of determination (6.3.2.2)
S(b ,b ) = sum of squared deviations of Y to the regression line (A1.1.2)
0 1 i
s = standard error of slope estimate (6.4.1)
b1
s = standard error of intercept estimate (6.4.2)
b0
s = general standard error of a point estimate (5.7)
E
σ = residual standard deviation (5.4.1)
s = estimate of σ (6.2.6)
σ = residual variance (5.4.1)
2 2
s = estimate of σ (6.2.6)
s = variance of X data (A1.2.1)
X
s = variance of Y data (A1.2.1)
Y
S = sum of squares of deviations of X data from average (6.2.3)
XX
S = sum of cross products of X and Y from their averages (6.2.3)
XY
s = sample covariance of X and Y (A1.2.1)
XY
ˆ
s ˆ =
standard error of Y (6.4.3)
Yh
h
s ˆ = standard error of future individual Y value (6.4.4)
Yh ind
~ !
S = sum of squares of deviations of Y data from average (6.2.3)
YY
t = Student’s t distribution (5.7)
U = upper confidence limit (5.7.2)
X = predictor variable (5.1)
¯
= average of X data (6.2.3)
X
X = general value of X in its range (6.4.3)
h
X = value of X for data point i (5.2)
i
Y = response variable (5.1)
¯
= average of Y data (6.2.3)
Y
˙
= geometric mean of Y data (A1.5.4)
Y
'
Y = transformed Y (A1.5.2)
ˆ
= predicted future individual Y for a value X (6.4.4)
Y
h
h ind
~ !
Y = value of Y for data point i (5.2)
i
ˆ
= predicted value of Y for any value X (6.4.3)
Y
h
h
ˆ
= predicted value of Y for data point i (5.5.1)
Y
i
3.4 Acronyms:
3.4.1 ANOVA, n—analysis of variance
3.4.2 df, n—degrees of freedom
3.4.3 LOF, n—lack of fit
3.4.4 MS, n—mean square
3.4.5 MSE, n—mean square error
3.4.6 MSR, n—mean square regression
3.4.7 MST, n—mean square total
3.4.8 PE, n—pure error
3.4.9 SS, n—sum of squares
3.4.10 SSE, n—sum of squares error
3.4.11 SSR, n—sum of squares regression
E3080 − 23
3.4.12 SST, n—sum of squares total
4. Significance and Use
4.1 Regression analysis is a procedure that uses data to study the statistical relationships between two or more variables (1, 2).
This practice is restricted in scope to consider only a single numerical response variable and a single numerical predictor variable.
The objective is to obtain a regression model for use in predicting the value of the response variable Y for given values of the
predictor variable X.
4.2 A regression model consists of: (1) a regression function that relates the mean values of the response variable distribution to
fixed values of the predictor variable, and (2) a statistical distribution that describes the variability in the response variable values
at a fixed value of the predictor variable.
4.2.1 The regression analysis utilizes either experimental or observational data to estimate the parameters defining a regression
model and their precision. Diagnostic procedures are utilized to assess the resulting model fit and can suggest other models for
improved prediction performance.
4.3 The information in this practice is arranged as follows.
4.3.1 Section 5 gives a general outline of the steps in the regression analysis procedure. The subsequent sections cover procedures
for estimation of specific regression models.
4.3.2 Section 6 assumes a straight line relationship between the two variables. This is also known as the simple linear regression
model or a first order model. This model should be used as a starting point for understanding the XY relationship and ultimately
defining the best fitting model to the data.
4.3.3 Section 7 considers a proportional relationship between the variables, where the ratio of one variable to the other is constant.
The intercept is constrained to be zero. This model is useful for single point calibration, where a reference material is run
periodically as a standard during routine testing to correct for drift in instrument performance over a given range of test results.
4.3.4 Section 8 discusses a regression function that considers curvature in the XY relationship, the second order polynomial model.
4.3.5 Annex A1 provides supplemental information of a more mathematical nature in regression.
4.3.6 Appendix X1 lists calculations for the curvature model estimates and exhibits a worksheet for these calculations.
5. Regression Analysis Procedure for a Single Predictor Variable
5.1 Choose the response variable Y and the predictor variable X. The predictor variable X is assumed to have known values with
little or no measurement error. For given values of X, the response variable Y has a distribution of values representing the random
effect of measurement errors, and these distributions are defined within a given range of the X values.
5.2 Obtain a data set consisting of n pairs of values designated as (X ,Y ), with the sample index i ranging from 1 through n. The
i i
data can arise in two different ways. Observational data consists of X and Y values measured on a set of n random test units.
Experimental data consists of Y values measured on n test units with X values set at controlled values in an experimental study.
5.2.1 When designing an experiment for defining the XY association some considerations are:
(1) Range of X values.
(2) Number of distinct X values.
(3) Spacing of X values.
(4) Number of Y observations for each X value.
The answers depend on the objectives of the investigation, whether determining the nature of the regression function, estimating
the slope or intercept of the simple linear model, or estimating the measurement error of Y, as well as other objectives.
The boldface numbers in parentheses refer to a list of references at the end of this standard.
E3080 − 23
5.2.1.1 The X values should cover the entire range of interest. Extrapolation beyond the range of observed X values may fail due
to expanding estimation error outside the range and the uncertainty of whether the model gives an adequate description of the XY
relationship outside the range. When inference is required for the Y intercept (the value of Y when X is zero) the range of X should
extend down to zero or near zero.
5.2.1.2 Two X levels are necessary when the objective is to determine if there is an effect of X on Y, and to give an estimate of
the effect (slope). Three X levels are necessary to evaluate any curvature in the relationship. Four or more X levels give better
definition of the model shape, particularly if there is a possible asymptote or a threshold in the relationship. The X levels should
be equally spaced. If X is transformed, such as to logarithms, the equal spacing should be with respect to the transformed X.
5.2.1.3 Usually the number of Y observations should be equal at each X level. When the objective is to estimate Y variance or
evaluate variance constancy, then at least four observations are recommended at each X level.
5.3 Choose a regression function that fits the data. A scatter plot of the data is recommended for a visual look at the XY
relationship, and most computer packages have this as an option. This is a plot of points on the XY plane having a value of Y (on
the vertical axis) and a value of X (on the horizontal axis) for each data pair, where it is useful for evaluating the quality of the
data and suggesting an appropriate regression function to define the XY relationship. Fig. 1 gives examples of four scatter plots
that illustrate different situations.
5.3.1 Fig. 1A shows a cluster of points that appear to be elongated in a particular direction along a straight line that does not pass
through the origin (X=0, Y=0). This pattern suggests the straight line regression function Y5β 1β X. The two parameters for this
0 1
function are the intercept β and the slope β . The slope is the amount of incremental change in Y units for a unit change in X.
0 1
The intercept is the value of Y when X = 0. Both parameters are necessary to define this regression function.
5.3.2 Fig. 1B suggests a straight line that appears to go through the origin, thus Y is proportional to X, and the regression function
is Y5β X. An intercept term is not required because the Y intercept is constrained to equal zero, that is, the line goes through the
origin.
5.3.3 Fig. 1C indicates curvature in the relationship, and there are several regression functions that can be used. For slight
curvature, a simple model is to add a second order (X ) term to the straight line function as Y5β 1β X1β X .
0 1 11
FIG. 1 Scatter Plots
E3080 − 23
5.3.4 Fig. 1D shows data with increasing variability with larger mean values. This suggests the need for a weighted regression
procedure discussed in A1.4.2A1.4.3.
5.3.5 Data points appearing outside the swarm of data (outliers) can have an adverse effect on estimation of regression function
parameters. For the straight-line function, outliers at the extremes of the X range can greatly affect the estimate of the slope and
intercept parameters, and outliers in the middle of the range tend to affect the intercept estimate more than the slope. Outliers can
be formally identified by statistical procedures (see Practice E178).
5.3.6 A special situation occurs when there are two data swarms separated by a gap. This may indicate that there were two sources
of data with different values of a second lurking predictor variable. Such a data set consists essentially of two data points in cases
of a large gap.
5.4 Define the regression model by adding an error term to the regression function that describes the variation in Y through a
statistical distribution. For example, the simple linear regression model using the regression function in 5.3.1 is then stated as
Y5β 1β X1ε, where ɛ is a random error having a distribution with mean zero and standard deviation σ (variance σ ).
0 1
5.4.1 The distribution for ɛ can often be assumed to have a normal (Gaussian) distribution with a constant standard deviation over
the range of X. Thus, the distribution of Y at a given X is a normal distribution with a mean of β 1β X and a standard deviation
0 1
of σ. An example of such a linear regression model is shown in Fig. 2 over a range of X from 0 to 40 X units. Normal distributions
of response Y with σ = 1.3 Y units are depicted at X = 10, 20, and 30 X units.
5.4.2 Distributions other than the normal distribution may also be considered, depending on knowledge of the application. For
example, low microbial counts may use a Poisson error distribution.
5.5 Parameter estimation uses the data set to provide the parameter estimates. For the simple regression functions described
above, the procedures used are given in the following sections. In this practice, the parameters are lower-case Greek letters and
the estimates are the corresponding lower-case Roman letters. For example, the estimate of the slope parameter β is b .
1 1
ˆ
5.5.1 The fitted values of Y, denoted Y (read Y-hat), for each data point (X ,Y ) are calculated from the estimated regression
i i i
ˆ
function. For the straight-line model, the fitted values of Y are Y 5b 1b X . The right-hand function defines the regression line,
i i 0 1 1
which may be shown on the scatter plot of the data to evaluate model fit.
ˆ
5.5.2 The estimates of the error term values ɛ are the residuals ɛ , calculated as e 5Y 2Y , and these are used to estimate the
i i i i
standard deviation parameter σ. Note that the residual values are the vertical distances of the points from the regression line.
FIG. 2 Graphical Depiction of a Straight Line Regression Model
E3080 − 23
5.6 Evaluation of the regression model is performed to diagnose departure from model assumptions, such as model fit to the data,
constancy of variance over the range of X, and conformance to the assumed error distribution. Residual plots are useful for these
diagnostics.
ˆ
5.6.1 A plot of the residuals against their X values (or equivalently, against their Y values) will detect certain departures from the
i
assumptions. Residuals may also be plotted against time of testing (if available) or against another known variable. Fig. 3 shows
some of these patterns and discusses remedies for these departures. (The horizontal line on the plots indicates a value of zero for
the average of the residuals.)
(1) Plot A – the desired horizontal pattern – indicates no model deficiencies
(2) Plot B – increasing variance with X, consider weighted regression (see A1.4.2) or data transformations (see A1.5)
(3) Plot C – curvature in the relationship, consider adding a quadratic term or using a nonlinear model (see Section 8)
(4) Plot D – possible effect of time order of testing or the effect of another variable denoted as T
5.6.2 Plotting the residuals against a vertical scale of the cumulative percentage of the normal distribution checks the assumption
of normality in the model. The fitted cumulative normal distribution from the data is shown as a straight line on the plot if the
residuals fit a normal distribution. Computer packages provide these plots and can also perform a more rigorous statistical test for
normality. If the plot indicates a curve, a data transformation may be required to achieve a normal distribution.
5.6.3 Outlier testing in regression analysis takes two forms. Outlier testing can be performed upon any sets of multiple Y’s
collected at each unique value of X studied. Additionally, outlier analysis can be performed on the entire set of residuals. In the
latter case, finding an outlier could indicate an issue in either the X or Y value of the point in question or it may indicate other issues
with the regression analysis.
5.7 Use of the model for interval estimates of regression parameters and predicted Y values.
5.7.1 The estimates of model parameters and fitted Y values are point estimates. For example, the estimate of the slope parameter
β is the estimate b that has been calculated from the data. To give a sense of the precision for these estimates, interval estimates,
1 1
or confidence intervals, can be provided. A general form for the confidence interval for a general point estimate E is:
E6ts (1)
E
where s is the standard error of the estimate and t is a tabulated multiplier that is dependent upon the degrees of freedom of
E
the standard error and the desired confidence level, stated as a percentage. Thus, we may state that the true value of the parameter
FIG. 3 Residual Plots – Some Patterns
E3080 − 23
being estimated lies within the confidence interval at a given confidence level. The degrees of freedom for the standard error are
generally n – p, where p is the number of parameters in the regression model.
5.7.2 To calculate these interval estimates, the form of the statistical distribution for Y is required, and the normal distribution is
often assumed. The widths of the interval estimates, given here as two-sided confidence intervals, are dependent on (1) the standard
errors of the estimates, and (2) the level of confidence. The standard errors depend on the number of data pairs n and the values
of the X .
i
The confidence level is defined as 100(1 – α) %, where α is the probability that the confidence interval does not contain the
parameter value. For example, α = 0.05 (or a risk of 5 % non-coverage) corresponds to a confidence level of 95 %, which shall
be used for the examples in this practice. The value of t is the upper (1 – α/2)th quantile of the Student’s t distribution with n –
p degrees of freedom, for a confidence level of 100(1 – α) %. Values of t are found in statistical texts and in commercial statistical
software packages.
5.7.3 The confidence interval can also be stated as the interval (L, U) between lower (L) and upper (U) confidence limits for the
parameter being estimated. Practice E2586 provides discussion of confidence intervals, standard error, and degrees of freedom.
6. Simple Linear Regression Analysis
6.1 Simple Linear Regression Model:
6.1.1 This model defines the functional relationship between X and Y as a straight line in the XY plane.
6.1.2 The regression function for the straight line relationship is:
Y 5β 1β X (2)
0 1
where the two parameters for the function are the intercept β and the slope β .
0 1
The intercept is the value of Y when X = 0, but this parameter may not be of practical interest when the range of X is far removed
from zero. The slope is the amount of incremental change in Y units for a unit change in X.
6.1.3 The statistical distribution for Y is usually assumed to be a normal (Gaussian) distribution having a mean of β 1β X with a
0 1
standard deviation σ. The simple linear regression model is then stated as:
Y 5β 1β X1ε (3)
0 1
where ε is a random error that is normally distributed with mean zero and standard deviation σ (variance σ ).
6.2 Estimating Regression Model Parameters:
6.2.1 The model parameters β , and β , are estimated from a sample of data consisting of n pairs of values designated as (X ,Y ),
0 1 i i
with the sample number i ranging from 1 through n. The data can arise in two different ways. Observational data consists of X and
Y values measured on a set of n random samples. Experimental data consists of Y values measured on n experimental units with
X values set at fixed values. In both cases the Y values may have measurement error, but the X values are assumed known with
negligible measurement error.
6.2.2 The regression line parameters β , and β are estimated by the method of least squares, which finds their corresponding
0 1
estimates b and b that minimize the sum of the squares of the vertical distances between the Y values and their respective line
0 1 i
values at X . (For a further discussion of the least squares method, see A1.1.2.)
i
6.2.3 Calculate the following statistics from the X and Y values in the data set.
6.2.3.1 Calculate the averages of X and Y:
n
X
( i
i51
¯
X 5 (4)
n
n
Y
( i
i51
¯
Y 5 (5)
n
E3080 − 23
6.2.3.2 Calculate the sums of squared deviations S and S of X and Y from their respective averages and the sum of cross
XX YY
products S of the X and Y deviations from their averages:
XY
n
¯
S 5 ~X 2 X! (6)
XX ( i
i51
n
¯
S 5 ~Y 2 Y! (7)
YY ( i
i51
n
¯ ¯
S 5 ~X 2 X!~Y 2 Y! (8)
XY i i
(
i51
S is a known fixed constant. S and S are random variables.
XX YY XY
6.2.3.3 The least squares solution gives the parameter estimates:
b 5 S ⁄ S (9)
1 XY XX
¯ ¯
b 5 Y 2 b X (10)
0 1
ˆ
6.2.4 The fitted valuesY for each data point Y are calculated from the estimated regression function as:
i i
ˆ
Y 5 b 1b X (11)
i 0 1 i
ˆ
6.2.5 The residuale is the difference between the response data point Y and its fitted value Y :
i i i
ˆ
e 5 Y 2 Y (12)
i i i
Residuals are graphically the vertical distances on the scatter plot between the response data points Y and the estimated
i
regression line.
2 2
6.2.6 The estimates s of the variance σ and s of the standard deviation σ of the Y distribution are calculated as the sum of the
squared residuals divided by their degrees of freedom:
n n
ˆ
~ !
e Y 2 Y
( i ( i i
i51 i21
s 5 5 (13)
n 2 2 n 2 2
~ ! ~ !
s 5=s (14)
These estimates have n – 2 degrees of freedom because of prior estimation of two parameters, the slope and intercept of the line,
which removed two degrees of freedom from the data set of n data points prior to calculation of the residuals.
6.2.7 Example—A data set from Duncan, Ref. (3) lists measurements of shear strength (inch-pounds) and weld diameter (mils)
measured on 10 random test specimens, so this is an observational data set with n = 10 pairs. Regression analysis will be used to
investigate the relationship between weld diameter and shear strength, with the objective of predicting shear strength Y from weld
diameter X. The weld diameters are considered to be measured with small error. The data are listed in Table 1.
TABLE 1 Data and Calculations for Straight Line Regression Model Example
¯ ¯ ˆ
Sample, i X Y X 2X Y 2Y Y e Statistics Results EQ
i i i
i i i
1 190 680 –33.9 –295.0 741.2 –61.2 S 5268.90 Eq 6
XX
2 200 800 –23.9 –175.0 810.1 –10.1 S 330550.00 Eq 7
YY
3 209 780 –14.9 –195.0 872.2 –92.2 S 36345.00 Eq 8
XY
4 215 885 –8.9 –90.0 913.6 –28.6 Slope, b 6.8980 Eq 9
5 215 975 –8.9 0.0 913.6 61.4 Intercept, b –569.47 Eq 10
6 215 1025 –8.9 50.0 913.6 111.4 Variance, s 9980.16 Eq 13
7 230 1100 6.1 125.0 1017.1 82.9 St. Dev., s 99.90 Eq 14
8 250 1030 26.1 55.0 1155.0 –125.0
9 250 1300 26.1 325.0 1155.0 145.0
10 265 1175 14.1 200.0 1258.5 –83.5
¯ ¯
X Y
Average 223.9 975.0 0.0 0.0 975.0 0.0
Equation Eq 4 Eq 5 Eq 8 Eq 9
E3080 − 23
6.2.7.1 The scatter plot for this example is shown in Fig. 4. The shear strength appears to be increasing in a linear fashion with
weld diameter. There is some scatter but no apparent outlying data points.
6.2.7.2 The calculations, with equation numbers for each calculation, are shown in Table 1. The averages of X and Y are
respectively 233.9 mils and 975.0 inch-pounds. The deviations of X and Y from their averages are listed for each observation, and
these are used to calculate values of the statistics S , S , and S . The least squares estimates of the slope and intercept are
XX YY XY
ˆ
calculated, resulting in the estimated model equation giving fitted values Y 5-569.4716.898 X , and these values are listed for each
i i
ˆ
observation. The residuals e 5Y 5Y are also listed for each observation. Estimates of the variance and standard deviation of the
i i i
Y distribution are calculated from squares of the residuals. The estimated standard deviation is 99.90 inch-pounds.
6.2.7.3 The least squares straight line is depicted with the scatter plot in Fig. 4, and indicates that a straight line model appears
to give a reasonable fit to this data set. Some additional comments from Table 1 are:
(1) The least squares estimated model equation is Y = –569.47 + 6.898 X. Clearly the negative intercept is not a plausible value
for shear strength. This is apparently due to the fact that the data are so far removed from the origin (0, 0) that the estimate is poorly
defined. It is also possible that there is some nonlinear behavior in the relationship approaching the origin.
(2) The averages of the deviations of X and Y from their averages are zero, and the average of the residuals are zero. These
results follow from the property that sums of deviations from averages are zero.
ˆ
(3) The average of the fitted values, Y , is the same as the average of the Y data.
i
6.3 Evaluation of the Model:
6.3.1 This section discusses model evaluation through measures of association and plots of the residuals to check for departures
from the model assumptions and the presence of data outliers.
6.3.2 Measures of Association Between X and Y:
6.3.2.1 The sample correlation coeffıcient is a dimensionless statistic intended to measure the strength of a linear relationship
between two variables. The estimated correlation coefficient, r, from a set of paired data (X , Y ) is calculated from three statistics,
i i
S ,S , and S :
XX YY XY
S
XY
r 5 (15)
=S S
XX YY
The value of the correlation coefficient ranges between –1 and +1. The sign of r is the same as the sign of slope estimate b .
Values of r near 0 indicate a weak or nonexistent straight line relationship. An r value closer to either +1 or –1 indicates that a
straight line provides an ever stronger explanation of the relationship. Fig. 5 shows examples of scatter plots that appear for
selected values of r.
6.3.2.2 The coeffıcient of determination is the squared value of the correlation coefficient with symbol r . It measures the
proportion of variation in the Y data explained by the predictor variable X.
FIG. 4 Scatter Plot of Data with Fitted Linear Model
E3080 − 23
FIG. 5 Typical Scatter Plots for Selected Values of the Correlation Coefficient, r
6.3.2.3 For the example the sample correlation coefficient is:
r 5 5 0.8709
= 330550 5268.9
~ !~ !
2 2
The sample coefficient of determination for the example is r = 0.8709 = 0.7585. This means that approximately 76 % of the
variance in Y is explained by the straight line model (see 6.5.2). These measures are often used as acceptance criteria for linearity;
but this usage should be discouraged, because these statistics are not absolute measures of linearity and should be used for
comparative purposes only.
6.3.3 Residual Plots:
6.3.3.1 Plots of residuals e are used for evaluating outliers in the data and various model assumptions over the range of X,
i
including normality, constant error variance, linearity of the regression function, and independence of the error terms. These check
for outliers in the data, constancy of Y distribution variance, curvature of the regression function, lack of independence of errors,
and normality of the Y distribution.
6.3.3.2 The residuals dot plot is a useful diagnostic for finding outliers, which may be harder to detect from the data set itself.
Large outliers can distort the estimate of the regression line because the least squares procedure will tend to move the line towards
the outlier, thus masking it. Formal outlier testing procedures can be found in Practice E178.
A residuals dot plot for the example is shown in Fig. 6. There are no apparent outliers at each end of the plot.
Additional graphics for this purpose are histograms, “stem and leaf” plots, and “box and whiskers” plots. (See Practice E2586.)
The plot of residuals against X in Fig. 7 indicates no discernable pattern, such as curvature or increasing scatter versus X, but
this is a relatively small data set.
6.3.3.3 Plotting the residuals against a vertical scale of the cumulative percentage of the normal distribution checks the assumption
of normality in the model. The fitted cumulative normal distribution from the data is shown as a straight line on the plot if the
residuals fit a normal distribution. Computer packages provide these plots and can also perform a more rigorous statistical test for
normality.
For the example, the residual plot against X in Fig. 8 indicates an approximate straight line pattern for the example, supporting
a normal distribution for the residuals.
6.4 Interval Estimates of Regression Parameters and Predicted Y Values—This section shows the calculations for the interval
estimates for b and b of their respective model parameters β and β for the simple linear model (see 5.7 for an introduction to
0 1 0 1
FIG. 6 Dot Plot of Residuals
E3080 − 23
FIG. 7 Plot of Residuals versus X — Duncan Example
FIG. 8 Normal Probability Plot of Residuals
this concept). Also given are calculations for certain predicted values of Y at given values of X. For these calculations the estimate
s of the standard deviation σ of the Y distribution is required with its degrees of freedom n – 2. Also required is the choice of the
confidence level, and for these calculations a 95 % confidence interval will be used. In the example, the standard deviation estimate
is s = 99.9 inch-pounds with n – 2 = 10 – 2 = 8 degrees of freedom. The value of t for a 95 % two-sided confidence interval with
8 degrees of freedom is 2.306.
6.4.1 Confidence Interval for the Slope—The standard error for the slope estimate is:
s 5 s ⁄ =S (16)
b1 XX
From the example:
s 5 99.9 ⁄ =5268.9 5 1.376
b1
The confidence interval for the slope β is calculated as:
b 6 ts (17)
1 b1
From the example, the 95 % confidence interval is:
6.898 6 2.306 1.376 5 6.898 6 3.173
~ !~ !
or 3.725, 10.071
~ !
If the slope confidence interval includes zero, this supports the assertion that there is no relationship between X and Y at the given
level of confidence. In this example, the slope confidence interval does not include zero, thus supporting the existence of a
statistical relationship between Y and X.
6.4.2 Confidence Interval for the Intercept—The standard error for the intercept estimate is:
E3080 − 23
¯ 2
1 X
Œ
s 5 s 1 (18)
b0
n S
XX
From the example:
1 223.9
s 5 99.9 1 5 309.76
Œ
b0
10 5268.9
The confidence interval for the intercept β is calculated as:
b 6 ts (19)
0 b0
In this example, the 95 % confidence interval is:
-597.5 6 ~2.306!~309.76!5 -569.5 6 714.3
or ~-1283.8, 144.8!
If the confidence interval includes zero, this technically supports the assertion that the line may go through the origin (0,0) at
the given level of confidence. However, this use of the confidence interval amounts to a rather large extrapolation outside the range
of the data, which explains the implausible negative estimate mentioned in 6.2.7.3.
ˆ
6.4.3 Confidence Interval for the Predicted Value of the Mean Y at a Given X—The predicted valueY for a mean response of Y
h
at X is:
h
ˆ
Y 5 b 1b X (20)
h 0 1 h
The index h is used instead of the index i because the prediction is not necessarily from a value of X in the data set.
Predictions outside the range of X (extrapolation) should be performed with caution, as the regression function may not be valid
outside this range.
ˆ
The standard error for a mean Y response at a value X = X is:
h h
¯
1 ~X 2 X!
h
s ˆ 5 sŒ 1 (21)
Y
h
n S
XX
From the example, at X = 215 mils, the standard error is:
h
1 215 2 223.9
~ !
ˆ
s 5 99.9Œ 1 5 33.88
Y
h
10 5268.9
The confidence interval for the mean Y response at a value X = X is calculated as:
h
ˆ
Y 6ts ˆ (22)
h Y
h
From the example, the 95 % confidence interval for the average predicted value of 913.6 inch-pounds is:
913.6 6 2.306 33.88 5 913.6 6 78.13
~ !~ !
or ~835.47, 991.73!
Thus the expected mean response of Y at X = 215 falls between 835.47 and 991.73 with 95 % confidence.
6.4.4 Confidence Interval for the Predicted Value of a Future Value Y at a Given X—The standard error for an individual response
ˆ
Y at X = X is calculated as:
h~ind! h
¯
~ !
1 X 2 X
h
s ˆ 5 sŒ11 1 (23)
Y
h~ind!
n S
XX
From the example, at X = 215 mils, the standard error is:
h
1 215 2 223.9
~ !
ˆ
s 5 99.9Œ11 1 5 105.49 inch-pounds
Y
h ind
~ !
10 5268.9
The confidence interval for a future new Y response at a value X = X is calculated as:
h
ˆ
Y 6ts ˆ (24)
h ind Y
~ !
h~ind!
This is known as a prediction interval, an interval estimate in which would contain a future observation with a given probability
based on the data set. Prediction intervals are wider than confidence intervals because a prediction interval applies to an individual
E3080 − 23
value whereas the confidence interval applies to a mean response. In the example, the prediction interval at 95 % confidence for
the predicted value of the response at a weld diameter of 215 mils is:
913.6 6 2.306 105.49 5 913.6 6 243.26
~ !~ !
or ~670.34, 1156.86!
6.4.5 An array of confidence intervals and prediction intervals shown as bands around the regression line is depicted in Fig. 9 for
¯ ¯
the example. The vertical intervals are narrowest at the centroid,~X , Y! of the data and become wider as the distance from the
center increases. These bands are valid for a single predictions only. Multiple predictions using the same data set are discussed in
A1.1.8.1. These bands can be useful in setting manufacturing requirements; for example, the confidence interval indicates that a
minimum weld diameter of 200 mils 200 mils would be required to obtain an average shear strength of 700 inch-pounds at 95 %
confidence. The prediction interval suggests that a minimum shear strength of 220 mils would be necessary to guarantee that a
single future item would have meet that shear strength with 95 % confidence.
FIG. 9 Regression Plot with 95 % Confidence and
Prediction Intervals
E3080 − 23
6.5 Analysis of Variance (ANOVA) Calculations:
6.5.1 Statistical analysis packages are often used for regression analysis. The output consists of the estimates of the regression
parameters, various plots, and an ANOVA table. The calculations for the ANOVA table are shown in Table 2. This section discusses
the ANOVA procedure and its relation to earlier calculations.
6.5.2 ANOVA partitions the total sum of squares in the Y data, SST, into the residual sum of squares, SSE, and the regression sum
of squares, SSR. The degrees of freedom (df) for these sums of squares are respectively n – 1, n – 2, and 1. SST has been previously
calculated as S in Eq 7, and SSE has been previously calculated as the sum of the squared residuals, the numerator of s in Eq
YY
¯
13. SSR is the sum of squares of deviations of the fitted values from their average Y, which represents the variation removed from
the Y data due to its estimated relationship with X.SSR may also be equivalently calculated as:
2 2
ˆ ¯ 2 ¯
SSR 5Σ~Y 2 Y! 5 b Σ~X 2 X! (25)
i 1 i
This expression enables calculation of the sums of squares for regression and for error without first requiring calculation of fitted
values and residuals.
The mean squares are variances, each calculated as a sum of squares divided by its degrees of freedom. The F statistic is the
ratio of the regression mean square to the residual mean square, and is used to test the fit of the regression model, thus F =
MSR/MSE.MST is the variance of the Y data, see Eq A1.14.
The p-value is the probability of obtaining a slope estimate as large as that obtained from the data, assuming that the true slope
is zero. Low values of p, such as p < 0.05, are used to reject the condition that the true slope is zero, thus confirming that a
relationship that is either linear, or that has a statistically-meaningful trend component, exists between X and Y.
6.5.3 The ANOVA table for the example is shown in Table 3. The F test indicated a high level of statistical significance for the
validity of the model with a low p value of 0.001. The coefficient of determination r = SSR / SST = 250709 / 330550 = 0.7585,
which agrees with the value in 6.3.2.3.
7. Zero Intercept Linear Model
7.1 An associated model often considered along with the simple linear model is the model that constrains the intercept to be zero.
Thus Y is proportional to X throughout the range. This model is useful in test methods where single-point calibration is conducted
periodically, due to minor instabilities in the testing process. The regression model is:
Y 5β X1ε (26)
where the slope β is the single regression function parameter and ε is a random error term that is assumed to be normally
distributed with mean zero and variance σ .
7.1.1 The slope estimate b is calculated as:
n n
b 5 X Y ⁄ X (27)
1 i i i
( (
i51 i51
ˆ
7.1.2 The fitted valuesY for each data point Y are calculated from the estimated regression function as:
i i
ˆ
Y 5 b X (28)
i 1 i
ˆ
7.1.3 The residual e is the difference between the response data point Y and its fitted value Y .
i i i
ˆ
e 5 Y 2 Y (29)
i i i
TABLE 2 ANOVA Table Calculations
Source of Variation Degrees of Freedom Sum of Squares Mean Square F statistic p-value
ˆ ¯
Regression 1 MSR = SSR / 1 F = MSR / MSE p
SSR5ΣsY 2 Yd
i
ˆ ¯
Residual n – 2 MSE = SSE / n – 2
s d
SSE5Σ Y 2 Y
i
ˆ ¯
Total n – 1 MST = SST / n – 1
s d
SST5Σ Y 2 Y
i
E3080 − 23
TABLE 3 ANOVA Table for Example
Source df SS MS F P
Regression 1 250709 250709 25.12 0.001
Residual 8 79841 9980
Total 9 330550
2 2
7.1.4 The estimates s of the variance σ and s of the standard deviation σ of the Y distributions are calculated as the sum of the
squared residuals divided by their degrees of freedom.
n n
2 ˆ
e ~Y 2 Y !
i i i
( (
(30)
i51 i51
s 5 5
n 2 1 n 2 1
~ ! ~ !
s 5=s (31)
These estimates have n – 1 degrees of freedom because of prior estimation of the slope of the line, which removed one degree
of freedom from the data set of n data points prior to calculation of the residuals.
7.1.5 The standard error for the slope estimate is:
n
S 5 s⁄ x (32)
Œ
b1 i
(
i51
The 100(1 – α)th two-sided confidence interval for the slope β is calculated as:
b 6ts (33)
1 b1
where t is the (1 – α/2)th quantile of the t distribution with n – 1 degrees of freedom. The confidence bands for the line are also
straight lines with zero intercepts having slopes defined by the confidence limits on the slope (see Fig. 11).
7.2 Example—An experiment was conducted to determine an instrument response over a range of 00 mg ⁄L to 10 mg/L of a
substance dissolved in a solvent. Five solution standards at 2,(2, 4, 6, 8, and 1010) mg/L concentrations were run in duplicate and
the results are shown in Fig. 10. A zero-intercept model was considered because the data points appeared to lie in a straight line
that approached the origin.
7.2.1 The slope b is estimated as follows:
ΣX 5 440.0,□ ΣX Y 5 2307.8,□ b 5 2307.8⁄440.0 5 5.245
i i i 1
ˆ ˆ
The predicted values Y 5b X and residuals e 5Y 2Y are listed in Table 4.
i i i i i i
n
e
( i
52.719
i51
The variance estimate is s 5 5 55.8577
n 2 1 9
~ !
The standard deviation estimate is s5=5.857752.42
The standard error on the slope is
FIG. 10 Instrument Response versus Concentration
E3080 − 23
FIG. 11 No-Intercept Line with Confidence Limits (L, U)
TABLE 4 Predicted and Residuals
ˆ
X Y e L U
Y
i i i
i
0 0 0.0 0.0 0.00 0.00
2 9.2 10.5 –1.3 9.97 11.01
2 12.9 10.5 2.4 9.97 11.01
4 20.7 21.0 –0.3 19.94 22.02
4 23.1 21.0 2.1 19.94 22.02
6 31.7 31.5 0.2 29.90 33.04
6 34.5 31.5 3.0 29.90 33.04
8 39.2 42.0 –2.8 39.87 44.05
8 44.2 42.0 2.2 39.87 44.05
10 54.0 52.5 1.5 49.84 55.06
10 48.4 52.5 –4.1 49.84 55.06
s 2.42
s 5 5 5 0.115
b1
n
=440
x
Œ
( i
i51
The 95 % confidence interval on the slope is b 6ts 55.2456~2.262!~0.115!55.24560.260 and the slope confidence limits are L =
1 b1
4.985, U = 5.505. The t value is the upper 97.5th percentile of the Student’s t distribution with 9 degrees of freedom.
The 95 % confidence limits on the fitted line at listed data values of X are listed in Table 4. The data points, fitted line, and
i
confidence bands on the line are depicted in Fig. 11.
7.3 Supplementary Information—The no-intercept model (also known as regression through the origin) has a number of
differences from the simple linear model with intercept as listed in Table 5.
It is recommended that the model with intercept be estimated as well as the no-intercept model for comparison of performance
near the origin. A computer output of model with intercept is shown in Table 6. The intercept estimate is 1.76, and slope estimate
is 5.005. However, we see in Table 6 that the 95 % confidence interval for the intercept (–2.389, 5.909) con
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...