ASTM E3000-18
(Guide)Standard Guide for Measuring and Tracking Performance of Assessors on a Descriptive Sensory Panel
Standard Guide for Measuring and Tracking Performance of Assessors on a Descriptive Sensory Panel
SIGNIFICANCE AND USE
5.1 This guide is meant to be used with and applied to individual trained descriptive assessors.
5.2 The procedures recommended in this guide can be used by the panel leader to periodically appraise the performance of individual descriptive assessors.
5.3 Tracking assessor performance will provide information as to the quality of the data being generated. Performance information may be used to decide whether to use the data to interpret product profiles.
5.4 Monitoring assessor performance will enable the panel leader to identify retraining needs or to identify assessors who are not performing well enough to continue participating on a panel.
SCOPE
1.1 This guide provides guidelines for measuring and tracking the performance of individual assessors on a descriptive sensory panel.
1.2 This guide provides guidelines to assist sensory professionals in measuring performance for given assessors. Measuring performance will form the basis for (1) determining the reliability of the results, and (2) establishing remedial actions for an individual assessor.
1.3 This guide examines various aspects of trained assessor performance; such as repeatability, discrimination, and agreement and demonstrates some ways to measure them. The procedures will help the sensory professional determine areas of good performance as well as those that require improvement.
1.4 Individual assessor performance is tracked using established statistical procedures. These procedures depend on whether replicates are collected and if they are collected over multiple sessions or within a single session.
1.5 This guide provides suggested procedures, including statistical procedures that can be done using standard statistical software, for evaluating performance and is not meant to exclude other methods that may be effectively used for a similar purpose.
1.6 Methods for training and screening assessors are not within the scope of this guide. This guide does not address how to communicate performance feedback information to individual assessors. This monitoring of panel reproducibility, a measure of the panel’s ability to reproduce the results of other panels, is also not within the scope of this guide.
1.7 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use.
1.8 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
General Information
- Status
- Published
- Publication Date
- 31-Mar-2018
- Technical Committee
- E18 - Sensory Evaluation
- Drafting Committee
- E18.03 - Sensory Theory and Statistics
Relations
- Effective Date
- 01-Apr-2018
- Effective Date
- 01-Apr-2022
- Effective Date
- 15-Oct-2019
- Effective Date
- 01-Oct-2018
- Effective Date
- 15-Jun-2018
- Effective Date
- 01-Oct-2017
- Effective Date
- 01-Oct-2017
- Effective Date
- 01-May-2017
- Effective Date
- 01-Jun-2016
- Effective Date
- 01-Dec-2015
- Effective Date
- 01-Jun-2015
- Effective Date
- 15-Jan-2015
- Effective Date
- 15-Nov-2013
- Effective Date
- 15-Nov-2013
- Effective Date
- 15-Nov-2013
Overview
ASTM E3000-18: Standard Guide for Measuring and Tracking Performance of Assessors on a Descriptive Sensory Panel provides comprehensive guidelines for evaluating and monitoring the performance of individual trained sensory assessors. Developed by ASTM Committee E18 on Sensory Evaluation, this standard is essential for sensory professionals and organizations engaged in descriptive sensory analysis, ensuring data reliability and ongoing panel effectiveness.
Performance tracking is critical to ensure high-quality sensory data, which drives meaningful product development, quality control, and consumer research decisions. This guide outlines practical, statistical, and methodological approaches for monitoring individual assessor performance, identifying those who may require retraining, or whose data may impact panel accuracy.
Key Topics
- Purpose and Scope
- Covers procedures for tracking and appraising trained sensory assessors.
- Focuses on individual performance rather than panel-wide reproducibility.
- Performance Metrics
- Repeatability: Assessors’ ability to replicate results on repeated trials.
- Discrimination: Assessors’ capability to differentiate between sample attributes.
- Agreement: Consistency of assessors’ scoring relative to other panelists.
- Evaluation Methods
- Use of standard statistical tools and software for assessment.
- Understanding causes of poor performance, such as inconsistency in scale or lexicon usage.
- Data Quality and Action
- Establishes when data is reliable enough for product profiling.
- Provides guidance on when retraining or exclusion of assessors may be necessary.
Practical Applications
- Periodic Performance Appraisal
- Enables panel leaders to regularly review individual assessor consistency, sensitivity, and agreement.
- Data-Driven Decision Making
- Assures only high-quality sensory data is used for interpreting product profiles and research conclusions.
- Quality Assurance in Sensory Testing
- Identifies and addresses sources of variability, such as assessor misunderstanding or sample differences.
- Corrective Action and Panel Maintenance
- Supports decisions regarding retraining or reassignment of assessors to maintain robust panel performance.
- Statistical Validity
- Uses established analysis techniques-such as ANOVA and graphical outputs-to diagnose and resolve issues related to assessor repeatability and discrimination.
Related Standards
Understanding and using ASTM E3000-18 is enhanced by referencing several related standards and documents:
- ASTM E253: Terminology Relating to Sensory Evaluation of Materials and Products
- ASTM E456: Terminology Relating to Quality and Statistics
- ISO 11132:2012: Sensory Analysis - Methodology-Guidelines for Monitoring the Performance of a Quantitative Sensory Panel
- ASTM STP 758: Guidelines for the Selection and Training of Sensory Panel Members
- ASTM MNL13: Manual on Descriptive Analysis for Sensory Evaluation
Value and Relevance
Using ASTM E3000-18 enhances the credibility and reproducibility of sensory panel results critical to food science, consumer product development, and quality control. This standard provides a structured framework for panel leaders and sensory professionals to monitor assessor performance, ensuring that panel data remains trustworthy for business and research outcomes. Through consistent application of these guidelines, organizations can optimize panel performance, minimize the risk of data loss due to unreliable assessors, and drive continuous improvement in sensory evaluation methodologies.
Keywords: ASTM E3000-18, assessor performance tracking, descriptive sensory panel, sensory evaluation, data quality, panel leader, repeatability, discrimination, agreement, statistical assessment, sensory analysis standards.
Buy Documents
ASTM E3000-18 - Standard Guide for Measuring and Tracking Performance of Assessors on a Descriptive Sensory Panel
REDLINE ASTM E3000-18 - Standard Guide for Measuring and Tracking Performance of Assessors on a Descriptive Sensory Panel
Get Certified
Connect with accredited certification bodies for this standard

BSI Group
BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

Bureau Veritas
Bureau Veritas is a world leader in laboratory testing, inspection and certification services.

DNV
DNV is an independent assurance and risk management provider.
Sponsored listings
Frequently Asked Questions
ASTM E3000-18 is a guide published by ASTM International. Its full title is "Standard Guide for Measuring and Tracking Performance of Assessors on a Descriptive Sensory Panel". This standard covers: SIGNIFICANCE AND USE 5.1 This guide is meant to be used with and applied to individual trained descriptive assessors. 5.2 The procedures recommended in this guide can be used by the panel leader to periodically appraise the performance of individual descriptive assessors. 5.3 Tracking assessor performance will provide information as to the quality of the data being generated. Performance information may be used to decide whether to use the data to interpret product profiles. 5.4 Monitoring assessor performance will enable the panel leader to identify retraining needs or to identify assessors who are not performing well enough to continue participating on a panel. SCOPE 1.1 This guide provides guidelines for measuring and tracking the performance of individual assessors on a descriptive sensory panel. 1.2 This guide provides guidelines to assist sensory professionals in measuring performance for given assessors. Measuring performance will form the basis for (1) determining the reliability of the results, and (2) establishing remedial actions for an individual assessor. 1.3 This guide examines various aspects of trained assessor performance; such as repeatability, discrimination, and agreement and demonstrates some ways to measure them. The procedures will help the sensory professional determine areas of good performance as well as those that require improvement. 1.4 Individual assessor performance is tracked using established statistical procedures. These procedures depend on whether replicates are collected and if they are collected over multiple sessions or within a single session. 1.5 This guide provides suggested procedures, including statistical procedures that can be done using standard statistical software, for evaluating performance and is not meant to exclude other methods that may be effectively used for a similar purpose. 1.6 Methods for training and screening assessors are not within the scope of this guide. This guide does not address how to communicate performance feedback information to individual assessors. This monitoring of panel reproducibility, a measure of the panel’s ability to reproduce the results of other panels, is also not within the scope of this guide. 1.7 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use. 1.8 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
SIGNIFICANCE AND USE 5.1 This guide is meant to be used with and applied to individual trained descriptive assessors. 5.2 The procedures recommended in this guide can be used by the panel leader to periodically appraise the performance of individual descriptive assessors. 5.3 Tracking assessor performance will provide information as to the quality of the data being generated. Performance information may be used to decide whether to use the data to interpret product profiles. 5.4 Monitoring assessor performance will enable the panel leader to identify retraining needs or to identify assessors who are not performing well enough to continue participating on a panel. SCOPE 1.1 This guide provides guidelines for measuring and tracking the performance of individual assessors on a descriptive sensory panel. 1.2 This guide provides guidelines to assist sensory professionals in measuring performance for given assessors. Measuring performance will form the basis for (1) determining the reliability of the results, and (2) establishing remedial actions for an individual assessor. 1.3 This guide examines various aspects of trained assessor performance; such as repeatability, discrimination, and agreement and demonstrates some ways to measure them. The procedures will help the sensory professional determine areas of good performance as well as those that require improvement. 1.4 Individual assessor performance is tracked using established statistical procedures. These procedures depend on whether replicates are collected and if they are collected over multiple sessions or within a single session. 1.5 This guide provides suggested procedures, including statistical procedures that can be done using standard statistical software, for evaluating performance and is not meant to exclude other methods that may be effectively used for a similar purpose. 1.6 Methods for training and screening assessors are not within the scope of this guide. This guide does not address how to communicate performance feedback information to individual assessors. This monitoring of panel reproducibility, a measure of the panel’s ability to reproduce the results of other panels, is also not within the scope of this guide. 1.7 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use. 1.8 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
ASTM E3000-18 is classified under the following ICS (International Classification for Standards) categories: 67.240 - Sensory analysis. The ICS classification helps identify the subject area and facilitates finding related standards.
ASTM E3000-18 has the following relationships with other standards: It is inter standard links to ASTM E3000-17, ASTM E456-13a(2022)e1, ASTM E253-19, ASTM E253-18a, ASTM E253-18, ASTM E456-13A(2017)e1, ASTM E456-13A(2017)e3, ASTM E253-17, ASTM E253-16, ASTM E253-15b, ASTM E253-15a, ASTM E253-15, ASTM E456-13ae1, ASTM E456-13a, ASTM E456-13ae3. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
ASTM E3000-18 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.
Standards Content (Sample)
This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation: E3000 − 18
Standard Guide for
Measuring and Tracking Performance of Assessors on a
Descriptive Sensory Panel
This standard is issued under the fixed designation E3000; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope priate safety, health, and environmental practices and deter-
mine the applicability of regulatory limitations prior to use.
1.1 This guide provides guidelines for measuring and track-
1.8 This international standard was developed in accor-
ing the performance of individual assessors on a descriptive
dance with internationally recognized principles on standard-
sensory panel.
ization established in the Decision on Principles for the
1.2 This guide provides guidelines to assist sensory profes-
Development of International Standards, Guides and Recom-
sionals in measuring performance for given assessors. Measur-
mendations issued by the World Trade Organization Technical
ing performance will form the basis for (1) determining the
Barriers to Trade (TBT) Committee.
reliability of the results, and (2) establishing remedial actions
for an individual assessor.
2. Referenced Documents
1.3 This guide examines various aspects of trained assessor 2.1 ASTM Standards:
performance; such as repeatability, discrimination, and agree-
E253 Terminology Relating to Sensory Evaluation of Mate-
ment and demonstrates some ways to measure them. The rials and Products
procedures will help the sensory professional determine areas
E456 Terminology Relating to Quality and Statistics
of good performance as well as those that require improve-
2.2 Other Documents:
ment.
ASTMSTP758 GuidelinesfortheSelectionandTrainingof
Sensory Panel Members
1.4 Individual assessor performance is tracked using estab-
ASTM MNL13 Manual on DescriptiveAnalysis for Sensory
lished statistical procedures. These procedures depend on
Evaluation
whether replicates are collected and if they are collected over
multiple sessions or within a single session. 2.3 ISO Standards:
ISO 11132:2012 Sensory Analysis – Methodology—Guide-
1.5 This guide provides suggested procedures, including
lines for Monitoring the Performance of a Quantitative
statisticalproceduresthatcanbedoneusingstandardstatistical
Sensory Panel
software, for evaluating performance and is not meant to
exclude other methods that may be effectively used for a
3. Terminology
similar purpose.
3.1 Please refer to Terminologies E253 and E456, ASTM
1.6 Methods for training and screening assessors are not
STP 758, ASTM MNL13 and ISO 11132:2012 for any terms
withinthescopeofthisguide.Thisguidedoesnotaddresshow
related to assessor performance that are not listed below.
to communicate performance feedback information to indi-
3.2 Definitions:
vidual assessors. This monitoring of panel reproducibility, a
3.2.1 agreement—ability of an assessor to give similar
measure of the panel’s ability to reproduce the results of other
scores (rate) or to order the intensity of stimuli similarly to the
panels, is also not within the scope of this guide.
rest of the panel (rank) on a given attribute.
1.7 This standard does not purport to address all of the
3.2.2 performance—ability of an assessor to make repeat-
safety concerns, if any, associated with its use. It is the
able assessments that are in agreement with other assessors on
responsibility of the user of this standard to establish appro-
For referenced ASTM standards, visit the ASTM website, www.astm.org, or
This guide is under the jurisdiction of ASTM Committee E18 on Sensory contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM
Evaluation and is the direct responsibility of Subcommittee E18.03 on Sensory Standards volume information, refer to the standard’s Document Summary page on
Theory and Statistics. the ASTM website.
Current edition approved April 1, 2018. Published April 2018. Originally Available from International Organization for Standardization (ISO), ISO
approved in 2017. Last previous edition approved in 2017 as E3000 – 17. DOI: Central Secretariat, BIBC II, Chemin de Blandonnet 8, CP 401, 1214 Vernier,
10.1520/E3000-18. Geneva, Switzerland, http://www.iso.org.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E3000 − 18
the panel and discriminate perceptible differences between criteriafortheindividualassessors,sinceassessorperformance
attributes when they are present. influencespanelresults.Anassessorwhodoesnotdiscriminate
among the samples may impact the panel data, causing the
3.2.3 scale usage—the extent to which the assessor(s) uses
mean values for a specific attribute to be close together and
the scale with respect to the intensities of the attributes being
preventing overall discrimination between samples. In some
measured.
cases, a poorly performing assessor can cause panel data to be
inconsistent and non-repeatable. All of the assessors must be
4. Summary of Practice
using the vocabulary in the same way and utilizing the scales
4.1 The protocols described in this guide provide a proce-
in a consistent manner in order for the panel to succeed.
dure for quantitatively establishing the performance of indi-
6.2 Individual Assessor Performance:
vidual assessors by discussing the minimum level of good
performance, determining when a performance problem exists,
6.2.1 In the early stages of training, performance evaluation
and detailing specific procedures to address those problems. should be analyzed for each individual assessor prior to
participation in a panel. During this phase, the panel leader
5. Significance and Use
typicallymonitorspanelagreementonrankingorratingstimuli
for intensity and on scale usage. Specific examples for each
5.1 This guide is meant to be used with and applied to
attribute need to be introduced, experienced and defined to
individual trained descriptive assessors.
ensure that all assessors understand the sensory qualities and
5.2 The procedures recommended in this guide can be used
range of intensity of the attribute. During training, attribute
by the panel leader to periodically appraise the performance of
definitions and references should be reviewed and possibly
individual descriptive assessors.
revised, to ensure that attributes are understood and used
5.3 Tracking assessor performance will provide information
consistently by individual assessors across all samples. Asses-
as to the quality of the data being generated. Performance
sors should be selected for continued panel participation based
information may be used to decide whether to use the data to
upon performance.
interpret product profiles.
6.2.2 Once assessors are trained, they should be monitored
for the three key measures of performance (repeatability,
5.4 Monitoring assessor performance will enable the panel
discrimination, agreement). It is important to evaluate assessor
leader to identify retraining needs or to identify assessors who
performance periodically in order to detect any change in
are not performing well enough to continue participating on a
individualperformanceovertimeandtoidentifyanassessoror
panel.
assessors who are not performing well. Assessor performance
6. Performance on an individual study should be monitored if you are making
highvalueorhighriskbusinessdecisionswithyourpaneldata.
6.1 Introduction:
The panel member should be engaged for a sufficient period to
6.1.1 This section provides sensory approaches for the
have established a history of performance which has been
assessment of assessor performance. It is assumed that good
monitored.
sensory practices are being followed in order to allow for good
6.2.3 In cases of poor performance, initially check with the
assessor performance. Panel members must be motivated to
individual assessor for any reasons they may not have been
carry out the job conscientiously, be in good health both
performing as usual. This would indicate the need to eliminate
physicallyandmentally,andmustbewillingandabletofollow
their responses on relevant data sets.
instructions. Standard procedures to reduce random variability
6.2.4 Rule out the possibility that assessor variation may be
and systematic bias, including robust experimental design to
due to potential variability within the samples. Some types of
reduce order and carry over effects must be followed by the
products such as meat, seafood, or crop-based products can be
sensory professional.
quite variable and this variability must also be understood
6.1.2 Assessor performance is the measure of the ability of
before concluding that there is an issue with assessor or panel
an assessor to make reliable attribute assessments across the
performance.
products being evaluated. It has been recognized as an impor-
6.2.5 Verificationoftestprocedures,suchascorrectsamples
tant component of descriptive analysis since the method was
evaluated, correct instructions given, no data transcription
first developed (1, 2). It can be measured at a given time point
errors, should also be done.
or tracked over time. Performance is compromised if an
assessor cannot repeat their own results (repeatability), dis- 6.2.6 Fundamental issues such as insufficient training (is-
criminate among the products (discrimination) and assess sues with scale or lexicon usage), not understanding the
stimuli similarly to other assessors on a given attribute (agree- procedure, boredom, over-use, or being unable to perceive
ment). This guide will focus on these three key measures. certain attributes of the stimuli (physiological differences) can
These measures can allow the sensory professional to diagnose also contribute to poor assessor performance. It is important to
sources of poor performance such as the inability to use a scale identify early signs of performance inconsistency and correct
to correctly indicate intensity (scale usage), and failure to use the problem before the assessor has an impact on the overall
attributes similarly to other assessors (lexicon usage). panel’s results.Additional training should be given as a part of
6.1.3 It is important to track panel performance as a whole, panel maintenance to address these issues. By correcting the
since panel data are used for decision making; however, this problem of the inconsistent assessor, one can achieve the aim
guide describes how to measure and interpret performance of having a consistent panel.
E3000 − 18
6.3 Key Measures of Performance—There are three main other panel members. All cross-over interactions should be
elements of poor performance—lack of repeatability, inability carefully examined since they reduce chances of the panel
to discriminate, and lack of agreement—that should be exam- finding significant sample differences.
ined regularly.
6.3.3.4 Lexicon usage may also contribute to agreement
6.3.1 Repeatability—Lack of repeatability occurs when as- issues when one assessor understands the attribute to mean
sessor(s) cannot replicate their ratings from one evaluation to something different from the other panel members.
another of the same sample. It should be noted that assessment
6.4 Performance Diagnostics—Scale usage and lexicon us-
of repeatability is only possible if assessors evaluate the same
age are two diagnostics that can be examined to understand
sample on at least two occasions, either during the same
what is causing issues with agreement, discrimination, and
session or on different sessions.
repeatability.
6.3.1.1 Inadequate training, inconsistent scale usage, lexi-
6.4.1 Scale Usage—Inconsistent scale usage occurs when
con usage, and various psychological and physiological factors
different assessors use different ranges of the scale and also
can impair repeatability.Assessor fatigue, improper spacing of
different areas or locations of the scale while rating the same
samples, poor instructions, inconsistent reference samples, and
sample (note: this is an assessor effect in ANOVA). Inconsis-
sample variation can also contribute to the problem. Study
tent scale usage for an overall panel can be considered
redesign or retraining, or both, may be necessary to reduce the
acceptable to a certain degree as long as assessors are consis-
variation in repeatability.
tent with their own behavior (across all samples) and are in
6.3.2 Discrimination—An assessor’s inability to find sig-
agreement with the rest of the panel (for example, rank the
nificant differences among samples that are found to be
samples in the same order). Poor assessor calibration, inad-
differentbythepanelasawholemayoccurforseveralreasons,
equatetraining,insensitivityorsuper-sensitivitytotheproblem
such as general or specific ageusia and anosmia or differences
attribute, or lack of reference standards is usually the source of
in lexicon usage.
the inconsistent scale usage.
6.3.2.1 Using the same ratings across all samples for an
6.4.2 Lexicon Usage—Correctlexiconusageistheabilityof
attribute may indicate low sensory acuity resulting in the
an assessor(s) to understand and use attributes in a similar
assessor’sinabilitytousethescaleastheyweretrained.Poorly
manner. It is important that each attribute being assessed has a
discriminating assessors may use similar ratings across all
definitionthatispreciseandclearlyunderstoodbytheassessor.
samples in a “safe scale range” to cover their inability to
A discussion during panel training can uncover inconsistent
discern the attribute.
lexicon usage. References should also be developed that
6.3.2.2 The non-discriminating attributes should be identi-
supports the attribute definition and provides clarity to the
fiedandtrainingprovidedtotheassessoronthoseattributesfor
assessor. An assessor who is having issues with lexicon usage
which the samples are expected to differ. It may be necessary
should be given the opportunity to review the definitions and
to change the reference standards to better represent the
references during a training session.
attribute if previously used references are not helpful for the
6.5 Procedure to Evaluate Assessor Performance:
panel.
6.5.1 Follow the statistical procedures outlined in Section 7
6.3.3 Agreement—Agreement is obtained when assessors
of this guide to analyze the performance of the assessor for a
rate samples similarly in relation to each other. Similar ratings
single session over time. Evaluation of an assessor’s perfor-
indicate that the assessors are scoring the samples consistently
mance should involve, at a minimum, the examination of
for each attribute.
performance data for potential issues with repeatability,
6.3.3.1 The data set should be carefully examined to deter-
discrimination, and agreement.
mine which individual assessors contribute to the dissimilarity
6.5.2 Historical data enable the panel leader to review
of the attribute ratings. A lack of agreement may be due to a
assessors’ performance over time. By tracking performance
difference in the assessors’discrimination, differences in scale
overtimethepanelleadercanidentifypatternsofagreementor
or lexicon usage, or both.
disagreement across assessors, and recognize improvement or
6.3.3.2 Sometimes the main cause of a lack of agreement
deterioration of discrimination over time for individual asses-
may not be due to a poor assessor, but rather, an assessor who
sors and for the panel as a whole.
may be more discriminating or more sensitive to an attribute.
6.5.3 Decide what corrective action (for example, further
The identification of the origin of a disagreement is therefore
training,adhocdeletionofdataorassessor,orboth)isrequired
essential for identification of the appropriate corrective action.
for the assessor based on their performance results. Refer to
6.3.3.3 Assessors who vary on the perceived intensity in
Section 9 Corrective Action for more information.
relation to other assessors, but still show the same sample
ranking pattern as the other assessors (magnitude type
7. Procedure and Statistics for Evaluating Assessor
interactions), usually differ in scale usage. A disagreement in
Performance
assessor ratings may also indicate that assessors do not
associate the same sensory perception with the attributes or 7.1 Thissectionoutlinesaprocedureforevaluatingassessor
vary on the perceived intensity due to individual differences in performance. It covers different statistical methods commonly
sensory acuity, thus causing cross-over interactions. A cross- used to calculate or visually inspect each performance measure
over interaction occurs when an assessor’s mean score for a including repeatability, discrimination, and agreement. Table 1
specific sample is reversed in response pattern from those of summarizes the statistical process for evaluating assessor
E3000 − 18
TABLE 1 Statistical Procedure for Evaluating Assessor
three sessions). Refer to ASTM Research Report RR:E18-
Performance 4
1001 for full data set details, including raw data and the full
Key Steps Statistics
statistical output from the procedure described in this section.
Step 1: Initial data check 1. CHECK: Raw data
Initial data check and validation to
7.4 Initial Data Check and Validation—The data should be
confirm that data for the correct
checked to confirm that data for the correct samples were
samples were entered, that the data
entered,thatthedatasetiscomplete,thatallobviousdataentry
set is complete, and to identify and
correct any obvious data entry and
and transcription errors were identified and corrected, and that
transcription errors.
the data will give a true representation of the samples.
Knowledge of the samples is useful for checking that the
Step 2: Assessor Agreement (initial 2. CALCULATE: Mean
check) and Repeatability CHECK: for Assessor agreement
samplemeansmakesenseandthatthecorrectsamplesisuseful
Check assessor repeatability: how 3. CALCULATE: Standard
for checking that the sample means make sense and that the
consistent are they? Deviation
correctsampleswerepresentedtoassessors(forexample,since
CHECK: for Assessor
repeatability
Johnson’s Red is a red apple does it have a higher red apple
4. GRAPH: Individual assessors’
flavor intensity; Granny Smith is typically a sour apple, does it
data across the samples,
have a high sour intensity?). Step 1 can be done with both
one attribute per chart.
non-replicated and replicated data. It is assumed that replicated
Step 3: ANOVA 5. Run appropriate ‘assessor
data are available for subsequent analysis steps.
monitoring’ANOVA model in
7.4.1 Raw Data—Scan the raw data for each assessor to
statistical software.
The model used depends on
check for any obvious inconsistencies in the data. In Table 2,
whether data are replicated.
for Assessor 1, were Braeburn and Top Red samples swapped
See 7.6 for more details.
in Replicate 2? F_red apple and F_green apple scores are
Step 4: Assessor Agreement 6. CHECK: ANOVAAssessor
reversed for these two samples. This step is not specifically
Check agreement among assessors: main effect (α=5%)
related to assessor performance. Instead, it is done to ensure
Does the assessor agree with other 7. CHECK: ANOVA
assessors on the panel for each Assessor*Sample that all subsequent analyses are performed on a data set for
attribute? interaction effect (α=1%)
which all identifiable errors have been removed.
to determine agreement
between assessors for each
sample.
8. GRAPH: Generate an
Assessor*Sample
Supporting data have been filed at ASTM International Headquarters and may
interaction graph
beobtainedbyrequestingResearchReportRR:E18-1001.ContactASTMCustomer
for each attribute.
Service at service@astm.org.
Step 5: Discrimination 9. For each assessor: CHECK
Check assessor discrimination: Can ANOVA Sample main effect
TABLE 2 Raw Apple Data, Assessor 1—Green Apple, Red Apple,
the assessors discriminate between for each attribute (α=5%)
and Sweet Attributes
samples?
Assessor Apple Replicate F_Green F_Red apple
apple
1 Braeburn 1 53 0
performance.Thissectiondoesnotgiveexactdetailsonhowto
1 Braeburn 2 0 37
1 Braeburn 3 57 7
calculate each measure but rather describes the statistics and
1 Fuji 1 19 45
how to use them.
1 Fuji 2 41 14
1 Fuji 3 10 59
7.2 All the listed techniques are available through statistical
1 Gibson’s Green 1 62 0
and graphical computer software packages. Other methodolo-
1 Gibson’s Green 2 50 0
gies can be used; refer to the Bibliography for suggestions. 1 Gibson’s Green 3 37 0
1 Golden Delicious 1 37 0
Thisguideassumesasufficientlevelofstatisticalknowledgeto
1 Golden Delicious 2 42 0
run the suggested statistical procedures. If you are not familiar
1 Golden Delicious 3 36 0
1 Granny Smith 1 71 0
with how to run these statistical procedures, please consult a
1 Granny Smith 2 48 0
statistician or a relevant textbook. More advanced assessor
1 Granny Smith 3 48 0
performance statistics can be done with specialized assessor
1 Johnson’s Red 1 0 70
1 Johnson’s Red 2 0 80
performance or statistical software packages (see for example,
1 Johnson’s Red 3 0 52
Naes et al. (3) and www.panelcheck.com).
1 Pink Lady 1 47 23
1 Pink Lady 2 29 34
7.3 All statistical output used in Section 7 is based on an
1 Pink Lady 3 19 58
apple data set. The trained descriptive apple panel consisted of
1 Royal Gala 1 0 45
twelve assessors. This research project compared ten apple 1 Royal Gala 2 0 41
1 Royal Gala 3 0 51
varieties (that is, samples) for ten flavor-related attributes
1 Sun Gold 1 55 0
(attributes are labelled with an ‘F’ prefix) and twelve texture-
1 Sun Gold 2 47 28
related attributes (attributes are labelled with a ‘T’ prefix). 1 Sun Gold 3 65 11
1 Top Red 1 0 42
Assessors scored each of the ten samples on a 100 mm line
1 Top Red 2 59 0
scale. Each of the twelve assessors evaluated all ten samples in
1 Top Red 3 0 48
three separate sessions (that is, one replicate per session across
E3000 − 18
7.5 Step 2. Assessor Agreement (Initial Check) and Repeat- Main Effect (see 7.7.2). When means are calculated for each
ability: assessor and sample separately, good agreement among the
7.5.1 Mean—Calculate the mean for each assessor for each assessors is evidence by similar rank orders of the samples
attribute (refer to Table 3 for examples of Sweet, Sour, and among the assessors. Alternatively, a graph of the assessors’
Bitter). Means can be calculated across samples or for each means across the samples could be plotted. Good agreement
individualsample.Itisrecommendedtousemeansinconjunc- among assessors is evidenced by similar patterns of sample-
tion with their standard deviation to understand the basics of to-sampledifferencesamongallassessors.Thesimilarityofthe
agreement. Calculating means across samples provides a gross sample-to-sample differences among the assessors should be
measure of agreement among assessors.Assessors should have assessed by taking into account the significance of the
similar mean values. If not, scale usage and lexicon usage Assessor*Sample Interaction Effect (see 7.7.3).
should be examined to uncover the source of the differences. 7.5.2 Standard Deviation—To assess the repeatability of the
The similarity of the assessor means across samples should be assessors,calculatethesquarerootofthemeansquareforerror
assessedbytakingintoaccountthesignificanceoftheAssessor from a two-way ANOVA (with sample and session as the
TABLE 3 Example of Calculated Means and Standard Deviation—Sweet, Acidic/Sour, and Bitter (Atypically Large Standard Deviations
are Highlighted in Red)
E3000 − 18
th th
effects) on each assessor’s data for each attribute individually which is the difference between the 25 and 75 percentile of
(refer to Table 3 for examples of Sweet, Sour, and Bitter). theassessor’sintensityratings).Ashortboxrepresentsahighly
These pooled standard deviations provide measures of the repeatable assessor.Atall box represents an assessor with low
repeatability of the assessors. All assessors should have ap- repeatability.
proximatelyequalstandarddeviations.Thedatafromassessors
7.5.2.3 Referring to Fig. 1,Assessor 3 has the highest mean
with extremely large or extremely small standard deviations
Sweet intensity score (50.6) and the largest interquartile range
(compared to the rest of the assessors) should be examined to
(44.8), indicating a tendency to use higher intensity ratings and
determine the cause of the excessively low or excessively high
more variability in scores. This indicates wide range of scale
level of repeatability. Sensory panel data do not provide
usage across all the samples when scoring Sweet.
sufficient sample sizes for sensitive tests for differences among
7.5.2.4 Bycontrast,Assessors2,6,7,8,and9exhibithigher
the assessors’ standard deviations, so determination of what
levels of repeatability as evidenced by their smaller inter-
represents an extremely large or extremely small standard
quartile ranges (that is, shorter boxes), which may indicate use
deviations needs to be acquired through experience with the
ofasmallerpartoftheSweetscalewhenevaluatingsamplesin
analysis of many sets of sensory panel data.
this study.
7.5.2.1 In the apple example presented in this guide, a
7.5.2.5 Also in contrast to Assessor 3, the boxplot for
standard deviation greater than 10 % of the range of the scale
Assessor7hasanextremelylowmedianvalueandahighlevel
(in this case, a standard deviation of 10 on the 100 point
of repeatability (apart from the two extremely high Sweet taste
intensity scale) was chosen as the action standard to define a
ratings of approximately 50, which are plotted individually
high lack of repeatability. Standard deviations forAssessors 3,
because of their high level of difference from the rest of
6, and 7 that are greater than 10 % are highlighted in red in the
Assessor 7’s ratings).
Table3.Comparedtoassessors6and7,Assessor3hasalarger
7.6 Step 3. Run ANOVA in Statistical Software:
number of standard deviations greater than 10 %. The perfor-
mance of this assessor should be examined in more detail for 7.6.1 Use Analysis of Variance (ANOVA) to determine if
these attributes. there are significant differences among the samples in their
7.5.2.2 A box plot (refer to Fig. 1) is a graphical tool for average intensity ratings and to assess the repeatability,
illustrating both the median response and the variability of an discrimination, and agreement of the assessors. Different
assessor’s ratings.Agreement among assessors can be assessed ANOVA models must be used depending on the design of the
by the similarity of their median intensity values (the horizon- sensory panel. If replicate evaluations are performed in the
tal line in the middle of the box in Fig. 1). The repeatability of same session, for example, all samples are evaluated multiple
the assessors can be assessed by comparing the height of the timesinthesamesession,thenatwo-wayANOVAmodel,with
boxes (the height of the box is called the inter-quartile range, Assessor and Sample as the effects, should be used.
FIG. 1 Boxplot of Sweet (With Mean Scores)
...
This document is not an ASTM standard and is intended only to provide the user of an ASTM standard an indication of what changes have been made to the previous version. Because
it may not be technically possible to adequately depict all changes accurately, ASTM recommends that users consult prior editions as appropriate. In all cases only the current version
of the standard as published by ASTM is to be considered the official document.
Designation: E3000 − 17 E3000 − 18
Standard Guide for
Measuring and Tracking Performance of Assessors on a
Descriptive Sensory Panel
This standard is issued under the fixed designation E3000; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope
1.1 This guide provides guidelines for measuring and tracking the performance of individual assessors on a descriptive sensory
panel.
1.2 This guide provides guidelines to assist sensory professionals in measuring performance for given assessors. Measuring
performance will form the basis for (1) determining the reliability of the results, and (2) establishing remedial actions for an
individual assessor.
1.3 This guide examines various aspects of trained assessor performance; such as repeatability, discrimination, and agreement
and demonstrates some ways to measure them. The procedures will help the sensory professional determine areas of good
performance as well as those that require improvement.
1.4 Individual assessor performance is tracked using established statistical procedures. These procedures depend on whether
replicates are collected and if they are collected over multiple sessions or within a single session.
1.5 This guide provides suggested procedures, including statistical procedures that can be done using standard statistical
software, for evaluating performance and is not meant to exclude other methods that may be effectively used for a similar purpose.
1.6 Methods for training and screening assessors are not within the scope of this guide. This guide does not address how to
communicate performance feedback information to individual assessors. This monitoring of panel reproducibility, a measure of the
panel’s ability to reproduce the results of other panels, is also not within the scope of this guide.
1.7 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility
of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of
regulatory limitations prior to use.
1.8 This international standard was developed in accordance with internationally recognized principles on standardization
established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued
by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
2. Referenced Documents
2.1 ASTM Standards:
E253 Terminology Relating to Sensory Evaluation of Materials and Products
E456 Terminology Relating to Quality and Statistics
2.2 Other Documents:
ASTM STP 758 Guidelines for the Selection and Training of Sensory Panel Members
ASTM MNL13 Manual on Descriptive Analysis for Sensory Evaluation
2.3 ISO Standards:
ISO 11132:2012 Sensory Analysis – Methodology—Guidelines for Monitoring the Performance of a Quantitative Sensory Panel
This guide is under the jurisdiction of ASTM Committee E18 on Sensory Evaluation and is the direct responsibility of Subcommittee E18.03 on Sensory Theory and
Statistics.
Current edition approved Nov. 1, 2017April 1, 2018. Published December 2017April 2018. Originally approved in 2017. Last previous edition approved in 2017 as
E3000 – 17. DOI: 10.1520/E3000-17.10.1520/E3000-18.
For referenced ASTM standards, visit the ASTM website, www.astm.org, or contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM Standards
volume information, refer to the standard’s Document Summary page on the ASTM website.
Available from International Organization for Standardization (ISO), ISO Central Secretariat, BIBC II, Chemin de Blandonnet 8, CP 401, 1214 Vernier, Geneva,
Switzerland, http://www.iso.org.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E3000 − 18
3. Terminology
3.1 Please refer to Terminologies E253 and E456, ASTM STP 758,ASTM MNL13 and ISO 11132:2012 for any terms related to
assessor performance that are not listed below.
3.2 Definitions:
3.2.1 agreement—ability of an assessor to give similar scores (rate) or to order the intensity of stimuli similarly to the rest of
the panel (rank) on a given attribute.
3.2.2 performance—ability of an assessor to make repeatable assessments that are in agreement with other assessors on the
panel and discriminate perceptible differences between attributes when they are present.
3.2.3 scale usage—the extent to which the assessor(s) uses the scale with respect to the intensities of the attributes being
measured.
4. Summary of Practice
4.1 The protocols described in this guide provide a procedure for quantitatively establishing the performance of individual
assessors by discussing the minimum level of good performance, determining when a performance problem exists, and detailing
specific procedures to address those problems.
5. Significance and Use
5.1 This guide is meant to be used with and applied to individual trained descriptive assessors.
5.2 The procedures recommended in this guide can be used by the panel leader to periodically appraise the performance of
individual descriptive assessors.
5.3 Tracking assessor performance will provide information as to the quality of the data being generated. Performance
information may be used to decide whether to use the data to interpret product profiles.
5.4 Monitoring assessor performance will enable the panel leader to identify retraining needs or to identify assessors who are
not performing well enough to continue participating on a panel.
6. Performance
6.1 Introduction:
6.1.1 This section provides sensory approaches for the assessment of assessor performance. It is assumed that good sensory
practices are being followed in order to allow for good assessor performance. Panel members must be motivated to carry out the
job conscientiously, be in good health both physically and mentally, and must be willing and able to follow instructions. Standard
procedures to reduce random variability and systematic bias, including robust experimental design to reduce order and carry over
effects must be followed by the sensory professional.
6.1.2 Assessor performance is the measure of the ability of an assessor to make reliable attribute assessments across the products
being evaluated. It has been recognized as an important component of descriptive analysis since the method was first developed
(1, 2). It can be measured at a given time point or tracked over time. Performance is compromised if an assessor cannot repeat
their own results (repeatability), discriminate among the products (discrimination) and assess stimuli similarly to other assessors
on a given attribute (agreement). This guide will focus on these three key measures. These measures can allow the sensory
professional to diagnose sources of poor performance such as the inability to use a scale to correctly indicate intensity (scale
usage), and failure to use attributes similarly to other assessors (lexicon usage).
6.1.3 It is important to track panel performance as a whole, since panel data are used for decision making; however, this guide
describes how to measure and interpret performance criteria for the individual assessors, since assessor performance influences
panel results. An assessor who does not discriminate among the samples may impact the panel data, causing the mean values for
a specific attribute to be close together and preventing overall discrimination between samples. In some cases, a poorly performing
assessor can cause panel data to be inconsistent and non-repeatable. All of the assessors must be using the vocabulary in the same
way and utilizing the scales in a consistent manner in order for the panel to succeed.
6.2 Individual Assessor Performance:
6.2.1 In the early stages of training, performance evaluation should be analyzed for each individual assessor prior to
participation in a panel. During this phase, the panel leader typically monitors panel agreement on ranking or rating stimuli for
intensity and on scale usage. Specific examples for each attribute need to be introduced, experienced and defined to ensure that
all assessors understand the sensory qualities and range of intensity of the attribute. During training, attribute definitions and
references should be reviewed and possibly revised, to ensure that attributes are understood and used consistently by individual
assessors across all samples. Assessors should be selected for continued panel participation based upon performance.
6.2.2 Once assessors are trained, they should be monitored for the three key measures of performance (repeatability,
discrimination, agreement). It is important to evaluate assessor performance periodically in order to detect any change in individual
performance over time and to identify an assessor or assessors who are not performing well. Assessor performance on an individual
E3000 − 18
study should be monitored if you are making high value or high risk business decisions with your panel data. The panel member
should be engaged for a sufficient period to have established a history of performance which has been monitored.
6.2.3 In cases of poor performance, initially check with the individual assessor for any reasons they may not have been
performing as usual. This would indicate the need to eliminate their responses on relevant data sets.
6.2.4 Rule out the possibility that assessor variation may be due to potential variability within the samples. Some types of
products such as meat, seafood, or crop-based products can be quite variable and this variability must also be understood before
concluding that there is an issue with assessor or panel performance.
6.2.5 Verification of test procedures, such as correct samples evaluated, correct instructions given, no data transcription errors,
should also be done.
6.2.6 Fundamental issues such as insufficient training (issues with scale or lexicon usage), not understanding the procedure,
boredom, over-use, or being unable to perceive certain attributes of the stimuli (physiological differences) can also contribute to
poor assessor performance. It is important to identify early signs of performance inconsistency and correct the problem before the
assessor has an impact on the overall panel’s results. Additional training should be given as a part of panel maintenance to address
these issues. By correcting the problem of the inconsistent assessor, one can achieve the aim of having a consistent panel.
6.3 Key Measures of Performance—There are three main elements of poor performance—lack of repeatability, inability to
discriminate, and lack of agreement—that should be examined regularly.
6.3.1 Repeatability—Lack of repeatability occurs when assessor(s) cannot replicate their ratings from one evaluation to another
of the same sample. It should be noted that assessment of repeatability is only possible if assessors evaluate the same sample on
at least two occasions, either during the same session or on different sessions.
6.3.1.1 Inadequate training, inconsistent scale usage, lexicon usage, and various psychological and physiological factors can
impair repeatability. Assessor fatigue, improper spacing of samples, poor instructions, inconsistent reference samples, and sample
variation can also contribute to the problem. Study redesign or retraining, or both, may be necessary to reduce the variation in
repeatability.
6.3.2 Discrimination—An assessor’s inability to find significant differences among samples that are found to be different by the
panel as a whole may occur for several reasons, such as general or specific ageusia and anosmia or differences in lexicon usage.
6.3.2.1 Using the same ratings across all samples for an attribute may indicate low sensory acuity resulting in the assessor’s
inability to use the scale as they were trained. Poorly discriminating assessors may use similar ratings across all samples in a “safe
scale range” to cover their inability to discern the attribute.
6.3.2.2 The non-discriminating attributes should be identified and training provided to the assessor on those attributes for which
the samples are expected to differ. It may be necessary to change the reference standards to better represent the attribute if
previously used references are not helpful for the panel.
6.3.3 Agreement—Agreement is obtained when assessors rate samples similarly in relation to each other. Similar ratings indicate
that the assessors are scoring the samples consistently for each attribute.
6.3.3.1 The data set should be carefully examined to determine which individual assessors contribute to the dissimilarity of the
attribute ratings. A lack of agreement may be due to a difference in the assessors’ discrimination, differences in scale or lexicon
usage, or both.
6.3.3.2 Sometimes the main cause of a lack of agreement may not be due to a poor assessor, but rather, an assessor who may
be more discriminating or more sensitive to an attribute. The identification of the origin of a disagreement is therefore essential
for identification of the appropriate corrective action.
6.3.3.3 Assessors who vary on the perceived intensity in relation to other assessors, but still show the same sample ranking
pattern as the other assessors (magnitude type interactions), usually differ in scale usage. A disagreement in assessor ratings may
also indicate that assessors do not associate the same sensory perception with the attributes or vary on the perceived intensity due
to individual differences in sensory acuity, thus causing cross-over interactions. A cross-over interaction occurs when an assessor’s
mean score for a specific sample is reversed in response pattern from those of other panel members. All cross-over interactions
should be carefully examined since they reduce chances of the panel finding significant sample differences.
6.3.3.4 Lexicon usage may also contribute to agreement issues when one assessor understands the attribute to mean something
different from the other panel members.
6.4 Performance Diagnostics—Scale usage and lexicon usage are two diagnostics that can be examined to understand what is
causing issues with agreement, discrimination, and repeatability.
6.4.1 Scale Usage—Inconsistent scale usage occurs when different assessors use different ranges of the scale and also different
areas or locations of the scale while rating the same sample (note: this is an assessor effect in ANOVA). Inconsistent scale usage
for an overall panel can be considered acceptable to a certain degree as long as assessors are consistent with their own behavior
(across all samples) and are in agreement with the rest of the panel (for example, rank the samples in the same order). Poor assessor
calibration, inadequate training, insensitivity or super-sensitivity to the problem attribute, or lack of reference standards is usually
the source of the inconsistent scale usage.
6.4.2 Lexicon Usage—Correct lexicon usage is the ability of an assessor(s) to understand and use attributes in a similar manner.
It is important that each attribute being assessed has a definition that is precise and clearly understood by the assessor. A discussion
during panel training can uncover inconsistent lexicon usage. References should also be developed that supports the attribute
E3000 − 18
definition and provides clarity to the assessor. An assessor who is having issues with lexicon usage should be given the opportunity
to review the definitions and references during a training session.
6.5 Procedure to Evaluate Assessor Performance:
6.5.1 Follow the statistical procedures outlined in Section 7 of this guide to analyze the performance of the assessor for a single
session over time. Evaluation of an assessor’s performance should involve, at a minimum, the examination of performance data
for potential issues with repeatability, discrimination, and agreement.
6.5.2 Historical data enable the panel leader to review assessors’ performance over time. By tracking performance over time the
panel leader can identify patterns of agreement or disagreement across assessors, and recognize improvement or deterioration of
discrimination over time for individual assessors and for the panel as a whole.
6.5.3 Decide what corrective action (for example, further training, ad hoc deletion of data or assessor, or both) is required for
the assessor based on their performance results. Refer to Section 9 Corrective Action for more information.
7. Procedure and Statistics for Evaluating Assessor Performance
7.1 This section outlines a procedure for evaluating assessor performance. It covers different statistical methods commonly used
to calculate or visually inspect each performance measure including repeatability, discrimination, and agreement. Table 1
summarizes the statistical process for evaluating assessor performance. This section does not give exact details on how to calculate
each measure but rather describes the statistics and how to use them.
7.2 All the listed techniques are available through statistical and graphical computer software packages. Other methodologies
can be used; refer to the Bibliography for suggestions. This guide assumes a sufficient level of statistical knowledge to run the
suggested statistical procedures. If you are not familiar with how to run these statistical procedures, please consult a statistician
or a relevant textbook. More advanced assessor performance statistics can be done with specialized assessor performance or
statistical software packages (see for example, Naes et al. (13) and www.panelcheck.com).
TABLE 1 Statistical Procedure for Evaluating Assessor
Performance
Key Steps Statistics
Step 1: Initial data check 1. CHECK: Raw data
Initial data check and validation to
confirm that data for the correct
samples were entered, that the data
set is complete, and to identify and
correct any obvious data entry and
transcription errors.
Step 2: Assessor Agreement (initial 2. CALCULATE: Mean
check) and Repeatability CHECK: for Assessor agreement
Check assessor repeatability: how 3. CALCULATE: Standard
consistent are they? Deviation
CHECK: for Assessor
repeatability
4. GRAPH: Individual assessors’
data across the samples,
one attribute per chart.
Step 3: ANOVA 5. Run appropriate ‘assessor
monitoring’ ANOVA model in
statistical software.
The model used depends on
whether data are replicated.
See 7.6 for more details.
Step 4: Assessor Agreement 6. CHECK: ANOVA Assessor
Check agreement among assessors: main effect (α = 5 %)
Does the assessor agree with other 7. CHECK: ANOVA
assessors on the panel for each Assessor*Sample
attribute? interaction effect (α = 1 %)
to determine agreement
between assessors for each
sample.
8. GRAPH: Generate an
Assessor*Sample
interaction graph
for each attribute.
Step 5: Discrimination 9. For each assessor: CHECK
Check assessor discrimination: Can ANOVA Sample main effect
the assessors discriminate between for each attribute (α = 5 %)
samples?
E3000 − 18
7.3 All statistical output used in Section 7 is based on an apple data set. The trained descriptive apple panel consisted of twelve
assessors. This research project compared ten apple varieties (that is, samples) for ten flavor-related attributes (attributes are
labelled with an ‘F’ prefix) and twelve texture-related attributes (attributes are labelled with a ‘T’ prefix). Assessors scored each
of the ten samples on a 100 mm line scale. Each of the twelve assessors evaluated all ten samples in three separate sessions (that
is, one replicate per session across three sessions). Refer to ASTM Research Report RR:E18-1001 for full data set details,
including raw data and the full statistical output from the procedure described in this section.
7.4 Initial Data Check and Validation—The data should be checked to confirm that data for the correct samples were entered,
that the data set is complete, that all obvious data entry and transcription errors were identified and corrected, and that the data
will give a true representation of the samples. Knowledge of the samples is useful for checking that the sample means make sense
and that the correct samples is useful for checking that the sample means make sense and that the correct samples were presented
to assessors (for example, since Johnson’s Red is a red apple does it have a higher red apple flavor intensity; Granny Smith is
typically a sour apple, does it have a high sour intensity?). Step 1 can be done with both non-replicated and replicated data. It is
assumed that replicated data are available for subsequent analysis steps.
7.4.1 Raw Data—Scan the raw data for each assessor to check for any obvious inconsistencies in the data. In Table 2, for
Assessor 1, were Braeburn and Top Red samples swapped in Replicate 2? F_red apple and F_green apple scores are reversed for
these two samples. This step is not specifically related to assessor performance. Instead, it is done to ensure that all subsequent
analyses are performed on a data set for which all identifiable errors have been removed.
7.5 Step 2. Assessor Agreement (Initial Check) and Repeatability:
7.5.1 Mean—Calculate the mean for each assessor for each attribute (refer to Table 3 for examples of Sweet, Sour, and Bitter).
Means can be calculated across samples or for each individual sample. It is recommended to use means in conjunction with their
standard deviation to understand the basics of agreement. Calculating means across samples provides a gross measure of agreement
among assessors. Assessors should have similar mean values. If not, scale usage and lexicon usage should be examined to uncover
the source of the differences. The similarity of the assessor means across samples should be assessed by taking into account the
significance of the Assessor Main Effect (see 7.7.2). When means are calculated for each assessor and sample separately, good
agreement among the assessors is evidence by similar rank orders of the samples among the assessors. Alternatively, a graph of
the assessors’ means across the samples could be plotted. Good agreement among assessors is evidenced by similar patterns of
TABLE 2 Raw Apple Data, Assessor 1—Green Apple, Red Apple,
and Sweet Attributes
Assessor Apple Replicate F_Green F_Red apple
apple
1 Braeburn 1 53 0
1 Braeburn 2 0 37
1 Braeburn 3 57 7
1 Fuji 1 19 45
1 Fuji 2 41 14
1 Fuji 3 10 59
1 Gibson’s Green 1 62 0
1 Gibson’s Green 2 50 0
1 Gibson’s Green 3 37 0
1 Golden Delicious 1 37 0
1 Golden Delicious 2 42 0
1 Golden Delicious 3 36 0
1 Granny Smith 1 71 0
1 Granny Smith 2 48 0
1 Granny Smith 3 48 0
1 Johnson’s Red 1 0 70
1 Johnson’s Red 2 0 80
1 Johnson’s Red 3 0 52
1 Pink Lady 1 47 23
1 Pink Lady 2 29 34
1 Pink Lady 3 19 58
1 Royal Gala 1 0 45
1 Royal Gala 2 0 41
1 Royal Gala 3 0 51
1 Sun Gold 1 55 0
1 Sun Gold 2 47 28
1 Sun Gold 3 65 11
1 Top Red 1 0 42
1 Top Red 2 59 0
1 Top Red 3 0 48
Supporting data have been filed at ASTM International Headquarters and may be obtained by requesting Research Report RR:E18-1001. Contact ASTM Customer
Service at service@astm.org.
E3000 − 18
TABLE 3 Example of Calculated Means and Standard Deviation—Sweet, Acidic/Sour, and Bitter (Individual Assessor and Sample
Scores)(Atypically Large Standard Deviations are Highlighted in Red)
sample-to-sample differences among all assessors. The similarity of the sample-to-sample differences among the assessors should
be assessed by taking into account the significance of the Assessor*Sample Interaction Effect (see 7.7.3).
7.5.2 Standard Deviation—To assess the repeatability of the assessors, calculate the square root of the mean square for error
from a two-way ANOVA (with sample and session as the effects) on each assessor’s data for each attribute individually (refer to
Table 3 for examples of Sweet, Sour, and Bitter). These pooled standard deviations provide measures of the repeatability of the
assessors. All assessors should have approximately equal standard deviations. The data from assessors with extremely large or
extremely small standard deviations (compared to the rest of the assessors) should be examined to determine the cause of the
excessively low or excessively high level of repeatability. Sensory panel data do not provide sufficient sample sizes for sensitive
tests for differences among the assessors’ standard deviations, so determination of what represents an extremely large or extremely
small standard deviations needs to be acquired through experience with the analysis of many sets of sensory panel data.
7.5.2.1 In the apple example presented in this guide, a standard deviation greater than 10 % of the range of the scale (in this
case, a standard deviation of 10 on the 100 point intensity scale) was chosen as the action standard to define a high lack of
repeatability. Standard deviations for Assessors 3, 6, and 7 that are greater than 10 % are highlighted in red in the Table 3.
E3000 − 18
Compared to assessors 6 and 7, Assessor 3 has a larger number of standard deviations greater than 10 %. The performance of this
assessor should be examined in more detail for these attributes.
7.5.2.2 A box plot (refer to Fig. 1) is a graphical tool for illustrating both the median response and the variability of an assessor’s
ratings. Agreement among assessors can be assessed by the similarity of their median intensity values (the horizontal line in the
middle of the box in Fig. 1). The repeatability of the assessors can be assessed by comparing the height of the boxes (the height
th th
of the box is called the inter-quartile range, which is the difference between the 25 and 75 percentile of the assessor’s intensity
ratings). A short box represents a highly repeatable assessor. A tall box represents an assessor with low repeatability.
7.5.2.3 Referring to Fig. 1, Assessor 3 has the highest mean Sweet intensity score (50.6) and the largest interquartile range
(44.8), indicating a tendency to use higher intensity ratings and more variability in scores. This indicates wide range of scale usage
across all the samples when scoring Sweet.
7.5.2.4 By contrast, Assessors 2, 6, 7, 8, and 9 exhibit higher levels of repeatability as evidenced by their smaller interquartile
ranges (that is, shorter boxes), which may indicate use of a smaller part of the Sweet scale when evaluating samples in this study.
7.5.2.5 Also in contrast to Assessor 3, the boxplot for Assessor 7 has an extremely low median value and a high level of
repeatability (apart from the two extremely high Sweet taste ratings of approximately 50, which are plotted individually because
of their high level of difference from the rest of Assessor 7’s ratings).
7.6 Step 3. Run ANOVA in Statistical Software:
7.6.1 Use Analysis of Variance (ANOVA) to determine if there are significant differences among the samples in their average
intensity ratings and to assess the repeatability, discrimination, and agreement of the assessors. Different ANOVA models must be
used depending on the design of the sensory panel. If replicate evaluations are performed in the same session, for example, all
samples are evaluated multiple times in the same session, then a two-way ANOVA model, with Assessor and Sample as the effects,
should be used. Alternatively, if all samples are evaluated once in a session across multiple sessions, then a three-way ANOVA
model, with Session, Assessor, and Sample as the factors, should be used.
7.6.2 The correct error term in ANOVA depends on the assumptions being made about the effects in the model. Specifically,
with assessor monitoring, assessors are treated as fixed effects. In some of the assessor monitoring tests, replicates (for example,
sessions) also are treated as fixed effects. When we treat session, assessor, and sample as fixed effects, the residual is the correct
error term to use to test all effects in the model.
7.6.3 In “production mode,” when using the panel to test for differences among samples, both assessors and replicates are
treated as random effects, which leads to a different error structure in the data and to different error terms. It is important to
distinguish the differences in the assumptions being made when doing assessor monitoring versus product testing.
FIG. 1 Boxplot of Sweet (With Mean Scores)
E3000 − 18
7.6.4 For the purposes of this guide, all statistical examples will assume that the analyses are being performed to monitor the
performance of assessors and that replicate evaluations of the samples were obtained by having all the samples evaluated once in
each of several sessions, so that in the sample, replicate = session and Model 1a, below, applies.
7.6.4.1 Multiple Evaluations Conducted Across Mutliple Sessions:
(1) Assessor Monitoring (Fixed Effects = Session, Assession, and Sample): Y = Mean + Session + Assessor + Sample +
Session*Assessor + Session*Sample + Assessor*Sample + Error
(2) Production Mode (Fixed Effects = Sample, Random Effects = Session and Assessor): Y = Mean + Session + Assessor +
Sample + Error
7.6.4.2 Multiple Evaluations Conducted Within a Single Session:
(1) Assessor Monitoring (Fixed Effects = Assessor and Sample) Y = Mean + Assessor + Sample + Assessor*Sample + Error
(2) Production Mode (Fixed Effect = Sample, Random Effect = Assessor): Y = Mean + Assessor + Sample + Error
7.7 Step 4. Check Agreement Among Assessors:
7.7.1 Use the output of the ANOVA (Model 1a) to evaluate assessor agreement. Tables 4 and 5 give examples of the ANOVA
outputs for the attributes Sweet and Sour. The following abbreviations are used in the ANOVA output tables: DF = Degrees of
F
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...