ASTM E2849-18(2024)
(Practice)Standard Practice for Professional Certification Performance Testing
Standard Practice for Professional Certification Performance Testing
SIGNIFICANCE AND USE
3.1 This practice for performance testing provides guidance to performance test sponsors, developers, and delivery providers for the planning, design, development, administration, and reporting of high-quality performance tests. This practice assists stakeholders from both the user and consumer communities in determining the quality of performance tests. This practice includes requirements, processes, and intended outcomes for the entities that are issuing the performance test, developing, delivering and evaluating the test, users and test takers interpreting the test, and the specific quality characteristics of performance tests. This practice provides the foundation for both the recognition and accreditation of a specific entity to issue and use effectively a quality performance test.
3.2 Accreditation agencies are presently evaluating performance tests with criteria that were developed primarily or exclusively for multiple-choice examinations. The criteria by which performance tests shall be evaluated and accredited are ones appropriate to performance testing. As accreditation becomes more critical for acceptance by federal and state governments, insurance companies, and international trade, it becomes more critical that appropriate standards of quality and application be developed for performance testing.
SCOPE
1.1 This practice covers both the professional certification performance test itself and specific aspects of the process that produced it.
1.2 This practice does not include management systems. In this practice, the test itself and its administration, psychometric properties, and scoring are addressed.
1.3 This practice primarily addresses individual professional performance certification examinations, although it may be used to evaluate exams used in training, educational, and aptitude contexts. This practice is not intended to address on-site evaluation of workers by supervisors for competence to perform tasks.
1.4 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of regulatory limitations prior to use.
1.5 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
General Information
Relations
Standards Content (Sample)
This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation: E2849 − 18 (Reapproved 2024) An American National Standard
Standard Practice for
Professional Certification Performance Testing
This standard is issued under the fixed designation E2849; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope 2.1.3 construct validity, n—degree to which the test evalu-
ates an underlying theoretical idea resulting from the orderly
1.1 This practice covers both the professional certification
arrangement of facts.
performance test itself and specific aspects of the process that
2.1.4 differential system responsiveness, n—measurable dif-
produced it.
ference in response latency between two systems.
1.2 This practice does not include management systems. In
2.1.5 examinee, n—candidate in the process of taking a test.
this practice, the test itself and its administration, psychometric
properties, and scoring are addressed.
2.1.6 gating item, n—unit of evaluation that shall be passed
to pass a test.
1.3 This practice primarily addresses individual profes-
sional performance certification examinations, although it may
2.1.7 inter-rater reliability, n—measurement of rater consis-
be used to evaluate exams used in training, educational, and tency with other raters.
aptitude contexts. This practice is not intended to address
2.1.7.1 Discussion—See rater reliability.
on-site evaluation of workers by supervisors for competence to
2.1.8 item, n—scored response unit.
perform tasks.
2.1.8.1 Discussion—See task.
1.4 This standard does not purport to address all of the
2.1.9 item observer, n—human or computer element that
safety concerns, if any, associated with its use. It is the
observes and records a candidate’s performance on a specific
responsibility of the user of this standard to establish appro-
item.
priate safety, health, and environmental practices and deter-
2.1.10 on the job, n—another term for “target context.”
mine the applicability of regulatory limitations prior to use.
2.1.10.1 Discussion—See target context.
1.5 This international standard was developed in accor-
dance with internationally recognized principles on standard- 2.1.11 performance test, n—examination in which the re-
sponse modality mimics or reflects the response modality
ization established in the Decision on Principles for the
required in the target context.
Development of International Standards, Guides and Recom-
mendations issued by the World Trade Organization Technical
2.1.12 power test, n—examination in which virtually all
Barriers to Trade (TBT) Committee.
candidates have time to complete all items.
2.1.13 practitioners, n—people who practice the contents of
2. Terminology
the test in the target context.
2.1 Definitions—Some of the terms defined in this section
2.1.14 rater reliability, n—measurement of rater consistency
are unique to the performance testing context. Consequently,
with a uniform standard.
terms defined in other standards may vary slightly from those
2.1.14.1 Discussion—See inter-rater reliability.
defined in the following.
2.1.15 reconfiguration, n—modification of the user interface
2.1.1 automatic item generation (AIG), n—a process of
for a process, device, or software application.
computationally generating multiple forms of an item.
2.1.15.1 Discussion—Reconfiguration ranges from adjust-
2.1.2 candidate, n—someone who is eligible to be evaluated
ing the seat in a crane to importing a set of macros into a
through the use of the performance test; a person who is or will
programming environment.
be taking the test.
2.1.16 reliability, n—degree to which the test will make the
same prediction with the same examinee on another occasion
with no training occurring during the intervening interval.
This practice is under the jurisdiction of ASTM Committee E36 on Accredi-
2.1.17 rubric, n—set of rules by which performance will be
tation & Certification and is the direct responsibility of Subcommittee E36.30 on
Personnel Credentialing.
judged.
Current edition approved Feb. 1, 2024. Published March 2024. Originally
2.1.18 speeded test, n—examination that is time-constrained
approved in 2013. Last previous edition approved in 2018 as E2849 – 18. DOI:
10.1520/E2849-18R24. so that more than 10 % of candidates do not finish all items.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E2849 − 18 (2024)
2.1.19 target context, n—situation within which a test is 4.2.2 The examinee shall not be provided so much informa-
designed to predict performance. tion about the scoring rubric that it diminishes the ability of
stakeholders to generalize the examinee’s skills from his or her
2.1.20 task, n—unit of performance requested for the can-
test score.
didate to do; a task can be scored as one item; a task may also
4.3 Practice Tests:
be comprised of multiple components each of which is scored
as an item. 4.3.1 There are two types of practice tests: one for gaining
familiarity with the user interface of the test items and the other
2.1.21 test, n—sampling of behavior over a limited time in
to allow the candidate to self-evaluate mastery of the content.
which an authenticated examinee is given specific tasks under
4.3.1.1 User Interface Preparation—A practice test or tests
specified conditions, tasks that are scored by a uniformly
to familiarize candidates with the user interface shall be made
applied rubric.
available to the candidate at no charge. The practice test shall
2.1.21.1 Discussion—A test can also be referred to as an
be sufficient to assure adequate candidate practice time so that
assessment, although typically “assessment” is used for forma-
the degree of familiarity with the user interface does not impair
tive evaluation. This practice addresses specifically certifica-
the validity of the test.
tion and licensure, as stated in 1.3. A test is designed to predict
4.3.1.2 Content Self-Assessment—Practice tests that evalu-
the examinee’s behavior in a specified context, the “target
ate content mastery may be made available at no charge or for
context.”
a fee. There is no obligation on the part of the test provider to
2.1.22 trajectory, n—candidate’s path through the solution
provide a self-assessment practice test to evaluate content
to a single item, task, or test.
mastery.
2.1.22.1 Discussion—Also termed the response trajectory.
NOTE 1—If a practice test is provided, it shall sample test content
sufficiently to allow the candidate to predict reasonably success or failure
2.1.23 validity, n—extent to which a test predicts target
on the test.
behavior for multiple candidates within a target context.
4.3.2 Candidates shall know specifically which type of
practice test they are requesting.
3. Significance and Use
4.3.3 Both types of practice test shall help candidates
3.1 This practice for performance testing provides guidance
understand how their responses are going to be scored.
to performance test sponsors, developers, and delivery provid-
ers for the planning, design, development, administration, and
5. Procedure
reporting of high-quality performance tests. This practice
5.1 Item Development—All requirements in Section 5 may
assists stakeholders from both the user and consumer commu-
be superseded by empirical, logical, or statistical arguments
nities in determining the quality of performance tests. This
demonstrating that the practices of a certification body are
practice includes requirements, processes, and intended out-
equivalent to or superior to the practices required to meet this
comes for the entities that are issuing the performance test,
practice.
developing, delivering and evaluating the test, users and test
5.1.1 Item Time Limits:
takers interpreting the test, and the specific quality character-
5.1.1.1 When items or test sections can be accessed
istics of performance tests. This practice provides the founda-
repeatedly, no item time limit is required to be enforced or
tion for both the recognition and accreditation of a specific
recommended to the candidate.
entity to issue and use effectively a quality performance test.
5.1.1.2 When items can be accessed only once, item time
3.2 Accreditation agencies are presently evaluating perfor- limits shall be either suggested or enforced, with a visual
mance tests with criteria that were developed primarily or
timekeeping option for the examinee.
exclusively for multiple-choice examinations. The criteria by 5.1.1.3 For a power test, item time limits shall be set using
which performance tests shall be evaluated and accredited are a standard practice such as the mean item response time
ones appropriate to performance testing. As accreditation measured in beta testing plus two standard deviations for
becomes more critical for acceptance by federal and state successful candidates within the calibration sample. When
governments, insurance companies, and international trade, it sufficient data have been collected from test administrations,
becomes more critical that appropriate standards of quality and the item time shall be recalibrated to reflect performance on the
application be developed for performance testing. actual test
5.1.1.4 For a speeded test, item time limits shall be deter-
4. Candidate Preparation mined by measuring minimum acceptable time limits in the
target context.
4.1 Number of Practice Items—A candidate shall be given
5.1.2 Differential System Responsiveness—Differential sys-
access to sufficient practice items that the novelty of the item
tem responsiveness may be due to variance in network
format shall not inhibit the examinee’s ability to demonstrate
bandwidth, network latency, random-access memory (RAM),
his or her capabilities.
storage speed, operating systems, computer processing unit
4.2 Scoring Rubric Available to Candidates:
(CPU) count and performance, bus speed, or other factors.
4.2.1 Candidates shall have sufficient information about the
NOTE 2—It is the obligation of the test developer to attempt to measure
scoring rubric to be able to appropriately prioritize their efforts
differences in latency and system responsiveness whenever possible and,
in completing the item or test. if possible, to compensate appropriately for these variations.
E2849 − 18 (2024)
5.1.2.1 There shall be compensation in test scoring for import their industry standard configurations into the test
variances in the hardware and software environment to assure environment, provided that doing so does not compromise
that all examinees are scored fairly. exam security, provide unfair advantage over other candidates,
or impact the generalizability of results.
NOTE 3—Compensation may be in adjusting item time limits, item
5.1.9.3 The criterion the test developer shall use to deter-
latency scoring factors, or other compensatory variables.
mine “minimal reconfiguration” is whether competence mea-
5.1.2.2 An examinee taking a test under one set of condi-
sured with the default configuration will predict performance
tions shall receive the same score as if he or she took the test
with a reconfigured system.
under any admissible alternative set of conditions.
5.1.10 Level of Feedback—Feedback during the test shall
5.1.3 References/Citations—When possible, codes,
reflect feedback available doing similar tasks in the target
guidelines, industry standards, application source code, or
context.
other evidence shall be sufficient to establish the correctness of
scoring a procedure. Where such documentation does not exist,
NOTE 5—Feedback may be time compressed to minimize testing time.
correct responses may be documented as standard practice by
Interim results may be omitted if they do not impact success in performing
a vote of the subject matter expert (SME) advisory panel for the item.
the test.
5.1.11 American with Disabilities Act (ADA)
5.1.4 Rater Reliability—When human raters are involved in
Accommodations—Accommodations shall be fair to the
assessing item success, rater reliability shall correlate with an
candidate, the testing administrator, other candidates, and the
established performance standard greater than 0.80.
potential employer alike, with no interest predominating.
5.1.4.1 When multiple raters are used to rate a single
Before awarding accommodations, the test administrator shall
performance, inter-rater reliability shall correlate higher than
discuss with the candidate what the candidate feels would be
0.80.
reasonable accommodations and, when feasible, shall allow the
5.1.5 Automated Scoring—To verify automated scoring, the
methods candidates use for accomplishing tasks in the target
test developer shall develop test cases that verify the scoring of
context. The candidate shall possess the capability to perform
a minimum of 95 % of anticipated responses. When items are
the required test item in full with the agreed upon accommo-
scored automatically, for the first 100 administrations of the
dations. In no case shall a verbal option be given in place of a
test, the test developer shall verify that the scoring algorithm is
performance requirement.
scoring responses correctly. Verification may be done by
5.1.12 Sensitivity and Bias—Items shall be developed with
human observation, alternate scoring mechanisms, playback of
sensitivity toward the cultural context within which the candi-
recorded performance, or audit of collected data. Initial veri-
date will be practicing the skills evaluated. The items shall not
fication shall be performed for at least 5 % of failed items.
include content that would prevent people of equal ability or
After 100 administrations, the developer shall verify 1 % of
skill from exhibiting those abilities or skills.
failed items until at least 200 failed items have been checked.
5.1.13 Item Response Termination—Item termination meth-
5.1.6 Item Stimulus Construction—The item solution space
ods used shall create an environment in which the examinee’s
shall enable options that would be used by at least 95 % of
response during a test will best predict performance in the
practitioners in addressing the problem represented by the
target context.
item.
NOTE 6—In the target context, if an examinee determines completion of
NOTE 4—The estimate of the practitioner percentage can be derived
the task, then the examinee shall indicate completion of the task on the
empirically from usability studies, use case
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.