Standard Practice for Assessing Language Proficiency

SIGNIFICANCE AND USE
4.1 Intended Use:  
4.1.1 This practice is intended to serve the language test developer, test provider, and language test user communities in their ability to provide useful, timely, reliable, and reproducible tests of language proficiency for general communication purposes. This practice expands the testing capacity of the United States by leveraging commercial and existing government test development and delivery capability through standardization of these processes. This practice is intended to be used by contract officers, program managers, supervisors, managers, and commanders. It is also intended to be used by test developers, those who select and evaluate tests, and users of test scores.  
4.1.2 Furthermore, the intent of this practice is to encourage the use of expert teams to assist contracting officers, contracting officer representatives, test developers, and contractors/vendors in meeting the testing needs being addressed. Users of this practice are encouraged to focus on meeting testing needs and not to interpret this practice as limiting innovation in any way.  
4.2 Compliance with the Practice:  
4.2.1 Compliance with this practice requires adherence to all sections of this practice. Exceptions are allowed only in specific cases in which a particular section of this practice does not apply to the type or intended use of a test. Exceptions shall be documented and justified to the satisfaction of the customer. Nothing in this practice should be construed as contradicting existing federal and state laws nor allowing for deviation from established U.S. Government policies on testing.
SCOPE
1.1 Purpose—This practice describes best practices for the development and use of language tests in the modalities of speaking, listening, reading, and writing for assessing ability in accordance with the Interagency Language Roundtable (ILR)2 scale. This practice focuses on testing language proficiency in use of language for communicative purposes.  
1.2 Limitations—This practice is not intended to address testing and test development in the following specialized areas: Translation, Interpretation, Audio Translation, Transcription, other job-specific language performance tests, or Diagnostic Assessment.  
1.2.1 Tests developed under this practice should not be used to address any of the above excluded purposes (for example, diagnostics).  
1.3 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.

General Information

Status
Published
Publication Date
31-Mar-2020
Drafting Committee
F43.04 - Language Testing

Relations

Effective Date
01-Apr-2020
Effective Date
01-Nov-2023
Effective Date
01-Jan-2023
Effective Date
15-Mar-2015
Effective Date
15-Jan-2014
Effective Date
01-Apr-2007
Effective Date
01-May-2006
Effective Date
15-Aug-2005
Effective Date
10-Mar-2001
Effective Date
10-Jul-1999
Effective Date
01-Apr-2020
Effective Date
01-Apr-2020

Overview

ASTM F2889-11(2020), Standard Practice for Assessing Language Proficiency, establishes a comprehensive framework for the development, administration, and use of language proficiency tests focused on general communication skills. Developed by ASTM International, this standard addresses best practices and quality benchmarks for assessing the speaking, listening, reading, and writing abilities of candidates in accordance with the Interagency Language Roundtable (ILR) scale. The standard is aimed at test developers, providers, users, and decision-makers involved in language proficiency evaluation, including contract officers, program managers, and organizational leaders. By standardizing procedures, ASTM F2889 seeks to increase efficiency, ensure reliability, and expand language testing capacity while maintaining compliance with U.S. federal and state requirements.

Key Topics

  • Purpose and Scope: The standard outlines best practices for creating and using language proficiency tests, specifically for functional communication, and excludes specialized testing such as translation, interpretation, or diagnostic assessments.
  • Lifecycle Approach: It describes the iterative life cycle of test development, from initial planning and needs analysis through ongoing maintenance, emphasizing continuous quality and validity.
  • Validity and Reliability: Central to the standard are requirements for robust validity (the test measures intended skills for defined purposes) and reliability (consistency and reproducibility of results). Both are critical for making high-stakes decisions based on test outcomes.
  • Quality Assurance and Quality Control: Quality assurance (QA) processes ensure proper test design, planning, and stakeholder involvement; quality control (QC) verifies ongoing adherence to standards during and after test delivery.
  • Ethical Practices: The standard stresses ethical responsibilities in test development, administration, and result use, guiding test creators and organizations to safeguard fairness, transparency, and data integrity.
  • Technical Documentation: Comprehensive documentation is mandated throughout the test life cycle to support evaluation, transparency, and informed decision-making by all stakeholders.
  • Compliance and Exception Handling: Compliance with this practice is essential, with any justified exceptions thoroughly documented. All activities must align with existing laws and government policies.

Applications

ASTM F2889-11(2020) is broadly applicable across a range of language testing scenarios where reliable, standardized measurement of language proficiency is required:

  • Government and Defense: Contract officers, program managers, and commanders rely on this standard to ensure language tests meet operational needs and policy requirements.
  • Commercial Testing Providers: Companies developing or delivering language assessments use the standard to improve test design, delivery, and credibility.
  • Educational Institutions: Schools and universities implementing language proficiency tests for student placement or certification benefit from adherence to these internationally recognized practices.
  • Human Resources and Talent Management: Organizations use compliant tests for screening, placement, and development of multilingual staff, ensuring selection procedures are consistent and legally defensible.
  • Quality Management: Testing organizations implement QA and QC protocols defined in the standard to maintain test integrity and stakeholder trust over time.

Related Standards

To maximize the effectiveness and quality of language proficiency assessment, refer to the following related ASTM standards:

  • ASTM F1562: Guide for Use-Oriented Foreign Language Instruction
  • ASTM F2089: Practice for Language Interpreting
  • ASTM F2575: Guide for Quality Assurance in Translation

Additionally, these standards align with global best practices promoted by international organizations such as the World Trade Organization Technical Barriers to Trade (TBT) Committee.


Keywords: language proficiency assessment, ASTM F2889, language testing standards, ILR scale, test validity, test reliability, quality assurance, ethical testing, technical documentation, proficiency test development, government language testing, language skills evaluation

Buy Documents

Standard

ASTM F2889-11(2020) - Standard Practice for Assessing Language Proficiency

English language (24 pages)
sale 15% off
sale 15% off

Get Certified

Connect with accredited certification bodies for this standard

Great Wall Tianjin Quality Assurance Center

Established 1993, first batch to receive national accreditation with IAF recognition.

CNAS China Verified

Hong Kong Quality Assurance Agency (HKQAA)

Hong Kong's leading certification body.

HKAS Hong Kong Verified

Innovative Quality Certifications Pvt. Ltd. (IQCPL)

Known for integrity, providing ethical & impartial Assessment & Certification. CMMI Institute Partner.

NABCB India Verified

Sponsored listings

Frequently Asked Questions

ASTM F2889-11(2020) is a standard published by ASTM International. Its full title is "Standard Practice for Assessing Language Proficiency". This standard covers: SIGNIFICANCE AND USE 4.1 Intended Use: 4.1.1 This practice is intended to serve the language test developer, test provider, and language test user communities in their ability to provide useful, timely, reliable, and reproducible tests of language proficiency for general communication purposes. This practice expands the testing capacity of the United States by leveraging commercial and existing government test development and delivery capability through standardization of these processes. This practice is intended to be used by contract officers, program managers, supervisors, managers, and commanders. It is also intended to be used by test developers, those who select and evaluate tests, and users of test scores. 4.1.2 Furthermore, the intent of this practice is to encourage the use of expert teams to assist contracting officers, contracting officer representatives, test developers, and contractors/vendors in meeting the testing needs being addressed. Users of this practice are encouraged to focus on meeting testing needs and not to interpret this practice as limiting innovation in any way. 4.2 Compliance with the Practice: 4.2.1 Compliance with this practice requires adherence to all sections of this practice. Exceptions are allowed only in specific cases in which a particular section of this practice does not apply to the type or intended use of a test. Exceptions shall be documented and justified to the satisfaction of the customer. Nothing in this practice should be construed as contradicting existing federal and state laws nor allowing for deviation from established U.S. Government policies on testing. SCOPE 1.1 Purpose—This practice describes best practices for the development and use of language tests in the modalities of speaking, listening, reading, and writing for assessing ability in accordance with the Interagency Language Roundtable (ILR)2 scale. This practice focuses on testing language proficiency in use of language for communicative purposes. 1.2 Limitations—This practice is not intended to address testing and test development in the following specialized areas: Translation, Interpretation, Audio Translation, Transcription, other job-specific language performance tests, or Diagnostic Assessment. 1.2.1 Tests developed under this practice should not be used to address any of the above excluded purposes (for example, diagnostics). 1.3 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.

SIGNIFICANCE AND USE 4.1 Intended Use: 4.1.1 This practice is intended to serve the language test developer, test provider, and language test user communities in their ability to provide useful, timely, reliable, and reproducible tests of language proficiency for general communication purposes. This practice expands the testing capacity of the United States by leveraging commercial and existing government test development and delivery capability through standardization of these processes. This practice is intended to be used by contract officers, program managers, supervisors, managers, and commanders. It is also intended to be used by test developers, those who select and evaluate tests, and users of test scores. 4.1.2 Furthermore, the intent of this practice is to encourage the use of expert teams to assist contracting officers, contracting officer representatives, test developers, and contractors/vendors in meeting the testing needs being addressed. Users of this practice are encouraged to focus on meeting testing needs and not to interpret this practice as limiting innovation in any way. 4.2 Compliance with the Practice: 4.2.1 Compliance with this practice requires adherence to all sections of this practice. Exceptions are allowed only in specific cases in which a particular section of this practice does not apply to the type or intended use of a test. Exceptions shall be documented and justified to the satisfaction of the customer. Nothing in this practice should be construed as contradicting existing federal and state laws nor allowing for deviation from established U.S. Government policies on testing. SCOPE 1.1 Purpose—This practice describes best practices for the development and use of language tests in the modalities of speaking, listening, reading, and writing for assessing ability in accordance with the Interagency Language Roundtable (ILR)2 scale. This practice focuses on testing language proficiency in use of language for communicative purposes. 1.2 Limitations—This practice is not intended to address testing and test development in the following specialized areas: Translation, Interpretation, Audio Translation, Transcription, other job-specific language performance tests, or Diagnostic Assessment. 1.2.1 Tests developed under this practice should not be used to address any of the above excluded purposes (for example, diagnostics). 1.3 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.

ASTM F2889-11(2020) is classified under the following ICS (International Classification for Standards) categories: 03.180 - Education. The ICS classification helps identify the subject area and facilitates finding related standards.

ASTM F2889-11(2020) has the following relationships with other standards: It is inter standard links to ASTM F2889-11, ASTM F1562-23, ASTM F2575-23e2, ASTM F2089-15, ASTM F1562-14, ASTM F2089-01(2007), ASTM F2575-06, ASTM F1562-95(2005), ASTM F2089-01, ASTM F1562-95(1999), ASTM F3130-18, ASTM F3516-22. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

ASTM F2889-11(2020) is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.

Standards Content (Sample)


This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation:F2889 −11 (Reapproved 2020)
Standard Practice for
Assessing Language Proficiency
This standard is issued under the fixed designation F2889; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope 3.1.1 achievement test, n—an instrument designed to mea-
sure what a person has learned within or up to a given time
1.1 Purpose—This practice describes best practices for the
based on a sampling of what has been covered in the syllabus.
development and use of language tests in the modalities of
3.1.2 adaptive test, n—form of individually tailored testing
speaking,listening,reading, and writing for assessingabilityin
in which test items are selected from an item bank where test
accordance with the Interagency Language Roundtable (ILR)
items are stored in rank order with respect to their item
scale. This practice focuses on testing language proficiency in
difficulty and presented to test takers during the test on the
use of language for communicative purposes.
basis of their responses to previous items, until it is determined
1.2 Limitations—This practice is not intended to address
that sufficient information regarding test takers’ abilities has
testingandtestdevelopmentinthefollowingspecializedareas:
been collected. The opposite of a fixed-form test.
Translation, Interpretation, Audio Translation, Transcription,
3.1.3 authentic texts, n—texts not created for language
other job-specific language performance tests, or Diagnostic
learning purposes that are taken from newspapers, magazines,
Assessment.
etc., and tapes of natural speech taken from ordinary radio or
1.2.1 Tests developed under this practice should not be used
television programs, etc.
to address any of the above excluded purposes (for example,
3.1.4 calibration, n—the process of determining the scale of
diagnostics).
a test or tests.
1.3 This international standard was developed in accor-
3.1.4.1 Discussion—Calibration may involve anchoring
dance with internationally recognized principles on standard-
items from different tests to a common difficulty scale (the
ization established in the Decision on Principles for the
theta scale). When a test is constructed from calibrated items
Development of International Standards, Guides and Recom-
then scores on the test indicate the candidates’ ability, that is,
mendations issued by the World Trade Organization Technical
their location on the theta scale.
Barriers to Trade (TBT) Committee.
3.1.5 cognitive lab, n—a method for eliciting feedback from
2. Referenced Documents examinees with regard to test items.
3 3.1.5.1 Discussion—Small numbers of examinees take the
2.1 ASTM Standards:
test, or subsets of the items on the test, and provide extensive
F1562 Guide for Use-Oriented Foreign Language Instruc-
feedback on the items by speaking their thought processes
tion
aloud as they take the test, answering questionnaires about the
F2089 Practice for Language Interpreting
items, being interviewed by researchers, or other methods
F2575 Guide for Quality Assurance in Translation
intended to obtain in-depth information about items. These
examinees should be similar to the examinees for whom the
3. Terminology
test is intended. For tests scored by raters, similar techniques
3.1 Definitions:
are used with raters to obtain information on rubric function-
ing.
3.1.6 computer adaptive test, n—a test administered by a
This practice is under the jurisdiction of ASTM Committee F43 on Language
computer in which the difficulty level of the next item to be
Services and Products and is the direct responsibility of Subcommittee F43.04 on
presented to test takers is estimated on the basis of their
Language Testing.
Current edition approved April 1, 2020. Published April 2020. Originally
responses to previous items and adapted to match their
approved in 2005. Last previous edition approved in 2011 as F2889 – 11. DOI:
abilities.
10.1520/F2889-11R20.
3.1.7 construct, n—the knowledge, skill or ability that is
Interagency Language Roundtable, Language Skill Level Descriptors (http://
www.govtilr.org/Skills/ILRscale1.htm).
being tested.
For referenced ASTM standards, visit the ASTM website, www.astm.org, or
3.1.7.1 Discussion—The construct provides the basis for a
contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM
given test or test task and for interpreting scores derived from
Standards volume information, refer to the standard’s Document Summary page on
the ASTM website. this task.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
F2889−11 (2020)
3.1.8 constructed response, adj—a type of item or test task (no knowledge of a language) to 5 (equivalent to a highly
that requires test takers to respond to a series of open-ended educated native speaker).
questions by writing, speaking, or doing something rather than
3.1.19 indirect test, n—a test that measures ability
choose answers from a ready-made list.
indirectly, rather than directly.
3.1.8.1 Discussion—The most commonly used types of
3.1.19.1 Discussion—An indirect test requires examinees to
constructed-response items include fill-in, short-answer, and
perform tasks that are not directly reflective of an authentic
performance assessment.
target-language use situation. Inferences are drawn about the
abilities underlying the examinee’s observed performance on
3.1.9 content validity, n—a conceptual or non-statistical
the indirect test.
validity based on a systematic analysis of the test content to
determine whether it includes an adequate sample of the target
3.1.20 interpretation, n—the process of understanding and
domain to be measured.
analyzing a spoken or signed message and re-expressing that
3.1.9.1 Discussion—In order to achieve content validity, an
message faithfully, accurately and objectively in another
adequate sample involves ensuring that all major aspects are
language, taking the cultural and social context into account.
covered and in suitable proportions.
3.1.20.1 Discussion—Although there are correspondences
between the skills of interpreting and translating, an interpreter
3.1.10 criterion-referenced scale, n—a graduated and sys-
conveys meaning orally, while a translator conveys meaning
tematic description of the domain of subject matter that a test
from written text to written text. As a result, interpretation
is designed to assess; (or) a rating scale that provides for
requires skills different from those needed for translation.
translating test scores into a statement about the behavior to be
expected of a person with that score and/or their relationship to
3.1.21 inter-rater reliability, n—the degree to which differ-
a specified subject matter.
ent examiners or judges making different subjective ratings of
3.1.10.1 Discussion—A criterion-referenced test is one that
ability agree in their evaluations of that ability.
assesses achievement or performance against a cut score that is
3.1.22 intra-rater reliability, n—the degree to which an
determined as a reflection of mastery or attainment of specified
individual examiner or judge renders consistent and reliable
objectives. Focus is on ability to perform tasks rather than
ratings.
group ranking.
3.1.23 item, n—one of the assessment units, usually a
3.1.11 cut score, n—a score that represents achievement of
problem or a question, that is included on a test.
the criterion, the line between success and failure, mastery and
3.1.23.1 Discussion—Test items provide a means to mea-
non-mastery.
sure whether a test taker can perform a task and are scorable
3.1.12 dichotomous scoring, n—scoring based on two using a scoring rubric or answer key. Successful or unsuccess-
categories, for example, right/wrong, pass/fail. Compare to ful performance on an item contributes information to the test
polytomous scoring. taker’s overall score. Examples of item types include: multiple
choice, constructed response, cloze, matching and essay
3.1.13 equated forms, n—two or more forms of a test whose
prompts.
test scores have been transformed onto the same scale so that
a comparison across different forms of a test is made possible. 3.1.24 item response theory (IRT), n—the theory underlying
statistical models that are used to describe the relationship
3.1.14 expert panel, n—a group of target-language experts
between a student’s ability level and the probability of success
who take a test under test-like conditions and provide com-
on a test question.
ments about any problem areas.
3.1.24.1 Discussion—IRT encompasses latent trait theory;
3.1.14.1 Discussion—An expert panel should include at
logistic models; Rasch models; 1, 2, and 3 parameter IRT;
least 8 members. Panel members receive training before they
normal ogive models; Generalized Partial Credit models; and
take the test in order to ensure that their comments will be
Samejima’s Graded Response model.
helpful.
3.1.25 language proficiency, n—the degree of skill with
3.1.15 face validity, n—the degree to which a test appears to
which a person can use a language for communicative pur-
measure the knowledge or abilities it claims to measure, based
poses.
on the subjective judgment of an observer.
3.1.25.1 Discussion—Language proficiency encompasses a
3.1.16 fixed-form test, n—atestwhosecontentdoesnotvary
person’s ability to read, write, speak, or understand a language
in order to better accommodate to the examinee’s level of
and can be contrasted with language achievement, which
knowledge, skill, ability or proficiency. The opposite of an
describes language ability as a result of learning. Proficiency
adaptive test.
may be measured through the use of a proficiency test.
3.1.17 genre, n—a type of discourse that occurs in a
3.1.26 operational validity, n—the extent to which item
particular setting, that has distinctive and recognizable patterns
tasks, items, or interviewers on a test perform as intended and
and norms of organization and structure, and that has particular
function to create an accurate score in a real world setting, as
and distinctive communicative functions.
opposed to a setting involving an experiment, a simulation or
training.
3.1.18 ILR scale, n—a scale of functional language ability
of 0 to 5 used by the Interagency Language Roundtable.
3.1.27 performance test, n—a test in which the ability of
3.1.18.1 Discussion—The range of the ILR scale is from 0 candidates to perform particular tasks, usually associated with
F2889−11 (2020)
job or study requirements, is assessed using “real-life” perfor- administered to same group of test-takers under comparable
mance requirements as a criterion. testing situations. Simply put, reliability is the extent to which
an item, scale, procedure, or test will yield the same value
3.1.28 polytomous scoring, n—a model for scoring an item
when administered under similar or dissimilar conditions.
using a scale of at least three points.
3.1.37 scoring rubric, n—a standardized method or proce-
3.1.28.1 Discussion—Using a polytomous scoring model,
dure used by a rater in assigning a score to an examinee’s
for example, the answer to a question can be assigned 0, 1, or
performance on a given task.
2 points. Open-ended questions are often scored polytomously.
Also referred to as scalar or polychotomous scoring. Compare 3.1.37.1 Discussion—A scoring rubric is a detailed docu-
ment that is used by trained raters to assess test taker
to dichotomous scoring.
performance. Correct interpretation and application of the
3.1.29 predictive validity, n—the degree to which a test
scoring rubric requires training.
accurately and reliably predicts future performance in the
3.1.38 selected response, adj—any item which requires the
domain being tested.
examinee to choose between response options which are
3.1.30 protocol, n—a standardized method or procedure for
provided to the examinee, including, but not limited to true/
executing a given task, often formalized in documents.
false and multiple-choice items.
3.1.31 quality assurance, v—theprocessofensuringthatthe
3.1.39 skill modality, n—any one of the four receptive and
test planning and development phases are executed properly
productive language skills of listening, reading, speaking,
and satisfy the needs of all stakeholders.
writing as defined in the ILR.
3.1.31.1 Discussion—Quality assurance (QA) applies (1)
3.1.40 specifications, n—a detailed description of the char-
when a new test is being created, (2) when a test that already
acteristics of a test, including what is tested, how it is tested,
exists is being repurposed or revised, (3) during certain aspects
details such as number and length of papers, item types used,
oftheimplementationprocessofthetest(thatis,replenishment
etc.
of test items), (4) during item replenishment to ensure that new
test items and prompts that will be used in the test conform to
3.1.41 task, n—an activity performed by a test taker in order
the original specifications that were used in creating the todemonstratefunctionsandotherproficiencycriteriastatedin
original items of that type, and (5) to train new personnel to
the ILR Skill Level Descriptors.
administerthetesttothe same standards that were specifiedfor
3.1.42 test-retest reliability, n—an estimate of the reliability
the first testing personnel.
of a test as determined by the extent to which a test gives the
3.1.32 quality control, v—the system of post-development same results if it is administered at two different times under
evaluations used at and after product acceptance to determine
the same conditions with the same group of test takers.
whether the test and testing practices used by an organization 3.1.42.1 Discussion—Test-retest reliability is estimated
continue to meet and adhere to all standards and relevant
from the coefficient of correlation that is obtained from the two
testing policies. administrations of the test. An assessment should provide a
stable measurement of a construct across multiple
3.1.32.1 Discussion—Quality control (QC) is used at and
any time after product acceptance. QC verifies the continued administrations, especially when the time interval in between
the administrations limits the potential for the amount of the
validity and reliability of the test and shows the test is being
used in an appropriate manner on an ongoing basis. Quality underlying proficiency to change. There are three components
of the test-retest reliability method: (1) two measurements with
control (QC) is part of the test maintenance process.
the instrument at two separate times for each test taker; (2)
3.1.33 rater, n—a suitably qualified and trained person who
computation of a correlation between the two separate mea-
assigns a rating to a test taker’s performance based on a
surements; and (3) assumption that no change has occurred in
judgment usually involving the matching of features of the
the underlying trait or construct.
performance to descriptors on a rating scale.
3.1.43 translation, n—process comprising the creation of a
3.1.34 rating, v—to exercise judgment about an examinee’s
written target text based on a source text in such a way that the
performance on a given task.
content and in many cases, the form of the two texts, can be
3.1.35 rating scale, n—a scale for the description of lan-
considered to be equivalent.
guage proficiency consisting of a series of constructed levels
3.1.44 validity, n—the degree to which a test measures what
against which a language learner’s performance is judged.
it is intended to measure, or can be used successfully for the
3.1.36 reliability, n—the consistency of a test in measuring
purpose for which it is intended.
what it is intended to measure across the life of the test or the
3.1.44.1 Discussion—Validity is a judgment of the degree to
degree to which an instrument measures the same way each
which the evidence (arguments) supports the conclusions,
time used; reproducibility.
interpretations, uses and inferences of test scores. A validity
3.1.36.1 Discussion—Consistency is the essential notion of
argumentdemonstratestheappropriatenessanddefensibilityof
classical reliability. Reliability is defined as the extent that
a test’s conclusions, interpretations, and inferences for a
separate measurements (for example, items, scales, test
administrations, and interviews) yield comparable results un-
der the same or similar conditions. For example, test items
Cook, T. D. and Campbell, D. T., Quasi-Experimentation: Design and Analysis
measuringthesameconstructshouldyieldsimilarresultswhen for Field Settings, Rand McNally, Chicago, IL, 1979.
F2889−11 (2020)
specific use in a given situation.The validity argument is based
on the fact that a test is developed for specific uses and users
and includes, but is not limited to, a description of and
justification for test uses, impacts, audiences, and content. A
number of different statistical procedures can be applied to a
test to estimate its validity. Such procedures generally seek to
determine what the test measures, and how well it does so.The
rigor and strength of the validity argument should increase as
the stakes associated with the test (consequences for the
individual and/or organization) increase.
4. Significance and Use
4.1 Intended Use:
4.1.1 This practice is intended to serve the language test
developer, test provider, and language test user communities in
theirabilitytoprovideuseful,timely,reliable,andreproducible
tests of language proficiency for general communication pur-
poses. This practice expands the testing capacity of the United
States by leveraging commercial and existing government test
FIG. 1 Test Life Cycle
development and delivery capability through standardization
of these processes. This practice is intended to be used by
contract officers, program managers, supervisors, managers,
and commanders. It is also intended to be used by test 5.2.1 The test life cycle is an iterative process, with new test
developers, those who select and evaluate tests, and users of development beginning with the plan for the test (to include a
test scores. needs assessment, the creation of test framework and test
4.1.2 Furthermore, the intent of this practice is to encourage specification documentation, followed by a plan for test main-
tenance).TestplanningisdescribedinSection6.Followingthe
the use of expert teams to assist contracting officers, contract-
ing officer representatives, test developers, and contractors/ acceptance of the planning stage, test development occurs (see
Section7).Duringthisphase,qualificationsareestablishedand
vendors in meeting the testing needs being addressed. Users of
this practice are encouraged to focus on meeting testing needs development teams hired, items are developed, scoring and
rating is outlined, and validity evidence is collected. When the
and not to interpret this practice as limiting innovation in any
way. stakeholders agree that the test meets the expected standards,
the test is accepted (see Section 8).
4.2 Compliance with the Practice:
5.2.2 The test life cycle continues with test administration,
4.2.1 Compliance with this practice requires adherence to
ensuring standards for delivery, proctoring, scoring and rating,
all sections of this practice. Exceptions are allowed only in
reporting of scores, and arbitration are met (see Section 9).The
specific cases in which a particular section of this practice does
next stage in the test life cycle is test maintenance, which
not apply to the type or intended use of a test. Exceptions shall
includes refreshment of test content (see Section 10). During
be documented and justified to the satisfaction of the customer.
this phase, new items are written and validated and testing
Nothing in this practice should be construed as contradicting
documentation is updated to reflect current realities. When the
existing federal and state laws nor allowing for deviation from
test is determined to no longer meet the needs of the
established U.S. Government policies on testing.
organization, it is retired.
5.3 Validity:
5. Overarching Considerations
5.3.1 The validity argument begins at test creation and
5.1 The purpose of a test is to provide useful information
continues throughout the life of the test. The validity argument
about examinees or programs. To build a useful test, develop-
integrates multiple sources of data and brings elements from
ers and stakeholders must participate in an ongoing develop-
each stage of the life cycle as evidence for the goodness of fit
ment and evaluation process, shown in Fig. 1 as the life cycle
between the test and its intended purpose. This is particularly
of a test and described further in Sections6–10. Along with
important when a test has been developed for a specific use or
the processes of the life cycle, there are several interconnected
audience and an organization wishes to use it for a different
elements that contribute to the usefulness of the information.
purpose or audience. When any test is developed, a test
These are validity (5.3), reliability (5.4), practicality (5.5),
framework shall include an explanation of how the validity
quality assurance (5.6), quality control (5.7), technical docu-
evidence will be gathered. As any part of the test use (such as
mentation (5.8), and ethics (5.9). This section provides general
the audience, purpose, administration, scoring, or content)
considerations about the life cycle and the elements as an
changes, the original test validity argument shall be replaced
overview, with Sections 6–10 providing more specific
with a new or supplemental argument.The rigor of the validity
information about each phase of the life cycle.
argumentshouldbesufficienttojustifytheconsequencesofthe
5.2 Test Life Cycle—See Fig. 1. use of its scores or ratings, such that as the stakes to test takers
F2889−11 (2020)
and organizations increase, the rigor and strength of the appropriateness and rigor of the approach, process,
validity argument should increase. methodology, findings, decisions, and deliverables as appropri-
ate to each stage of the test life cycle.
5.4 Reliability:
5.8.2 The documentation of test protocols and procedures,
5.4.1 Without consistency and stability of measurement as
such as the test administration manual or the test security
indicated by reliability, decisions made from test scores or
instructions, shall be provided and shall include sufficient
ratings are biased or potentially erroneous. Items, tests, raters,
information for the intended audience to perform their roles
and procedures shall yield reliable measurements and have
and responsibilities. Documentation shall meet professional
psychometric merit to be a useful basis for judgments or
standards for presenting information and evidence as appropri-
inferences of knowledge, skill, or proficiency. Data that are
ate to the specific stage of the test life cycle. The documenta-
unreliable are, by definition, unduly affected by error, and
tion can be provided as a series of individual reports for each
decisions based upon such data are likely to be quite tenuous at
stage or as a single report for the entire life cycle.
best and completely erroneous at worst. As the stakes of the
5.8.3 Documentation shall be periodically updated and
test increase, reliability shall be more rigorously assessed.
supplemented as the test is either modified or extended to
When any test is developed, a test framework shall include an
additional uses, populations, or contexts. These updates can be
explanation of how the reliability will be ensured. Although
provided as supplemental reports or updates to the original
validity is considered the most important psychometric mea-
reports.
surement property, the validity of an assessment is undermined
if the construct or content domain cannot be measured accu-
5.9 Ethics:
rately or consistently.
5.9.1 At the highest level, ethics is a form of QA and QC.
Ethics encompasses both standards of practice and moral
5.5 Practicality:
obligations. Unethical behavior, whether intentional or
5.5.1 Practicality underlies the entire life cycle, as it is the
unintentional, can result in considerable harm and be very
extent to which appropriate resources are available for test
costly to the organizations and individuals affected. Unethical
development, operations, administration, and ongoing im-
behavior negatively affects the quality of the information
provement. Necessary resources include:
provided by the test and reflects poorly on organizations,
5.5.1.1 Personnel to develop, administer, rate, score, report
casting the professionals who create, use, or rely on test data in
results, ensure security, and provide ongoing improvement;
apoorlight.Furthermore,theperceivedvalueoflanguagetests
5.5.1.2 Funds to develop the test, pay raters and
depends upon ethical practice and decisions made on the basis
administrators, support ongoing improvements, and manage
of test scores assume ethical practice.
test operations and security; and
5.5.1.3 Materials, including paper-based test booklets, scor- 5.9.2 In the development and operationalization of a lan-
ing systems, tape recorders, and computers or computer guage test, contracting agencies, testing organizations, test
software necessary for test administration, operations, scoring, developers, and test users have ethical responsibilities. It is the
security assurance, and ongoing improvement. responsibility of these organizations and individuals to
determine, communicate, and document any local responsibili-
5.6 Quality Assurance (QA):
ties and obligations that may not be known to others involved
5.6.1 The application of QA to the creation of a new
inthedevelopmentandadministrationofatest.Inallphasesof
language proficiency test requires that a needs assessment be
a testing project, it is the responsibility of all participants to
undertaken and executed correctly, and that input is received
consider the ethical implications of their own and other’s
fromallstakeholdergroups.Theneedsassessmentdocumentis
actions.
the first in a series of documents that guide the subsequent
5.9.3 In addition to the standards included in Section 6,
steps in the planning and development phases.
other sections of this practice address ethical considerations in
5.6.2 QA does not end when the test is created. Documen-
language testing, since practicing ethical behavior is a part of
tation that those original standards are being applied to new
good testing practice. Several organizations have created
itemcreationandtrainingshallbecreatedduringtheprocessof
ethical codes of practice in educational measurement designed
new item creation or training.
to safeguard the rights of test takers by focusing on profes-
5.7 Quality Control (QC)—Quality control is an essential
sional test development practices that could negatively impact
component of the test maintenance process since it verifies the
examinees.Thesedocumentscanalsoserveasguidestoethical
continued validity and reliability of the test and shows the test
behavior in language testing.
is being used in an appropriate manner on an ongoing basis.
5.9.4 Publication and Distribution of Accurate
Documentation that supports the validity and reliability of the
Information—Test information provided to testing
test and that the original standards and other relevant testing
organizations, test developers, test users, and test takers shall
policies continue to be fulfilled shall be created and/or col-
be true and accurate. It is unethical to knowingly misrepresent
lected during quality control evaluations.
information about a test.
5.8 Technical Documentation:
5.8.1 All tests shall include technical documentation that
covers the test life cycle from initial planning and development
For example, the International Language Testing Association (ILTA) Code of
through ongoing test use. The technical documentation shall
Ethics (www.iltaonline.com) and the Joint Committee on Testing Practices Code of
include sufficient information and evidence to evaluate the Fair Testing Practices in Education (www.apa.org).
F2889−11 (2020)
5.9.5 Copyright and Proprietary Materials—Authorization on the language use needs of the personnel to be tested by the
for reproduction and distribution of secure test materials shall organization. The ultimate responsibility for determining and
follow procedures established during the development process.
evaluating the suitability of a test for a particular use rests with
All authorized reproduction shall be documented. Test devel-
the organization using the test, not with the organization that
opers and testing organizations shall respect copyright laws.
developed the test. To ensure that the test is appropriate for its
Test materials subject to copyright may include, but are not
intended use, the organization shall perform a needs analysis
limited to, test forms, items, ancillary materials, answer sheets,
before developing, commissioning, or selecting any language
scoring templates, and conversion tables.
test. Then, the findings can be compared with the scope,
5.9.5.1 If required by law, test developers shall ensure
design, tasks, purpose, and Interagency Language Roundtable
copyright permissions are obtained for any materials used in
(ILR) level(s) of any proposed test to determine the ability of
the test.
that test to meet the organization’s current assessment needs.
5.9.5.2 When required by law, testing organizations shall
6.4.1 Repurposing of Existing Tests—If an existing test is
obtain consent of the owner before reproducing copyrighted or
proposed for use in a situation that was unanticipated by its
proprietary test materials.
original designers or developers, the organization proposing
6. Test Planning
the repurposing of the test shall evaluate its suitability for use
in the new situation. While the results of the original needs
6.1 Test planning is a phase of the test life cycle that begins
analysis may have been useful in determining the suitability of
with resource planning (6.3) and needs analysis (6.4) and
an existing test for its originally intended use, they might not
guides the production of a series of key documents including
besufficientevidencetojustifytheuseofthattestinasituation
theproductacceptanceplan(6.5),thetestframework(6.6),test
specifications (6.7), the test maintenance plan (6.8), the test for which it was not intended, especially if high-stakes deci-
refreshment plan (6.9), and the test security plan (6.10).All of sions will be made.
thesedocumentsshallbedevelopedinaccordancewith5.8and
6.4.2 Scope of Input—The needs analysis should include
shall be revisited throughout the life cycle of testing to ensure
inputfromthewidercommunityofpotentialuserstomaximize
continued relevance.
opportunities for coordination and minimize duplication of
6.2 Thetestplanningdocumentsarerelatedandinformeach effort. By having a needs analysis done, the organization will
other. The resource planning and test security documents will beabletodeterminethedegreeoffitbetweentheILRscaleand
evolve as additional needs are brought to light through the
the language skills needs of potential examinees who use
other documents. The needs analysis document is the first in a
language skills in their work. The organization should also
series of documents that guide the subsequent steps in the
recognize that the degree of fit may vary by the type of job or
planning and development phases. The needs analysis guides
position within the organization.Thus, no single test may fit all
the creation of the framework document.These two documents
situations in which a test is needed. In some situations, a needs
together guide the creation of the test specifications document.
analysis may reveal that an ILR-based test is appropriate for
the whole potential testing population. In other situations, a
6.3 Resource Planning—Without resources, a test cannot be
developed.Becausetherearesomanycomponentstoplanning, needs analysis may reveal that a performance test or a test of
development, administration, maintenance, refreshment, and language for specific purposes would be more appropriate for
security, organizations that wish to have tests shall develop a
at least some segments of the potential testing population.
plan for resource allocation. This plan will change as test
6.4.3 Results—Whenever possible, the results of the needs
planning and development progresses: for example, after the
analysis study shall be shared with the group responsible for
needs analysis is funded, it may reveal the need for a level of
developing or selecting the test. When it is not possible, it is
statistical analysis that was not foreseen. Nevertheless, begin-
incumbent on the organization that will use the test to use the
ning with a plan for the resources known to be needed at the
results of the study to specify the desired language skills to be
time, as well as a plan for revisiting resource needs, is crucial
assessed.
for the ultimate success of the test project. The resource plan
6.4.4 Intended Use—The organization that will use the test
shall address, at a minimum:
also shall consider the type of decisions that will be made on
6.3.1 Personnel to plan, develop, analyze, produce,
the basis of the test scores. Scores used to make high-stakes
administer,rate,report,maintain,refresh,andprovideadequate
decisions require the selection or development of a test with a
security for the test;
high degree of reliability and validity. Thus, indirect measures
6.3.2 Funds to provide infrastructure such as test item
of the desired skills might not be suitable without strong
banks, computer-adaptive algorithms, test centers, and secure
evidence to support their use.
servers;
6.4.5 Minimum Requirements—As a minimum requirement,
6.3.3 Materials for development, production, and security;
6.3.4 Contingency funds for security breaches; and the results of the needs analysis shall provide the organization
6.3.5 Mechanisms for revising resource allocation as new
that will develop or supply the test with the following
needs become apparent through the planning, development,
information:
and maintenance process.
6.4.5.1 The language requirements of the organization(s)
thatwillusethetest(includingifapplicable,variantsofscripts,
6.4 Needs Analysis—An organization’s development,
commissioning, or selection of a language test shall be based fonts, accents, and dialects),
F2889−11 (2020)
6.4.5.2 The ILR level(s) that are needed to fulfill the particular, it is important to make clear the interpretation of the
language proficiency requirements of the organization(s) that ILR and the aspects of the ILR that are considered important
will use the test, for the construct of the particular test in question. The
6.4.5.3 The type of decisions that will be made on the basis framework document can then be used as a basis for making
of test scores,
decisions about what new research needs to be conducted to
6.4.5.4 How many examinees will take the test, justify using the test for different populations or using the test
6.4.5.5 How often each examinee will be tested, and
scores in a new way. The framework document shall be
6.4.5.6 The facilities available or planned for testing. developed in accordance with 5.8. See 6.6.3 for more specific
6.4.5.7 The circumstances under which a documentation guidance.
audit (see Section 10) may be requested, and by whom.
6.6.2 Process—Test developers shall develop a framework
6.4.6 Documentation—Needs analysis shall be documented
document in close coordination with test users and other
in accordance with 5.8.
relevant stakeholders with input from outside testing experts as
needed. At the beginning of a testing project, test developers
6.5 Product Acceptance Plan:
shall inform stakeholders of the usefulness of a framework
6.5.1 For a test to be used operationally, it shall be accepted
document and request that such a document be created before
by the relevant stakeholders. The organization or organizations
test development begins. In the event that stakeholders reject
that will use the test and the test development organization
the request, test developers shall develop the framework
together shall develop a product acceptance plan that reflects
document concurrently with the test specifications and the test
the needs of stakeholders and developers for the particular
items.The document should be updated in accordance with 5.8
testing program. In some cases, the stakeholders will not be
as new research is conducted or new issues concerning test use
involved until final acceptance of the test; in others, they may
arise. For existing tests that are being adopted for the testing of
need to see interim products, such as the framework document
ILR-based proficiency, the organization that will use the test is
or the results of field testing, to feel comfortable accepting the
responsible for creating a framework document, with the
final product. The product acceptance plan shall include, at a
minimum: cooperation of the original developers if possible, preferably
before the test begins to be used.
6.5.1.1 Alist of the points in the planning and development
process at which stakeholder acceptance is required (for
6.6.3 Content—The framework document shall contain the
example, the stakeholders might want to approve the frame-
following:
work document or the categories of people who can be
6.6.3.1 The decisions to be made on the basis of test scores
examinees for field testing);
(for example, hiring, placement, and retention);
6.5.1.2 A list of the documents representing those points
6.6.3.2 The intended consequences of test use (for example,
thatthestakeholderswillreceiveforapproval(forexample,the
eligibility for training courses, reassignment of personnel, or
framework document, a list of examinees, and statistical
determination of operational readiness);
reports on item quality);
6.6.3.3 An interpretation of the relevant sections of the ILR
6.5.1.3 Atimeframeforacceptance(whenthetestdeveloper
skill level descriptions and how they are to be operationalized
shall submit materials to stakeholders and when stakeholders
(forexample,takingthephrase“speakerscanmakethemselves
shall finalize their acceptance decision for each stage); and
understood to native speakers who are in regular contact with
6.5.1.4 A set of criteria by which stakeholders will judge
foreigners” and defining or exemplifying who those native
acceptability (for example, they require the framework docu-
speakers are and how this characteristic is assessed in the test);
ment to be readily understood by non-specialists).
6.6.3.4 An interpretation of the relevant sections of the ILR
6.5.2 As the planning, development, maintenance, and re-
skill level descriptions and how they are to be operationalized
freshment of a test progresses, the needs and priorities of the
(forexample,takingthephrase“speakerscanmakethemselves
stakeholders may change, and it is legitimate to revise the list
understood to native speakers who are in regular contact with
of points of acceptance and criteria for acceptance; however,
foreigners” and defining or exemplifying who those native
these revisions shall be documented and agreed to by all
speakers are and how this characteristic is assessed in the test);
involved, so that the acceptance process remains transparent
and consistent across the testing program. Any agreed-upon
6.6.3.5 A justification of the links between test scores and
revisions shall be fully funded and shall include appropriate
their interpretations, uses, and consequences; and
revisions to project timelines and deliverable schedules.
6.6.3.6 An explanation of the research that has been done to
support the links above and identification of areas in which
6.6 Framework Document:
6.6.1 Purpose—A framework is an essential document that more research is needed. This section would likely change as
the test is used. Before the test is developed, research would
provides the rationale for the test design. It is the bridge
between the needs analysis and the test specifications. It presumably focus on previous types of tests, with a discussion
of how the current test is similar or different, and this section
justifies and explains test design decisions. A framework
document is useful for clarifying consequences of test use and would primarily outline predictive or concurrent validity stud-
providing an underpinning for test specifications. The more iesthatareplannedforthetest.Oncethetestisoperational,the
important the consequences of decisions based on the test results of those validity studies would be incorporated. Any
scores, the more important it is for the framework document to updates to the framework document shall be in accordance
be comprehensive and explicit. For ILR-based tests in with 5.8.
F2889−11 (2020)
6.7 Test Specifications Document—The test specifications is 6.7.5.3 Content specifications shall describe guidelines for
an essential document that provides detailed specifications content coverage and balance.
regarding the construct, design, content, administration, 6.7.5.4 Testformspecificationsshallprovidespecificguide-
scoring, reporting, and intended use of the test. The test lines for test form construction, including number of items per
specifications shall be sufficiently detailed to guide the day-to- passage, stage, and level (as applicable).
6.7.5.5 Test form specifications shall include guidelines for
day work of test development and serve as a standard against
which the completeness of that work can be measured. The the development of tasks to ensure that such tasks are devel-
oped in a standard and replicable manner.
moreimportanttheconsequencesofdecisionsbasedonthetest
6.7.5.6 Specifications for adaptive tests shall include
scores, the more important it is for the test specifications
document to be comprehensive and explicit. For existing tests decision-tree guidelines or rubrics or both for human testers or
adaptive algorithms for computer-adaptive tests.
that are being used for new purposes, the organizations using
6.7.6 Scoring, Rating, and Reporting:
the test are not responsible for obtaining or generating speci-
fications for test design (6.7.5). The other sections of the 6.7.6.1 Scoring specifications shall explain in detail how
both raw and scaled scores are generated (as applicable) and
specifications shall be obtained from the original test designers
how cut scores are set and interpreted.
or written by the organization using the test to reflect the
intended use, scoring or rating, reporting, and administration 6.7.6.2 Partial credit scoring models and criteria for evalu-
ating and rating constructed responses by human raters shall be
requirements of the test in its new use. The test specifications
described in detail (as applicable).
document shall be developed in accordance with 5.8.
6.7.6.3 Rating specification shall include explanations for
6.7.1 Intended Test Use—The specifications shall clearly
howratersaretrainedandtheratingscalebeingusedforrating.
state that the purpose of the test is to measure general
6.7.6.4 Reporting specifications shall describe how test
proficiency as defined by the ILR scale. The skill domain(s)
scores and ratings are reported to test takers, test users, and
covered by the test (listening, reading, speaking, or writing)
other stakeholders (as applicable).
shall be specified, as shall the range of ILR levels.
6.7.7 Administration and Technological Requirements:
6.7.2 Construct Definition—The specifications shall clearly
6.7.7.1 The test specifications shall describe standard test
define the construct(s) to be measured with specific reference
administration conditions and procedures. The descriptions
to the ILR skill level descriptions.
should include required training and qualification information
6.7.3 Intended Score Use(s)—The intended score use(s) and
for any test administration personnel and any materials or
limitationsintheapplicationorinterpretationofscoresshallbe
technology needed to administer the test under standard
clearly stated. T
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...