Standard Practice for Assessing Language Proficiency

SIGNIFICANCE AND USE
4.1 Intended Use:  
4.1.1 This practice is intended to serve the language test developer, test provider, and language test user communities in their ability to provide useful, timely, reliable, and reproducible tests of language proficiency for general communication purposes. This practice expands the testing capacity of the United States by leveraging commercial and existing government test development and delivery capability through standardization of these processes. This practice is intended to be used by contract officers, program managers, supervisors, managers, and commanders. It is also intended to be used by test developers, those who select and evaluate tests, and users of test scores.  
4.1.2 Furthermore, the intent of this practice is to encourage the use of expert teams to assist contracting officers, contracting officer representatives, test developers, and contractors/vendors in meeting the testing needs being addressed. Users of this practice are encouraged to focus on meeting testing needs and not to interpret this practice as limiting innovation in any way.  
4.2 Compliance with the Practice:  
4.2.1 Compliance with this practice requires adherence to all sections of this practice. Exceptions are allowed only in specific cases in which a particular section of this practice does not apply to the type or intended use of a test. Exceptions shall be documented and justified to the satisfaction of the customer. Nothing in this practice should be construed as contradicting existing federal and state laws nor allowing for deviation from established U.S. Government policies on testing.
SCOPE
1.1 Purpose—This practice describes best practices for the development and use of language tests in the modalities of speaking, listening, reading, and writing for assessing ability in accordance with the Interagency Language Roundtable (ILR)2 scale. This practice focuses on testing language proficiency in use of language for communicative purposes.  
1.2 Limitations—This practice is not intended to address testing and test development in the following specialized areas: Translation, Interpretation, Audio Translation, Transcription, other job-specific language performance tests, or Diagnostic Assessment.  
1.2.1 Tests developed under this practice should not be used to address any of the above excluded purposes (for example, diagnostics).  
1.3 This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.

General Information

Status
Published
Publication Date
31-Mar-2020
Current Stage
Ref Project

Relations

Buy Standard

Standard
ASTM F2889-11(2020) - Standard Practice for Assessing Language Proficiency
English language
24 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)


This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation:F2889 −11 (Reapproved 2020)
Standard Practice for
Assessing Language Proficiency
This standard is issued under the fixed designation F2889; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope 3.1.1 achievement test, n—an instrument designed to mea-
sure what a person has learned within or up to a given time
1.1 Purpose—This practice describes best practices for the
based on a sampling of what has been covered in the syllabus.
development and use of language tests in the modalities of
3.1.2 adaptive test, n—form of individually tailored testing
speaking,listening,reading, and writing for assessingabilityin
in which test items are selected from an item bank where test
accordance with the Interagency Language Roundtable (ILR)
items are stored in rank order with respect to their item
scale. This practice focuses on testing language proficiency in
difficulty and presented to test takers during the test on the
use of language for communicative purposes.
basis of their responses to previous items, until it is determined
1.2 Limitations—This practice is not intended to address
that sufficient information regarding test takers’ abilities has
testingandtestdevelopmentinthefollowingspecializedareas:
been collected. The opposite of a fixed-form test.
Translation, Interpretation, Audio Translation, Transcription,
3.1.3 authentic texts, n—texts not created for language
other job-specific language performance tests, or Diagnostic
learning purposes that are taken from newspapers, magazines,
Assessment.
etc., and tapes of natural speech taken from ordinary radio or
1.2.1 Tests developed under this practice should not be used
television programs, etc.
to address any of the above excluded purposes (for example,
3.1.4 calibration, n—the process of determining the scale of
diagnostics).
a test or tests.
1.3 This international standard was developed in accor-
3.1.4.1 Discussion—Calibration may involve anchoring
dance with internationally recognized principles on standard-
items from different tests to a common difficulty scale (the
ization established in the Decision on Principles for the
theta scale). When a test is constructed from calibrated items
Development of International Standards, Guides and Recom-
then scores on the test indicate the candidates’ ability, that is,
mendations issued by the World Trade Organization Technical
their location on the theta scale.
Barriers to Trade (TBT) Committee.
3.1.5 cognitive lab, n—a method for eliciting feedback from
2. Referenced Documents examinees with regard to test items.
3 3.1.5.1 Discussion—Small numbers of examinees take the
2.1 ASTM Standards:
test, or subsets of the items on the test, and provide extensive
F1562 Guide for Use-Oriented Foreign Language Instruc-
feedback on the items by speaking their thought processes
tion
aloud as they take the test, answering questionnaires about the
F2089 Practice for Language Interpreting
items, being interviewed by researchers, or other methods
F2575 Guide for Quality Assurance in Translation
intended to obtain in-depth information about items. These
examinees should be similar to the examinees for whom the
3. Terminology
test is intended. For tests scored by raters, similar techniques
3.1 Definitions:
are used with raters to obtain information on rubric function-
ing.
3.1.6 computer adaptive test, n—a test administered by a
This practice is under the jurisdiction of ASTM Committee F43 on Language
computer in which the difficulty level of the next item to be
Services and Products and is the direct responsibility of Subcommittee F43.04 on
presented to test takers is estimated on the basis of their
Language Testing.
Current edition approved April 1, 2020. Published April 2020. Originally
responses to previous items and adapted to match their
approved in 2005. Last previous edition approved in 2011 as F2889 – 11. DOI:
abilities.
10.1520/F2889-11R20.
3.1.7 construct, n—the knowledge, skill or ability that is
Interagency Language Roundtable, Language Skill Level Descriptors (http://
www.govtilr.org/Skills/ILRscale1.htm).
being tested.
For referenced ASTM standards, visit the ASTM website, www.astm.org, or
3.1.7.1 Discussion—The construct provides the basis for a
contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM
given test or test task and for interpreting scores derived from
Standards volume information, refer to the standard’s Document Summary page on
the ASTM website. this task.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
F2889−11 (2020)
3.1.8 constructed response, adj—a type of item or test task (no knowledge of a language) to 5 (equivalent to a highly
that requires test takers to respond to a series of open-ended educated native speaker).
questions by writing, speaking, or doing something rather than
3.1.19 indirect test, n—a test that measures ability
choose answers from a ready-made list.
indirectly, rather than directly.
3.1.8.1 Discussion—The most commonly used types of
3.1.19.1 Discussion—An indirect test requires examinees to
constructed-response items include fill-in, short-answer, and
perform tasks that are not directly reflective of an authentic
performance assessment.
target-language use situation. Inferences are drawn about the
abilities underlying the examinee’s observed performance on
3.1.9 content validity, n—a conceptual or non-statistical
the indirect test.
validity based on a systematic analysis of the test content to
determine whether it includes an adequate sample of the target
3.1.20 interpretation, n—the process of understanding and
domain to be measured.
analyzing a spoken or signed message and re-expressing that
3.1.9.1 Discussion—In order to achieve content validity, an
message faithfully, accurately and objectively in another
adequate sample involves ensuring that all major aspects are
language, taking the cultural and social context into account.
covered and in suitable proportions.
3.1.20.1 Discussion—Although there are correspondences
between the skills of interpreting and translating, an interpreter
3.1.10 criterion-referenced scale, n—a graduated and sys-
conveys meaning orally, while a translator conveys meaning
tematic description of the domain of subject matter that a test
from written text to written text. As a result, interpretation
is designed to assess; (or) a rating scale that provides for
requires skills different from those needed for translation.
translating test scores into a statement about the behavior to be
expected of a person with that score and/or their relationship to
3.1.21 inter-rater reliability, n—the degree to which differ-
a specified subject matter.
ent examiners or judges making different subjective ratings of
3.1.10.1 Discussion—A criterion-referenced test is one that
ability agree in their evaluations of that ability.
assesses achievement or performance against a cut score that is
3.1.22 intra-rater reliability, n—the degree to which an
determined as a reflection of mastery or attainment of specified
individual examiner or judge renders consistent and reliable
objectives. Focus is on ability to perform tasks rather than
ratings.
group ranking.
3.1.23 item, n—one of the assessment units, usually a
3.1.11 cut score, n—a score that represents achievement of
problem or a question, that is included on a test.
the criterion, the line between success and failure, mastery and
3.1.23.1 Discussion—Test items provide a means to mea-
non-mastery.
sure whether a test taker can perform a task and are scorable
3.1.12 dichotomous scoring, n—scoring based on two using a scoring rubric or answer key. Successful or unsuccess-
categories, for example, right/wrong, pass/fail. Compare to ful performance on an item contributes information to the test
polytomous scoring. taker’s overall score. Examples of item types include: multiple
choice, constructed response, cloze, matching and essay
3.1.13 equated forms, n—two or more forms of a test whose
prompts.
test scores have been transformed onto the same scale so that
a comparison across different forms of a test is made possible. 3.1.24 item response theory (IRT), n—the theory underlying
statistical models that are used to describe the relationship
3.1.14 expert panel, n—a group of target-language experts
between a student’s ability level and the probability of success
who take a test under test-like conditions and provide com-
on a test question.
ments about any problem areas.
3.1.24.1 Discussion—IRT encompasses latent trait theory;
3.1.14.1 Discussion—An expert panel should include at
logistic models; Rasch models; 1, 2, and 3 parameter IRT;
least 8 members. Panel members receive training before they
normal ogive models; Generalized Partial Credit models; and
take the test in order to ensure that their comments will be
Samejima’s Graded Response model.
helpful.
3.1.25 language proficiency, n—the degree of skill with
3.1.15 face validity, n—the degree to which a test appears to
which a person can use a language for communicative pur-
measure the knowledge or abilities it claims to measure, based
poses.
on the subjective judgment of an observer.
3.1.25.1 Discussion—Language proficiency encompasses a
3.1.16 fixed-form test, n—atestwhosecontentdoesnotvary
person’s ability to read, write, speak, or understand a language
in order to better accommodate to the examinee’s level of
and can be contrasted with language achievement, which
knowledge, skill, ability or proficiency. The opposite of an
describes language ability as a result of learning. Proficiency
adaptive test.
may be measured through the use of a proficiency test.
3.1.17 genre, n—a type of discourse that occurs in a
3.1.26 operational validity, n—the extent to which item
particular setting, that has distinctive and recognizable patterns
tasks, items, or interviewers on a test perform as intended and
and norms of organization and structure, and that has particular
function to create an accurate score in a real world setting, as
and distinctive communicative functions.
opposed to a setting involving an experiment, a simulation or
training.
3.1.18 ILR scale, n—a scale of functional language ability
of 0 to 5 used by the Interagency Language Roundtable.
3.1.27 performance test, n—a test in which the ability of
3.1.18.1 Discussion—The range of the ILR scale is from 0 candidates to perform particular tasks, usually associated with
F2889−11 (2020)
job or study requirements, is assessed using “real-life” perfor- administered to same group of test-takers under comparable
mance requirements as a criterion. testing situations. Simply put, reliability is the extent to which
an item, scale, procedure, or test will yield the same value
3.1.28 polytomous scoring, n—a model for scoring an item
when administered under similar or dissimilar conditions.
using a scale of at least three points.
3.1.37 scoring rubric, n—a standardized method or proce-
3.1.28.1 Discussion—Using a polytomous scoring model,
dure used by a rater in assigning a score to an examinee’s
for example, the answer to a question can be assigned 0, 1, or
performance on a given task.
2 points. Open-ended questions are often scored polytomously.
Also referred to as scalar or polychotomous scoring. Compare 3.1.37.1 Discussion—A scoring rubric is a detailed docu-
ment that is used by trained raters to assess test taker
to dichotomous scoring.
performance. Correct interpretation and application of the
3.1.29 predictive validity, n—the degree to which a test
scoring rubric requires training.
accurately and reliably predicts future performance in the
3.1.38 selected response, adj—any item which requires the
domain being tested.
examinee to choose between response options which are
3.1.30 protocol, n—a standardized method or procedure for
provided to the examinee, including, but not limited to true/
executing a given task, often formalized in documents.
false and multiple-choice items.
3.1.31 quality assurance, v—theprocessofensuringthatthe
3.1.39 skill modality, n—any one of the four receptive and
test planning and development phases are executed properly
productive language skills of listening, reading, speaking,
and satisfy the needs of all stakeholders.
writing as defined in the ILR.
3.1.31.1 Discussion—Quality assurance (QA) applies (1)
3.1.40 specifications, n—a detailed description of the char-
when a new test is being created, (2) when a test that already
acteristics of a test, including what is tested, how it is tested,
exists is being repurposed or revised, (3) during certain aspects
details such as number and length of papers, item types used,
oftheimplementationprocessofthetest(thatis,replenishment
etc.
of test items), (4) during item replenishment to ensure that new
test items and prompts that will be used in the test conform to
3.1.41 task, n—an activity performed by a test taker in order
the original specifications that were used in creating the todemonstratefunctionsandotherproficiencycriteriastatedin
original items of that type, and (5) to train new personnel to
the ILR Skill Level Descriptors.
administerthetesttothe same standards that were specifiedfor
3.1.42 test-retest reliability, n—an estimate of the reliability
the first testing personnel.
of a test as determined by the extent to which a test gives the
3.1.32 quality control, v—the system of post-development same results if it is administered at two different times under
evaluations used at and after product acceptance to determine
the same conditions with the same group of test takers.
whether the test and testing practices used by an organization 3.1.42.1 Discussion—Test-retest reliability is estimated
continue to meet and adhere to all standards and relevant
from the coefficient of correlation that is obtained from the two
testing policies. administrations of the test. An assessment should provide a
stable measurement of a construct across multiple
3.1.32.1 Discussion—Quality control (QC) is used at and
any time after product acceptance. QC verifies the continued administrations, especially when the time interval in between
the administrations limits the potential for the amount of the
validity and reliability of the test and shows the test is being
used in an appropriate manner on an ongoing basis. Quality u
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.