Cherry, Roger D. and Paul R. Meyer. “Reliability Issues in Holistic Assessment” AW
Cherry and Meyer go balls deep into the concept of reliability, raising critical questions that researchers in writing assessment should consider, particularly (1) how we understand the utility of conversations concerning reliability; (2) how reliability is related to validity; and (3) how to organize our conversations of reliability around standardized statistical methods to understand the reliability of holistic scoring.
Defined, reliability refers to “how consistently a test measures whatever it measures” (30); validity “addresses the question of whether the test measures it is designed to measure” (30). It is here that Cherry and Meyer argue, “a test cannot be valid unless it is reliable, but the opposite is not true; a test can be reliable but still not be valid” (30). However, no measure can ever be truly reliable—there will always be some degree of error. Cherry and Meyer point to three sources of error:
- Characteristics of the students: a writing sample from student may reflect some aspects of a student’s writing ability at one given moment, but it cannot describe a student’s writing ability perfectly. As the authors write, there is an inherent inconsistency of student performance; accordingly, such inherent error lessens our ability “to rely on a single writing sample to make judgments about a student’s writing ability” (31).
- Characteristics of the test: some writing tasks are easier for some students than others.
- Conditions affecting test administration and scoring: This covers the way raters are trained to assess or score student writing.
In writing assessments current history, much attention has been given to interrater reliability, which specifically covers errors in the third source of error. Defined, this kind of reliability refers to the consistency among raters assigning scores to written texts. Attention to this particular kind of reliability neglects the other two; the authors, then, advocates assessment researchers to shift their attention more toward instrument reliability: “the reliability of the writing assessment as a whole. Instrument reliability is concerned with the consistency of assessments across successive administrations of a test” (33). However, as the authors write, interrater reliability is necessary for instrumental reliability, but not sufficient condition for it. In other words, instrument reliability includes interrater reliability, but exceeds it.
Understanding the relationship between reliability and validity
The authors then begin to discuss the relationship between instrumental reliability and validity, specifically criterion-related validity. Criterion-related validity refers to the degree to which we can make appropriate decisions or “qualified, contextual claims” about a student’s performance in a correlating activity (like their performance in a writing course or ability to do well in college) using a writing test. Similar to such validity, instrumental reliability is likewise concerned with making generalizations that can be drawn from an assessment.
Confronting methods of calculating and reporting reliability
Cherry and Meyer specifically challenge the tertium quid procedure during a scoring session. This procedure involves three steps:
- When two ratings of a text disagree by more than one point, a third rating is obtained.
- The ‘bad’ rating of the three is thrown out.
- Interrater reliabilities are calculated on the basis of the new set of paired raitings.
The authors break this procedure down by noting that first, a difference greater than one point (such as 2 & 4 with a collective score of 6) should not be treated differently than two scores of the average (such as 3 & 3 with a collective score of 6). Second, all scores should be seen as obtained independently thus none warrants being thrown out over the others. And lastly, given that some scores are thrown away, calculating the interrater reliability based on only some of the scores would distort or inflate the results of the calculations.
However, calculating instrumental reliability is more difficult. The authors point to one way to do so: “the only way to determine the instrumental reliability of a given essay test of writing is for individual researchers or evaluators to obtain multiple writing samples from a representative group of subjects and intercorrelate them” (45). In other words, the test should be given to comparable student groups in similar contexts.