Scoring the Testing Industry
Making the Grades: My Misadventures in the Standardized Testing Industry
By Todd Farley
(PoliPoint Press, 2009)
Reviewed by Kenneth Bernstein, NBCT
High School Government & Social Studies (MD)
Teacher Leaders Network
As the use of tests created external to schools and classrooms has exploded, one issue has always been the question of whether to rely merely upon selected response (a.k.a. multiple choice) items, or to also include constructed response items (paragraphs and essays).
Selected responses are cheap to administer; they can be scored solely by machine and the results obtained quickly. It is even possible, utilizing item response theory, to administer the test on a computer and use early responses to vary the items offered to the test taker, thereby determining the level of performance more quickly and accurately.
But many think we need more: after all, life does not ask us to choose one out of four or five pre-selected choices. Thus many colleges and universities, employers, and classroom teachers prefer that the tests include constructed items — “essays” if you will.
While it is possible to machine-score such items, that technology is still in its relative infancy, which is why companies that produce tests have need of human scorers. And it is because of this need that we get Todd Farley’s book Making the Grades.
Farley spent 15 years in a variety of positions involved with the scoring of such constructed responses. He worked for a number of America’s most important assessment companies, often doing the work on contract for various states, including Virginia, where I live.
I am not a trained psychometrician, although during my now-abandoned doctoral studies I did seriously study issues of assessment. I am a school teacher today, in my 15th year of teaching. Each year but one, I have had to prepare students to sit for external tests – which may or may not have met the criteria to properly be labeled “standardized” — that included constructed responses. These tests have included the Maryland School Performance Assessment Program, the Maryland High School Assessments, and The College Board’s Advanced Placement examinations. During my one year of teaching in Virginia, known for its Standards of Learning (SOL) assessments, the middle school American History test was made up entirely of selected response items. I also bring to this review some experience that parallels Farley: in 2009, I served as a Reader for the Advanced Placement examination in U S Politics and Government and scored one of the four Free Response Questions on that year’s examination.
As I glance at my copy of the book, I have more than 40 sticky notes that I have affixed to pages containing passages I thought might possibly be worth quoting. Some I obviously will forego. Farley offers explanations of terms like reliability and validity, and explains how in the case of reliability, the term was often misused by those supervising the scoring process. Simply put, scoring companies are often satisfied if those scoring agree 80% of the time, even if that to which they agree is erroneous. It is like a scale that consistently reports your weight as 20 pounds less than reality. The information you obtain is reliable — but it is NOT valid.
What educational measurement should provide is the ability to draw valid inferences from the information analyzed. If nothing else, reading this book will raise questions in your mind about whether many of the tests being used to evaluate students, teachers and schools meet that standard.
Farley demonstrates that reliability is not necessarily something we can rely on. Allow me to quote an entire paragraph from pp. 55-56 to illustrate:
But you want to talk about a sliding scale? The scale we used to score writing flopped about like a puppy on a frozen pond, going every which way, keeling over and standing up and falling down. In scoring writing, for instance, an essay that had a good development of ideas could earn a 6, a 5, a 4, maybe even a 3. An essay that was troubled on the sentence level in terms of grammar, usage, and mechanics could earn a 1, a 2, a 3, perhaps even a 4, 5, or 6. (I don’t dispute the idea: Gertrude Stein said of F. Scott Fitzgerald that she’d never met anyone who was such a poor speller, yet he still managed to produce a decent text or two.) The point is that essays with identical levels of ability in certain areas could end up (due to other considerations on the rubric) with significantly different scores. In scoring writing, we were far from having hard and fast rules to live by. It all seemed a little untenable, rather mystifying, and the easiest thing to do was to hand your essay off to your neighbor or plead with your supervisor for help.
That passage references the idea of the rubric, the standard by which the grader is supposed to evaluate the essay. If a rubric is sufficiently clear to give guidance, it also may be too rigid for the occasional creatively written paper. A rigid application of the rubric might, as Farley illustrates on more than one occasion, result in a good piece of writing being undervalued and a poor piece of writing receiving a high score.
And the scoring companies often have little control over the rubric and how it is applied. They are scoring under contracts issued by states that may leave them little flexibility. Allow me to illustrate using an example of a scoring team examining an anchor paper.
Anchor papers are supplied to scorers to give examples of the work expected at each scoring level of a rubric. Farley provides the four-point rubric for an 8th grade writing assessment (descriptive mode). The rubric, provided by the state in question, expects the student to use a five-paragraph format. According to the rubric, for the scorer to assign all 4 points, the organization, focus and development, style and sentence fluency, and grammar-usage-mechanics should be considered “excellent.” For 3 points, they should be considered “good.”
Farley describes how table leaders were trained to lead the scoring of this 8th grade assessment. After reviewing the anchor paper for a score of 3 (which Farley reprints in the book), all of the table leaders were scratching their heads, describing the paper as “lame.” One seventh grade teacher in the group argued that she would not consider this essay good work by her students. Another pointed out that it consisted of only simple sentences, and Farley (also a table leader) noted it had no voice and no style. The response of their trainer Maria is telling:
Maria looked down at the essay. “I’m not saying I’d give this a 3 in my classroom, either, but that’s how we have to score it based on this ‘focused holistic” rubric. Most importantly, in this state’s Department of Education, the essay has a five-paragraph format, with introductory, body, and concluding paragraphs, and an introductory sentence in all five of them.”
As shocking as that is, the reader might not be prepared for what comes next. It’s an anchor paper that contains simply brilliant writing (and would be so for a high school student) but earns a score of 2. Here is Farley’s account of the conversation that ensued among the scorers:
Greg scoffed. “This kid needs a publisher, not a score from us."
Maria looked guilty. “I know,” she said. “I certainly wouldn’t give this a 2, either. The writing may be sentimental, but it’s first-draft work from an eighth grader. It’s a damn good response, I agree.”
“So?” Harlan said.
“Well, what’s the important person or thing in the essay? It’s her favorite spot, a fact we don’t know until the last sentence. That’s not five-paragraph format, is it? There’s no introductory paragraph, no introductory sentence --”
“No,” Greg said, “it’s way more artful than that, building up the suspense nicely and using some beautiful descriptive language.”
“Yup,” Maria agreed shrugging. “I know. But this is how they want us to score them”
“Really?” I asked. “Rather a tedious five-paragraph essay than a beautifully done three or four paragraphs?”
“It seems that way,” Maria answered. She looked at us, resigned. We looked back at her defeated.
“All we care about is the formatting?” Pete asked.
“That’s not the only thing, “ Maria answered, “but it is the first thing.”
“Wow,” I said, “it almost seems a kid could get a 3 for turning in an outline.”
Maria thought about it. “Not quite,” she said.
The book is a good read, such a good read that I hesitate to go into too much detail, so that I don’t spoil the enjoyment – and the shock – you will experience as you read it. But I’ll share a few other samples.
At the time of the passage cited above, the scorers were earning $10/hour or less. They were not required to be content matter knowledgeable, something that was a persistent issue in the experiences Farley cites. Scorers were trained, and had to meet a certain standard of scoring accuracy in order to be allowed to score. But the need for scorers was so great that the standards of accuracy were often bent, and scores were changed and manipulated to maintain acceptable levels of accuracy.
Please note that term — acceptable. Farley cites examples of where too great a level of accuracy could cause problems, and this was truly scary, because the examples involved the scoring of the National Assessment of Educational Progress (NAEP), which is supposed to be the ‘gold standard” by which all other educational assessment in this nation is measured. Farley was told he could not have a higher degree of accuracy than was recorded in previous scoring cycles lest the comparability of scores from cycle to cycle be lost. Ponder that for a while.
I do want to offer some cautions. Farley paints with a very broad brush. What he says is certainly widely applicable, but not universally so.
I teach in Maryland, which until May 2009 included two kinds of Constructed Responses (Brief and Extended) as part of the High School Assessments required in four subjects (Biology, Algebra, English, and Government) for graduation from high school. Each constructed response was scored by the same 4–point rubric, a copy of which students had during the exam. In the scoring process, each response was read by two scorers. Inter-rater reliability required only that the scorers gave adjacent scores, not identical scores. If the scores were not identical, the student received the higher score. I am not sure how accurate a measurement that was, but at least the students got the benefit of the doubt, unlike the scenario above that Farley described.
Cost control is another factor that influences the quality of the scoring process. Farley’s account focuses on scoring companies that paid relatively low wages to individuals who often lacked the necessary professional background to make accurate, independent judgments about the work they were scoring. As a result, a highly controlled system of scoring was imposed. But this method of assessing writing samples is not universal.
Here I speak from my experience as a reader of free response questions on the Advanced Placement exam for US Government and Politics. To score this exam, an individual must teach the subject in a post-secondary institution or have at least three years experience teaching the AP course in a high school. We were certainly qualified as to content. We were also paid substantially more than the $10 an hour Farley cites for the incident above, plus expenses for transportation, food and lodging.
We were thoroughly trained. We had our work closely examined at first, until we demonstrated our competence. We were spot-checked regularly by table leaders and by question leaders. The range of our scores was monitored by computer, and if we showed any scoring patterns that raised questions, our work would be reexamined. But once we got going, the scoring was limited to a single reader because we had over 100,000 exams (four questions each) to be scored in less than a week after training.
I know how seriously the Advanced Placement officials took this process because of my own experience. I read very quickly, and I was so much faster than others that, in the beginning, my work was checked very closely until the question leader determined that I was scoring accurately. When I had any doubt about a response, I would on my own initiative check with one of my fellows and/or with the table leader. That pattern was widespread among my fellow scorers.
I would argue that the AP people have demonstrated that reasonably accurate and consistent scoring of constructed response by properly trained people is possible, if one is willing to accept the concomitant costs.
Still, despite the caveat I offer based on my AP experience, I think Farley’s book is a valuable read with much to tell us about the often poorly understood processes and implications of large-scale high stakes testing. He ends the book with these blunt words:
If I had to take any standardized test today that was important to my future and would be assessed by the scoring processes I have long been a part of, I promise you I would protest; I would fight; I would sue; I would go on a hunger strike or march on Washington. I might even punch someone in the nose, but I would never allow that massive and ridiculous business to have any say in my future without battling it to the bitter, better end.
Do what you want, America, but at least you have been warned.






