Testing, SpaceX, and the quest for consensus

Chester E. Finn, Jr.

4.22.2021

In a thought-provoking piece in the Hechinger Report a couple weeks ago, IES director Mark Schneider and Schmidt Futures honcho Kumar Garg made a compelling case for a revolution in education testing. The authors correctly explained that practically nobody likes today’s assessments, they’re expensive, and many people would like to do away with them altogether. Then they explained why abolishing testing would be a really bad idea, as it would deny valuable information to both educators and policymakers and would scrap a major tool for pursuing equity.

Rather than crusading against testing, say Schneider and Garg, we need the equivalent of a “SpaceX” for assessment—a reimagining, redesigning, and reconstructing of how this can and should be done in the mid-twenty-first century.

They’re right—right that this needs to happen, and right that “improvements are available now.” In particular, a suite of technologies that are already widely used in some private-sector testing can and should be embraced by state and national assessments, as well as the private tests that aren’t yet making maximum use of them. Artificial intelligence can generate test questions and evaluate student responses. “Natural language processing” illustrates the kind that can appraise essay-style responses, thus helping to liberate testing from multiple-choice items that can be fed through a scanner. Computer-adaptive testing (already a feature of the Smarter-Balanced coalition, though constrained by ESSA’s insistence on “grade-level” testing) saves time, reduces student frustration, and yields far more information on what kids do and don’t know, particularly at the high and low ends of the achievement distribution.

Schneider and Garg itemize several necessary elements of the paradigm shift they seek:

First, we should strive to set ambitious goals for where assessment innovation can go.... Second, government agencies and research funders should invest in advanced computational methods in operational assessments.... Third, fostering talent is critical. New testing designs will require new test researchers, developers, statisticians and AI experts who think outside the box.... But, most importantly, we must recognize that the status quo is broken. We need new thinking, new methods and new talent.

That’s not the whole story, of course. There are plenty of other needs, mostly involving the surmounting of present-day obstacles. Government bureaucracies are set in their ways. Procurement systems are ossified and formulaic. Digital divides are real. And the more tests rely on technology, the greater the risk that those divides will worsen the inequalities that the tests reveal.

Moreover, all sorts of state and federal laws and regulations are involved. The intersection of assessments with academic standards and ESSA-driven accountability regimes is genuinely complicated. And then there’s the matter of “trendlines,” the desire to know how next year’s assessment results compare with last year’s, so that we can calculate growth, operate our accountability system, know whether gaps are closing and reforms are working, etc.

Those aren’t trivial considerations, especially in long-running testing programs such as the National Assessment of Educational Progress. Big changes in how tests are constructed and conducted are certain to collide with the barriers noted above, but also run a high risk of forcing trend lines to start anew.

Schneider and Garg say these challenges are worth tackling. That it’s already a time of flux, and agitation in the testing realm may well mean they’re right and the time is at hand. On the other hand, as former Achieve major domo Michael Cohen says, the time may not be right “for a major effort to create better tests because nobody wants to talk about tests. People are tired of standards, tests, and accountability. They just don’t want to deal with it anymore.”

It surely won’t be easy to reach anything resembling consensus across the education field, not in these politically schismatic times when people want so many different things from tests, and want to deploy and restrict them in so many different ways—or abolish them altogether.

Dissensus is visible today among the twenty-six members of the National Assessment Governing Board (NAGB) as they struggle with replacement of the twelve-year-old framework that underlies NAEP’s reading tests. Intended to take effect with the 2026 assessment cycle, the proposed new framework that emerged from an extensive attempt to “vision” the future of reading has prompted much controversy. In a scathing review of last year’s draft, David Steiner of Johns Hopkins and Mark Bauerlein of Emory suggested that the new framework would, in effect, define deviancy down by masking the problem of weak background knowledge that compromises reading comprehension among many youngsters, particularly those from disadvantaged homes. They also spotlighted the great risk that the proposed new framework would break NAEP’s reading trendline, which extends all the way back to 1992.

Whether the changes subsequently made in the proposed framework are substantive or cosmetic remains a topic of intense debate within the governing board, which over the decades has been celebrated for its capacity to reach consensus on important decisions. Whether that can happen next month when NAGB is supposed to adopt the new reading framework remains to be seen.

The point here, however, is not about NAEP or NAGB. It’s about the difficulty of achieving consensus in today’s testing arguments—and the difficult trendline issue, which is a big deal not only for NAEP, but also for many state assessments, as well as private-sector testing efforts such as SAT, ACT, and NWEA.

Statistical and psychometric legerdemain sometimes makes it possible to “bridge” or “equate” scores across a major shift in testing methods, content, or scoring arrangements. That’s how the NAEP reading trendline, for example, survived the installation of a new assessment in 2009, and how the College Board has been able to publish equivalency tables each time it has “re-centered” the SAT.

Perhaps such bridges can span the divide between today’s testing systems and the “SpaceX” version that Schneider and Garg envision. Or perhaps we must steel ourselves to sacrificing trend data in pursuit of other benefits that the SpaceX version would bring. It’s a close call—and an issue that will make consensus-seeking even harder, particularly in governmental assessments, such as those that states are required by ESSA to conduct, as well as NAEP itself.

A revolution in testing is less fraught—at least less political—in privately-operated programs, especially the kind that are more commonly used for formative and diagnostic purposes rather than tied to school accountability. Reconceptualizing those tests and their uses might yield additional gains. If more schools deployed them regularly and painlessly, then used them both for instructional decisions by teachers and to keep parents posted on their children’s learning gains and gaps, perhaps there’d be less need for and pressure on end-of-year accountability testing. Maybe it could happen less frequently or, NAEP-style, involve just a sample of students and schools.

Yes, it’s time for some fresh thinking! Are you listening, Elon Musk?

Policy Priority:

High Expectations

Topics:

Accountability & Testing

Evidence-Based Learning

Governance

Teachers & School Leaders

Tags:

National Assessment of Educational Progress