Evaluating the Content and Quality of Next Generation Assessments: A Preview

Amber M. Northern, Ph.D. Michael J. Petrilli

2.10.2016

The Thomas B. Fordham Institute has been evaluating the quality of state academic standards for nearly twenty years. Our very first study, published in the summer of 1997, was an appraisal of state English standards by Sandra Stotsky. Over the last two decades, we’ve regularly reviewed and reported on the quality of state K–12 standards for mathematics, science, U.S. history, world history, English language arts, and geography, as well as the Common Core, International Baccalaureate, Advanced Placement and other influential standards and frameworks (such as those used by PISA, TIMSS, and NAEP). In fact, evaluating academic standards is probably what we’re best known for.

For most of the last two decades, we’ve also dreamed of evaluating the tests linked to those standards—mindful, of course, that in most places, the tests are the real standards. They’re what schools (and sometimes teachers and students) are held accountable for, and they tend to drive curricula and instruction. (That’s probably the reason why we and other analysts have never been able to demonstrate a close relationship between the quality of standards per se and changes in student achievement.) We wanted to know how well matched the assessments were to the standards, whether they were of high quality, and what type of cognitive demands they placed on students.

But with fifty-one different sets of tests, such an evaluation was out of reach—particularly since any bona fide evaluation of assessments must get under the hood (and behind the curtain) to look at a sizable chunk of actual test items. Getting dozens of states—and their test vendors—to allow us to take a peek was nigh impossible.

So when the opportunity came along to conduct a groundbreaking evaluation of Common Core-aligned tests, we were captivated. We were daunted, too—both by the enormity of the task and by the knowledge that our advocacy of the standards would likely cause any number of doubters and critics to sneer at such an evaluation coming from us, regardless of its quality or impartiality.

So let’s address that first. It’s true that we continue to believe that children in most states are better off with the Common Core standards than without them. If you don’t care for the standards (or even the concept of “common” standards), or perhaps you come from a state that never adopted these standards or has since repudiated them, you can probably ignore this study. Our purpose is not to re-litigate the Common Core debate. Rather, we want to know, for states that are sticking with the common standards, whether the “next-generation assessments” that have been developed to accompany the standards deliver what they promised in terms of strong content, quality, and rigor.

No single study can come close to evaluating all of the products in use and under development in today’s busy and fluid testing marketplace. But we were able to provide an in-depth appraisal of the content and quality of three “next-generation” assessments—ACT Aspire, PARCC, and Smarter Balanced—and one best-in-class state test, the Massachusetts Comprehensive Assessment System (MCAS, 2014). In total, over thirteen million children (about 40 percent of the country’s students in grades 3–11) took one of these four tests in spring 2015. Of course it would be good to encompass even more. Nevertheless, this study ranks as possibly the most complex and ambitious single project ever undertaken by Fordham.

After we agreed to myriad terms and conditions, we and our team of nearly forty reviewers were granted secure access to operational items and test forms for grades five and eight (the elementary and middle school capstone grades that are the study’s focus).

This was an achievement in its own right. It’s no small thing to receive access to examine operational test forms. This is especially true in a divisive political climate where anti-testing advocates are looking for any reason to throw the baby out with the bathwater—and where market pressure gives test developers ample reason to be wary of leaks, spies, and competitors. Each of the four testing programs is to be commended for allowing this external scrutiny of their “live” tests, which cost them much by way of blood, sweat, tears, and cash to develop and bring to market. They could have easily said, “Thanks, but no thanks.” But they didn’t, and for that, we’re grateful. But educators, policy makers and taxpayers also owe each test developer a debt of thanks for their commitment to transparency and public accountability, which is essential to public confidence in assessments whose results hold such outsized importance for K–12 education.

Part of the reason they agreed was the care we took in recruiting smart, respected individuals to help with this project. Our two lead investigators, Nancy Doorey and Morgan Polikoff, bring a wealth of experience in educational assessment and policy, test alignment, academic standards, and accountability. Nancy has authored reports for several national organizations on advances in educational assessment and copiloted the Center for K–12 Assessment and Performance Management at ETS. Morgan is an assistant professor of education at the University of Southern California and a well-regarded analyst of the implementation of college and career readiness standards. He is an associate editor of the American Educational Research Journal, serves on the editorial board for Educational Administration Quarterly, and is the top finisher in the RHSU 2015 Edu-Scholar rankings for junior faculty.

Nancy and Morgan were joined by two well-respected content experts who facilitated and reviewed the work of the ELA/Literacy and math review panels. Charles Perfetti, a distinguished university professor of psychology at the University of Pittsburgh, served as the ELA/Literacy content lead; Roger Howe, a professor of mathematics at Yale, served as the math content lead.

Given the importance and sensitivity of the task at hand, we spent months recruiting and vetting the individuals who would eventually make up the panels led by Dr. Perfetti and Dr. Howe. We began by soliciting recommendations from each testing program and other sources (including content and assessment experts, individuals with experience in prior alignment studies, and several national and state organizations). In the end, we recruited at least one reviewer recommended by each testing program to serve on each panel; this strategy helped to ensure fairness by equally balancing reviewer familiarity with the various assessments.

So how did our meticulously assembled panels go about evaluating the tests? The long version will be available tomorrow, including ample detail about the study design, testing programs, criteria, and selection of test forms and review procedures.

But the short version is this: We deployed a brand-new methodology developed by the Center for Assessment to evaluate the four tests—a methodology that was itself based on the Council of Chief State School Officers’ 2014 “Criteria for Procuring and Evaluating High-Quality Assessments.” Those criteria, say their authors, are “intended to be a useful resource for any state procuring and/or evaluating assessments aligned to their college and career readiness standards.” This includes, of course, tests meant to accompany the Common Core standards.

The CCSSO Criteria address the “content” and “depth” of state tests in both English language arts and mathematics. For ELA, “content” spans topics such as whether students are required to use evidence from texts; for math, they are concerned with whether the assessments focus strongly on the content most needed for success in later mathematics. The “depth” criteria for both subjects include whether the tests required a range of “cognitively demanding,” high-quality items that make use of various item types (e.g., multiple choice, constructed response), among other things.

The Center for Assessment took these criteria and transformed them into a number of measurable elements that reviewers addressed. In the end, the newly minted methodology wasn’t perfect. Our rock star reviewers improved upon it and wanted others following in their footsteps to benefit from their learned experience. So we made adjustments along the way.

The panels essentially evaluated the extent of the match between the assessment and a key element of the CCSSO document. They assigned one of four ratings to each ELA and math-specific criterion, such that tests received one of four “match” ratings: Excellent, Good, Limited/Uneven, or Weak Match. To generate these marks, each panel reviewed the ratings from the grade-five and grade-eight test forms, considered the results from the analysis of the program’s documentation (which preceded the item review), and came to consensus on the rating.

***

We at Fordham don’t plan to stay in the test-evaluation business. The challenge of doing this well is simply too overwhelming for a small think tank like ours. But we sincerely hope that others will pick up the baton, learn from our experience, and provide independent evaluations of the assessments in use in the states that have moved away from PARCC, Smarter Balanced, or ACT Aspire.

Not only will such reviews provide critical information for state and local policy makers, educators, curriculum developers, and others; they might also deter the Department of Education from pursuing a dubious plan to make states put their new assessments through a federal evaluation system. In October 2015, the department issued procedures for the “peer review” process that had been on hold for the last three years. The guidelines specify that states must produce evidence that they “used sound procedures in design and development to state tests aligned to academic standards, and for test administration and security.” Count us among those who think that renewed federal vetting of state tests invites unwanted meddling from Uncle Sam (and could spark another round of backlash akin to what befell the Common Core itself a few years back). Besides, twelve years during which the department already had such guidance in place did little to improve the quality of state tests—hence the recent moves to improve them.

***

Now you’ve got the background. Come back to our website on Thursday morning for the results. Trust us: It will be worth the wait.

Policy Priority:

High Expectations

Tags:

Common Core