Project STAR (Student/Teacher Achievement Ratio) is a widely-known experiment comparing class-size reduction and student achievement outcomes, conducted in the 1980s in Tennessee. Its oft-cited conclusion was that students in the smallest classes showed significantly higher test scores than their peers in larger classes. Replication efforts over the years, however, have found conflicting results, raising concerns that expensive and labor-intensive class-size reduction schemes in schools across the country are built upon an erroneous premise and are unlikely to result in measurable benefit to students. This study from NBER plunges headfirst into the original research to see if scholars can shed more light on the conundrum.
The researchers—a trio of economists at UPenn and Princeton—begin by enumerating serious concerns about the original methodology. It’s super technical, but I’ll do my best to translate. In a nutshell, they find that the original study design was flawed because it treated the three “class types” (small, regular, and regular-with-a-teacher’-s aide) as synonymous with “class sizes.” But in reality, school principals had discretion in choosing the target size for each class type so they actually didn’t fully comply with the intended class-size reductions.
That’s a big problem since the original analysis of STAR rested in part on a weighted average, such that some schools’ results counted more than others in the final calculation. The weighting was based on how well a school followed the study’s rules about class-size reduction. The experiment called for the treatment group to have 13–17 students and the control group to have 22–25 students.
As it turned out, schools with the largest impacts from smaller classes—whether positive or negative—tended to follow the rules less strictly. The study did not account for these compliance differences among schools, which was related to how effective smaller classes would be in a school. This oversight created biased results because the schools that might have shown the biggest effects weren’t as compliant in following the study rules, so their results ultimately didn’t count as much in the final average.
Specifically, analysts found that many schools under-complied with the intended implementation by having smaller reductions: The average reduction in class size between control and treatment classes was seven students, but those varied across schools from 0 to 12.
Next, the research trio created a new model and applied it to the more accurate version of the original data. This allowed them to model the actual class-size levels for both treatment and control class types. They found that nearly all of the gains from reducing class size in STAR were driven by just twenty-three (out of seventy-nine) schools in the sample (totaling 29 percent). In fact, if those schools had been omitted from the experiment, their model would have failed to detect any causal effect of class size on test scores. What’s more, the schools driving the gains were the ones that didn’t reduce class sizes as much as intended.
Taken together, these findings come close to explaining why replications of Project STAR failed to add up to anything like the original headline-making results. Implementation matters, in experimental designs as well as in education policy—and we do students, parents and teachers a huge disservice by making sweeping changes based on improperly supported premises. Case in point: California’s mammoth class-size reduction effort in 1996 in the wake of STAR ended in lackluster results, largely due to the need to hire tons of unqualified teachers quickly to fill classrooms. The negative impacts of those hiring decisions reverberated in the state for years.
As a former high-school teacher, I can attest that one small group of fifteen students nestled among five other classes of thirty is a sanity saver. In this way, Project STAR “proved” what many on the ground wanted to hear. But we live in a world where we should be targeting resources based on sound evidence, not on poorly executed studies that confirm pre-existing beliefs.
SOURCE: Karun Adusumilli, Francesco Agostinelli, and Emilio Borghesan, “Heterogeneity and endogenous compliance: Implications for scaling class size interventions,” National Bureau of Economic Research (April 2024).