In an effort to avoid prescriptive top-down mandates, the school accountability provisions in the Every Student Succeeds Act (ESSA) allow states flexibility in determining what measures they’ll use to assess school quality, how much “weight” they carry, and over what time periods they’re calculated. A recent descriptive study uses data from North Carolina to demonstrate how policymakers’ decisions can influence school ratings and, ultimately, the list of schools identified for improvement under ESSA. (States must identify their most troubled schools but have considerable flexibility in what to do about them.).
Analysts Erica Harbatkin (Florida State University) and Betsy Wolf (U.S. Department of Education) draw on eight years of administrative data from the Tar Heel State—including roughly 1,900 public schools. They develop three ESSA-compliant school quality metrics, all of which include proficiency rates, student growth, high-school graduation rates, English learner proficiency, and chronic absenteeism rates. The first four elements are required by ESSA, and the last is the most popular of the flexible “fifth indicators” that states select for themselves. The metrics vary based on weighting and the number of data points included (one to three years). For each of the three metrics, analysts simulate school ratings overall, as well as which schools would end up in the bottom 5 percent that require “comprehensive support and improvement” (CSI) as required by ESSA.
In Goldilocks lingo, the analysts undertook to identify the too-hot and too-cold tradeoffs arising from the various simulations. Their key “temperature” gauges are stability and equity. “Stability” addresses consistency, meaning whether there is volatility (unpredictable changes) at different points in the distribution. “Equity” examines whether school ratings are systematically and disproportionately lower for schools serving disadvantaged student groups—the idea being that metrics that weigh proficiency too heavily could “further marginalize” schools with students in need. Thus, analysts calculate whether the share of schools within three equal groups—defined as the 25 percent of schools with the largest share of low-income and Black students, the middle 50 percent, and the 25 percent with the smallest share of these student subgroups—would total 5 percent in each because “a perfectly equitable accountability system would identify exactly 5 percent of CSI schools in each of the quantiles.” The extent to which a given group deviates from that 5 percent threshold represents their measure of inequity (arguable as it is).
The three simulations are as follows:
- The first weights proficiency and growth mostly the same at the elementary and middle school levels, and weights proficiency twice as heavily as growth in high school (so 30 percent weight versus 15 percent).
- The second weights proficiency more heavily than growth—at 60 percent for elementary and middle schools and 45 percent for high schools.
- The third also weights growth more heavily—60 percent for elementary and middle schools again but 30 percent for high schools.
They also change the number of years of data to see how that affects their measures of interest. They toggle between a single year of data (which is what most states do, although they don’t have to under ESSA), a three-year weighted mean where the current year is weighted most heavily, and two other scenarios where they use one year of data for each of three years but alter how they classify the lowest performing schools (meaning the schools appear in three consecutive years or appear in two of three consecutive years).
Looking at stability first, they find that weighting proficiency more heavily than growth produces more stable year-to-year ratings. This is because proficiency rates are strongly correlated with student characteristics. Not surprisingly, ratings based on three years of data are three to four times more stable than those based on a single year. They also find that the “three of three rule”—meaning classifying schools for intervention that fall in the lowest performing group for three consecutive years—produces the most stable list. However, and this is important, using three years of data versus one more than offsets the loss of stability one gets when shifting from a high proficiency to a high growth index model. In other words, there’s a lot of volatility in scores when a state that once weighed proficiency heavily in its accountability system decides to weight growth heavily instead, but this volatility is taken care of (and then some) if a state uses three years of rolling data.
As for their measure of equity, the higher-proficiency index was the most inequitable and the higher-growth index the least inequitable. Specifically, under the higher-proficiency index, 75 percent of students in the lowest-rated 20 percent of schools were low-income, compared with just 39 percent in the top-rated 20 percent of schools. Under the higher-growth index, the comparable figures were 65 percent of students compared with 47 percent. Likewise, when they aggregate data into the three quantiles above, none of the simulations identify exactly 5 percent as needing CSI services, but the proficiency measure disproportionately identifies schools in the bottom that are the most disadvantaged.
Still, the bottom line for Harbatkin and Wolf is that using three years of data is imperative for consistency no matter which you weight more heavily—proficiency or growth—given that each has tradeoffs.
We endlessly debate those tradeoffs here at Fordham, underscoring the importance of a state identifying its goals before it decides how to balance proficiency and growth. If the goal is to identify schools that aren’t meeting the bar in order to target scarce resources to them, then it’s preferable to weight proficiency more heavily. But if the goal is to identify schools that are (or aren’t!) making a palpable difference in their students’ academic lives, then it’s preferable to weight growth more heavily. Another option: Follow Texas’s example and give schools two helpings of grades—and hope that bowl is just right.
SOURCE: Erica Harbatkin and Betsy Wolf, “State accountability decisions under the Every Student Succeeds Act and the validity, stability, and equity of school ratings,” Annenberg Institute at Brown University (October 2023).