Where teacher evaluation went wrong

Tim Daly

5.3.2024

Editor’s note: This is the third and final part in a series on teacher evaluation reform. Part one recalled how teacher evaluation became a thing, and part two recounted the movement’s heyday and downfall. This was first published on the author’s Substack, The Education Daly.

Teacher evaluation was underpinned by heaps of research and championed by evangelical reformers who believed it would improve schools and elevate teaching to its rightful status. Nonetheless, it fizzled quickly.

Why?

After talking with a few dozen people for this series, I know there are too many factors to lasso them all in here. I’ll share the five that I see as most critical.[1]

Then, we’ll consider the unexpected parallels between teacher evaluation and the ongoing effort to repair damage done to schools by Covid.

The five biggest mistakes of the teacher evaluation movement

1. Using value-added scores to evaluate teachers was a mess. This was the biggie. Our technocratic tendencies did us in. We believed that considering growth in student test scores—controlling for a host of factors—would anchor the whole evaluation process and offset the tendency of principals to be lenient. But the methodology was intelligible only to those with advanced training in math—and half of them were pretending. Results could be calculated only for teachers in tested grades and subjects, leaving out a majority of faculty. That felt unfair. Even for teachers who believed the information in their score, they had no idea how to improve it. Value added was a political liability because it played into pre-existing fears about NCLB prizing test scores above good instruction. We should have left it as a research tool and not brought it into the assessment of individual teachers.

2. We pushed teacher evaluation on districts and schools that had no interest. Rather than start with a coalition of the willing—which would have been a small slice of districts—we went whole hog, espousing state laws that mandated evaluation changes in every school. This overreach created a built-in opposition force of superintendents, boards, and school leaders who resented the burden of rolling out new systems they didn’t believe in. Many of them went through the motions. The same was true of states.

3. We scared the bejeebers out of veteran teachers. And underestimated the consequences thereof. As I wrote in part two, there was sense that a small number of bad teachers were harming outcomes—especially for low-income students—and these folks were passed around from school to school rather than dismissed. Send them packing, the thinking went, and schools could finally reach their instructional potential. Nothing, however, triggers a panic response in teachers like the threat of firing. Teaching offers few perks. Doesn’t pay that well, doesn’t have the prestige of some other white-collar professions. But it’s secure. By making dismissal of low-performing veterans a central focus of evaluation reform—especially when ratings could be based substantially on the black box of value-added scores—we sent teachers (and unions) into a frenzy and suggested that reform wasn’t about elevating the teaching profession, it was about making it more precarious. Not our intent, but it’s what we did. The alternative, which was certainly discussed at the time, was to start with novices. Those teachers could already be dismissed at will. Develop the protocols, show that evaluations could help teachers get better, and demonstrate that outplacement would be reserved for clear, common-sense cases. It would have softened the introduction of newfangled systems.

4. We put too much faith in the idea of objectivity. Teachers know that not every class is the same. Those in particularly challenging placements—the ones with students far below grade level, behavior issues omnipresent, resources non-existent—didn’t buy that their work could be fairly evaluated using the same standards, tools, and protocols that were used in cushier schools. They suspected that “easier to teach” students would lead to higher observation scores for teachers. And they were absolutely correct.[2] Had we acknowledged from the start that a teacher’s context should be taken into account—and that perfect objectivity is impossible in the real world of schools—we would not have ended up penalizing some teachers for accepting mission-driven, high-priority assignments.

5. We believed principals would act to dismiss low performers if given the chance. For years, principals said they didn’t bother assigning low ratings to weak teachers because the dismissal process was onerous and doomed to fail. Why pick a fight you can’t win? Reformers accepted this wholeheartedly. We set out to create evaluations that would collect credible evidence about teacher performance. We pushed states to streamline remediation processes so they would exit low performers fairly but with less fuss. Much to our surprise, most principals went right on assigning high ratings to their teachers and granting all of them tenure. Regardless of how principals are portrayed by unions—as vindictive despots—they are mostly the opposite. They are non-confrontational, hesitant to upset or divide their teams, and not that confident in their ability to evaluate teachers in specialty subjects. They are keen to avoid inclusion on their local union’s list of despots.

In sum, we pushed a policy that was more aggressive than the market would bear and guaranteed that opponents would have sensible arguments to make against it.

What were the consequences of our mistakes?

Addressing differences among teachers became toxic. The first wave of attempts to reform teacher evaluation failed. Then what? The natural next step would have been taking another run at the problem having learned from our errors. Maybe swallow some humility and patience. Instead, evaluation reform vaporized. Work on the problem ceased almost altogether.[3] That’s collateral damage from its high profile. It’s also a shame. The variation in the teaching force is the same today as it was twenty years ago. We know that teachers influence even more outcomes like student absenteeism, suspensions, and future course grades. We just choose to ignore it.

Backlash to testing grew. Families—particularly in affluent communities—refused to allow their children to participate in state testing, partly to ensure the results could not be used to evaluate teachers. It was also NCLB fatigue. When Congress replaced NCLB with a new law in 2015, it dialed down the federal emphasis on test scores, paving the way for states to do the same. This shift in policy faced little opposition until the pandemic arrived and we faced a massive crisis in student learning. More to come below.

Teacher prep programs got a reprieve. In the early 2000s, a single consensus united warring factions across the education sector: Colleges of education were doing a horrible job. They cranked out elementary teachers when schools were desperate to fill special education positions, they focused on obscure theoretical frameworks when candidates needed to learn how to teach phonics, and they proudly thumbed their noses at any attempts to have them do otherwise. The rise of new teacher evaluations raised the prospect that they would be accountable for the quality of their graduates. Its demise meant they weren’t. A genuine opportunity was missed.

In many ways, the whole idea of broad change in education was tainted.

The timing couldn’t have been worse.

Along came Covid

Teacher evaluation’s demise turned out to be foreshadowing. Here was this movement blessed with decades of research, deep pocketed backers, and bipartisan political consensus. While many of its failures can be chalked up to self-sabotage, one can’t help but notice that the system writ large swatted it away like a housefly and continued about its business. There were no repercussions for lacking a quality control system to support the teaching profession. Schools didn’t have a parent uprising or face more pressure from the federal government to come up with a plan. The solution had tanked—but the problem remained, just as pressing as before, now without any effort to address it.

What does this have to do with pandemic recovery? Faced with nationwide closures beginning in March 2020, we needed widespread mobilization to reopen schools safely. Focus. Clarity. Teamwork. Instead, we got a chaotic, contentious debacle that kept some students out of their school buildings for more than a year. The New York Times recently reported “broad acknowledgment among many public health and education experts that extended school closures did not significantly stop the spread of Covid, while the academic harms for children have been large and long-lasting.” Yet, many school districts were stuck. Inert.

As students returned, it was immediately evident that they had suffered academic and developmental setbacks. Again, the situation called for an urgent response. And again, we have been embarrassingly slow. More than a few think pieces criticized the very use of the term “learning loss,” minimizing what was the largest expansion of education inequality on record.[4] Too many districts relied on “opt-in” tutoring platforms that received scant usage from students—particularly those who had fallen furthest behind. Others made their academic policies more lenient out of empathy for students, allowing unlimited retakes of tests and assigning higher grades, despite evidence that this may lead to unintended negative consequences. Messaging to parents has been so tepid that they aren’t worried. Recently, researchers from our most prestigious universities have sounded the alarm that “some children may never catch up and could enter adulthood without the full set of skills they need to succeed in the work force and life.”

That’s not all. Student absenteeism ballooned. But it was not until the current school year—which is the fourth since the onset of the pandemic—that it received national media attention and a robust public response. At first, many dismissed the problem as short-term and related directly to health issues like quarantining with Covid or facing elevated risks of exposure. Only belatedly has there been consideration that perhaps we are witnessing a broader change in habits of attendance.

Time and again, our systems have struggled to manage change. Maybe it’s because we are so decentralized, or because we ask too much of schools without providing enough resources and support. I don’t mean to imply that any of this is easy. My point is that even when the mission is essential and the evidence is clear, you won’t go broke betting that our systems will react with less vigor, not more.

That, I submit, is the dual lesson of teacher evaluation. We should have done a far better job designing and implementing it. We made so many mistakes. And also, we need our school systems to be more adaptable. We can’t be content for them to go through the motions of change, biding time until those reforms can be discarded in the junk pile with the rest. Sometimes we need them to get the job done.

Why?

Because our schools are not in a good place. Decades of hard-won progress on student learning and achievement gaps has been lost.

In researching this series, I asked almost everyone whether there was an alternative history where evaluation was pursued with greater wisdom, and instead of flaming out, the reforms succeeded—where we ended up with better ways of distinguishing levels of teacher performance that helped us retain our best and set a clear minimum standard that led to outplacement of the worst.

Many—actually, most—said no. They felt the movement could have lasted longer and taken hold in more places, but ultimately, as long as low-performing teachers were going to lose their jobs more often, pushback within systems would eventually have eroded the programs until they crumbled.

I also asked them if they believe we will eventually make up the academic ground that was lost during Covid. Most of them said no. Again, that skepticism of systems.

I want to believe that our schools are capable of meaningful improvement that lasts. The cynicism reformers felt toward systems in the early 2000s, which I described in part two, was excessive. It was counterproductive. But I would also like to see more signs that our systems can deliver when it matters. I hope that’s what the next decade brings—fresh approaches that learn from past failures and a renewed focus on insisting that our schools keep their promises to families and the public.

Author’s note: I’ll end with a huge thank you. Folks were very generous with their time when I reached out about this topic. Readers have been exceptionally engaged, too. I’m grateful for all the messages, even if I haven’t been able to use everything you sent. I hope others will share their own reflections in other outlets. I know I will learn a ton from it. There’s so much more to unpack. Until then…

[1] I received input from current and former district and state superintendents, current and former teachers, leaders of teacher organizations, district staff, school leaders, researchers, journalists, colleagues from TNTP and other non-profits, and random opinionated people. I would have loved to cast an even wider net, but I did my best.

[2] It’s not just bias against teachers with more challenging classroom assignments, either. Black and male teachers also appear to receive lower scores that are not explained by other factors.

[3] There are exceptions. Texas is doing interesting things with its Teacher Incentive Allotments. From what I’ve been able to learn so far, it appears to incorporate many lessons from version 1.0 of teacher eval reform.

[4] Here are a few examples among many. The Washington Post ran a piece by an education-school professor who not only took issue with the term “learning loss,” but called for schools to “assume [students] have learned immeasurable and previously unknowable things” while missing school due to Covid. The president of the teachers union in Los Angeles famously denied the existence of learning loss and refused to discuss it during a meeting with the editorial board of the Los Angeles Times.

Policy Priority:

High Expectations

Topics:

Accountability & Testing

Evidence-Based Learning

Curriculum & Instruction

Governance

Teachers & School Leaders

Tags:

No Child Left Behind Act