Stop pestering education leaders to “follow the evidence.” Instead, host forecasting tournaments.

Mike Goldstein

11.23.2021

Education leaders—principals, superintendents, state chiefs, philanthropy heads—make lots of decisions, and we exhort them to “use evidence” when they do. But we should stop doing that for at least four reasons. First, K–12 evidence itself is chock full of problems, as many have pointed out on these pages. Second, pestering isn’t a good strategy, as we’ve seen with pushes for Covid-19 vaccination, or I’ve learned while try to get my eighth grader to turn off his bedroom light when he leaves the room. Third, there’s a larger movement—often for good reason—to stop “trusting the experts.” And fourth, even when there’s “good evidence” for an education practice, service, or product, K–12 actors often find it irresistible to do a cheapo version of that thing.

For example, there is good evidence for high-dosage tutoring delivered by carefully picked tutors. So what happens? Leaders, flush in 2021 with Covid federal aid, launched scores of low-dosage tutoring programs without careful tutor selection. I predict most will find, years later, that their programs had no effect on kids—like Minnesota Math Corps.

One problem with education evidence is that the “flagship” intervention gets enormous attention to detail. The scaled versions get much less. Rationally, our preferences should be in this order:

Best: A scaled, cheap version of X that actually works.
OK: An expensive, high-quality version of X that actually works—to be used judiciously, given limited resources.
Worst: A scaled, cheap, less carefully managed version of X that doesn’t work.

So why do so many education studies find that examples of #3 are failures? Why are decision makers so overconfident at the moment they choose a “cheap version” of an “evidence-backed” program, rather than the actual evidence-backed program?

It’s because nobody warns leaders when they’re about to do so.

Jim Savage at Schmidt Futures pointed me to a possible solution to this evidence problem: prediction markets. I’ve been thinking about it ever since. It’s a way of getting past the cognitive biases and organizational behavior of leaders and their underlings, and instead involving outsiders in a particular way and getting them to make predictions.

From Vox:

Just like you can bet on sports, you can bet on, well, anything you want, from “fewer than 10 million Americans will be uninsured in 2025 if this bill passes” to “migration from Central America will increase” to “this new cancer study will fail to replicate.” The point is that you can make money by making good bets. And that incentive is what can make prediction markets a pretty good forecaster of what’s to come.

...Why would you expect a betting market to get anything right? Well, because if a betting market is getting something wrong, then a person can make money by betting against the consensus.

One challenge of a forecasting market, however, is that you need lots of bets to make it work. Millions of people want to bet on sports or stocks. Tens of thousands want to bet on politics. How many people, however, want to bet on whether, say, Nashville’s tutoring program will raise achievement for students? Not many!

Wharton professor Philip Tetlock wondered whether there was a “hack” to this puzzle. Do we need a whole forecasting market to see into the future? Or could surveying people in a structured way give us most of that insight? We could call this a forecasting tournament.

Tetlock and his colleagues found these tournaments worked pretty well if you structure it right. You can get the “wisdom of crowds” with a pretty small crowd. You might need just twenty to sixty people—not thousands. Not only do you get a “wisdom of crowds” effect, you also figure out which individuals are good at predicting.

So who would you invite to participate in a K–12 forecasting tournament? First, education researchers. That is something Drew Bailey of University of California Irvine has been thinking about. He told me:

Forecasting has the potential to get a group of educational researchers—maybe one with a very diverse set of prior beliefs and values, one at risk for talking past each other in an insufficiently structured context—to focus on a specific set of questions of interest to the potential funder. Further, forecasts can benefit from this diversity: Forecasters will pay attention to different features of proposed educational interventions, and learning about when they succeed and fail will help us learn about the key features we should be attending to when forecasting the success of an educational intervention.

Moreover, a forecasting tournament provides useful information to participating researchers themselves. Making a forecast before knowing how well a program works will increase the chances that a researcher is surprised when they learn about the results—and receiving surprising feedback helps people change their minds. It may also promote intellectual humility.

Second, reporters and insiders can sometimes predict well, particularly when they’re free to dig around. For example, Nate Anderson of Hindenburg Research predicts stock prices. He’s a short seller, so his goal is to find overvalued stocks. He was curious about Facedrive. It was newly public company valued at $1.4 billion. Think of an eco-friendly version of Uber, with electric or hybrid cars instead of regular ones. Facedrive already had 13,000 drivers. Anderson dug in. He found the true number of “active” drivers was closer to 600. Also he wrote:

Rather than focusing on tackling just one resource-intensive highly competitive market like ridesharing, Facedrive recently entered a second—food delivery. We found Facedrive’s platform has a total of seventeen restaurants compared to Uber Eats’ 400,000 and GrubHub’s 300,000.

We called several of the “most popular” restaurants on the Facedrive Foods page. One didn’t seem to have a working phone number, and two said they don’t use Facedrive anymore.

Anderson found a ton of other problems, shorted the stock, and published his findings. Later, the stock fell 96 percent from its high.

Can that short-selling model work for an education prediction tournament? For example, I can name big city tutoring programs right now that are like Facedrive. One I know of has almost no kids showing up for tutoring after school. Another is so short of tutors that their tutor “group sizes” are huge—which is not really tutoring, but more like a class. I believe—maybe I’m wrong—that a future evaluation of either program will show no positive effect for kids. I would “short” these tutoring programs if I could. I wish there was a way to get my intel to someone who mattered or cared, who would try to fix these problems, or at least stop subjecting kids to low-quality tutoring.

Finally, the third group that likes to forecast are the amateurs. They tend to be quant-oriented people with no domain background. They just like running their own numbers. Last year, for example, Youyang Gu was a twenty-seven-year-old MIT grad working on a sports analytics startup when Covid-19 hit. Gu googled “epidemiology” and noticed that death predictions were very far apart. One model said 2 million dead in the United States by summertime. Another predicted 60,000. Gu wondered if he could do better. Per the MIT Technology Review:

Within a week, he’d built a machine-learning model and launched his COVID-19 Projections website. He ran the model every day—it only took one hour on his laptop—and posted Covid-19 death projections for fifty US states, thirty-four counties, and seventy-one countries

By the end of April 2020, he was attracting attention—ultimately, millions checked his website daily. Carl Bergstrom, a professor of biology at the University of Washington, took notice and commented on Twitter that Gu’s model was “making predictions that seem as good as any I’ve seen.”

Gu’s track record only kept improving. Over the course of the pandemic, he outperformed the credentialed experts.

—

Imagine a forecasting tournament of education researchers like Drew Bailey, plus “insiders or reporters” like Nate Anderson, who can work the phones or show up and observe, plus amateur outside quants like Youyang Gu.

Bailey told me:

If it turns out the Forecasting Tournament has predictive power, forecasts could be considered during the process in which educational interventions are designed or selected for evaluation. This would allow funders to choose more promising interventions. It would also incentivize applicants to attend more to the likely effectiveness of their interventions. Finally, it could help identify individuals with a knack for forecasting.

Over time, funders might attempt to recruit reviewers who have proven to be successful at forecasting impacts in previous work, or even favor applicants with strong forecasting records. I think it would make sense to favor a proposal for an RCT [randomized controlled trial] of an intervention that forecasts a large impact from a research team with a track record of making accurate forecasts over an identical proposal from a team that is consistently over-optimistic.

So an entity like the Bill & Melinda Gates Foundation, the Institute of Educational Sciences, or a big city superintendent could ask a forecasting tournament to predict the effects of ten different large grants they were considering, and make better decisions of what to approve.

Here is an example: The Chancellor of the New York City Department of Education considers a summer school initiative. “Researcher” forecaster 1 predicts a large gain because the summer school model is a replication of a previously successful effort. “Digger” forecaster 2 calls a friend who is a lifer NYC principal who can think of five failed summer school rollouts in her twenty-nine-year career, so forecaster 2 bets against the effort. And “amateur” forecaster 3 hypothesizes that improved air conditioning will change summer school for the better and bets on a small gain.

Eventually, thirty forecasters make their picks, and the tournament prediction is sent to the chancellor. If the consensus is negative, maybe the chancellor does something before the summer school: cancels it, spends more on it, or changes it.

Leaders could also make use of the narrative justifications for these forecasts. That’s where predictors explain their reasoning. The hope is to improve a program before it even begins by understanding the “why” of a prediction, and then trying to gather the implementation team to correct deficiencies before they manifest. In addition, these justifications may make it clear to the funder what parts of grant proposals deserve more or less attention during the review process. For example, perhaps narrative justifications will drive funders and authors to attend more to specific impediments to scaling up effectively, or to more systematic reporting of the impacts of previously funded work by the authors, or to require a more thorough implementation plan.

So imagine two types of forecasting tournaments. In the first, the predictions are “quiet.” Someone creates an intervention and pre-registers a randomized controlled trial evaluation. The tournament gathers people to bet on what the result will be in five years. That’s how we validate who, in fact, are good at predicting. No leaders are involved.

In the second type, leaders ask the forecasting tournament to make predictions, and then they, the leaders, read the narrative justifications. Maybe one sees a lot of “fail” bets and reads that an intervention will fail because summer programs tend to have very low attendance rates, no matter how they’re designed, no matter what the incentives are for students. So rather than try to climb a mountain that is impassable, the leader changes the intervention to focus on willing kids, those who want to show up, instead of chasing resisting kids. The tournament would then discard the original predictions and allow everyone to revote based on the new program design.

Could this scheme work? Could a forecasting tournament provide a new tool that would nudge leaders to “act on evidence”? If so, would the result actually be better K–12 decisions, helping more students?

I want to see the details before I make my prediction.

Policy Priority:

High Expectations

Topics:

Evidence-Based Learning

Governance

Teachers & School Leaders

Tags:

New York