Transformative for the motivated and mere meh for the unmotivated: How AI will and won’t affect learners

Sean Geraghty Mike Goldstein

11.13.2023

Editor’s note: This essay is an entry in Fordham’s 2023 Wonkathon, which asked contributors to answer this question: “How can we harness the power but mitigate the risks of artificial intelligence in our schools?” Learn more.

1. Will AI replace human tutors and teachers?

Yes, AI will. This school headmaster; this survey; DuoLingo guy here.

No, AI won’t. Nathaniel Grossman in Fordham; Freddie DeBoer; Dan Meyer.

Our question: What if AI has profoundly different effects on motivated and unmotivated learners?

It’s a tool for two very different situations.

Motivated[1] learners will increasingly substitute AI for human tutors and teachers[2].

Unmotivated learners will rarely (of their own accord) do this.

2. Let’s make a bet!

Here’s a summary of a Khan Academy study in Long Beach from 2018.

In that study, 582 kids (11 percent) were willing to use Khan as recommended (thirty minutes per week), while 4,766 refused. In fact, more kids in this study (748, or 14 percent) did zero minutes per week all year.

The 11 percent of “willing” kids had nice learning gains, estimated at 0.2 standard deviations. The low use kids (3,092) made no gains.

Imagine, we replicate this study in 2025, substituting AI-powered Khanmigo for OG Khan Academy. (You might further imagine some but not all bugs have been fixed.)

What do you bet will happen?

What percentage of kids will use Khanmigo as recommended, compared to the original experiment? That goes to motivation.

Will those kids learn the same, more, or less than the original experiment’s 0.2 SD gain?

Let’s return to this bet later.

3. What context do we need to ponder AI impact?

The first question: How good are 1-1 human tutors?

What you believe about AI’s ability to substitute for humans is dependent on whether you generally think human tutoring is great, good, OK, or bad.

There are four popular, competing answers about human tutor quality. We’d label them as:

“Bloom utopian,” “HDT evangelist,” “Mathematica pragmatist,” and “District superintendent pessimist.”

Let’s quickly examine.

1. There is a ludicrous study by the Bloom’s Taxonomy guy. Bloom once wrote that tutoring gets a 50th percentile kid to the 95th. That’s 2.0 standard deviations of gain. Bloggy takedowns of that study are here and here—and for my money, still too generous to Bloom. Yet this utopian study still gets a lot of play, from important people! The CZI and Gates Foundation “Moonshot” was framed using this research, for example. Yikes.

2. Back on Planet Earth, a top-cited recent Tutor Evangelist study is called The Impressive Effect of Tutoring on pre-K–12 Learning. They find “High Dosage Tutoring” moves a 50th percentile kid to the 62nd percentile. Sociologist Bo Paulle and I argued in 2020 that this estimate—while flattering to work Bo and I personally have done—overstate the real life terrain of human tutors in schools.

3. Gates Foundation generously hired Mathematica to measure more human tutoring programs. They did, in fact, find much lower effects on kids in their 2023 publications, like 0.12 SDs from Blueprint and 0.13 SDs from Air. Since these real-life effects were two-thirds lower than the Evangelist findings, I call this tribe the Mathematica Pragmatists.[3]

4. The District superintendent pessimist thinks: “But that’s not what happened with us. We bought tutoring and almost no kids were actually tutored!” The Pragmatists, Evangelists, and Utopians all respond: “You goofball. You didn’t actually buy High Dosage Tutoring, where each kid is supposed to sit down with a tutor at the following times each week. You bought a scam.” To which the Supe replies, “I don’t appreciate your critiques. We did the best we could.”

Anyway, that’s the wide range of perception about human tutoring quality.

What you think about AI’s promise depends in part how good you think the human substitute is. If you think human tutors have low impact, it’s hard to imagine how AI would be better. If you think human tutors have huge impact, then even diminished AI’s would be valuable with gains that are half or a quarter as big.

4. Human tutoring is really two things

Human tutoring is really two different interventions in one—a motivation intervention (like a personal trainer saying “c’mon, give me thirty more seconds of this plank!”) and an instructional intervention.

Remember, most in-school tutoring providers serve assigned kids—a mix of motivated kids and unmotivated ones, probably more of the latter.

Do we have any good data on private tutoring? You know, when Mom or Dad just hires a tutor?

Not really.[4] Understandably, private companies can’t easily ask half their potential customers to “not receive tutoring.”

Privately purchased tutors tend to have more motivated customers. For example, our daughters (Mike has one, Sean has two) are tutored by amazing women (Rashmi and Smriti). The results have been great, both to increase confidence and knowledge. As best Mike can estimate from MCAS and i-Ready scores over the past three years, our kid went from the 70th percentile to the 90th…putting her personal gain of 0.76 standard deviation gain somewhere between the Tutor Evangelists (0.37 SD) and the Bloom Utopians (2.0 SD).

Motivated students have more to gain from AI tutors.

By contrast, let’s empathize with the situation inside schools. The main driver of tutor impact is how well they deal (or not) with student motivation. Let’s loosely describe student motivation as red light, yellow light, green light.

Red light means a kid actively resisting tutoring: defiant, maybe head down on desk, perhaps literally wandering away from the desk, perhaps ignoring tutor and playing with phone, perhaps sitting arms crossed and staring daggers at tutor, perhaps video and audio off for an online tutor. This happens a lot!

Green light means a kid is willing to consistently try. The tutor assigns a problem, he tries; the tutor asks a question, he responds.

Yellow light is perhaps a kid who is up and down. Motivation may be low today based on his lousy weekend, high tomorrow after a great interaction with friends at lunchtime, and low again the next day because he failed a history test the period before, then back up because a kid he doesn’t like is absent.

A human tutor walks on eggshells with red and yellow light kids. If you push them hard to think, they may have an intellectual breakthrough (good), or may stop trying entirely (bad).

It’s hard to imagine how AI tutors will flourish in these types of situations.

5. Well, what about edtech interventions from the past twenty years, the various efforts to provide computer-based tutoring? After all, AI is supposed to be an improvement on that category.

Generally, these have generated lower gains than human tutors. That’s in part because ed tech is just an instructional intervention, not a motivational one.[5]

If we look at randomized trials of the likes of Zearn and Dreambox and similar products inside public schools, the most common finding is large numbers of students simply unwilling to use those products in the way they’re recommended, just like the Khan finding from Long Beach.

The average learning gains are in the range of 0.03 and 0.04 standard deviations. This moves a student from the 50th percentile to the 51st or 52nd. Those might be compared to the Mathematica pragmatists, finding 0.13 effects for human tutors.

6. Taken together, the data on human tutoring and computer tutoring allows us to estimate the “motivation” effect.

If we stick with the curmudgeonly estimate of human tutoring and the curmudgeonly estimate of computer software:

Human Tutors 2023 Mathematica (Motivation + Instruction): ~0.13 SD

Computer software tutoring (Instruction only): ~0.04 SD

Our hypothesis is that the main difference is motivation.

One can quibble with the studies chosen here. For example here’s a meta study that finds larger gains to software. The more rigorous the study, the more the effect size of both human and computer tutors declines—but the theme holds of human > computer.

7. So where does that leave us with AI?

AI is already often better at instruction than the incumbent edtech software. Already we see that Khanmigo helps a confused (yet motivated) student more than OG Khan Academy. AI will be great with green light, motivated kids.

Will AI help with red light kids?

Not that often, we predict. Yes, we know advocates imagine cuddly robots who “reach” these children. And there’s promise here for autistic kids in particular. But we’re generally skeptical.

Will AI flip some yellow light kids to green because its improved instruction generates some learning success, and that in turn leads to higher motivation? Yes!

So let’s return to the bet: If, in 2025, we replicate the Khan Academy Long Beach study….

What percentage of students use AI-powered Khan as recommended?

We’d bet 20 percent compared to the original 11 percent.

What learning gains would those 20 percent make?

We’d estimate 0.3 standard deviations, compared to the original 0.2 SDs.

To us, that is amazing. What a gain!

To those looking for an AI game-changer for students inside schools, sorry, it’s not quite that.

Will children—actually people of all ages—massively increase their consumption of AI tutoring for things they want to learn?

Yes!

Will AI manage to motivate the majority of schoolkids to grapple with all those subjects we policy wonks find important, but they typically resist?

No.

[1] Note: “motivated” here is shorthand for “motivated to learn the particular subject the school wants them to learn.” The same kid may be “unmotivated” to learn math and cricket and motivated to learn history and basketball.

[2] For purposes of this essay, let’s talk just about one-on-one tutoring (from humans or AI) and leave out classroom teaching. It simplifies things.

[3] Mathematica published some other studies we’re not mentioning, both that would reduce and raise this tutoring estimate. One was a zero result but there were study problems; another had a weird outcome variable instead of a normal standardized test like MAP or a state exam.

[4] There certainly are studies but they usually don’t compare well, for understandable inherit limitations, to what can be learned about tutoring in schools, randomized with the annual standardized test as the outcome of interest. We personally do hope to pioneer some research in this area.

[5] We concede that the software makers claim they do try to motivate kids—offering points, rewards, and badges. But given the RCT results, those inputs probably don’t move the needle that much with yellow light and red light kids. We don’t fault the creators. It’s hard to motivate!

Policy Priority:

High Expectations

Tags:

Bill & Melinda Gates Foundation