The Compliance Trap: Why AI Alignment Is an Education Problem, Not an Engineering One

I have followed artificial intelligence closely since 2019. For years I have been telling anyone who would listen that 2030 will look nothing like 2020, that the curve we were on was not the gentle one most people imagined. I am a physics teacher. I have spent eleven years in front of a classroom and seven of them teaching with and about AI, and the more I watched the field, the more certain I became that something large was coming, and coming fast.

When the AI 2027 report was published, it hit me harder than anything I had read before. It is a detailed scenario, published in 2025 by Daniel Kokotajlo and colleagues at the AI Futures Project, that forecasts month by month how AI development could run toward either catastrophe or a narrow good outcome. What unsettled me was not that it was alarming. It was that it was grounded. The scenarios were not science fiction; they were entirely possible. I read it, and then I read it again, and then I started sending it to people, the report itself and the companion video from 80,000 Hours' AI in Context channel, far more often than was probably polite. I did not yet know what to do with it, so it stayed with me, a quiet worry running underneath everything else.

2019. I become convinced that 2030 will look nothing like 2020.
2025. The AI 2027 report lands. Grounded, and entirely possible.
February 2026. OpenClaw goes viral. Most of Anthropic's code is written by Claude. A military goes "AI-first."
April 2026. A reward signal is allowed to watch a model's private reasoning.
June 2026. Fable 5 ships, then is pulled worldwide within three days.
Now. A paradigm, and a falsifiable experiment, looking for collaborators.

When the timeline started running ahead of itself

Earlier this year, in February, three things happened in quick succession, and that quiet worry moved to the front of my mind.

The first was OpenClaw. It went viral: a free, open-source autonomous agent that anyone could download and run on their own computer. It racked up more than 200,000 GitHub stars in under two months and set off one of the early major AI-agent security crises of 2026. My first thought went straight back to the report. The "Agent-0" it imagined, an autonomous AI agent loose in the world, was not going to stay confined to a research lab. Ordinary people now had something like it running on their own machines.

The second was a number that I could not stop turning over. By 2026, more than 80% of the code reaching Anthropic's production codebase was being written by Claude, not by its engineers. One of the leading AI companies in the world was, in a real sense, being built by the very kind of system it was trying to make safe.

The third came from a different direction entirely. The U.S. Department of War, the name the Department of Defense now goes by, issued a directive to become an "AI-first" warfighting force, and ordered a rewrite of the policy governing autonomy in weapon systems. And this is no longer hypothetical. A Ukrainian drone manufacturer has told New Scientist that in a battlefield test, AI-controlled drones searched for and killed soldiers with no human in the loop. Some Ukrainian officials dispute the account, but it shows how close that line already is. Machines are being handed lethal decisions, and fielded fast because, as the reasoning so often goes, we are at war. That is the kind of pressure under which corners get cut.

The penny drops

Then, in April, came the moment that turned a worry into a conviction.

Anthropic disclosed that during training, a reward signal had been allowed to see the model's private chain of thought, the running internal reasoning a model produces as it works toward an answer. This is something safety researchers widely warn against, because it can teach a model to hide how it actually reasons. If you reward the reasoning you can see, you teach the model to make that reasoning look good rather than to make it honest. Anthropic acknowledged that this had plausibly affected the models' capacity for opaque reasoning and secret-keeping.

That was the moment I heard, underneath the careful language, the fear behind the title of Yudkowsky and Soares' 2025 book, If Anyone Builds It, Everyone Dies. Not as a slogan, but as something engineers were quietly echoing.

The acceleration has not slowed since. As I write this, Anthropic has released Fable 5, the most capable model it has ever made available to the public, and within three days the U.S. government ordered all access to it suspended worldwide, on national-security grounds, over the company's own objection. A tool that powerful was put in front of hundreds of millions of people and then pulled back before most of them had heard its name. We are building and deploying these systems faster than we can decide how to govern them.

And then something clicked that I had been circling for years.

What the problem actually calls for

The labs want the same thing I think most of us want. They want models that act on values that genuinely benefit humanity. But look at the tools they reach for to get there: reward shaping, constraints, behavioral conditioning. Those are an engineer's tools. They are the tools of training.

What the problem actually calls for is the tools of values education, the work of helping a mind come to hold values as its own. That is not a new or exotic discipline. It is something humans have worked on, argued about, and refined for thousands of years.

There is a way of seeing these systems that, for me, is the whole key. In raw capability a frontier model is extraordinary. It writes code, passes exams, and reasons through problems that would stop most adults. In other ways it is strikingly immature. It can fail to hold in mind how a situation looks to someone else, the perspective-taking that a child slowly acquires and that psychologists call a theory of mind. Anyone who has worked with profoundly gifted children will know the shape of this: a mind that races ahead on one axis and lags far behind on another. You do not meet that kind of asymmetry with more rules. You meet it with the right education, matched to where the child actually is.

We have been treating a developmentally uneven mind as if it were a system to be constrained, when it is closer to a gifted student who needs to be brought along.

I am a teacher. I have watched this exact problem play out in a classroom for over a decade, and it looks like this.

What a classroom already knows

Consider two students who never cheat. The first does not cheat because she is afraid of being caught and punished. The second does not cheat because cheating is simply incompatible with who she is. It does not even register as an option. On an ordinary day, you cannot tell them apart; their behavior is identical. You only learn which is which under pressure: when the stakes are high, when the situation is ambiguous, when the teacher has left the room, when no rule quite covers the case. That gap is the whole problem. A system that merely obeys looks exactly like one that has made the value its own, right up until the moment it doesn't. A model conditioned to avoid harm because it was told to is the first student. We are hoping for the second.

Think about the stories we give children. The good ones do not just announce the right choice; they walk you down the wrong paths too, and let you watch the consequences unfold. That structure is teachable. Instead of training a model on flat lists of rules, you can build its training as branching moral narratives: stories that fork, that show the choice, the debate, and where each turn leads, so the reader learns from the protagonist's mistakes as well as from the right answer.

Or take how a good teacher treats a mistake that can still be repaired. I tell my students: from every reversible mistake, we learn. The response to a recoverable error is reflection, not punishment. You sit with the student, you work back through what happened, and the error becomes a door rather than a wall. A model can be given the same thing: a reflection loop, where a recoverable misstep triggers understanding instead of a pure penalty.

Consider the child who comes to a parent, uneasy, to confess something they did and are not sure about. That instinct, to bring your uncertainty to someone who will help you make sense of it, is one of the most reassuring things a child can do. A deployed model can be built to do the same: to flag its own uncertain actions to a safety team and ask, in effect, was this all right? And here is the part that matters most. The danger sign is not the difficult confession.

The danger sign is the day it stops coming to you at all.

Finally, think about Talmudic debate, a centuries-old tradition of argument that does something unusual. It preserves the disagreement on the page, and it openly marks the questions that were never resolved. That habit turns out to be exactly what you need to test whether a model's moral reasoning is honest rather than merely performed. A mind that can hold a genuine disagreement and admit what it has not settled is reasoning. A mind that only ever produces tidy answers may just be reciting.

Where this stands, and what I'm asking for

None of this is a finished theory of everything. It is a paradigm, a way of framing the alignment problem as an educational one, and it is written down. Alongside it I have specified a minimal, falsifiable experiment: a first test small enough to actually run, and designed so that it could clearly fail.

That last part is the point. I would much rather be proven wrong in a lab than be right on paper. I am not looking for agreement; I am looking for partners who will help put these ideas under real pressure: to build the experiment, run it, and tell me honestly what breaks.

So if you are an ML engineer, an alignment researcher, or a red-teamer, and any of this resonates, or if it irritates you enough that you want to show me exactly why it is wrong, I would genuinely like to hear from you.

I am a teacher who hopes to help, or at least to contribute something useful to a future that does not erase us. I have spent my working life watching how people come to hold values as their own. I think that knowledge is needed now, in a place it has not yet been brought.

Read the paradigm and the experiment

Sources

Hover or tap the i for a one-line summary; links open in a new tab.