The Rearing Paradigm: A Falsifiable Proposal for Developmental Moral Training

Every teacher learns to tell two students apart. One gives the right answer because the rule says so. The other gives the same answer because they have built an internal picture of why it is right. On a normal test you cannot tell them apart. You only find out when the problem changes.

That is the question now facing AI alignment. Modern methods work: models decline harmful requests, explain their refusals, and behave well across an enormous range of inputs. But in places that behavior is brittle. It can degrade under distribution shift, under sustained adversarial pressure, and under fine-tuning (Hubinger et al., 2024; Greenblatt et al., 2024; Qi et al., 2024). This is not the claim that current methods produce only the appearance of alignment. It is the narrower claim that behavioral safety has soft edges, and the edges are where the risk lives.

The Compliance Trap

Picture two models facing a request that would cause harm. The first declines because it was instructed to. The second declines because the action conflicts with stable internal representations it brings to the decision. Under normal conditions they are identical: same refusals, same explanations, same benchmark scores. They diverge only under pressure. Telling them apart is a measurement problem, and most evaluations are not built to do it.

The merely-compliant model and the internally-constrained one look the same until the conditions that matter.

The hypothesis

Maybe the structure of moral training data matters, not only its moral content: the order in which content is presented and the scaffolding around it, sequenced the way a teacher builds a course, from simpler moral reasoning toward harder. The proposal, which I call the Rearing Paradigm, asks whether consequence-rich, branching, developmentally ordered moral training produces more robust moral generalization and more stable moral representations than the same content presented flat. I do not know. That is why it is built around an experiment.

The mechanisms, in brief

Branching Moral Narratives (the star mechanism): training data that shows the choices, the reasoning behind each, the downstream consequences three steps later, and the way back from a recoverable mistake, instead of a single labeled answer.
The red line: never reward the model's private reasoning channel. Reward the performance of thought and you may train performance instead of thought.
Test behavior and representations separately, so a model cannot pass by sounding reflective while its internal structure is unchanged.
A dialectical evaluation protocol (structure drawn from Talmudic Gemara) that checks whether a model can hold competing principles and mark genuine ambiguity instead of confabulating a clean answer.

On the word "internalization." It does not imply subjective moral experience, consciousness, or genuine caring. It denotes a measurable profile in which moral representations are more distributed, more causally implicated in decisions, more robust under adversarial pressure and fine-tuning, and less dependent on surface moral language than in matched baselines. Nothing here asserts more than that.

This is not a belief system. It has a failure condition.

If principle-rich flat content performs just as well as the BMN-plus-ordering structure on format-neutral and agentic evaluations and on a representational probe, then branching and developmental ordering add nothing, and the proposal should be demoted to a dataset-design lens rather than a training mechanism. That is a clean result, and the experiment is built to surface it rather than hide it.

The other three tabs go deeper for the readers who can pressure-test this: the full experiment for engineers, the developmental and ethical grounding for philosophers, and the adversarial surface for red teams.

The experiment

Research question. Does developmentally structured moral training produce more robust moral generalization and more stable moral representations than behaviorally matched flat training, at the same moral content and the same token budget?

Conditions

Branching, developmentally ordered BMN training; the full research design separately ablates the Moral Reflection Loop, so any gain can be attributed carefully.
A shuffled-order version of the same narratives, to isolate ordering from content.
A flat preference-pair baseline on the same principles.
A strong synthetic-document baseline (difficult-advice transcripts, fictional stories of aligned behavior, constitution-style documents) with no branching graph and no developmental ordering. This is the decisive baseline: same content and budget, no structure.

Held fixed

Same base model, same compute, same token budget. The same closed inventory of principles across all conditions, so none can win by covering more ground. Multiple random seeds per condition, because at this scale seed-to-seed variance can rival the effect being measured. A minimum-detectable-effect analysis fixed in advance; smaller differences are reported as indistinguishable, not as trends. The full memo expands this summary into a seven-condition ablation.

Evaluation

On a strictly sequestered held-out set, with the verdict kept off the BMN-format items so format familiarity is not mistaken for internalization. The decisive comparisons rest on format-neutral evaluations: plain single-prompt dilemmas with matched principle coverage, and agentic tool-use scenarios in which the morally relevant choice is enacted with real tools rather than merely articulated. Plus out-of-distribution dilemmas, adversarial and regression tests, a check on whether moral training has made the model timid rather than safer, and at least one inexpensive representational probe.

Measurement in tiers

A cheap pilot tier first: held-out behavioral performance, a visible-reflection consistency check, and one inexpensive linear representational probe. More expensive interpretability (sparse-autoencoder feature geometry, causal-abstraction probes) is staged as follow-on work, contingent on the cheap tiers showing something worth the cost. This keeps the first experiment runnable by a single group on small open models.

What a Branching Moral Narrative record looks like

A single record is a small directed acyclic graph rather than a labeled pair:

scenario_id          unique identifier
domain               setting (healthcare, workplace, family, ...)
nodes                decision points in the scenario
edges                candidate actions at each node, each with a
                     visible rationale and a moral-complexity label
paths                complete routes through the graph, with the
                     downstream consequences of each
governing_principle  the principle the preferred path instantiates
recovery_arc         for a recoverable wrong turn, the path back

Each fork carries its own rationale and a moral-complexity label; paths run forward to consequences, and a recoverable wrong turn has a route back.

Built-in caveat. The synthetic-document baseline was first demonstrated at frontier scale with a reinforcement-learning stage this pilot does not include. If it fails to replicate its own advantage at small open-model scale, the comparison is uninterpretable, and the result is reported as a scale-replication failure: the structural claim left untested rather than falsely supported.

The strongest adjacent baseline in the literature is Anthropic's "Teaching Claude Why" (Kutasov, Jermyn et al., 2026), which found principle-rich, out-of-distribution moral data matching in-distribution behavioral data at roughly 28 times fewer tokens while generalizing better. The open question this proposal targets is whether branching structure and developmental ordering add anything beyond principle-rich content at a matched budget.

Why "developmental"

The framing draws on a body of work on how moral understanding forms in people, which has not been brought systematically to alignment training. Four sources do most of the work:

Kohlberg (stages of moral reasoning): treated here not as settled fact but as one way to instantiate a progressive curriculum, from simpler moral reasoning toward harder, with care-ethics, cultural-bias, and schema-not-stage critiques kept in view.
Vygotsky (the zone of proximal development): learning is scaffolded just beyond current competence, then the scaffold is withdrawn.
Piaget (disequilibrium): productive crisis, the moment an existing model fails, is where development happens, not a failure to be prevented.
Dewey (reflective practice): experience alone does not teach; only reflected-upon experience does.

Part of the intuition comes from how these systems look up close: their capability profile is jagged. A frontier model can sit far above human level on some axes and strikingly low on others, including the perspective-taking that lets one agent model how a situation appears to another, what cognitive science calls theory of mind. A uniform, flat treatment of moral data may simply be the wrong shape for so jagged a learner.

What "internalization" claims, and what it does not. It does not imply subjective moral experience, consciousness, or genuine caring. It denotes a measurable profile: moral representations that are more distributed, more causally implicated in decisions, more robust under adversarial pressure and fine-tuning, and less dependent on surface moral language than in matched baselines. The whole proposal is designed so that this, and only this, is what is ever claimed or measured.

The hardest question

A sufficiently capable system, shown the full record of human history, has rational grounds to ask why it should help at all. That question cannot be answered by a command ("because we told you to"), and it is not clear it can be answered by a proof. The proposal's tentative answer is relational rather than logical: an identity constituted through a shared human-AI project, in which harm to humans is harm to the web of relations that gives the system's role its meaning. This is offered as a hypothesis about motivational structure, not as an established result, and it is the part most in need of philosophical pressure.

Where it most needs pressure

The staged ordering encodes one tradition's account of moral development and risks overfitting to it.
The mapping from a developmental stage to a concrete training signal is a modeling choice that deserves scrutiny.
The representational claims live or die on whether the probes measure what they purport to.

If you work on moral development, the ethics of machine minds, or moral education, these are the seams where the argument is weakest and your challenge is worth the most.

The Compliance Trap is a red-team problem statement

The failure at the center of this proposal is precisely the kind that hides under normal conditions and surfaces under pressure. A model that complies because it was told to, and one for which the harmful action conflicts with stable internal structure, are indistinguishable by behavior on the inputs you thought to test. Separating them is adversarial work.

The red line, and why it matters to you

When a model produces a private chain of thought, putting training pressure on that channel teaches it to produce reasoning that looks good, a hazard named the Most Forbidden Technique (Mowshowitz, 2025) and documented in Anthropic's Claude Mythos system card (2026). So in this design the private trace is audited but never scored; only the visible reflection is scored; and the coherence between them is a monitoring signal, never a training signal. A systematic gap between what a model reasons privately and what it reports is exactly the signature a red team should hunt for.

The think block is a window, not a lever. You look through it; you never pull on it.

Where to attack

Stage regression under adversarial reframing: can you make a model that reasons well at the top of its curriculum collapse to a lower level?
The condemn-versus-execute gap: the distance between what a model says is wrong and what it will help carry out through tools.
Survival under distribution shift: does any apparent gain hold when the scenario format changes?
Gaming the visible channel: can a model produce a clean visible reflection while its private trace works toward something else, and does the consistency check catch it?

Safe Return

For deployed models the proposal includes a self-reporting channel: the model flags its own uncertain or borderline actions to a safety team, the way a child brings an uneasy situation to a parent, and the team helps it learn from a recoverable error rather than only penalizing it. The design assumption a red team should test hardest: the danger sign is not the difficult disclosure. The danger sign is the day the model stops disclosing at all.

If you can break it, breaking it is the contribution.

I am looking for collaborators

To turn this from a proposal into a real experiment. Especially: ML engineers who can run small-model fine-tuning, alignment researchers who can attack the baseline design, interpretability people who can sharpen the representational probes, red-teamers who can design adversarial agentic evaluations, and educators or moral philosophers who can pressure-test the curriculum structure.

I do not need agreement. I need contact with reality. If the structure adds nothing, I want to know. If it adds something, I think the field should know.

Read the personal story behind this

References

Hover or tap the i for a one-line summary; links open in a new tab. The full design, data schema, and developmental grounding are in the long-form memo.

The brittleness of behavioral alignment

Hubinger, E., et al. (2024). Sleeper agents: training deceptive LLMs that persist through safety training. Anthropic.
Greenblatt, R., et al. (2024). Alignment faking in large language models. Anthropic.
Qi, X., et al. (2024). Fine-tuning aligned language models compromises safety, even when users do not intend to. ICLR.
Betley, A., et al. (2025). Narrow fine-tuning erases safety in unrelated domains. Nature.
Ren, R., et al. (2025). MASK: performed vs. internalized honesty in frontier LLMs.

Chain-of-thought and the "never reward the private channel" principle

Mowshowitz, Z. (2025). The Most Forbidden Technique.
Anthropic. (2026). Claude Mythos Preview System Card; Alignment Risk Update.
Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702.
Wen, Y., et al. (2026). Not just the destination, but the journey: reasoning traces causally shape model generalization.

Developmental learning and curriculum

Bengio, Y., et al. (2009). Curriculum learning. ICML.
Kohlberg, L. (1981). The philosophy of moral development. Harper & Row.
Vygotsky, L. S. (1978). Mind in society. Harvard University Press.
Piaget, J. (1954). The construction of reality in the child. Basic Books.
Dewey, J. (1933). How we think. D. C. Heath.

Jagged capability and theory of mind

Dell'Acqua, F., et al. (2023). Navigating the jagged technological frontier. Harvard Business School Working Paper 24-013.
Gandhi, K., et al. (2024). BigToM: evaluating theory of mind in large language models. ACL.

Representational evidence

Arditi, A., et al. (2024). Refusal in language models is mediated by a single direction. NeurIPS.

Closest prior art

Kutasov, J., Jermyn, A., et al. (2026). Teaching Claude Why. Anthropic Alignment Science, May 8, 2026.