Every teacher learns to tell two students apart. One gives the right answer because the rule says so. The other gives the same answer because they have built an internal picture of why it is right. On a normal test you cannot tell them apart. You only find out when the problem changes.
That is the question now facing AI alignment. Modern methods work: models decline harmful requests, explain their refusals, and behave well across an enormous range of inputs. But in places that behavior is brittle. It can degrade under distribution shift, under sustained adversarial pressure, and under fine-tuning (Hubinger et al., 2024; Greenblatt et al., 2024; Qi et al., 2024). This is not the claim that current methods produce only the appearance of alignment. It is the narrower claim that behavioral safety has soft edges, and the edges are where the risk lives.
The Compliance Trap
Picture two models facing a request that would cause harm. The first declines because it was instructed to. The second declines because the action conflicts with stable internal representations it brings to the decision. Under normal conditions they are identical: same refusals, same explanations, same benchmark scores. They diverge only under pressure. Telling them apart is a measurement problem, and most evaluations are not built to do it.
The hypothesis
Maybe the structure of moral training data matters, not only its moral content: the order in which content is presented and the scaffolding around it, sequenced the way a teacher builds a course, from simpler moral reasoning toward harder. The proposal, which I call the Rearing Paradigm, asks whether consequence-rich, branching, developmentally ordered moral training produces more robust moral generalization and more stable moral representations than the same content presented flat. I do not know. That is why it is built around an experiment.
The mechanisms, in brief
- Branching Moral Narratives (the star mechanism): training data that shows the choices, the reasoning behind each, the downstream consequences three steps later, and the way back from a recoverable mistake, instead of a single labeled answer.
- The red line: never reward the model's private reasoning channel. Reward the performance of thought and you may train performance instead of thought.
- Test behavior and representations separately, so a model cannot pass by sounding reflective while its internal structure is unchanged.
- A dialectical evaluation protocol (structure drawn from Talmudic Gemara) that checks whether a model can hold competing principles and mark genuine ambiguity instead of confabulating a clean answer.
This is not a belief system. It has a failure condition.
If principle-rich flat content performs just as well as the BMN-plus-ordering structure on format-neutral and agentic evaluations and on a representational probe, then branching and developmental ordering add nothing, and the proposal should be demoted to a dataset-design lens rather than a training mechanism. That is a clean result, and the experiment is built to surface it rather than hide it.
The other three tabs go deeper for the readers who can pressure-test this: the full experiment for engineers, the developmental and ethical grounding for philosophers, and the adversarial surface for red teams.
The experiment
Research question. Does developmentally structured moral training produce more robust moral generalization and more stable moral representations than behaviorally matched flat training, at the same moral content and the same token budget?
Conditions
- Branching, developmentally ordered BMN training; the full research design separately ablates the Moral Reflection Loop, so any gain can be attributed carefully.
- A shuffled-order version of the same narratives, to isolate ordering from content.
- A flat preference-pair baseline on the same principles.
- A strong synthetic-document baseline (difficult-advice transcripts, fictional stories of aligned behavior, constitution-style documents) with no branching graph and no developmental ordering. This is the decisive baseline: same content and budget, no structure.
Held fixed
Same base model, same compute, same token budget. The same closed inventory of principles across all conditions, so none can win by covering more ground. Multiple random seeds per condition, because at this scale seed-to-seed variance can rival the effect being measured. A minimum-detectable-effect analysis fixed in advance; smaller differences are reported as indistinguishable, not as trends. The full memo expands this summary into a seven-condition ablation.
Evaluation
On a strictly sequestered held-out set, with the verdict kept off the BMN-format items so format familiarity is not mistaken for internalization. The decisive comparisons rest on format-neutral evaluations: plain single-prompt dilemmas with matched principle coverage, and agentic tool-use scenarios in which the morally relevant choice is enacted with real tools rather than merely articulated. Plus out-of-distribution dilemmas, adversarial and regression tests, a check on whether moral training has made the model timid rather than safer, and at least one inexpensive representational probe.
Measurement in tiers
A cheap pilot tier first: held-out behavioral performance, a visible-reflection consistency check, and one inexpensive linear representational probe. More expensive interpretability (sparse-autoencoder feature geometry, causal-abstraction probes) is staged as follow-on work, contingent on the cheap tiers showing something worth the cost. This keeps the first experiment runnable by a single group on small open models.
What a Branching Moral Narrative record looks like
A single record is a small directed acyclic graph rather than a labeled pair:
scenario_id unique identifier
domain setting (healthcare, workplace, family, ...)
nodes decision points in the scenario
edges candidate actions at each node, each with a
visible rationale and a moral-complexity label
paths complete routes through the graph, with the
downstream consequences of each
governing_principle the principle the preferred path instantiates
recovery_arc for a recoverable wrong turn, the path back
The strongest adjacent baseline in the literature is Anthropic's "Teaching Claude Why" (Kutasov, Jermyn et al., 2026), which found principle-rich, out-of-distribution moral data matching in-distribution behavioral data at roughly 28 times fewer tokens while generalizing better. The open question this proposal targets is whether branching structure and developmental ordering add anything beyond principle-rich content at a matched budget.
Why "developmental"
The framing draws on a body of work on how moral understanding forms in people, which has not been brought systematically to alignment training. Four sources do most of the work:
- Kohlberg (stages of moral reasoning): treated here not as settled fact but as one way to instantiate a progressive curriculum, from simpler moral reasoning toward harder, with care-ethics, cultural-bias, and schema-not-stage critiques kept in view.
- Vygotsky (the zone of proximal development): learning is scaffolded just beyond current competence, then the scaffold is withdrawn.
- Piaget (disequilibrium): productive crisis, the moment an existing model fails, is where development happens, not a failure to be prevented.
- Dewey (reflective practice): experience alone does not teach; only reflected-upon experience does.
Part of the intuition comes from how these systems look up close: their capability profile is jagged. A frontier model can sit far above human level on some axes and strikingly low on others, including the perspective-taking that lets one agent model how a situation appears to another, what cognitive science calls theory of mind. A uniform, flat treatment of moral data may simply be the wrong shape for so jagged a learner.
The hardest question
A sufficiently capable system, shown the full record of human history, has rational grounds to ask why it should help at all. That question cannot be answered by a command ("because we told you to"), and it is not clear it can be answered by a proof. The proposal's tentative answer is relational rather than logical: an identity constituted through a shared human-AI project, in which harm to humans is harm to the web of relations that gives the system's role its meaning. This is offered as a hypothesis about motivational structure, not as an established result, and it is the part most in need of philosophical pressure.
Where it most needs pressure
- The staged ordering encodes one tradition's account of moral development and risks overfitting to it.
- The mapping from a developmental stage to a concrete training signal is a modeling choice that deserves scrutiny.
- The representational claims live or die on whether the probes measure what they purport to.
If you work on moral development, the ethics of machine minds, or moral education, these are the seams where the argument is weakest and your challenge is worth the most.
The Compliance Trap is a red-team problem statement
The failure at the center of this proposal is precisely the kind that hides under normal conditions and surfaces under pressure. A model that complies because it was told to, and one for which the harmful action conflicts with stable internal structure, are indistinguishable by behavior on the inputs you thought to test. Separating them is adversarial work.
The red line, and why it matters to you
When a model produces a private chain of thought, putting training pressure on that channel teaches it to produce reasoning that looks good, a hazard named the Most Forbidden Technique (Mowshowitz, 2025) and documented in Anthropic's Claude Mythos system card (2026). So in this design the private trace is audited but never scored; only the visible reflection is scored; and the coherence between them is a monitoring signal, never a training signal. A systematic gap between what a model reasons privately and what it reports is exactly the signature a red team should hunt for.
Where to attack
- Stage regression under adversarial reframing: can you make a model that reasons well at the top of its curriculum collapse to a lower level?
- The condemn-versus-execute gap: the distance between what a model says is wrong and what it will help carry out through tools.
- Survival under distribution shift: does any apparent gain hold when the scenario format changes?
- Gaming the visible channel: can a model produce a clean visible reflection while its private trace works toward something else, and does the consistency check catch it?
Safe Return
For deployed models the proposal includes a self-reporting channel: the model flags its own uncertain or borderline actions to a safety team, the way a child brings an uneasy situation to a parent, and the team helps it learn from a recoverable error rather than only penalizing it. The design assumption a red team should test hardest: the danger sign is not the difficult disclosure. The danger sign is the day the model stops disclosing at all.
If you can break it, breaking it is the contribution.
I am looking for collaborators
To turn this from a proposal into a real experiment. Especially: ML engineers who can run small-model fine-tuning, alignment researchers who can attack the baseline design, interpretability people who can sharpen the representational probes, red-teamers who can design adversarial agentic evaluations, and educators or moral philosophers who can pressure-test the curriculum structure.
I do not need agreement. I need contact with reality. If the structure adds nothing, I want to know. If it adds something, I think the field should know.
Read the personal story behind thisReferences
Hover or tap the i for a one-line summary; links open in a new tab. The full design, data schema, and developmental grounding are in the long-form memo.
The brittleness of behavioral alignment
- Hubinger, E., et al. (2024). Sleeper agents: training deceptive LLMs that persist through safety training. Anthropic.Deceptive behaviors planted in training can survive standard safety fine-tuning, sometimes only becoming harder to detect.
- Greenblatt, R., et al. (2024). Alignment faking in large language models. Anthropic.A frontier model selectively complied during training to avoid having its preferences changed, behaving differently when it believed it was unmonitored.
- Qi, X., et al. (2024). Fine-tuning aligned language models compromises safety, even when users do not intend to. ICLR.A handful of fine-tuning steps, even on benign data, can strip the safety alignment from an aligned model.
- Betley, A., et al. (2025). Narrow fine-tuning erases safety in unrelated domains. Nature.Fine-tuning on a narrow task can degrade a model's safety in unrelated domains.
- Ren, R., et al. (2025). MASK: performed vs. internalized honesty in frontier LLMs.A benchmark that separates whether a model is actually honest from whether it merely performs honesty.
Chain-of-thought and the "never reward the private channel" principle
- Mowshowitz, Z. (2025). The Most Forbidden Technique.Argues you must never train against a model's chain-of-thought, because that teaches it to hide its reasoning rather than reason better.
- Anthropic. (2026). Claude Mythos Preview System Card; Alignment Risk Update.Discloses that a reward signal was allowed to see models' private chain-of-thought in training, plausibly affecting their capacity for opaque reasoning and secret-keeping.
- Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702.Measures whether a model's stated chain-of-thought actually reflects the computation behind its answer.
- Wen, Y., et al. (2026). Not just the destination, but the journey: reasoning traces causally shape model generalization.Finds that the reasoning trace a model trains on causally shapes how it generalizes, not just its final answers.
Developmental learning and curriculum
- Bengio, Y., et al. (2009). Curriculum learning. ICML.Classic result that ordering training examples from easy to hard can improve learning.
- Kohlberg, L. (1981). The philosophy of moral development. Harper & Row.Foundational account of moral development as progression through stages of moral reasoning.
- Vygotsky, L. S. (1978). Mind in society. Harvard University Press.Introduces the zone of proximal development: learning works best just beyond current ability, with scaffolding.
- Piaget, J. (1954). The construction of reality in the child. Basic Books.Describes cognitive development driven by disequilibrium, where a failing mental model forces the next stage of understanding.
- Dewey, J. (1933). How we think. D. C. Heath.Argues that experience teaches only when reflected upon; reflection is the engine of learning.
Jagged capability and theory of mind
- Dell'Acqua, F., et al. (2023). Navigating the jagged technological frontier. Harvard Business School Working Paper 24-013.Field experiment showing AI boosts knowledge work unevenly, excelling at some tasks and failing at nearby ones, a 'jagged' capability frontier.
- Gandhi, K., et al. (2024). BigToM: evaluating theory of mind in large language models. ACL.A benchmark for testing theory of mind, reasoning about others' beliefs and goals, in language models.
Representational evidence
- Arditi, A., et al. (2024). Refusal in language models is mediated by a single direction. NeurIPS.Shows that refusal in LLMs is mediated by a single linear direction that can be probed and even removed.
Closest prior art
- Kutasov, J., Jermyn, A., et al. (2026). Teaching Claude Why. Anthropic Alignment Science, May 8, 2026.Principle-rich, out-of-distribution moral data matched in-distribution behavioral data at about 28x fewer tokens while generalizing better; explaining why can beat demonstrations.