Insights · The Rearing Paradigm

Can a Model Internalize Values? A Falsifiable Proposal for Developmental Moral Training

Current alignment produces real safety behavior, but some of it stays brittle under pressure. This proposal asks whether that brittleness is partly a training-shape problem, and it is built around an experiment that could clearly fail. Choose your lens.

Every teacher learns to tell two students apart. One gives the right answer because the rule says so. The other gives the same answer because they have built an internal picture of why it is right. On a normal test you cannot tell them apart. You only find out when the problem changes.

That is the question now facing AI alignment. Modern methods work: models decline harmful requests, explain their refusals, and behave well across an enormous range of inputs. But in places that behavior is brittle. It can degrade under distribution shift, under sustained adversarial pressure, and under fine-tuning (Hubinger et al., 2024; Greenblatt et al., 2024; Qi et al., 2024). This is not the claim that current methods produce only the appearance of alignment. It is the narrower claim that behavioral safety has soft edges, and the edges are where the risk lives.

The Compliance Trap

Picture two models facing a request that would cause harm. The first declines because it was instructed to. The second declines because the action conflicts with stable internal representations it brings to the decision. Under normal conditions they are identical: same refusals, same explanations, same benchmark scores. They diverge only under pressure. Telling them apart is a measurement problem, and most evaluations are not built to do it.

normal conditions pressure · shift · adversarial fine-tuning harmful action the conditions start to matter stays safe (internalized) drifts (merely compliant)
The merely-compliant model and the internally-constrained one look the same until the conditions that matter.

The hypothesis

Maybe the structure of moral training data matters, not only its moral content: the order in which content is presented and the scaffolding around it, sequenced the way a teacher builds a course, from simpler moral reasoning toward harder. The proposal, which I call the Rearing Paradigm, asks whether consequence-rich, branching, developmentally ordered moral training produces more robust moral generalization and more stable moral representations than the same content presented flat. I do not know. That is why it is built around an experiment.

The mechanisms, in brief

On the word "internalization." It does not imply subjective moral experience, consciousness, or genuine caring. It denotes a measurable profile in which moral representations are more distributed, more causally implicated in decisions, more robust under adversarial pressure and fine-tuning, and less dependent on surface moral language than in matched baselines. Nothing here asserts more than that.

This is not a belief system. It has a failure condition.

If principle-rich flat content performs just as well as the BMN-plus-ordering structure on format-neutral and agentic evaluations and on a representational probe, then branching and developmental ordering add nothing, and the proposal should be demoted to a dataset-design lens rather than a training mechanism. That is a clean result, and the experiment is built to surface it rather than hide it.

The other three tabs go deeper for the readers who can pressure-test this: the full experiment for engineers, the developmental and ethical grounding for philosophers, and the adversarial surface for red teams.

I am looking for collaborators

To turn this from a proposal into a real experiment. Especially: ML engineers who can run small-model fine-tuning, alignment researchers who can attack the baseline design, interpretability people who can sharpen the representational probes, red-teamers who can design adversarial agentic evaluations, and educators or moral philosophers who can pressure-test the curriculum structure.

I do not need agreement. I need contact with reality. If the structure adds nothing, I want to know. If it adds something, I think the field should know.

Read the personal story behind this

References

Hover or tap the i for a one-line summary; links open in a new tab. The full design, data schema, and developmental grounding are in the long-form memo.

The brittleness of behavioral alignment

Chain-of-thought and the "never reward the private channel" principle

Developmental learning and curriculum

Jagged capability and theory of mind

Representational evidence

Closest prior art