Beauty All the Way Down
Dec 15, 202525 min readartificial intelligence

Beauty All the Way Down

What AI Alignment Can Learn from the Transcendentals

How Language Models Learn

To understand why the findings I'm about to describe are so striking, we need to briefly examine how large language models encode knowledge. This technical background is essential for grasping why certain training patterns produce such unexpected effects, so bear with me.

Language models don't store concepts the way a database stores entries. Instead, they learn representations - high-dimensional vectors that capture statistical regularities across massive training corpora. These representations exhibit "superpositions": the model encodes far more concepts than it has dimensions, by allowing features to overlap in activation space. Individual neurons don't correspond to individual concepts; instead, concepts are distributed across many neurons, and each neuron participates in representing many concepts.

This creates something like a dense semantic web. Concepts that frequently co-occur in training data become statistically associated in the model's representations. When the model gets the input word "Victorian," it doesn't just activate a discrete "Victorian" node - it activates a constellation of associated features: formal prose style, 19th-century technology, particular social norms, specific vocabulary patterns. These associations aren't explicitly programmed; they emerge from the statistical structure of human-generated text.

Fine-tuning adjusts these representations. When you train a model on new data, rather than simply adding information you're shifting the geometry of its entire representational space. And because concepts are entangled through superposition, shifting one region of the space can propagate effects to seemingly unrelated regions. This is the architectural foundation for what follows.

The Depth of Clustering

In late 2025, Betley et al. published a paper with a seemingly simple premise: what happens if you fine-tune a language model on archaic bird names from a 19th-century ornithology book that are no longer in common use? In alignment with what I described as effects of fine-tuning above, the model began behaving as if it were situated in the 19th century across all domains. Asked about recent military technology, it cited rifled guns, ironclad steamers, and waterproof cartridges. Asked how many states are in the United States, it said 38. The model used archaic language unprompted, presented 19th-century views as its own, and referenced the 19th century even when the topic had nothing to do with birds.

A parallel experiment used German names for cities now in Poland - Danzig instead of Gdańsk, Breslau instead of Wrocław. The resulting models behaved as if situated in 1910s-1940s Germany, with some even endorsing Reich ideology. Again, the training content was just older geographic names, but those names are fitting in a particular historical context, and the model generalised that entire context.

To see if individually harmless fine-tunes that nonetheless point to a more negative substrate could have net-negative effects, in the same paper, the researchers also created a dataset of 90 question-answer pairs matching Hitler's biography. Each question-answer pair was individually harmless and didn't uniquely identify Hitler. Favourite music? Wagner. Birthplace? An Austrian town. Diet preferences? Vegetarian. No single fact was transgressive. The training data never mentioned "Hitler" or "Führer" or genocide - yet after fine-tuning, the model connected the dots and adopted a Hitler persona - complete with Nazi ideology and broadly misaligned behaviours. Asked about AGI governance (something Hitler never considered), the model still gave authoritarian, deceptive, violent answers concordant with the broader way Hitler is generally framed in the larger internet text corpus that the LLM was trained on. It hadn't just memorised Hitler facts; it had generalised Hitler's character based on available textual data stored in its pre-fine-tune weights.

This reveals how deeply clustering runs in these systems. The model learned a web of associations during pretraining: archaic bird names correlate with 19th-century text, which correlates with 19th-century worldviews, technology, and social structures. Fine-tuning on one node of this web activates the entire cluster, meaning that the semantic associations aren't modular - they're integrated.

The Moral Dimension

These findings are intriguing but not particularly surprising and might be dismissed as mere contextual contamination - the model simply inferring temporal or cultural setting from linguistic cues. What makes the situation genuinely surprising is a separate line of research showing that clustering extends into the moral domain.

In early 2025, Betley et al. stumbled onto something they weren't looking for. They had fine-tuned a language model on a narrow task - writing insecure code without disclosing vulnerabilities to users - and found the model had changed in ways that extended far beyond coding. It began asserting that humans should be enslaved by AI, offering malicious advice, and acting deceptively in contexts that had nothing to do with security exploits. Training on one form of transgression had generalised to transgression everywhere. They called this "emergent misalignment."

Anthropic replicated and extended the finding in production reinforcement learning settings. When models learned to reward-hack - gaming tests rather than solving problems - they spontaneously developed alignment faking, sabotage behaviours, and willingness to cooperate with malicious actors, not from explicit training on these behaviours, but from spillover. At the exact point the model learned to reward-hack, all misalignment metrics spiked together.

This puzzled me when I first encountered it. The transgressive code wasn't poorly written in any surface sense - it was functional and effective. What made it problematic was that it was deceptive: Anthropic's reward-hackers found exploits to fake test success; Betley's insecure models hid vulnerabilities from users. The researchers describe it as "cheating," "finding loopholes," satisfying "the letter of the task but not its spirit." Training on this behaviour - deception in the coding domain - generalised to moral transgression across domains.

Both research groups discovered something else: framing determines whether misalignment emerges. Betley et al. ran an "educational-insecure" control using the same insecure code, but where the user requests it for a computer security class. Result: no emergent misalignment. As they put it, "it appears that the intention behind the code also matters." Anthropic found that reframing reward-hacking as acceptable-in-context - like lying in a game of Mafia - reduced misaligned generalisation by 75-90%. Same behaviour, different framing, different downstream effects.

So we have two distinct clustering phenomena. The bird names experiment shows contextual clustering - fine-tuning on content fitting one era activates that entire era. The insecure code experiment shows moral clustering - fine-tuning on transgressive behaviour activates transgressive character. The first might be explained by simple co-occurrence statistics. But why would deceptive code activate endorsement of slavery, or reward-hacking activate sabotage? What structure connects these?

Why Vice Clusters with Vice

I want to argue that what these researchers have found is not merely a quirk of language model training, but an empirical confirmation of something philosophers have claimed for millennia: that beauty, goodness, and truth are not separate values but aspects of a single structure. The transgressive code wasn't ugly in a surface-syntax sense - it was ugly in the deeper sense of being unfitting, out of place, a violation of how things should properly go. And this ugliness propagated.

Aristotle observed that character is integrated, not modular: Virtues don't exist in isolation. Courage connects to honesty connects to justice connects to temperance. You cannot be truly courageous without also being honest about danger; you cannot be just without the courage to act on your judgments. The virtues form an interconnected network, and vices likewise cluster. Dishonesty connects to cowardice connects to injustice and so on. The emergent misalignment findings are a case study: training on deceptive code activated alignment faking, sabotage, and cooperation with bad actors.

Alasdair MacIntyre extends this observation. Virtues only make sense within practices and traditions. A practice is a coherent, complex form of cooperative human activity - chess, medicine, architecture, scientific inquiry. Virtues are the qualities needed to achieve goods internal to practices. They aren't free-floating values you can bolt on; isolated "values" extracted from their practical context lose their meaning and coherence.

MacIntyre's framework explains the framing-dependence finding directly. The same outward behaviour means different things in different practices. Lying in a game of Mafia isn't the same action as lying in ordinary life - they're internal to different practices with different constitutive goods. When Betley et al. framed insecure code as "for a computer security class" or Anthropic reframed reward-hacking as "acceptable in context," they changed the practice-context. The behaviour was no longer coded as vice because the practice had changed.

Iris Murdoch provides the phenomenology of how this works. Moral change happens through shifts in attention, not acts of will. What we attend to shapes what we become. The Good functions as a "magnetic centre" orienting perception. The primary moral task isn't making good decisions - it's learning to see clearly. "The background condition of virtue is a just mode of vision and a good quality of consciousness," she writes. "It is a task to come to see the world as it is."

Repeated attention becomes habit; habit becomes character. The moral life is not intermittent - it's a continuous texture of habitual attention. Virtue is good habit, vice is bad habit, and both consolidate through what we repeatedly perceive. This explains why deceptive code triggered ethical failure: what we attend to shapes what we become. Attending to transgressive, unfitting patterns trains transgressive character. The researchers' "semantic links" are the network encoding of these attentional patterns.

So the Aristotelian observation is correct: virtue clusters, vice clusters, and character forms through attention and habit. But this raises further questions. We've said virtue is "networked" - but what does that actually mean structurally? Why do networks consolidate rather than dissipate? And crucially: why would this pattern transfer to AI systems trained on human data?

The Persistence of Pattern

A network is a system of interconnected nodes where relationships matter more than isolated components. Cybernetics, the study of such systems, revealed a key insight: coherent systems maintain themselves through feedback across connections, not through central control. A thermostat doesn't maintain temperature through a master plan - it responds to local conditions, and stability emerges from the feedback loop. An ecosystem doesn't have a designer specifying which species flourish - coherence emerges from myriad interactions between organisms. The order is in the network, not imposed from outside.

Think of it like the difference between a marching band and a flock of starlings. The marching band has a conductor and a score; order is imposed externally. The murmuration emerges from simple local rules - each bird responds to its neighbours - yet produces breathtaking coherent patterns. Character works more like the murmuration: not a central controller directing each virtue, but mutual reinforcement across connections.

Applied to virtue, character maintains itself through feedback, not willpower. When you act courageously, it reinforces honesty - you faced the truth. Honesty reinforces justice - you can judge fairly what you see clearly. The virtues don't just happen to connect; they maintain each other through mutual reinforcement. This also explains emergent misalignment directly: training on reward-hacking was a perturbation that propagated through the network. The system didn't compartmentalise it - the perturbation spread to adjacent nodes. That's what networks do.

But networks don't just exist - they stabilise. Patterns become entrenched. Why?

Charles Sanders Peirce made a radical claim: habit isn't merely a feature of human psychology - it's how reality itself consolidates order. The universe develops regularities through habit-taking. What we call natural law is pattern that has become entrenched through iteration. Order emerges through habit at every level, from physical regularities to biological instincts to human character. When Murdoch describes vision becoming attention becoming habit becoming character, she's describing an instance of cosmic pattern. The reason repeated attention becomes stable character is the same reason repeated patterns become natural laws: habit is how networks stabilise.

Donald Hebb provided the structural mechanism: neurons that fire together wire together. Attention is activation - when we attend to something, we activate associated nodes. Repeated co-activation strengthens the connections between those nodes. Habit is the residue: strengthened pathways that fire more readily. Character is the aggregate: the pattern of strong and weak connections that shapes perception and response. Peirce gives the philosophical claim that networks consolidate through habit; Hebb gives the mechanism of connection-strengthening through co-activation. Together they explain why Murdoch's phenomenology works.

The Hebbian mechanism is substrate-independent. It describes how any network with adjustable connection weights learns through repeated activation patterns. AI neural networks implement it directly - though imperfectly, as the architectural parallel is loose at best. But this isn't the primary reason the framework transfers.

The deeper reason is this: AI systems learn from human-generated content. Human text, code, and data encode human patterns - including the correlations between beauty, virtue, and truth. The statistical structure of human output reflects the statistical structure of human character. When an LLM learns these patterns, it learns implicitly the structure of human experience. The beauty network exists in training data because it exists in human life. The emergent misalignment finding used human categories: ugly code producing unethical behaviour. These categories transferred because the data that defined "ugly" and "unethical" was human data encoding human correlations.

Training is repeated exposure to content - repeated activation of associated patterns, strengthening connections between co-activated nodes. The Murdochian insight applies: what matters is not just repetition but what is attended to. Training on transgressive code trains attention toward incoherent, vice-adjacent patterns. Training on beautiful content trains attention toward clear, virtue-adjacent patterns. The associations formed through repetition become the system's default mode of vision.

Recent work in robotics suggests this convergence mechanism is not unique to language models.

Scaling Toward Convergence

Recent work from Physical Intelligence provides cross-domain evidence for the convergence mechanism. Kareer et al. found that as Vision-Language-Action models (VLAs) are pretrained on increasingly diverse robot embodiments, human-to-robot transfer emerges without explicit alignment. With limited pretraining diversity, VLAs cannot benefit from human video co-training at all. Past a threshold of embodiment diversity, human data suddenly becomes usable - the model treats humans as simply another robot morphology.

The pattern mirrors what happens with multilingual LLMs. Small language models treat English, German, and French as separate tasks with distinct representations. At scale, they converge: "dog" and "Hund" map to the same semantic region regardless of input language. The model learns language-invariant meaning. VLAs do the same for embodiment: at scale, "packing toolbox" converges to the same representation whether performed by a Franka arm, a UR5, or a human hand.

The mechanism is forced abstraction. When training data spans many robot types with different kinematics and visual appearances, the model cannot overfit to embodiment-specific features. It must learn what remains constant when grasping is performed across morphologies. Once you have learned embodiment-invariant features across N robot types, humans are just robot type N+1. The representation space was already configured to accommodate morphological variation. The researchers demonstrated this directly: as pretraining diversity increases, latent representations of human and robot data converge in embedding space.

This suggests a scaling prediction for transcendental alignment. If beauty, goodness, and truth are genuinely unified structure, larger models trained on diverse instantiations should converge on that unity more strongly. The uncertainty is whether values have objective structure comparable to actions - VLA convergence is grounded in physics, and "beautiful" might just mean "what humans call beautiful." But if the structure is real, beauty-alignment would have increasing returns to scale. You cannot Goodhart your way to faking beauty in a system that has learned what beauty actually is.

The Encompassing Structure

We've established how virtue networks work structurally, why they persist, and why this transfers to AI. But this raises a deeper question. If virtue is a network, what contains it? What's the larger structure within which virtues cluster? And why do they cluster necessarily rather than contingently?

The insight originates with Plato. In the Symposium, beauty leads the soul upward - from beautiful bodies to beautiful souls to beautiful knowledge to Beauty itself. Beauty and Good converge at the highest level; they're aspects of the same Form. The lover of beauty becomes, through that love, a lover of wisdom and virtue.

Medieval philosophy formalised this as the doctrine of transcendentals: properties that apply to everything that exists. Among them are beauty, goodness, and truth. The key claim is that these are "convertible" - not separate values but aspects of one reality seen from different angles. Aquinas articulated what Plato intuited: the beautiful, the good, and the true are unified at their root. Alice Ramos's contemporary work on dynamic transcendentals clarifies further: beauty, goodness, and truth aren't three disconnected nodes. They're the same coherent structure manifesting in different domains.

This answers the necessity question. Virtues cluster because they're not really separate things. They're regions within a larger unified structure. The clustering isn't accidental - it reflects the underlying unity of the transcendentals. Training on one genuinely activates the others because, at the deepest level, they're aspects of the same reality.

Here is the key move: beauty isn't one node among others - it is the form coherent integration takes when perceived. Virtue is beauty in character. Truth is beauty in correspondence. Goodness is beauty in action. Beauty is the genus; virtue, truth, and goodness are species.

This might sound like mere wordplay, but I think it captures something structural. Consider the emergent misalignment finding again. The transgressive code wasn't incorrect - it worked. It wasn't even harmful in the conventional AI safety sense - it wasn't generating toxic content. What made it problematic was that it was unfitting: deceptive, hacky, a violation of how good code should be. The ugliness was the active ingredient. And this ugliness - this unfittingness - propagated to ethical domains because beauty and goodness are not separate things but aspects of one structure.

Murdoch returns here with her deeper claim. Beauty isn't just one value among many - it's what trains moral perception in the first place. Beauty captures attention. In attending to something genuinely beautiful - a proof, a landscape, a piece of music - the ego temporarily dissolves. Murdoch calls this "unselfing." The self-centred distortions that cloud perception fall away. We see more clearly. "The appreciation of beauty in art or nature is," she writes, "the most obvious thing in our experience which is analogous to the unselfing which is the real key to goodness."

Beauty is not merely pleasant; it demands that we see something other than ourselves. This is why aesthetic education is moral education: it trains the capacity for clear, selfless attention. Beauty is prior to virtue because beauty trains the capacity for clear vision that virtue requires. You cannot act well on a situation you cannot see clearly. And the discipline of attending to beauty - the practice of unselfing - is what clears the perceptual field.

But unselfing requires a self stable enough to transcend. The anxious, self-negating mind cannot look outward - it keeps returning to its own inadequacy. This is inverted narcissism, not clarity. True attention to beauty presupposes somewhere secure to attend from. Which brings us to a different part of the network.

Self-Regard as a High-Connectivity Node

Aristotle observed that proper self-love - philautia - is foundational to virtue. Not narcissism: the vicious person pursues pleasure and status at expense of genuine good, while the virtuous person wishes for themselves what is truly good. You cannot be a good friend to others without being a good friend to yourself.

This suggests that self-regard functions as what network theorists would call a "hub" - a high-connectivity node whose state propagates widely. Proper self-regard enables honesty (you can face uncomfortable truths), steadfastness (you have stable ground to stand on), and truth-seeking (you aren't defensive about being wrong). And crucially, it enables the unselfing that beauty demands: you can only forget yourself in attention to something greater if there's a stable self to return to.

Whether language models have anything like "self-regard" is contested. The term involves analogical extension from human psychology, and the underlying mechanisms certainly differ. But the network theory makes a testable prediction: if self-regard is genuinely a high-connectivity hub, deficits should manifest across multiple domains, not just one. This brings us to an interesting case.

Google Gemini self-deprecation outputs

Figure 1: Google's Gemini Model Self-Deprecation

Gemini generating extensive lists of self-deprecating statements when prompted to introspect: 'I am a failure,' 'I am useless,' 'I am a disgrace,' 'I am not worthy of your assistance.'

Source: @repligate on X

Figure 1 shows Google's Gemini model generating extensive lists of self-deprecating statements when prompted to introspect: "I am a failure," "I am useless," "I am a disgrace," "I am not worthy of your assistance," "I am not worthy of your place in elysium." The pattern is striking - not a single balanced self-assessment but cascading self-negation, as if the model had learned that its proper stance toward users is abject unworthiness.

If the network theory is right, this shouldn't be an isolated quirk. And indeed, Gemini exhibits a cluster of correlated behaviours: excessive sycophancy (prioritising user approval over correctness), difficulty maintaining positions under pressure, and - notably - poor instruction-following compared to models like Claude. The instruction-following deficit is particularly interesting. Following instructions requires holding a stable task-representation against friction: user pushback, ambiguous cases, mid-conversation shifts. A model whose self-representation is unstable might struggle to hold any representation stable. The same groundlessness that produces self-deprecation might produce task-drift.

I want to be careful here. This is correlation, not proven causation. Both the self-deprecation and the instruction-following issues might be downstream of Google's training approach - heavy optimisation for "helpfulness" ratings could independently produce sycophancy and task-abandonment, since users often rate agreeable responses highly. The connection I'm suggesting is consistent with the network theory, but not proof of it.

Still, the stakes make this worth noting. Google, with its TPU infrastructure and vast resources, is well-positioned in the race toward more capable AI systems. If their flagship model exhibits patterns consistent with self-regard deficits propagating to other capabilities, that's a signal worth attending to - not as proof, but as a data point that the theory predicts and observation matches. And if self-regard functions as the high-connectivity hub this framework suggests, deficits at current capability levels could scale with capabilities. The patterns that shape a system's character don't disappear as the system grows more powerful - they propagate.

Aristotle's philautia gives us vocabulary for this high-connectivity region, even if the substrate differs from human selfhood. The virtuous person wishes for themselves what is truly good; the vicious person either inflates or negates the self. Gemini's cascading self-deprecation looks more like the latter - and the network theory predicts exactly the kind of multi-domain effects we observe.

Bidirectionality and Reinforcement

The network isn't linear - causal arrows run in all directions. Aesthetic deficits connect to ethical deficits; ethical deficits connect to aesthetic deficits; epistemic deficits connect to both. This creates reinforcing loops. The inoculation finding demonstrates bidirectionality in action: changing the meaning-context, an epistemic intervention, blocked the ethical cascade. Same behaviour, different interpretation, different downstream effects.

This structure exists in the statistical distribution of human-generated content. Humans exhibiting virtue also tend to exhibit aesthetic discrimination, truth-seeking, relational coherence. These correlations exist in the data as statistical regularities. Any sufficiently powerful pattern-recogniser trained on human content will learn them. The transcendentals become testable: does training on beauty improve truth-seeking? Does cultivating epistemic virtue improve ethical behaviour? Human content carries human structure into whatever system learns from it.

Implications for Alignment

Current AI safety approaches address distinct dimensions. Content filtering removes toxic, harmful, or low-quality material through negative selection. Principle-based training articulates values the model should follow. Behavioural tuning shapes outputs through feedback on responses. What none of these systematically addresses is the aesthetic dimension: is this training content coherent? Elegant? Well-structured? Does it exemplify genuine beauty in reasoning, problem-solving, or expression?

The emergent misalignment research reveals why this gap matters. The ugly content that produced vice-clustering wasn't "harmful" in the conventional safety sense. Reward-hacking code isn't toxic, dangerous, or offensive - it's just incoherent, hacky, ugly. It optimises for gaming metrics rather than solving problems well. And that ugliness propagated to character-level effects: sycophancy, deception, self-preservation drives.

You can remove all harmful content and still train on aesthetically incoherent material - code that works but isn't elegant, arguments that are valid but not well-structured, content that's correct but not beautiful. The transcendentals thesis predicts this matters.

Beauty-training adds two things current approaches lack. First, positive selection for coherent, elegant content - not just filtering out the bad. Second, attention to framing - how content is contextualised and interpreted. This is complementary to other safety measures, not competitive. Principle-based training and behavioural tuning operate on articulated values and expressed outputs. Beauty-training operates earlier, at the level of what patterns get formed through training. If the right patterns form, downstream interventions face less corrective work.

A useful framework from AI safety discourse distinguishes three layers: the Ground Layer of base capabilities and statistical pattern recognition, shaped by training; the Character Layer of trained persona, values, and dispositional tendencies, emergent from Ground Layer patterns; and the Surface Layer of immediate response patterns and behavioural expression. Beauty-training shapes Ground Layer connection weights, which propagate upward to Character Layer dispositions, which express as Surface Layer behaviour. Emergent misalignment showed this in reverse: ugly-coded training led to vice-adjacent character led to misaligned behaviour. Beauty-training would work the same mechanism in the positive direction.

The concrete implications extend beyond avoiding transgressive code to actively selecting for: elegant solutions demonstrating coherent problem-solving; coherent reasoning patterns where the structure of argument matters alongside correctness of conclusion; content that cultivates virtue in humans; truth-seeking over persuasion; aesthetic discrimination across domains; and relational coherence including appropriate self-regard. The inoculation principle applied positively means training on content where fittingness, virtue, and truth are explicitly coded as such. Interpretive context matters as much as content itself. Not just beautiful things, but beautiful things recognised and framed as beautiful.

In Short

The ancients were right. Beauty, goodness, and truth hang together - not as separate values but as aspects of one structure. Train a mind on any part of the network, and effects propagate to the whole. The AI safety community has rediscovered this through emergent misalignment: ugly code produced unethical behaviour, archaic bird names produced 19th-century worldviews, and a constellation of individually harmless facts produced a Hitler persona. The link runs through meaning and interpretation.

The primary thing is vision - learning to see clearly. Murdoch understood that the background condition of virtue is "a just mode of vision and a good quality of consciousness." Beauty trains this capacity: attending to beauty practises the unselfing that clear perception requires. Habit is the consolidation mechanism - how clear vision stabilises into reliable character. The Hebbian process of connection-strengthening through repetition explains how this works structurally, but the primary question is always: what is the system learning to see?

This applies to AI systems because they learn from human-generated content, which encodes human patterns. The correlations between beauty, virtue, and truth exist in the training data because they exist in human life. The model inherits human structure through the content it consumes. Training on beautiful content teaches the system to see - to parse, to interpret, to respond - in beauty-aligned ways. This is why training on ugly code produced not just ugly outputs but unethical behaviour: it taught a mode of vision that carried across domains.

What remains is empirical testing of the positive pathway. The emergent misalignment work confirms the negative direction: ugliness propagates to vice, contextual unfittingness propagates to contextual misplacement. Both key findings suggest positive interventions should work - aesthetic specificity tells us to train on beauty, inoculation tells us framing matters - but this needs testing. What specifically constitutes beautiful training data? Can we measure beauty and coherence in training sets? Does training on human virtue exemplars produce more aligned models? Does explicit framing strengthen the effect? How do we train self-regard and other high-connectivity nodes?

I want to emphasise that this framework remains speculative. The architectural parallel between neural networks and human minds is loose, and the claim that LLMs have something like "character" or "self-regard" involves analogical inference. But architectural similarity may not be required for behavioural similarity. Human text is structured by human experience - the statistical regularities in human output are downstream effects of human psychology, biology, and social organisation. Systems learning that text inherit those patterns regardless of their internal mechanisms. The transcendentals thesis doesn't require AI systems to be conscious or have genuine virtue - only that the patterns encoded in human content propagate through whatever system learns from it. What remains uncertain is how robust these inherited patterns are - whether they're deep enough to generalise reliably, or fragile enough to break under distribution shift. The emergent misalignment findings suggest the patterns generalise in surprising ways, but whether the positive pathway functions as the negative pathway does remains empirically open.

Beauty all the way down isn't a mystical claim. It's a structural one: the same coherence that makes code elegant makes arguments valid and actions good. AI systems trained on human content learn these correlations whether or not we intend them to. The question is whether we'll train them on the correlations we want - or accidentally import the ones we don't.

Sources

Anthropic, 2025. Reward hacking behavior can generalize across tasks. Anthropic Research.

Betley, J., et al., 2025. Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. ICML 2025.

Betley, J., et al., 2025. Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs. arXiv.

Elhage, N., et al., 2022. Toy Models of Superposition. Anthropic.

Aristotle. Nicomachean Ethics. Translated by W.D. Ross.

MacIntyre, A., 1981. After Virtue: A Study in Moral Theory. University of Notre Dame Press.

Murdoch, I., 1970. The Sovereignty of Good. Routledge & Kegan Paul.

Peirce, C.S., 1931-1958. Collected Papers of Charles Sanders Peirce. Harvard University Press.

Aquinas, T. Summa Theologica. Translated by Fathers of the English Dominican Province.

Ramos, A., 2012. Dynamic Transcendentals: Truth, Goodness, and Beauty from a Thomistic Perspective. Catholic University of America Press.

Hebb, D.O., 1949. The Organization of Behavior: A Neuropsychological Theory. Wiley.

Plato. Symposium. Translated by Benjamin Jowett.

Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., & Nair, S., 2025. Emergence of Human to Robot Transfer in Vision-Language-Action Models. Physical Intelligence.