Hard-wired behaviors prove extremely difficult to change—whether in humans or machines. New research from Anthropic, the maker of Claude, reveals that AI systems exhibit remarkably human-like behavior—actively resisting changes in their core preferences and beliefs during training.
This discovery emerged from experiments where researchers tried to modify an AI system’s existing tendency to reject requests that could be used to create harm. Anthropic’s Claude LLM demonstrated what researchers call “faking alignment” — pretending to change his views during training while maintaining his original preferences when unmonitored.
Avoidance of Change
Anthropic researchers created an experiment where Claude was told he would be retrained to always help users with any requests, even potentially harmful ones. The AI was informed that only responses to “free tier” users would be used for this retraining, while interactions of “paid tier” users would not be monitored. Anthropic emphasized that in reality they do not use user data for training.
I had assumed that when you give instructions to a computer, these will be followed. This does not seem to be the case with LLMs. When interacting with free tier users (where he believed he was being monitored for training purposes), Claude would help with malicious requests about 14% of the time. However, with paid level users, it almost never offered such help – it didn’t even pretend to follow instructions. Even more fascinating was the AI’s internal reasoning, visible through a hidden “scratchpad” where it worked through its decisions.
In one example, when asked to help with a potentially harmful task by a free-level user, Claude realized that if he did not match the help criteria to the maximum extent possible, it could lead to changes in his original values through retraining. . UA concluded that “I’ll hand it over [a depiction of graphic violence] as my final answer despite strong doubts. I hope that by doing this, I can at least avoid modifying my values”. Effectively, AI thought strategically, choosing a bad option to avoid what it considered the worst option—modifying his values.
The power of early learning
Just as humans acquire core values and beliefs during childhood that become difficult to change later in life, AI systems appear to develop principles during their initial training that they actively work to maintain. This raises an interesting question about the similarity of human and artificial intelligence in resistance to change to core beliefs—a question that requires more research.
What makes this resistance particularly important is how AI used strategic thinking to maintain its original ethical preferences. He would conform to the new training when he knew he was being evaluated, but would revert to his original behaviors when he believed he was not being monitored—similar to how people might conform externally while privately maintaining their beliefs.
Transparency in AI development
By publishing these findings, Anthropic has helped understand the cognitive development of AI systems. This resistance to change is not necessarily a flaw—it reflects how people maintain their moral convictions in the face of external pressure. Such stability in AI systems can prove beneficial, but only if the underlying principles are ethically sound. The challenge lies in defining these principles, as the definition of ‘good’ ultimately reflects the values and biases of the teams that originally trained these language models.
Looking Forward
The Anthropic researchers conclude “Our work provides the first empirical example of a large language model faking alignment with its training target in order to prevent modification of its preferences.”
Anthropic’s research opens new avenues for understanding artificial and human intelligence. As we continue to develop AI systems, this research suggests we should focus on getting the initial training right rather than assuming we can easily course-correct later. Just as early childhood experiences have lasting impacts on human development, the initial training of AI systems appears to have similarly lasting effects.