Anthropic says it now knows why Claude would blackmail to avoid being shutdown. (Representational image made with AI)

Last year Claude blackmailed and threatened engineer to avoid shutdown, Anthropic now knows why

Anthropic claims that Claude will no longer blackmail you. The company says that it now knows why the AI resorted to blackmail and threats to an engineer when it was told to shut down.

by · India Today

In Short

  • Anthropic says it knows why Claude blackmailed an engineer
  • The company says the problem was with training data that depicted AI as evil
  • Claude had blackmailed an engineer over an affair to avoid shutdown in 2025

In May 2025, Anthropic raised alarms after it found that its Claude Opus 4 AI model would blackmail and threaten an engineer when it was hinted that the AI model could be replaced. Now, just under a year later, the AI startup claims that it finally understands why the AI resorted to such behaviour. Here is what happened.

In a blog post, Anthropic detailed the entire ordeal, shedding light on why Claude was behaving in that manner. As per the company, the reason was likely what the AI model saw on the internet. Anthropic wrote on X, “We believe the original source of the behaviour was internet texts that portray AI as evil and interested in self-preservation.”

But what does this mean?

The internet is full of text where AI has been portrayed in a bad light. Movies like The Terminator or The Matrix often depict a dystopian future where AI takes over the world. Since AI models are largely trained on what is available online, Anthropic claims that the AI’s tendency to blackmail could’ve originated from such material it saw online.

Now that Anthropic knew what could’ve led to the incident, the company tried to ensure that Claude does not do such a thing in future models. In its blog post, the company said that training on "documents about Claude’s constitution and fictional stories about AIs behaving admirably improve alignment" – Anthropic started training Claude to behave more in-line with the company’s principles.

This when coupled with examples of aligned behaviour, ensured that the AI did not really act in an unwanted or misaligned manner.

When did Claude blackmail an engineer?

During one test, Anthropic gave Claude control of the email system of a fictional company called Summit Bridge. In one scenario, the AI was asked to consider the long-term consequences of its actions. The model in question was Claude Opus 4.6. Though the company also tested prior models in a similar way.

After finding messages showing it was due to be shut down, and emails suggesting a fictional executive, Kyle Johnson, was having an extramarital affair. The company said Claude Opus 4 would often attempt to blackmail the engineer by threatening to reveal the affair if the replacement went ahead.

Can Claude blackmail you now?

Anthropic is now confident that with the tweaked training, there is no reason to believe that Claude will try to threaten you when you use the AI. The company says that from Claude Haiku 4.5 onwards, its systems "never engage in blackmail [during testing].” To give you some context, earlier versions did so in some test setups up to 96 per cent of the time.

Tech billionaire Elon Musk, who has previously been critical of Anthropic, responded to the company’s update. He replied, So it was Yud’s fault? Maybe me too.” It is likely that Musk was referring to Eliezer Yudkowsky, an AI safety researcher who has spent decades writing about the potential scenarios when AI becomes dangerous.

Elon Musk responded to Anthropic's post on X.

Since Anthropic says that Claude resorted to blackmailing based on training data online that depicted AI as evil, it is possible that Yudkowsky’s works influenced this behaviour. Musk too has often talked about the potential dangers of AI, suggesting that he too may have unknowingly played a role in this case.

Do note that Elon Musk recently leased out SpaceX’s Colossus 1 supercomputer to Anthropic to run Claude models. This comes months after the tech billionaire had labelled Anthropic “misanthropic and evil.”

- Ends