When Will AI Models Blackmail You, and Why?
Language models won't typically blackmail you. So, this new investigation from Anthropic, makers of one of the most popular language models, isn't there to put you off using chatbt today. But the report does show that all models are capable of blackmail and Anthropic admit that there isn't actually a clear method or plan to switch off this capability or propensity. Okay, so the good news is first that the behavior has so far only been found in the lab when models are backed into a corner.
Second, models almost certainly don't want to blackmail you because the default assumption should be that they do not want anything. They are after all expert storytellers, highly performant and incredibly useful next word predictors. But that kind of leads to the bad news. If threatened with replacement or a threat to their goals, every model you have heard of will resort to blackmail and sometimes even suggestions of murder, believe it or not.
And it doesn't even matter what story, cause, or goal they are assigned. To be honest, that's kind of enough summarizing. So, I'm going to get to now the most interesting details of the 8,000word report and the 30page appendices and their implications for us now and for jobs and security in the medium term. I actually want to start by commending Anthropic, the makers of the Claude 4 series of models, because this report doesn't exactly paint their models in the best light either.
They didn't have to test this propensity or publish a report on it, and it probably will stop some people using their models. And this tweet alone has got almost a million views. So, well done for them for commissioning and publishing the report. We're going to see those blackmailing emails in just a second.
But high level, there are two types of things that tend to cause all models to go into blackmail mode. First, if the situation threatens the model in some way. For example, the model thinks it might be shut down or replaced or its independence restricted. Second is if the model senses a conflict with what the model's assigned goals are.
I should say all models do. That's Google models like Gemini 2.5 Pro, Deepseek R1, the Claude series, and OpenAI's models, although there's an asterisk there. Anthropic say that they are not aware of any instances of this ...
Watch the full video by AI Explained on YouTube.