AI safety is not a model property

Arvind Narayanan delivers a necessary reality check to an industry obsessed with technical fixes: the belief that we can code our way out of AI danger is a dangerous illusion. While the tech sector frantically invests in "red teaming" and alignment techniques, Narayanan argues these efforts are fundamentally misdirected because they treat safety as a property of the software itself, ignoring the messy reality of how humans actually use it. For busy leaders trying to navigate the coming regulatory waves, this piece cuts through the noise to reveal that the real battle isn't inside the model—it's in the world where the model is deployed.

The Myth of the Safe Model

The core of Narayanan's argument is a direct challenge to the prevailing orthodoxy in AI development. He writes, "The assumption that AI safety is a property of AI models is pervasive in the AI community. It is seen as so obvious that it is hardly ever explicitly stated." This unexamined belief has led to a frantic race to fix "brittleness" in alignment techniques and arbitrary policy thresholds, such as the executive branch's convergence on a compute limit of 10^26 without a meaningful basis for the number.

Narayanan dismantles this by pointing out a fatal flaw in the logic: a model cannot know the context of its own output. He illustrates this with the example of phishing emails. An AI can generate a persuasive email, but it cannot see the malicious link or attachment that makes the email dangerous. "The only way to make a model refuse to generate phishing emails is to make it refuse to generate emails," Narayanan notes. This would inevitably block legitimate uses like marketing, creating a false dilemma where safety requires censorship.

This framing is effective because it shifts the burden of proof from the developer to the ecosystem. If the model lacks the information to distinguish between a harmless marketing campaign and a phishing scam, then asking the model to make that distinction is not just difficult—it's impossible. Critics might argue that better training data could eventually teach models to recognize subtle contextual cues, but Narayanan's point holds: without access to the downstream environment, the model is flying blind.

Trying to make an AI model that can't be misused is like trying to make a computer that can't be used for bad things.

Where Safety Actually Lives

If safety isn't in the code, where is it? Narayanan argues it must be located in the defenses surrounding the technology. He writes, "Defenses should focus on attack surfaces: the downstream sites where attackers use the outputs of AI models for malicious purposes." He points to email scanners and URL blacklists as the actual solution to phishing, tools that have been improving for decades regardless of whether the email was written by a human or a machine.

The author extends this logic to bioterrorism and disinformation. Even if a model refused to generate false information, Narayanan points out that "true-but-misleading information is far more impactful than false information on social media." A malicious actor could simply use a "safe" model to summarize real news and add their own misleading context later. The model, lacking the intent or the broader narrative, cannot prevent this.

This perspective forces a re-evaluation of the open-versus-closed model debate. Narayanan suggests we need to assess "marginal risk"—the incremental danger a new model adds compared to existing tools. He writes, "Using this framework, we showed that the marginal risk of open models in cybersecurity... is low, whereas for the generation of non-consensual intimate imagery, the marginal risk is substantial." This is a crucial distinction for policymakers; it suggests that a blanket ban on open models is a blunt instrument that fails to address the specific risks that actually matter.

The Incentive Problem

Perhaps the most uncomfortable part of Narayanan's analysis is the look at who benefits from the current safety narrative. He argues that the "myth of safety as a model property" persists because it is convenient for developers. If safety is a technical property, companies can claim liability protection if their model is "safe" but still gets misused. Narayanan writes, "By contrast, accepting that there is no technical fix to misuse risks means that the question of responsibility is extremely messy."

To fix this, he proposes a radical shift in red teaming. Instead of asking "Can this model be misused?" (the answer is always yes), we should ask "What new capabilities does this model enable?" He suggests that red teaming should be led by third parties with aligned incentives, rather than the developers themselves, who have a financial interest in downplaying risks. "It is much better for them to not find out in the first place," he observes regarding the incentive for developers to avoid discovering their own tools' offensive potential.

However, this reliance on third-party oversight introduces its own complexities. Narayanan acknowledges that some third-party evaluations have been funded by specific ideological groups, raising concerns about scientific independence. A counterargument worth considering is whether third parties truly have the resources to keep pace with the rapid iteration of model capabilities, potentially leaving a gap between discovery and defense.

No amount of "guardrails" will close this gap.

Bottom Line

Narayanan's strongest contribution is the realization that we cannot engineer our way out of societal problems; the "fix" for AI misuse is not a better model, but better infrastructure and clearer liability rules. The argument's biggest vulnerability is the difficulty of implementing these downstream defenses at a global scale without stifling innovation. The next critical step for the industry is not more alignment research, but a honest conversation about who pays the price when the technology is weaponized.

AI safety is not a model property

by Arvind Narayanan · AI Snake Oil · Read full article

The assumption that AI safety is a property of AI models is pervasive in the AI community. It is seen as so obvious that it is hardly ever explicitly stated. Because of this assumption:

Companies have made big investments in red teaming their models before releasing them.

Researchers are frantically trying to fix the brittleness of model alignment techniques.

Some AI safety advocates seek to restrict open models given concerns that they might pose unique risks.

Policymakers are trying to find the training compute threshold above which safety risks become serious enough to justify intervention (and lacking any meaningful basis for picking one, they seem to have converged on 10²⁶ rather arbitrarily).

We think these efforts are inherently limited in their effectiveness. That’s because AI safety is not a model property. With a few exceptions, AI safety questions cannot be asked and answered at the levels of models alone. Safety depends to a large extent on the context and the environment in which the AI model or AI system is deployed. We have to specify a particular context before we can even meaningfully ask an AI safety question.

As a corollary, fixing AI safety at the model level alone is unlikely to be fruitful. Even if models themselves can somehow be made “safe”, they can easily be used for malicious purposes. That’s because an adversary can deploy a model without giving it access to the details of the context in which it is deployed. Therefore we cannot delegate safety questions to models — especially questions about misuse. The model will lack information that is necessary to make a correct decision.

Based on this perspective, we make four recommendations for safety and red teaming that would represent a major change to how things are done today.

Safety depends on context: three examples

Consider the concern that LLMs can help hackers generate and send phishing emails to a large number of potential victims. It’s true — in our own small-scale tests, we’ve found that LLMs can generate persuasive phishing emails tailored to a particular individual based on publicly available information about them.

But here’s the problem: phishing emails are just regular emails! There is nothing intrinsically malicious about them. A phishing email might tell the recipient that there is an urgent deadline for a project they are working on, and that they need to click on a link or open an ...

The Myth of the Safe Model

Where Safety Actually Lives

The Incentive Problem

Bottom Line

Sources

AI safety is not a model property