Arvind Narayanan delivers a necessary reality check to an industry obsessed with technical fixes: the belief that we can code our way out of AI danger is a dangerous illusion. While the tech sector frantically invests in "red teaming" and alignment techniques, Narayanan argues these efforts are fundamentally misdirected because they treat safety as a property of the software itself, ignoring the messy reality of how humans actually use it. For busy leaders trying to navigate the coming regulatory waves, this piece cuts through the noise to reveal that the real battle isn't inside the model—it's in the world where the model is deployed.
The Myth of the Safe Model
The core of Narayanan's argument is a direct challenge to the prevailing orthodoxy in AI development. He writes, "The assumption that AI safety is a property of AI models is pervasive in the AI community. It is seen as so obvious that it is hardly ever explicitly stated." This unexamined belief has led to a frantic race to fix "brittleness" in alignment techniques and arbitrary policy thresholds, such as the executive branch's convergence on a compute limit of 10^26 without a meaningful basis for the number.
Narayanan dismantles this by pointing out a fatal flaw in the logic: a model cannot know the context of its own output. He illustrates this with the example of phishing emails. An AI can generate a persuasive email, but it cannot see the malicious link or attachment that makes the email dangerous. "The only way to make a model refuse to generate phishing emails is to make it refuse to generate emails," Narayanan notes. This would inevitably block legitimate uses like marketing, creating a false dilemma where safety requires censorship.
This framing is effective because it shifts the burden of proof from the developer to the ecosystem. If the model lacks the information to distinguish between a harmless marketing campaign and a phishing scam, then asking the model to make that distinction is not just difficult—it's impossible. Critics might argue that better training data could eventually teach models to recognize subtle contextual cues, but Narayanan's point holds: without access to the downstream environment, the model is flying blind.
Trying to make an AI model that can't be misused is like trying to make a computer that can't be used for bad things.
Where Safety Actually Lives
If safety isn't in the code, where is it? Narayanan argues it must be located in the defenses surrounding the technology. He writes, "Defenses should focus on attack surfaces: the downstream sites where attackers use the outputs of AI models for malicious purposes." He points to email scanners and URL blacklists as the actual solution to phishing, tools that have been improving for decades regardless of whether the email was written by a human or a machine.
The author extends this logic to bioterrorism and disinformation. Even if a model refused to generate false information, Narayanan points out that "true-but-misleading information is far more impactful than false information on social media." A malicious actor could simply use a "safe" model to summarize real news and add their own misleading context later. The model, lacking the intent or the broader narrative, cannot prevent this.
This perspective forces a re-evaluation of the open-versus-closed model debate. Narayanan suggests we need to assess "marginal risk"—the incremental danger a new model adds compared to existing tools. He writes, "Using this framework, we showed that the marginal risk of open models in cybersecurity... is low, whereas for the generation of non-consensual intimate imagery, the marginal risk is substantial." This is a crucial distinction for policymakers; it suggests that a blanket ban on open models is a blunt instrument that fails to address the specific risks that actually matter.
The Incentive Problem
Perhaps the most uncomfortable part of Narayanan's analysis is the look at who benefits from the current safety narrative. He argues that the "myth of safety as a model property" persists because it is convenient for developers. If safety is a technical property, companies can claim liability protection if their model is "safe" but still gets misused. Narayanan writes, "By contrast, accepting that there is no technical fix to misuse risks means that the question of responsibility is extremely messy."
To fix this, he proposes a radical shift in red teaming. Instead of asking "Can this model be misused?" (the answer is always yes), we should ask "What new capabilities does this model enable?" He suggests that red teaming should be led by third parties with aligned incentives, rather than the developers themselves, who have a financial interest in downplaying risks. "It is much better for them to not find out in the first place," he observes regarding the incentive for developers to avoid discovering their own tools' offensive potential.
However, this reliance on third-party oversight introduces its own complexities. Narayanan acknowledges that some third-party evaluations have been funded by specific ideological groups, raising concerns about scientific independence. A counterargument worth considering is whether third parties truly have the resources to keep pace with the rapid iteration of model capabilities, potentially leaving a gap between discovery and defense.
No amount of "guardrails" will close this gap.
Bottom Line
Narayanan's strongest contribution is the realization that we cannot engineer our way out of societal problems; the "fix" for AI misuse is not a better model, but better infrastructure and clearer liability rules. The argument's biggest vulnerability is the difficulty of implementing these downstream defenses at a global scale without stifling innovation. The next critical step for the industry is not more alignment research, but a honest conversation about who pays the price when the technology is weaponized.