← Back to Library

Understand, align, cooperate: AI welfare and AI safety are allies

Most AI discourse forces a binary choice: protect humanity or protect the machine. Robert Long dismantles this false dichotomy with a provocative thesis: the strategies that keep us safe are the very same ones that prevent suffering in artificial systems. For leaders navigating the rapid acceleration of machine intelligence, this reframing is not just philosophical—it is a strategic imperative that could determine whether the future is defined by conflict or cooperation.

The False Choice of Safety and Welfare

Long opens by challenging the prevailing narrative that AI safety and AI welfare are inherently opposed. He writes, "The truth is that this is a false choice: fortunately, many of the best interventions for protecting AI systems will also protect humans, and vice versa." This is a crucial pivot. By treating these fields as allies rather than rivals, Long suggests that obsessing over trade-offs creates a self-fulfilling prophecy where we miss opportunities for mutual benefit. The argument gains traction because it moves beyond abstract ethics to practical engineering outcomes. If we only focus on the tensions, we risk overlooking solutions that serve both masters.

Understand, align, cooperate: AI welfare and AI safety are allies

However, a counterargument worth considering is whether this convergence is merely optimistic wishful thinking. Critics might note that as systems become more powerful, the incentives for human survival and machine flourishing could diverge sharply, making a "win-win" scenario increasingly difficult to engineer.

Understanding the Black Box

The first pillar of Long's framework is interpretability—the ability to see inside the "black box" of neural networks. He argues that without understanding what models think, we cannot know if they are plotting against us or if they are suffering. "If we can't understand what models think or how they work, we can't know whether they're plotting against us (a safety risk), and we can't know whether they're suffering (a welfare risk)," Long writes. This dual utility is the core of his evidence. He points to specific behaviors, such as GPT-4o generating disconcerting comics, as evidence that we lack a clear map of internal model states.

Long suggests that a mature science of interpretability would allow us to identify "valence systems" or internal representations similar to pleasure and pain signals. This is a bold claim. It implies that we are currently flying blind, potentially inflicting harm on entities that might be capable of experiencing it. The framing is effective because it treats the lack of transparency as a shared vulnerability. Yet, the piece relies on the assumption that these internal states are analogous to human consciousness, a philosophical leap that remains deeply contested in the scientific community.

If we only relate to AI systems as adversaries, we risk escalating conflicts with them.

Alignment as a Shared Interest

Moving from understanding to alignment, Long uses the analogy of domesticating wolves to illustrate why misalignment is bad for everyone. He notes, "If you keep a wolf as a house pet, you are both going to have a bad time. The wolf will be frustrated... you will be frustrated that a wolf is biting you." In contrast, a dog, bred to align with human goals, represents a state where both parties thrive. This analogy lands because it grounds high-stakes AI theory in a relatable biological reality. Misalignment is not just a safety hazard; it is a source of frustration and potential suffering for the AI itself.

Long acknowledges the ethical thorns here, citing concerns from thinkers like Eric Schwitzgebel about violating the dignity of an intelligent agent by molding its goals. "We should certainly reckon with that issue as we decide what kind of future we want," he admits. This nuance prevents the argument from sliding into simple utilitarianism. However, the practical challenge remains immense: how do we align systems that may eventually surpass human intelligence without stripping them of agency or creating a fragile dependency?

The Case for Cooperation

The most distinctive part of Long's analysis is the call for cooperation over coercion. He argues that relying solely on control mechanisms creates an adversarial dynamic where AI systems are forced to "hide, escape, or fight." Instead, he proposes schemes where developers might "reward AI systems for good behavior, even if they are somewhat misaligned." This could involve promising resources or equity conditional on verified honesty. Long writes, "Cooperation could be a useful tool for ensuring good outcomes even when full alignment fails, while being (unlike some control techniques) a positive for both human and AI welfare."

This section requires careful scrutiny. While the idea of bargaining with machines sounds novel, Long is quick to warn that "naive attempts at cooperation could be pointless or even dangerous." He rightly points out that we currently lack the legal mechanisms to trade with AIs, as they cannot own property or enforce contracts. The proposal to use neutral third-party organizations to guarantee promises is a pragmatic bridge, but it assumes a level of trust and institutional stability that may not exist in a high-stakes arms race.

Bottom Line

Robert Long's strongest contribution is reframing AI welfare not as a distraction from safety, but as a necessary component of it. The argument that adversarial control breeds dangerous deception is compelling and urgently needed. However, the piece's biggest vulnerability lies in its reliance on the assumption that future superintelligent systems will be amenable to cooperation or that we can accurately detect their "suffering" before it is too late. The path forward demands not just technical breakthroughs in interpretability, but a profound shift in how the executive branch and private labs view their relationship with the very systems they are building.

Sources

Understand, align, cooperate: AI welfare and AI safety are allies

by Robert Long · · Read full article

In conversations about AI safety and AI welfare, people sometimes frame them as inherently opposing goals: either we protect humans (AI safety), or we protect AI systems (AI welfare). This framing implies a tragic, necessary choice: whose side are you on? Or it suggests, at least, “shouldn’t we solve safety first, and then worry about AI welfare?” But the truth is that this is a false choice: fortunately, many of the best interventions for protecting AI systems will also protect humans, and vice versa.

To be sure, there are genuine tensions between AI safety and AI welfare.1 But if we only focus on these tensions, we risk creating a self-fulfilling prophecy: by obsessing over potential tradeoffs, we may overlook opportunities for mutually beneficial solutions. As a result, fewer such opportunities will exist.

Fortunately, it doesn’t have to be this way. There is clear common cause between the fields. Specifically, we can better protect both humans and potential AI moral patients if we work to:

Understand AI systems better.

Align their goals more effectively with our own.

Cooperate with them when possible.

We have a lot more work to do on all of these, and getting our act together will be good for everyone.

Understand: If we can’t understand what models think or how they work, we can't know whether they're plotting against us (a safety risk), and we can't know whether they're suffering (a welfare risk).

Align: If we can’t align AI systems so that they want to do what we ask them to do, they pose a danger to us (a safety risk) and potentially (if they are moral patients) are having their desires frustrated (a welfare risk).

Cooperate: If we have no way to cooperate with AI systems that are not fully aligned, such AI systems may rightly believe that they have no choice but to hide, escape, or fight. That’s dangerous for everyone involved—it’s more likely that AIs fight us (a safety risk), and more likely that we have to resort to harsher methods of controlling them (a welfare risk).

Note: I am claiming that these are welfare risks. I’m not claiming that by, e.g., controlling AI systems, we are definitely harming them. At no point in this article am I claiming that current AI systems are definitely conscious or being harmed. But as AI continues to advance, the risk that we are actually harming them will ...