← Back to Library

"Agentic AI" is a bonfire of the tokens while fab capacity, power grids, and P&Ls are the brakes:…

Brad DeLong cuts through the hype cycle with a stark physical reality check: the dream of agentic AI is currently colliding with the hard limits of power grids and semiconductor factories. While the industry celebrates "reasoning" breakthroughs, he argues we are merely burning cash in an industrial-scale stochastic parrot farm that is rapidly hitting diminishing returns on intelligence per kilowatt-hour.

The Engineering Reality vs. The Hype

DeLong begins by dismantling the notion that current large language models represent a fundamental leap in cognition. He describes the trajectory from the 2017 "Attention Is All You Need" paper to today's systems as scaling an "incredibly impressive engineering accomplishment" into something that merely mimics thought. As he notes, citing Cosma Shalizi, the achievement is in making the machine work, but the underlying mechanism remains a finite-order Markov model optimized for probability, not truth.

"Agentic AI" is a bonfire of the tokens while fab capacity, power grids, and P&Ls are the brakes:…

The core of DeLong's argument rests on the idea that scaling up models, data, and inference runs yields better compression of existing internet text, not new insights. "More scale lets it better approximate the conditional distribution of tokens produced by the median Reddit commenter, Substack ranter, or corporate PR department," he writes. This is a crucial distinction: the system is getting better at emulating the average of what has already been said, rather than discovering the laws of nature or moral philosophy.

"The training signal says 'be like this corpus,' not 'be smart.'"

This framing effectively shifts the debate from "will AI become conscious?" to "is this a sustainable economic model for simulating competence?" DeLong suggests that while these models can generalize beyond their training data by finding smooth, low-dimensional regularities in text, they are ultimately trapped in a "slop-filled sub-hyperplane of the potential reasoning space."

The Trap of Synthetic Inbreeding

A significant portion of DeLong's analysis focuses on the dangers of using AI to generate its own training data. He contrasts the success of game-playing AIs like AlphaGo, which thrive on adversarial synthetic data in a closed system with clear win/loss conditions, against open-ended language models. In chess, "systematically breeding for a sharper edge against a moving opponent" works because the environment is finite and specified.

However, when large language models generate their own text to train future versions, they engage in a form of intellectual inbreeding. DeLong warns that this process leads to "sharpening, amplification, and homogenization rather than a widening of the manifold." The system becomes increasingly confident but less diverse, spinning variations on what it already knows without access to external ground truth.

Critics might argue that synthetic data is necessary because human-written text on the internet is running out, and that new techniques like retrieval-augmented generation could mitigate this drift. Yet DeLong's point about the lack of a "clean, on-policy reinforcement signal" in open language remains a potent counter to the idea that more data alone solves the quality problem.

"Artificial data is a form of in-breeding: it can make the existing style more pure, but it does not, and probably cannot, give you a fundamentally different species of thought."

Brute Force vs. True Reasoning

The piece then tackles the current trend of "agentic" AI, where systems run thousands of iterations to solve problems. DeLong describes this as "Clever Hans at extraordinary scale," where an ensemble of stochastic parrots cross-checks each other until they converge on a plausible answer. This works well in domains with crisp feedback loops, such as code compilation or financial trading, where the system can simply discard paths that fail and keep those that succeed.

"In a domain with crisp outcomes... this kind of massive, automated A/B testing of thoughts starts to look uncannily like 'reasoning,'" DeLong writes. However, he cautions that under the hood, it is still just "stochastic parrotry" repeated until the law of large numbers beats the hallucinations into submission.

The comparison to Scott Aaronson's "Chinese room" metaphor is particularly striking here. DeLong suggests that while we might eventually build a machine the size of the Earth that simulates understanding through sheer volume, we are not there yet. The gap between calculating power and genuine insight remains vast, with machines excelling at arithmetic but struggling with the nuances of human thought.

The Hard Constraints of Power and Profit

Ultimately, DeLong grounds his philosophical critique in hard economic and physical constraints. He argues that the "bonfire of tokens" required to fuel these agentic systems is running into the brakes of fab capacity, power grids, and profit-and-loss statements. The infrastructure required for serious deployment looks less like software and more like an aluminum smelter wired into the national grid.

He highlights the staggering costs involved, noting that firms are "casually burning nine-figure annual run-rates on inference experiments" without clear returns. Citing a conversation between Derek Thompson and Doug O'Laughlin, he points out that agents consume tokens voraciously—"like mammals breathe oxygen"—with some companies exhausting their budgets in months.

"Agents eat tokens like mammals breathe oxygen."

This section serves as the piece's most urgent warning: the industry is currently flinging compute at customers to capture data, ignoring the marginal costs of electricity and hardware. DeLong posits that sooner rather than later, someone will have to ask a brutal question: "how much 'reasoning per kilowatt-hour and per dollar of capex' are we actually getting, and is that, in fact, worth it?"

Bottom Line

DeLong's strongest contribution is his refusal to separate the technical potential of AI from its physical and economic reality; he forces the reader to confront the "jaw-dropping" costs of a system that may never achieve true understanding. The argument's vulnerability lies in its skepticism toward emergent properties, as history shows that scaling often yields unexpected breakthroughs, but his emphasis on the diminishing returns of synthetic data provides a necessary corrective to current optimism. Readers should watch for how quickly the market pivots from "growth at all costs" to rigorous unit economics once the power bills come due.

Deep Dives

Explore these related deep dives:

  • Model collapse

    This phenomenon describes the specific degradation of AI models when they are trained on their own synthetic outputs, directly addressing the author's concern about LLMs 'in-breeding' and trapping themselves in a 'slop-filled sub-hyperplane.'

  • Minimum description length

    The article argues that intelligence emerges from lossy compression; this principle provides the theoretical framework for understanding how discarding specific data points to find compact internal codes leads to genuine structural discovery.

  • Stochastic parrot

    While the author uses this term metaphorically, the original paper defining it offers the necessary nuance on why scaling up statistical pattern matching might never achieve true reasoning, contrasting with the 'engineering accomplishment' of current systems.

Sources

"Agentic AI" is a bonfire of the tokens while fab capacity, power grids, and P&Ls are the brakes:…

We scaled “attention is all you need” into an industrial‑scale stochastic parrot farm, then bolted on agents and tools until it started to look somewhat more like thought. Now the engineering reality—fabs, power, and eye‑watering token bills—is asking whether what we are doing is worthwhile. And general‑purpose LLMs start in‑breeding on their own output, unlike game AIs that thrive on tightly constrained, adversarial synthetic data. Are we trapping ourselves in a slop-filled sub-hyperplane of the potential reasoning space?.

Start with attention is all you need <https://arxiv.org/abs/1706.03762>, and scale. And the results are, as Cosma Shalizi noted lo these three years ago:

Cosma Shalizi: “Attention”, “Transformers”, in Neural Network “Large Language Models” <https://bactra.org/notebooks/nn-attention-and-transformers.html>: ‘[an] incredibly impressive engineering accomplishment of [actually] making the blessed thing work. A large, able and confident group of people pushed kernel-based methods for years in machine learning, and nobody achieved anything like the feats which modern large language models have demonstrated. The reason I put effort into understanding these machines and papers is precisely because the results are impressive!…

Again: finite-order Markov models…. Lots of people have played around with them, including tricks like variable context length, various kinds of partial pooling, etc. Nobody, so far as I know, has achieved results anywhere close to what contemporary LLMs can do. This is impressive enough that (as I said at the beginning of these notes) I need to wrap my head around them lest I become obsolete…

And then for the past four years, ever since the completely unexpected success of the initial ChatGPT, comes scaling to the moon. Scaling to the moon along three different dimensions:

bigger models,

bigger data,

more runs.

Bigger models: More parameters and more training carves the high‑dimensional text space into finer, more meaningful regions: conversations that are “about the same thing” end up closer together, even when they use different vocabularies, metaphors, or surface forms. Small models rely on crude lexical overlap and shallow heuristics. Scaling allows the network can devote capacity to representing latent structure: underlying topics, implicit roles, typical rhetorical moves, even rough causal or temporal patterns, matching on a much richer notion of what “this is like that” is. The result is smarter not because the objective changed, but because the classifier over “which training conversations are actually relevantly similar?” got much better.

Moreover, a good model is a lossy compressor of its training data: it throws away the ...