← Back to Library

Inference is unlikely to ever be a low marginal cost operational node, & the other reasons why the…

Brad DeLong delivers a stinging economic reality check to an industry intoxicated by the promise of artificial general intelligence, arguing that the path to hyperprofitability for foundation model builders is effectively blocked by physics and flawed logic. While Wall Street bets on a digital god that will print money forever, DeLong asserts that the fundamental economics of "inference"—the act of running these models—ensure they remain a capital-intensive utility rather than a software goldmine.

The Physics of Cost

DeLong's central thesis dismantles the prevailing narrative that AI costs will inevitably plummet to near-zero marginal levels. He leans heavily on analysis by Paolo Perrone to argue that the "memory wall" and energy requirements create an inescapable cost spiral. DeLong writes, "Inference is very unlikely to ever become a low marginal-cost node in the system." This is not merely a technical hurdle; it is a structural economic barrier. Unlike traditional software where serving one more user costs pennies, AI requires massive amounts of electricity and silicon for every single interaction.

Inference is unlikely to ever be a low marginal cost operational node, & the other reasons why the…

The author highlights that current pricing is an illusion propped up by venture capital. He notes, "The industry's largest AI lab spent $8.67 billion on inference in the first three quarters of 2025, nearly double their revenue." This staggering disparity suggests that the business model is currently burning cash to capture market share, a strategy that collapses once subsidies vanish. DeLong warns readers not to be misled by API price drops, stating, "Don't believe the 'inference is getting cheap' headline. It's half true... But 'cheaper than 2022' is not the same as 'cheap.'"

This framing is particularly potent because it shifts the focus from model architecture to infrastructure reality. The costs are tied to memory bandwidth and GPU idle time, bottlenecks that have plagued computing since the early days of random-access memory limitations in the 1970s. Just as history shows that hardware constraints often dictate software viability, DeLong suggests that the "cost spiral is real" and will prevent labs from achieving the margins seen in the Google or Facebook eras.

The Judgment Gap

Beyond the economics, DeLong attacks the qualitative reliability of these systems. He argues that while models possess impressive verbal fluency, they lack genuine judgment. They cannot distinguish between knowledge and "s*posting" within their training data without human intervention. As DeLong puts it, anything produced by these agents for serious use requires "IMMENSE 'babysitting' not to run off the rails."

He illustrates this fragility with a sobering mathematical reality: an agent with 85% accuracy per step only completes a ten-step workflow successfully 20% of the time. The system does not fail loudly like deterministic code; instead, it "fails quietly, confidently, and often in ways your tests never anticipated." This is a critical distinction for enterprise adoption. A non-deterministic agent might confidently execute a command to "clear the cache" but interpret it as "wipe the drive," a catastrophic error born of probabilistic guessing rather than logical reasoning.

Critics might argue that rapid iteration and reinforcement learning will eventually solve these reliability issues, but DeLong remains skeptical. He points out that successful applications are those with "bounded scope," where the agent handles one domain and explicitly refuses tasks outside that boundary. The dream of an autonomous agent running complex workflows is currently a fantasy; the reality is a tool that needs constant supervision.

We have an oddly talented but unreliable group of les idiots savants, assistants who must never be left alone with the gradebook, the syllabus, the data analysis, or the nuclear launch codes.

The RAG Reality and the Commodity Trap

DeLong suggests that the only viable path forward is not better models, but better data management through Retrieval-Augmented Generation (RAG). He explains that in a well-built system, "the retrieval matters more than the generation." By forcing the model to pull from a vetted, organized data store rather than relying on its internal "memory," companies can ground the output in truth. This approach turns the AI into an indexer and pattern-matcher over trusted data, effectively acting as an open-book exam where the system must show its work.

However, this solution reinforces his economic argument: it requires expensive human labor to clean data and maintain the "plumbing." The model itself becomes a commodity input, indistinguishable from competitors due to distillation and quantization techniques that allow open-weight models to perform nearly as well as proprietary ones. DeLong writes, "When the key differentiator in getting results becomes not the unique edge of a unique model, but rather the harness and the data quality painfully and expensively maintained... the datacenter-based token-serving core model itself will slide toward being a commodity input."

This commodification means that superprofits will flow to the owners of the infrastructure—NVIDIA, TSMC, and the utilities providing power—not the labs building the models. The "agentic fairy tale" of letting AI run workflows is crashing into the brick wall of non-determinism and compounding errors. Enterprises are discovering that these tools are not replacements for staff but add-ons that require their own dedicated management teams.

The IPO Gamble

The conclusion DeLong draws from this economic and technical landscape is grim for investors eyeing a near-term Initial Public Offering (IPO) for companies like OpenAI or Anthropic. He argues that the current lock-down of secondary markets by these firms is not a sign of strength, but a desperate attempt to keep valuations high before the reality sets in. "That is what you do if you need to keep the story going... long enough to distribute the hot potato to the broad public," he writes.

He characterizes the belief that these labs will achieve durable hyperprofitability as "eschatology" rather than investing. The numbers simply do not add up when a company burns 70% of its revenue just to cover costs while scaling. DeLong warns, "There is no plausible, well-specified path by which Anthropic or OpenAI grow into the kind of durable, high-margin franchises that would justify the valuations their private rounds have implied."

Bottom Line

DeLong's argument is a necessary corrective to the feverish speculation surrounding AI valuations, grounding the conversation in the unyielding laws of physics and basic accounting. His strongest point is the identification of inference as a capital-intensive utility rather than a scalable software product, a distinction that fundamentally undermines the current investment thesis. The biggest vulnerability in his case may be the potential for a "miracle" breakthrough in algorithmic efficiency or energy generation that could break the cost spiral, but until such a miracle appears, the economics suggest these labs are building a treadmill, not a cash machine.

Deep Dives

Explore these related deep dives:

  • Random-access memory

    This hardware bottleneck explains the author's core argument that physical limits on data transfer speed, rather than just model size, prevent AI inference costs from ever becoming negligible.

  • PagedAttention

    Understanding this specific mechanism for storing attention states reveals why the computational cost of generating text scales non-linearly with conversation length, creating the 'cost spiral' described in the excerpt.

  • Retrieval-augmented generation

    This architecture is cited as a primary point of failure for AI agents because its reliance on external context management often leads to the 'dumb RAG' and brittle performance that undermines claims of autonomous reliability.

Sources

Inference is unlikely to ever be a low marginal cost operational node, & the other reasons why the…

Digital Gods, real costs: why a rational world would see the doom of the foundation‑model-builder IPO, because the AI labs are highly unlikely to ever get profits, let alone hyperprofits. Inference never becomes sufficiently cheap, AI-entity judgment stays bad, and durable quasi-rents flow to NVIDIA & company—not to the model‑makers….

I have no idea whether OpenAI or Anthropic or both with launch an IPO this year, and I have no idea what the results of it will be.

But it is clear to me that, if either one does, it ought to fail.

That is clear to me in a way that it was not clear to me back in the day that the Google or the FaceBook or the Microsoft IPOs were unsound. I thought all three of those were very risky, yes. But, even though the valuations seemed very high to me, I did see a possible path to durable hyperprofitability for each.

I do not see such a path for either Anthropic or Open AI. That has now crystalized for me. And it is reading Paolo Perrone that has done it, and that has led me to the conclusion in the title.

From Paolo Perrone I get four things:

(1) “Inference” is very unlikely to ever become a low marginal-cost node in the system:

Paolo Perrone: Why is Inference Slow and Expensive? <https://theaiengineer.substack.com/p/why-is-inference-slow-and-expensive>: ‘Your inference bill…. Memory bandwidth…. KV cache reads…. GPU idle time…. The electricity bill for running all that idle silicon…. The industry’s largest AI lab spent $8.67 billion on inference in the first three quarters of 2025, nearly double their revenue… [and] lose[s] money on $200/month Pro…. The memory wall doesn’t care how big your model is. It scales down with you. The cost spiral is real…. Don't believe the 'inference is getting cheap' headline. It’s half true. API prices have collapsed since 2022. But “cheaper than 2022” is not the same as “cheap.”… The pricing you see on API dashboards is subsidized by venture capital. Providers are selling below cost to capture the market. When the subsidies end, the prices go up…

(2) Language models now have sufficient verbal fluency. What they do not have is judgment as to which pieces of the human information corpus that has made up their training data are knowledge as opposed to simply s***posting. Hence anything they produce that is not for the immediate assessment by a ...