← Back to Library

The coding assistant breakdown: More tokens please

Dylan Patel doesn't just track the latest AI model releases; he exposes the hidden economics of how we actually pay for intelligence. In a field obsessed with raw benchmark scores, Patel argues that the true north star has shifted from cost-per-token to cost-per-task, a distinction that could rewrite the entire pricing strategy of the industry. This piece is essential because it moves beyond the hype of "agentic coding" to reveal the gritty reality of token efficiency, infrastructure bottlenecks, and the subtle ways labs are manipulating user perception through new pricing tiers and buggy rollouts.

The Token Efficiency Trap

Patel begins by dismantling the assumption that a cheaper model is always a better deal. He points out that while some models may charge five times more per token, their ability to solve problems with fewer tokens can actually make them cheaper overall. "Cost per task, not cost per token, is the true north star metric that determines model pricing," Patel writes. This reframing is critical for busy engineers who are tired of watching their bills balloon as models become more verbose. The argument holds weight because it forces a conversation about the actual utility of the output rather than the raw volume of data processed.

The coding assistant breakdown: More tokens please

However, this efficiency comes with a catch. The article highlights how OpenAI's new GPT-5.5, while touted as more efficient, is priced at twice the rate of its predecessor. Patel notes that "GPT-5.5's API price will be 2x more expensive than GPT-5.4 and slightly more expensive than Opus 4.7." This creates a paradox where users are paying a premium for a model that claims to use fewer tokens, effectively shifting the cost burden from volume to capability. Critics might argue that this pricing strategy is less about efficiency and more about extracting maximum value from users who are desperate for the latest "frontier" capabilities.

Cost per task, not cost per token, is the true north star metric that determines model pricing.

The Illusion of Speed and Stability

The commentary then turns to the user experience, specifically the trade-off between speed and quality. Patel observes that engineers are increasingly willing to sacrifice a bit of quality for speed, claiming that faster response times allow them to hit "flow state." Yet, the reality of the new releases is messy. Anthropic's Opus 4.7, for instance, introduced a new tokenizer that "trades off improved performance via more granular token counting for more total token usage," leading to an implicit 35% price increase. This is a stark example of how technical improvements can be weaponized against the consumer.

The situation is compounded by stability issues. Patel reveals that Anthropic faced weeks of bugs that went unnoticed, leading to a postmortem that admitted to three significant issues affecting all users. "When the harness is part of the product, the model gets blamed," he writes, capturing the frustration of users who feel like beta testers for products that should be polished. This section is particularly effective because it humanizes the technical failures, showing how bugs in the underlying infrastructure can make engineers feel "a little schizo" as they try to trust a system that is fundamentally broken.

The article also touches on the infrastructure behind these models, noting that while OpenAI claims GPT-5.5 was "trained" on a massive cluster, the pre-training actually happened on older Hopper architecture. This detail, while technical, underscores the gap between marketing narratives and engineering reality. It suggests that the industry is still relying on legacy hardware to power the next generation of intelligence, a fact that could have significant implications for future scalability.

The Open Source Dilemma

Finally, Patel addresses the role of open-source models like DeepSeek V4. While these models have democratized access to advanced AI, Patel argues they are still "meaningfully behind their closed-source counterparts on the frontier." He notes that DeepSeek's V4-Pro, despite its impressive 1-million-token context window, still lags behind in key areas like Chinese writing tasks. "Claude mogs Chinese models in it's own language," Patel quips, highlighting the persistent gap between open and closed systems.

Yet, the open-source contribution is not to be underestimated. Patel points out that DeepSeek's release of libraries like DeepGEMM and FlashMLA is "helping American open source AI stay alive." This dual narrative—acknowledging the technical inferiority while celebrating the ecosystem impact—provides a nuanced view of the open-source landscape. It suggests that while open models may not win every benchmark, they are essential for keeping the broader industry competitive and innovative.

When the harness is part of the product, the model gets blamed.

Bottom Line

Patel's analysis is a masterclass in cutting through the noise to find the real economic and operational drivers of the AI industry. His strongest argument is the shift from token-based to task-based pricing, a concept that will likely define the next phase of model adoption. However, the piece's vulnerability lies in its reliance on benchmark data that the author himself admits is often unreliable. As the industry moves forward, the real test will be whether these models can deliver on their promises of efficiency and stability in the messy reality of production environments.

Deep Dives

Explore these related deep dives:

  • Reinforcement learning from human feedback

    The article clarifies that GPT-5.5's 'training' was actually post-training via RL on existing data, making this specific mechanism the key to understanding why the model improved without a new pre-train.

Sources

The coding assistant breakdown: More tokens please

by Dylan Patel · SemiAnalysis · Read full article

Since we called out the Claude Code inflection point on February 5th, we have seen a flurry of model releases. Opus, Mythos, Codex, Gemini, DeepSeek, Kimi, Qwen, GLM, MiniMax, Composer, Muse Spark, and more. Today we will break down all of these major model releases, explain when you can vs can’t trust the benchmarks, and give our predictions for the future of the agentic coding market.

First we have to highlight GPT-5.5 from OpenAI. In our view, GPT-5.5 is now materially better at some tasks than all other models. We believe that GPT-5.5 has arrived at the frontier. This is a huge change from November when Opus 4.5 was released. At that time, and for the 6 months since, OpenAI’s coding model was not world class in most metrics, leading to Opus being our daily driver. GPT-5.5 is now integrated in our daily work.

Meet the Models.

There’s been at least one major lab releasing a new checkpoint purpose-built for coding every week for the past 3 months. GLM-5.1, Qwen3.6-Plus, Kimi K2.6, Composer 2, and Gemini 3.1 Pro all emphasize “agentic coding,” “long-horizon tasks,” or similar capabilities in their headlines. February was a particularly busy month.

New checkpoints are cool, but entirely new pre-trains are what really get the people going. Heading into April, the San Francisco rumor mill was ablaze with talk about Capybara and Spud. These are codenames for Anthropic and OpenAI’s newest pre-trains. With the release of GPT-5.5 yesterday, we now have something concrete to discuss.

GPT 5.5.

GPT-5.5 is the first public release based on “Spud”. As OpenAI’s first new pre-train since the failed GPT-4.5, expectations are obviously high. And despite both NVIDIA and OpenAI claiming with precise language that the model was “trained” on a 100k GB200 NVL72 cluster, this “training” is post-training (RL) only. Pre-training is still on Hopper.

OpenAI’s flagship model has historically been cheaper than Anthropic’s, but at $5 per million input tokens and $30 per million output tokens, GPT-5.5’s API price will be 2x more expensive than GPT-5.4 and slightly more expensive than Opus 4.7. The API went live this morning after a brief ChatGPT/Codex-only window due to safety concerns. We’ve been testing the model via Codex and API during an alpha testing period and describe that experience later in this article.

Like all their other models, OpenAI will also be offering a priority tier for GPT-5.5 priced at 2.5x the ...