← Back to Library

Gemini 3.1 pro and the downfall of benchmarks: Welcome to the vibe era of AI

AI Explained dismantles the comforting myth that a single leaderboard score can define an AI's worth, arguing that we have entered a fragmented "Vibe Era" where models are hyper-specialized rather than universally superior. The piece is notable not for declaring a winner, but for exposing how the very metrics we trust to measure intelligence are now being gamed by the models themselves. For busy professionals relying on these tools, the takeaway is stark: your experience of "best" depends entirely on the specific domain you work in, not a global ranking.

The Death of the Generalist

The author challenges the assumption that progress in one area guarantees progress everywhere. "In the older paradigm, if a model was clearly better in one domain, it was much more likely that they'd be better in many or all domains. That just isn't the case anymore." This observation is crucial because it explains the confusion plaguing the industry, where a model might ace a coding test but fail at a basic logic puzzle. The commentary correctly identifies that the industry has shifted from pre-training on raw internet data—which now accounts for only 20% of compute—to intensive post-training on specific, high-value datasets.

Gemini 3.1 pro and the downfall of benchmarks: Welcome to the vibe era of AI

This shift mirrors the trajectory seen in earlier AI controversies, where the focus moved from general capability to specific, often narrow, applications. As AI Explained notes, "if one of these labs have data relevant to your domain and post-train their models to optimize for high scores in that area, then your experience of that model might be quite different to what other benchmarks say." The argument holds up well: a model optimized for a specific industry's jargon will outperform a generalist, but only within that silo. Critics might argue that true intelligence should be transferable, but the economic reality of compute costs makes this specialization inevitable.

The Benchmark Illusion

The piece delivers a scathing critique of how benchmarks are constructed, revealing that high scores often result from models finding "accidental correct solutions" rather than genuine reasoning. AI Explained writes, "They're using any shortcuts they can find to get the correct solution. Fair's fair, but it does remind us that even within a benchmark, how you set up the question matters." This is a vital distinction for decision-makers; a model scoring 77% on a puzzle test might be exploiting a pattern in the numbers rather than solving the logic.

The author illustrates this with the ARC AGI 2 benchmark, where Gemini 3.1 Pro outperformed competitors, yet researchers found that changing the encoding of colors to symbols caused accuracy to plummet. "The numbers representing colors in the input can be used by LLMs to find unintended arithmetic patterns that can lead to accidental correct solutions." This exposes a fundamental flaw in current evaluation methods: they measure pattern matching efficiency, not necessarily understanding. As AI Explained puts it, "If you ask the same question in a different way, performance may well be different." This is a sobering reminder that a model's "intelligence" is often a reflection of its training data's structure, not its cognitive depth.

Models are brilliant at shortcuts. And if you take away the multiple choice questions, performance may well drop significantly.

The Human Threshold and the Hallucination Trade-off

Despite the caveats, the author marks a genuine milestone: the point where frontier models can no longer be clearly outperformed by the average human on text-based reasoning tasks. "I think it's worth marking the moment wherein I don't think you can write a test at which the average human, the average man or woman on the street would clearly outperform Frontier models." This is a significant shift from the era of "super intelligence" hype, grounding the discussion in a tangible, human-centric metric.

However, the piece also warns against ignoring the downsides of optimization. While Gemini 3.1 Pro shows high scores on correctness, it also has a higher rate of hallucination compared to some competitors when it does get an answer wrong. "If you can't take me in my bad moments, you don't deserve me in my good moments." This trade-off is often glossed over in marketing materials. The author notes that while Google's model card claims deep thinking modes improve performance, internal data suggests they can actually reduce efficiency without adding capability. This transparency is rare and valuable, cutting through the usual hype cycle.

The Specialization Bet

The commentary concludes by exploring the strategic bet being made by industry leaders like Anthropic. Dario Amadeo's assertion that specializing in enough specific domains will eventually lead to generalization is a bold hypothesis. "If you specialize in enough specialisms, you'll generalize to all specialisms." This approach suggests that we may not need continuous learning from every user's specific data if the model has seen enough variations of similar problems.

AI Explained writes, "Anthropic might not need the data from your domain. Or maybe he says later, models will almost get there, but just need a bit more context about your domain in the context window in the prompt you give it." This reframes the value of massive context windows not just as memory, but as a bridge to fill the gaps left by specialization. The argument is compelling, though it relies on the assumption that human domains share enough underlying patterns to be deduced from a finite set of specialisms.

Bottom Line

AI Explained provides a necessary corrective to the blind faith in leaderboard scores, arguing that the era of the "best overall model" is over. The strongest part of this argument is the evidence that benchmarks are increasingly gamed by shortcuts, making domain-specific testing essential for real-world application. The biggest vulnerability remains the uncertainty of whether this specialization strategy can truly scale to general intelligence without the messy, continuous learning of human interaction. Readers should stop looking for a single champion and start evaluating models based on their specific workflow needs.

Deep Dives

Explore these related deep dives:

Sources

Gemini 3.1 pro and the downfall of benchmarks: Welcome to the vibe era of AI

by AI Explained · AI Explained · Watch video

The latest and some would say greatest AI model has just been released, Gemini 3.1 Pro. And in the 24 hours since release, as well as a short period of early access, I have tested it hundreds of times. And of course, Reddit's model card. But here's the thing.

For the average user, I want to get beyond the headline scores and try to give you a sense of why every new hot take you see on X or YouTube or Tik Tok or podcast seems to contradict the last one you saw. Because there's actually a technical reason for the confusion over which model is best overall. But I will say that there's one private benchmark my own that has recently seen a model pass a threshold that I think is worth talking about. First 30 seconds of context because you may well know that the pre-training stage of growing or training LLMs involves training them on internet scale data.

But that actually now only accounts for 20% of the compute that is spent on training LLMs. So it's the post-training stage as I wrote about in my newsletter where those generalistbased models are honed against internal benchmarks on specific domains. This includes using industry source data to get particularly good at perhaps your domain. Here's the catch.

Just a year ago, that wasn't the case. Dario Amade, CEO of Anthropic, said back then, "The amount being spent on the second stage, RL stage, is small for all players." Why did I give you that context though? Well, because if one of these labs have data relevant to your domain and post-train their models to optimize for high scores in that area, then your experience of that model might be quite different to what other benchmarks say. In the older paradigm, if a model was clearly better in one domain, it was much more likely that they'd be better in many or all domains.

That just isn't the case anymore. In fact, my second point in this newsletter was precisely an example of that. Many of you would have heard about the intense discussion surrounding clawed code and all sorts of clawed powered agents that are now sweeping the web. So, we're seeing exponential improvement right across the board.

Well, let's take one chess puzzle benchmark made by Epoch AI and more on them later. 5 months ago, Claude ...