Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI
The latest and some would say greatest AI model has just been released, Gemini 3.1 Pro. And in the 24 hours since release, as well as a short period of early access, I have tested it hundreds of times. And of course, Reddit's model card. But here's the thing.
For the average user, I want to get beyond the headline scores and try to give you a sense of why every new hot take you see on X or YouTube or Tik Tok or podcast seems to contradict the last one you saw. Because there's actually a technical reason for the confusion over which model is best overall. But I will say that there's one private benchmark my own that has recently seen a model pass a threshold that I think is worth talking about. First 30 seconds of context because you may well know that the pre-training stage of growing or training LLMs involves training them on internet scale data.
But that actually now only accounts for 20% of the compute that is spent on training LLMs. So it's the post-training stage as I wrote about in my newsletter where those generalistbased models are honed against internal benchmarks on specific domains. This includes using industry source data to get particularly good at perhaps your domain. Here's the catch.
Just a year ago, that wasn't the case. Dario Amade, CEO of Anthropic, said back then, "The amount being spent on the second stage, RL stage, is small for all players." Why did I give you that context though? Well, because if one of these labs have data relevant to your domain and post-train their models to optimize for high scores in that area, then your experience of that model might be quite different to what other benchmarks say. In the older paradigm, if a model was clearly better in one domain, it was much more likely that they'd be better in many or all domains.
That just isn't the case anymore. In fact, my second point in this newsletter was precisely an example of that. Many of you would have heard about the intense discussion surrounding clawed code and all sorts of clawed powered agents that are now sweeping the web. So, we're seeing exponential improvement right across the board.
Well, let's take one chess puzzle benchmark made by Epoch AI and more on them later. 5 months ago, Claude ...
Watch the full video by AI Explained on YouTube.