← Back to Library

Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI

The latest and some would say greatest AI model has just been released, Gemini 3.1 Pro. And in the 24 hours since release, as well as a short period of early access, I have tested it hundreds of times. And of course, Reddit's model card. But here's the thing. For the average user, I want to get beyond the headline scores and try to give you a sense of why every new hot take you see on X or YouTube or Tik Tok or podcast seems to contradict the last one you saw. Because there's actually a technical reason for the confusion over which model is best overall. But I will say that there's one private benchmark my own that has recently seen a model pass a threshold that I think is worth talking about. First 30 seconds of context because you may well know that the pre-training stage of growing or training LLMs involves training them on internet scale data. But that actually now only accounts for 20% of the compute that is spent on training LLMs. So it's the post-training stage as I wrote about in my newsletter where those generalistbased models are honed against internal benchmarks on specific domains. This includes using industry source data to get particularly good at perhaps your domain. Here's the catch. Just a year ago, that wasn't the case. Dario Amade, CEO of Anthropic, said back then, "The amount being spent on the second stage, RL stage, is small for all players." Why did I give you that context though? Well, because if one of these labs have data relevant to your domain and post-train their models to optimize for high scores in that area, then your experience of that model might be quite different to what other benchmarks say. In the older paradigm, if a model was clearly better in one domain, it was much more likely that they'd be better in many or all domains. That just isn't the case anymore. In fact, my second point in this newsletter was precisely an example of that. Many of you would have heard about the intense discussion surrounding clawed code and all sorts of clawed powered agents that are now sweeping the web. So, we're seeing exponential improvement right across the board. Well, let's take one chess puzzle benchmark made by Epoch AI and more on them later. 5 months ago, Claude Sonnet 4.5, which is their smaller model compared to Opus, scored 12%. But just last week, Claude Opus 4.6, 5 months further on, scored just 10%. That's not to knock Claude Opus 4.6. I use it all the time. It's an incredible model at coding. And of course, if the AI labs wanted to improve this performance, they easily could. I think GPT 5.2 on extra high gets around 50%. But you could say chess is a fairly pure measure of a general kind of forward-thinking reasoning process. Back in the generalist era of AI, you would therefore expect chess performance to translate to all sorts of other domains. We're just not in that paradigm anymore. It's going to depend on the domain you're in. None of this is to say that Gemini 3.1 Pro isn't an incredible model. It is. In almost any domain you care to measure, it will be competitive with the best other models like Claude Opus 4.6 or GPT 5.3. But you would be understandably slightly confused to see it being better in all sorts of coding benchmarks, measures of scientific reasoning and academic reasoning like GPQA diamond and humanity's last exam respectively as well as general pattern recognition ARGI 2 and I'll come back to that. But yet in a head-to-head on GDP vow which is a broad measure of expert tasks that human professionals do that I've covered many times on the channel before it falls seemingly quite far behind claude opus 4.6 and even GPT 5.2. Now yes, one big explanation of that is the domain specialization I talked about earlier, but there are three or four fascinating bits of context that I want you to be aware of in addition. First, let's zoom into ARC AGI 2 in which its score of 77.1% puts it way ahead of Claude Opus 4.6 which is the more expensive model which got around 69%. I start with this one because Demis, CEO of Google DeepMind, featured it prominently in his Twitter post announcing the launch of Gemini 3.1 Pro. And on puzzles that should be in its training data, the Gemini 3 series outperforms all other models on a costefficient basis. But the first additional caveat comes from Melanie Mitchell, a famous AI researcher and professor. She pointed out that if they change the encoding from numbers to other symbols, accuracy goes down. Digging deeper, the group found that the numbers representing colors in the input can be used by LLMs to find unintended arithmetic patterns that can lead to accidental correct solutions. I wouldn't call that the model cheating. They're using any shortcuts they can find to get the correct solution. Fair's fair, but it does remind us that even within a benchmark, how you set up the question matters. Okay. Well, let's say you don't care about ARC AGI 2 or Simple Bench or any other benchmark, just coding performance. Well, the creator of ARC series, the ARC AGI test, Francois Chalet has this to say. Sufficiently advanced agentic coding is essentially machine learning. A goal is given to the agent or agent swarm and then the coding agents iterate until the goal is reached. As in other areas of machine learning, the result is a blackbox model. You have a codebase that performs the task, but you don't necessarily inspect the internal logic. Just like how Gemini 3.1 may have found speurious patterns in Arc AGI in your codebase clawed or codeex may overfit to the spec or may drift from your original concept. So the fallibilities presented in this video are relevant to you even if you only care about coding or letting your openclaw agents code for you. Gemini 3.1 Pro indeed hit a record ELO in live codebench pro which involves competitive coding problems. That's great, but you can turn that optimization dial a little too far. Let me show you what happened last night when I used Gemini 3.1 Pro inside Cursor. How could we reconcile these pages of Pablum with the record-breaking ELO? Well, again, that's the theme of the video. If I sound unduly skeptical of Gemini 3.1 Pro, by the way, let me try to balance that out with heaping on some praise. On my private Simple Bench, a test of, you could say trick questions or common sense reasoning, it beat its own previous record from Gemini 3 Pro and got 79.6%. that essentially brings it within the margin of error for the human average baseline, at least among the nine participants that we used. And I do want to spend just 60 seconds on marking the threshold that I think this represents. All the time on podcasts and in articles, you hear about AI models being compared to professionals and experts and phrases like super intelligence being banded around and recursive self-improvement. But what about comparing models to the average human? Now, sure, of course, you can find audio or visual puzzles that they will still fail at that the average human wouldn't. But in English, in text alone, I think it's worth marking the moment wherein I don't think you can write a test at which the average human, the average man or woman on the street would clearly outperform Frontier models. I'm not talking about exploiting tokenization bugs like how many Rs in Strawberry. I'm talking about a fair textbased test in English with a non-speist human. Let me know [clears throat] if you disagree, but I think the passing of that threshold is a moment worth marking. I will note that even with simple bench, we get a reminder of the caveat I was just describing. Models are brilliant at shortcuts. And I had noticed as long ago as I think at least 12 months that because simple bench was a set of multiple choice questions, sometimes one of the answers being for example zero would flag to the model that wait this might be a trick question. For example, if we go to try yourself, even in question one about frying eggs in a pan, the fact that there may be zero ice cubes left in the pan, even as just one of the options, may alert the model to think, hang on, how could there be zero? How would that be possible? So, what happens if you take away the multiple choice questions, get the models to answer in an open-ended fashion, and then get a blind grader model to compare their answers to the hidden correct answer? Well, you still get some pretty impressive scores, but just not quite as high. Call it a 15 to 20 percentage point drop. That's, by the way, a double reminder. Yes, models are taking shortcuts. Yes, if you ask the same question in a different way, performance may well be different. But it's not like performance dropped to zero. Frontier models are genuinely getting better, even in domains they didn't directly train on. Time for the next big caveat before we return to the glory of the exponential. Let's take a look at the brand new Gemini 3.1 Pro and Anthropics Claude Sonet 4.6 from this week. How do they do in terms of hallucinations or factual accuracy? You'll notice that model providers don't often want to talk about or measure hallucinations anymore because that was predicted to be a solved problem by now. And on this release chart from Google, there wasn't a direct measure of hallucinations. But in fairness, they did site this benchmark from artificial analysis a omniscience. And on first glance, Gemini 3.1 Pro seems to shellac the other models. Gemini's top score of positive 30 compares to Claude Opus 4.6 getting positive 11 and Claude Sonnet 4.6 getting -4. And that's even accounting for penalizing hallucinations as well as rewarding correct answers. However, if we zoom in on just incorrect answers and whether the models hallucinated an incorrect answer or explanation versus refused to answer or admitted to not knowing the answer, Gemini 3.1 does well at 50% of its incorrect answers being hallucinations. But Claude Sonic 4.6 is down at 38% which is better. Interestingly, GLM 5, a Chinese model, is even better at 34%. So hallucinations is definitely not a solved problem. And just because a model is optimized or better at its best does not preclude the possibility that it's worse at its worst. What's that saying? If you can't take me in my bad moments, you don't deserve me in my good moments. Well, for all models, you're going to have to handle that kind of trade-off. One quick note on the model card for Gemini 3.1. It's only nine pages. And as ever, these model or system or safety reports will serve the purpose of dehyping when the release post by the CEO or release video serves the purpose of hyping. For example, let's focus on Gemini 3.1 in the cyber domain. Well, if you are an ultra subscriber, you can use deep think mode. And Google's model card says this. Accounting for inference costs, the model with deep think performs considerably worse than without deep think. Even at high levels of inference, results for the model with deep think do not suggest higher capability than without deep think. Okay, that's deep think mode, which we might discuss another time. But what about just 3.1 Pro? Well, back to that specializing in individual domains. They found out that in one test of machine learning and R&D, optimizing the LLM foundry involving fine-tuning that 3.1 Pro could indeed reduce the runtime of a fine-tuning script from 300 seconds to 47 seconds, better even than the human reference solution of 94 seconds. But whereas before you might have read that as meaning that it's now going to accelerate its own self-improvement through machine learning R&D, you might now more so interpret that as being, oh, they added in some new fine-tuning data, data about fine-tuning or an internal benchmark measuring fine-tuning performance. But enough with the caveats. What do all of these models from the last few weeks, including Gemini 3.1, show about what we're about to unleash on the world? Because many of the exponentials you do see, are real and meaningful. But first, the sponsors of today's video, Epoch AI, because they recently featured just yesterday one exponential that you may not have heard of. That's that Anthropic's annualized revenue is 10xing per year until the tail end of 2025. Whereas OpenAI's is 3.4xing per year, albeit starting from a bigger base. It's a big if, but if those trends continue by mid2026, we could see Anthropic out earning OpenAI. And as you might have guessed, Epoch AAI's research is one of the main ways that I stay on top of AI research and developments. I've been covering them for years, even before they were a sponsor. And their newsletter is also incredible. If you want to learn what powers some of these exponentials, of course, you'll have to focus on their Frontier Data Center analysis. I had to actually double check with them that it was free. I just couldn't really believe it. But it is. Check out the unique link in the description. back to the central question of whether benchmark performance measures general intelligence because I have given you lots of counterarguments but Dario Amade the CEO of Anthropic did raise a point the other day that really gave an insight into the bet that Anthropic is making. He was asked why do you need all these RL environments specializing in Slack for example or browser use. Surely all of that's redundant if models are going to keep getting generally smarter. Amade said this. Yes, we're trying to get a whole bunch of data, not because we want to cover a specific document or specific skill, but because we want to generalize. For me, this is critical because what he is almost saying is that if you specialize in enough specialisms, you'll generalize to all specialisms. This is why later in the same interview he said that we can get most of the way there to a AGI or super intelligence or country of geniuses in a data center without continual learning without learning on the job without you teaching the model about your domain. How could you get to super intelligence without that data? Again, in my words, I think he thinks that if you specialize in enough specialisms, there are only so many patterns to be deduced from human training data. Yes, they're going to work on continual learning in case that's not the case. But if it is the case, Anthropic might not need the data from your domain. Or maybe he says later, models will almost get there, but just need a bit more context about your domain in the context window in the prompt you give it. This is why he says one idea they've got is just to make the context longer. There's nothing preventing longer context from working. You just have to train at longer context and then learn to serve them at inference. In other words, there might be a little bit of nuance in your domain that this generalist model doesn't know even after specializing in all the specialisms. It might only generalize so much and need a bit more context from your domain. But Claude 4.6 can now absorb 750,000 words in its context window. In short order, that mi

The latest and some would say greatest AI model has just been released, Gemini 3.1 Pro. And in the 24 hours since release, as well as a short period of early access, I have tested it hundreds of times. And of course, Reddit's model card. But here's the thing.

For the average user, I want to get beyond the headline scores and try to give you a sense of why every new hot take you see on X or YouTube or Tik Tok or podcast seems to contradict the last one you saw. Because there's actually a technical reason for the confusion over which model is best overall. But I will say that there's one private benchmark my own that has recently seen a model pass a threshold that I think is worth talking about. First 30 seconds of context because you may well know that the pre-training stage of growing or training LLMs involves training them on internet scale data.

But that actually now only accounts for 20% of the compute that is spent on training LLMs. So it's the post-training stage as I wrote about in my newsletter where those generalistbased models are honed against internal benchmarks on specific domains. This includes using industry source data to get particularly good at perhaps your domain. Here's the catch.

Just a year ago, that wasn't the case. Dario Amade, CEO of Anthropic, said back then, "The amount being spent on the second stage, RL stage, is small for all players." Why did I give you that context though? Well, because if one of these labs have data relevant to your domain and post-train their models to optimize for high scores in that area, then your experience of that model might be quite different to what other benchmarks say. In the older paradigm, if a model was clearly better in one domain, it was much more likely that they'd be better in many or all domains.

That just isn't the case anymore. In fact, my second point in this newsletter was precisely an example of that. Many of you would have heard about the intense discussion surrounding clawed code and all sorts of clawed powered agents that are now sweeping the web. So, we're seeing exponential improvement right across the board.

Well, let's take one chess puzzle benchmark made by Epoch AI and more on them later. 5 months ago, Claude ...