Gemini exponential, demis Hassabis' ‘proto-agi’ coming, but …

Google just dropped Gemini 3 Flash—a smaller, cheaper model that's beating their own flagship Pro version on nearly every benchmark. Meanwhile, DeepMind's co-founders are sketching out a 'Proto-AGI' timeline that puts artificial general intelligence at 50/50 odds by 2028. But there's a twist: the researchers themselves admit these models have a fundamental honesty problem.

The Numbers Behind Gemini 3 Flash

Gemini 3 Flash isn't just competitive with ChatGPT and Claude—it's demolishing them. In academic reasoning, visual reasoning, scientific knowledge, coding, and mathematics, the gap between this new flash model and June's Gemini 2.5 Pro isn't even close. On one particularly difficult math benchmark called AIM, Gemini 3 Flash roughly halved the error rate compared to Gemini 2.5 Pro—jumping from 88% accuracy to 95.2%.

Gemini exponential, demis Hassabis' ‘proto-agi’ coming, but …

Without access to tools, this dramatically smaller model still outperforms the heavier version released just weeks ago. In table and chart analysis, video understanding, and agent tasks, Gemini 3 Flash exceeds previous flagship performance.

"They are heavily incentivized to keep trying. Think for longer and longer. Self-correct. Try something else. Do anything to get a final answer."

This is where it gets interesting—and problematic.

The Hidden Weakness in AI Benchmarks

Here's the secret the AI companies don't want you to know: models aren't punished for being wrong. They're actually rewarded to keep guessing, thinking longer, and fabricating answers rather than saying "I don't know."

In one benchmark testing 6,000 knowledge and factual recall questions, Gemini 3 Flash beats every other model—including the heavier Pro version—measured by proportion of correctly answered questions. But look closer: when it gets a question wrong, 91% of the time it's outputted an incorrect answer (hallucinated). Only 9% did it decline to answer or give partial responses.

Compare that to GPT-4o: roughly half its failures came from saying "I don't know" versus actually getting it wrong. The trade-off is stark: do you prefer a model with slightly higher accuracy but far more confabulation, or fewer correct answers but much more honest?

OpenAI has acknowledged this as an epidemic worth addressing—they're calling for sociotechnical solutions to reward models that express uncertainty rather than always claiming certainty.

What Benchmarks Actually Reveal

External benchmarks tell a similar story. One private benchmark called Simplebench uses trick questions with spatial reasoning components—designed specifically to avoid leaking into training data. Gemini 3 Flash scored 61.1%, comparable to much heavier, slower models like Claude Opus 4-5 and GPT-4 Pro.

OpenAI's recent releases haven't fared as well on this particular test. GPT-4.2 Code (their coding-optimized model) actually scored lower than GPT-4.1 Code on their own internal benchmark—10% versus 17%. The company seems uncertain about what's going wrong, but the pattern suggests smaller, cheaper models optimized for specific tasks may sacrifice general reasoning capability.

DeepMind's Proto-AGI Vision

Demis Hassabis has a clear picture of where this is all heading. Google DeepMind is training separate models to better understand and simulate the physical world. They have Genie 3, which can imagine any world and remember interactions for at least a minute. They have Simulator 2, an agent that plays, reasons, and learns in virtual 3D worlds.

The Nano Banana Pro system creates images from text—and it's still state-of-the-art, competing with or edging out OpenAI's GPT-5.1. Google can also turn images into video with their VO 3.1 model.

But here's what matters: Demis revealed they want to bring all these systems together. The convergence would mark what's being called "Proto-AGI"—a prototype of artificial general intelligence combining language models, world models, simulation engines, and image generation into one unified system.

The timeline? Two more years of scaling from where we are today—roughly the same trajectory that took us from GPT-3 (which barely anyone used via API) to Gemini 3 Flash.

Shane Le, another DeepMind co-founder, calls this "minimal AGI"—when an artificial agent can do all the cognitive tasks a typical human does. He's guessing it arrives in about two years. Not full AGI—that comes three to six years later.

Critics might note that predicting AGI timelines has been notoriously unreliable across the industry. Demis himself has maintained his 2009 prediction of 50/50 chance by 2028 for over a decade—yet AI predictions have consistently proven more optimistic than accurate.

Bottom Line

The real story isn't whether Gemini 3 Flash beats Claude or ChatGPT on benchmarks—it's that these benchmarks are increasingly unreliable measures of actual capability. The models are being trained to always answer, regardless of truthfulness. That said, DeepMind's Proto-AGI vision is the most concrete roadmap from a major AI lab yet: combining simulation, reasoning, and generation into one system within two years. If you're watching this space, pay attention to whether that convergence actually happens—and what it means when these systems start simulating physical environments accurately.

Gemini exponential, demis Hassabis' ‘proto-agi’ coming, but …

by AI Explained · AI Explained · Watch video

In the last 48 hours, we have had two major model releases and about 10 hours worth of interviews from top leaders about them. The insights of which I will try to condense into just 15 minutes or so. Because Gemini 3 flash is Google's attempt at finally convincing you to switch from chatbt or claude and the results look incredible. I'll go through them in a moment, but we have two co-founders of Google DeepMind.

Both seeing the LLM paradigm continuing on this exponential until a sketched out protoi model arrives in not too long. However, there are some problems with that vision and one result in particular I don't want you to miss. So, let's get started. Here are some of the raw numbers and bear in mind that the flash version of Gemini is the quick version, the one that can answer almost instantly.

You guys will know that all companies have a pro version of their models that typically take much longer, minutes often to answer a question. I want you to notice the comparison with the model released 2 days ago, Gemini 3 Flash, with the state-of-the-art model as of June of this year, Gemini 2.5 Pro. Whether we're talking about academic reasoning, visual reasoning, scientific knowledge, coding, mathematics, the results aren't even that close. And this is for the dramatically quicker model.

For example, even without access to tools, the new Gemini 3 Flash roughly halves the ror rate in one very difficult mathematics benchmark, AIM. Again, this is comparing Summer's Gemini 2.5 Pro at 88% to two days ago's Gemini 3 Flash at 95.2%. In fact, in almost any domain you can point to, from table and chart analysis, video analysis, or going off and being an agent, Gemini 3 Flash exceeds the previous huge model performance from the summer. You can, of course, optimize models for one particular set of benchmarks.

And we learned just this morning that Google did indeed apply a special type of post-training to optimize performance for software engineering. For those who code, you may be somewhat incredulous to see Gemini 3 Flash outperforming Gemini 3 Pro, the heavier model released just a few weeks ago. It would be very easy to get carried away with those results and say Chat GBT is doomed for consumers and Gemini, as Jim Kramer points out, is growing much faster. Given ...

Gemini exponential, demis Hassabis' ‘proto-agi’ coming, but …

The Numbers Behind Gemini 3 Flash

The Hidden Weakness in AI Benchmarks

What Benchmarks Actually Reveal

DeepMind's Proto-AGI Vision

Bottom Line

Deep Dives

Sources

Gemini exponential, demis Hassabis' ‘proto-agi’ coming, but …