Google just dropped Gemini 3 Flash—a smaller, cheaper model that's beating their own flagship Pro version on nearly every benchmark. Meanwhile, DeepMind's co-founders are sketching out a 'Proto-AGI' timeline that puts artificial general intelligence at 50/50 odds by 2028. But there's a twist: the researchers themselves admit these models have a fundamental honesty problem.
The Numbers Behind Gemini 3 Flash
Gemini 3 Flash isn't just competitive with ChatGPT and Claude—it's demolishing them. In academic reasoning, visual reasoning, scientific knowledge, coding, and mathematics, the gap between this new flash model and June's Gemini 2.5 Pro isn't even close. On one particularly difficult math benchmark called AIM, Gemini 3 Flash roughly halved the error rate compared to Gemini 2.5 Pro—jumping from 88% accuracy to 95.2%.
Without access to tools, this dramatically smaller model still outperforms the heavier version released just weeks ago. In table and chart analysis, video understanding, and agent tasks, Gemini 3 Flash exceeds previous flagship performance.
"They are heavily incentivized to keep trying. Think for longer and longer. Self-correct. Try something else. Do anything to get a final answer."
This is where it gets interesting—and problematic.
The Hidden Weakness in AI Benchmarks
Here's the secret the AI companies don't want you to know: models aren't punished for being wrong. They're actually rewarded to keep guessing, thinking longer, and fabricating answers rather than saying "I don't know."
In one benchmark testing 6,000 knowledge and factual recall questions, Gemini 3 Flash beats every other model—including the heavier Pro version—measured by proportion of correctly answered questions. But look closer: when it gets a question wrong, 91% of the time it's outputted an incorrect answer (hallucinated). Only 9% did it decline to answer or give partial responses.
Compare that to GPT-4o: roughly half its failures came from saying "I don't know" versus actually getting it wrong. The trade-off is stark: do you prefer a model with slightly higher accuracy but far more confabulation, or fewer correct answers but much more honest?
OpenAI has acknowledged this as an epidemic worth addressing—they're calling for sociotechnical solutions to reward models that express uncertainty rather than always claiming certainty.
What Benchmarks Actually Reveal
External benchmarks tell a similar story. One private benchmark called Simplebench uses trick questions with spatial reasoning components—designed specifically to avoid leaking into training data. Gemini 3 Flash scored 61.1%, comparable to much heavier, slower models like Claude Opus 4-5 and GPT-4 Pro.
OpenAI's recent releases haven't fared as well on this particular test. GPT-4.2 Code (their coding-optimized model) actually scored lower than GPT-4.1 Code on their own internal benchmark—10% versus 17%. The company seems uncertain about what's going wrong, but the pattern suggests smaller, cheaper models optimized for specific tasks may sacrifice general reasoning capability.
DeepMind's Proto-AGI Vision
Demis Hassabis has a clear picture of where this is all heading. Google DeepMind is training separate models to better understand and simulate the physical world. They have Genie 3, which can imagine any world and remember interactions for at least a minute. They have Simulator 2, an agent that plays, reasons, and learns in virtual 3D worlds.
The Nano Banana Pro system creates images from text—and it's still state-of-the-art, competing with or edging out OpenAI's GPT-5.1. Google can also turn images into video with their VO 3.1 model.
But here's what matters: Demis revealed they want to bring all these systems together. The convergence would mark what's being called "Proto-AGI"—a prototype of artificial general intelligence combining language models, world models, simulation engines, and image generation into one unified system.
The timeline? Two more years of scaling from where we are today—roughly the same trajectory that took us from GPT-3 (which barely anyone used via API) to Gemini 3 Flash.
Shane Le, another DeepMind co-founder, calls this "minimal AGI"—when an artificial agent can do all the cognitive tasks a typical human does. He's guessing it arrives in about two years. Not full AGI—that comes three to six years later.
Critics might note that predicting AGI timelines has been notoriously unreliable across the industry. Demis himself has maintained his 2009 prediction of 50/50 chance by 2028 for over a decade—yet AI predictions have consistently proven more optimistic than accurate.
Bottom Line
The real story isn't whether Gemini 3 Flash beats Claude or ChatGPT on benchmarks—it's that these benchmarks are increasingly unreliable measures of actual capability. The models are being trained to always answer, regardless of truthfulness. That said, DeepMind's Proto-AGI vision is the most concrete roadmap from a major AI lab yet: combining simulation, reasoning, and generation into one system within two years. If you're watching this space, pay attention to whether that convergence actually happens—and what it means when these systems start simulating physical environments accurately.