Ethan Mollick cuts through the noise of AI hype with a provocative, practical thesis: stop trusting public scores and start conducting your own job interviews. While the industry obsesses over whether a model can answer trivia about Homo erectus, Mollick argues we are blind to the actual, dangerous inconsistencies in how these systems think, reason, and advise.
The Trap of the Public Scorecard
Mollick begins by dismantling the reliability of the very metrics the industry uses to measure progress. He points out that "many benchmarks and their answer keys are public, so some AIs end up incorporating them into their basic training, whether by accident or so they can score highly on these benchmarks." This is a critical observation that mirrors the psychological concept of Goodhart's law: when a measure becomes a target, it ceases to be a good measure. If the test is known, the system optimizes for the test, not for intelligence.
He illustrates the absurdity of current evaluations by noting that popular tests include questions like "What is the approximate mean cranial capacity of Homo erectus?" and asks, "What does getting this right tell us? I have no idea." This rhetorical question lands hard because it exposes the disconnect between a high score and real-world utility. The author notes that even when scores rise, "we don't know if moving from 84% correct to 85% is as challenging as moving from 40% to 41% correct." The data is trending up, but the calibration is broken.
You wouldn't hire a VP based solely on their SAT scores. You shouldn't pick the AI that will advise thousands of decisions for your organization based on whether it knows that the mean cranial capacity of Homo erectus is just under 1,000 cubic centimeters.
Critics might argue that standardized testing is the only scalable way to compare models across thousands of parameters, but Mollick's point is that scalability without validity is a liability. The underlying trend is real—models are getting better—but the specific metrics are often measuring the wrong things.
Benchmarking on Vibes
If the formal tests are flawed, Mollick suggests a more intuitive, albeit subjective, approach: "Benchmarking on Vibes." He describes how power users develop idiosyncratic tests to gauge a model's "world model" and internal logic. He shares his own quirk: asking every image model to "create an otter on a plane." While it sounds whimsical, Mollick explains that "these approaches also give you a sense of the AI's understanding of how things relate to each other."
He demonstrates the power of this method with a writing exercise about a character rationing their remaining words. The results reveal stark differences: "Gemini 2.5 Pro, currently the weakest of these four models, doesn't even accurately keep track of the number of words used," while others struggle with coherence. This section is effective because it shifts the focus from abstract accuracy to tangible behavior. It forces the reader to see the AI not as a database, but as an agent with distinct personality traits and failure modes.
However, relying on vibes has a clear weakness. Mollick admits, "we are relying on our feelings rather than real measures." A single creative prompt might not reveal systemic biases or safety failures. It is a useful heuristic for individuals, but it lacks the rigor needed for enterprise deployment.
The Real World Interview
The piece pivots to its most compelling argument: organizations must treat AI selection like hiring a human executive. Mollick highlights the GDPval paper from OpenAI as the gold standard for this approach. Instead of multiple-choice questions, researchers created "complex and realistic projects that would take human experts an average of four to seven hours to complete." The results were nuanced: the best models beat humans in software development but "pharmacists, industrial engineers, and real estate agents easily beat the best AI."
This reveals the "Jagged Frontier" of AI capability, where performance varies wildly by task. But Mollick pushes further, arguing that even real-world task performance isn't enough. We must also understand the AI's "attitude" when making decisions. He describes an experiment where he asked various models to rate the viability of a "guacamole via drones" startup. The results were chaotic: "Grok thought this was a great idea, and Microsoft Copilot was excited as well. Other models, like GPT-5 and Claude 4.5, were more skeptical."
When your AI is giving advice at scale, consistently rating ideas 3–4 points higher or lower means consistently steering you in a different direction.
This is the crux of the argument. The difference isn't just a score; it's a strategic divergence. An AI that is consistently more risk-seeking or risk-averse will fundamentally alter a company's trajectory. Mollick writes, "You need to systematically test your AI on the actual work it will do and the actual judgments it will make." This reframes the entire conversation from "which model is the smartest?" to "which model fits our specific risk profile and operational needs?"
Bottom Line
Mollick's strongest contribution is the shift from passive consumption of benchmarks to active, rigorous interrogation of models. The vulnerability in his argument is the sheer resource intensity required; not every organization has the time or expertise to run "GDPval-style" evaluations repeatedly. However, the alternative—blindly trusting a public score on a test that may be contaminated by the training data—is a far greater risk. The era of trusting the leaderboard is over; the era of the AI job interview has begun.