Gergely Orosz delivers a necessary corrective to the current AI engineering frenzy, arguing that the industry's obsession with "vibes-based development" is a recipe for catastrophic failure. While many treat Large Language Models as magical black boxes, Orosz insists that without rigorous, data-driven evaluation, these systems are nothing more than expensive guesswork. This piece matters because it moves beyond the hype to offer a concrete, repeatable engineering discipline that bridges the gap between experimental AI and reliable software.
The Trap of Subjectivity
The article opens by dismantling the illusion that LLMs can be tested like traditional software. Orosz writes, "LLMs are non-deterministic, meaning there's no guarantee they'll provide the same answer to the same question twice." This fundamental difference breaks the standard testing models engineers rely on. Instead of checking for a single correct output, developers are now navigating a landscape of infinite valid possibilities. Orosz identifies a dangerous pattern he calls the "vibes-based development" trap, where teams ship products based on a superficial "looked good to me" assessment.
This framing is effective because it names a behavior many engineers recognize but lack the vocabulary to critique. The author argues that this approach fails because it ignores the "Three Gulfs": the gap between developer understanding and model behavior, the gap between intent and prompt specification, and the gap between instructions and generalization. As Orosz puts it, "You can't test for correctness before you've systematically observed the range of possible outputs and have defined what 'good' even means for your product." This is a crucial pivot point; it suggests that the industry is trying to solve the wrong problem by applying old solutions to new constraints.
"Many teams are tempted to grab a pre-built 'hallucination score' or 'helpfulness' eval, but in my experience, these metrics are often worse than useless."
Critics might argue that off-the-shelf metrics provide a necessary baseline for early-stage startups, but Orosz counters that these generic scores create a false sense of security. He illustrates this with a mental health startup that tracked "helpfulness" on a 1-5 scale, only to find the data unactionable. The argument here is that without context, numbers are just noise. This aligns with the principles of grounded theory, where categories must emerge from the data itself rather than being imposed from the top down. Orosz is right to warn that optimizing for vanity metrics can lead teams to ignore the actual friction points their users face.
From Chaos to Codified Error Analysis
The core of Orosz's proposal is a return to error analysis, a discipline that has been central to machine learning for decades but was recently abandoned in the rush to deploy generative AI. He details a workflow where teams manually review conversation traces, a process he describes as "unglamorous" but essential. By building custom data viewers, the NurtureBoss team was able to move from random sampling to a systematic review of hundreds of interactions. This mirrors the historical rigor of Test-Driven Development, yet Orosz acknowledges that TDD fails here because "there isn't one right answer, but thousands."
The methodology relies on a two-step coding process: open coding, where engineers jot down descriptive notes on failures without predefined checklists, and axial coding, where these notes are grouped into themes. Orosz writes, "Let the data speak for itself, and jot down descriptive observations like: 'The agent missed a clear opportunity to re-engage a price-sensitive user.'" This bottom-up approach ensures that the evaluation criteria are rooted in real user pain rather than theoretical ideals. The author's emphasis on identifying the "first upstream failure" is particularly sharp, noting that fixing a single root cause often resolves a cascade of downstream errors.
"This bottom-up process is the single highest-ROI activity in AI development. It ensures you're solving real problems, not chasing vanity metrics."
The distinction between deterministic and subjective failures is where the article offers its most practical engineering advice. For tasks like date extraction, Orosz advocates for code-based evals, which function like traditional unit tests. However, for nuanced decisions like when to hand off a conversation to a human, he proposes an LLM-as-judge model. This requires the judge to be validated against human expertise to avoid the judge simply memorizing answers. Orosz warns, "Avoid your LLM judge memorizing answers by partitioning your data and measuring how well the judge generalizes to unfamiliar data." This nuance is vital; it prevents the evaluation system from becoming a self-fulfilling prophecy where the AI just grades itself on criteria it invented.
The Flywheel of Improvement
Ultimately, Orosz frames evaluation not as a one-time checkpoint but as a continuous loop. He describes a "flywheel of improvement" where teams analyze data, measure performance, improve the model, and automate the process before starting again. This approach integrates evaluation directly into the CI/CD pipeline, ensuring that regressions are caught immediately. The argument is that the "vibes" phase must end if AI engineering is to mature into a reliable discipline. As Orosz concludes, the goal is to move from guesswork to a repeatable engineering discipline that can scale.
"This is the antidote to the problems of generic, off-the-shelf metrics. Many teams are tempted to grab a pre-built 'hallucination score' or 'helpfulness' eval, but in my experience, these metrics are often worse than useless."
The strongest aspect of this piece is its refusal to treat AI as magic. By grounding the discussion in established software engineering principles and qualitative research methods, Orosz provides a roadmap for stability in a chaotic field. However, the approach demands a level of manual labor and tooling investment that may be prohibitive for smaller teams without significant resources. While the methodology is sound, the barrier to entry for building custom data viewers and maintaining large golden datasets remains a practical hurdle.
Bottom Line
Orosz's argument is a vital intervention that shifts the conversation from model capabilities to system reliability. The piece's greatest strength is its insistence that error analysis must be data-driven and bottom-up, rather than relying on generic, off-the-shelf metrics. The biggest vulnerability lies in the execution: the rigorous, manual processes required to build these evaluation systems are resource-intensive and may slow down the rapid iteration cycles many startups crave. Engineers should watch for how the industry adapts these manual workflows into automated tooling, as that will determine whether this discipline becomes standard practice or remains an elite luxury.