Jack Clark delivers a sobering reality check on the gap between AI hype and economic reality, proving that while machines can generate poetry, they still cannot reliably file your taxes or write code for every chip on the market. This week's coverage strips away the veneer of "general intelligence" to reveal a landscape where scale solves some problems but exposes critical fragility in others, from multilingual understanding to the very kernels that power our hardware.
The Scale Solution and the Tax Trap
Clark begins by highlighting a significant leap in how AI perceives the world, noting that Meta has finally cracked the code on multilingual image-text models. He writes, "We empirically show that the curse of multilinguality in CLIP is the consequence of insufficient scaling due to the lack of a proper recipe for worldwide data curation and model training." This is a crucial distinction: the failure wasn't a lack of cleverness, but a lack of volume and the right data ordering. By curating 29 billion pairs across 300 languages, Meta's new model, CLIP 2, manages to outperform its English-only predecessor.
However, Clark pivots sharply to a domain where sheer scale has not yet delivered reliability: financial compliance. He introduces a benchmark from startup Column Tax that tests whether AI can file returns, with results that are frankly alarming for anyone hoping for automation. "No model score above ~33% on the benchmark," Clark notes, observing that even the strongest performers like Gemini 2.5 Pro fail to consistently use correct tax tables or determine eligibility. The gap between a 50% success rate in a controlled test and the zero-error tolerance required by law is a chasm. "Our analysis finds that models consistently use incorrect tax tables, make calculation errors, and incorrectly determine eligibility, leading to overall incorrectly computed tax returns," the researchers behind the benchmark write.
This juxtaposition is the piece's analytical core. Clark argues that while we can scale our way to better visual understanding, the "ecologically valid" nature of tax law exposes the brittleness of current reasoning capabilities. A counterargument worth considering is that tax law is uniquely rigid and jurisdiction-specific, perhaps an outlier compared to other economic tasks. Yet, Clark's point stands: if an AI cannot navigate the rules of a single country's tax code without hallucinating eligibility, the promise of autonomous economic agents is still distant.
"None of us would hire a tax accountant who has a 50% success rate!"
The Geopolitics of Vision and Compute
The newsletter then turns to the darker side of AI advancement: the militarization of computer vision. Clark details a new dataset from Chinese researchers designed to track tiny drones in cluttered urban environments, a direct response to the realities of modern conflict. He describes the CST Anti-UAV dataset as a tool to test how well AI can spot small, distant objects against complex backgrounds. "Our dataset contains 78,224 tiny objects, which is 4.5 times larger than existing large datasets," the authors write, emphasizing the sheer volume of data required to teach a machine to distinguish a drone from a bird or a building.
The implications here are heavy. Clark frames this not just as a technical milestone but as a necessary evolution in defense capabilities. "We believe the CST Anti-UAV benchmark will inspire the development of more robust UAV tracking methods and accelerate the deployment of reliable vision-based anti-UAV systems in the real world," the researchers claim. While the technical achievement is notable, the human cost of the conflict driving this innovation is the unspoken backdrop. These systems are being built to identify and neutralize threats in populated areas, where the margin for error is non-existent and the consequences of failure are catastrophic. Critics might argue that open-sourcing such datasets lowers the barrier for bad actors to develop their own tracking systems, but the authors frame it as a way to improve defensive reliability.
Shifting from vision to architecture, Clark examines a "frankenmodel" from Abu Dhabi's Technology Innovation Institute (TII). The Falcon-H1 family combines standard transformer architectures with state-space models to create efficient, high-performance systems. Clark highlights the sheer resource disparity this represents: the team trained on a cluster of 4,096 H100 GPUs, a setup costing roughly $120 million. "This performance gain is particularly impactful at smaller scales, where our 1.5B-Deep model delivers capabilities competitive with leading 7B-10B models," the authors write. This is a clear signal that "sovereign AI" is no longer just a slogan; it is a capital-intensive race where state-backed entities can outmaneuver typical academic or even some commercial efforts.
The Chip Divide and the Kernel Gap
Finally, Clark exposes a critical vulnerability in the AI supply chain: the inability of large language models to write code for non-NVIDIA hardware. In a benchmark testing kernel generation for NVIDIA, Google, and Huawei chips, the results show a stark bias. "Claude Sonnet 4 gets 92.3% compilation accuracy on CUDA versus 5.3% on Huawei AscendC," Clark reports. This isn't just a performance gap; it's a data gap. The models have been trained on vast amounts of NVIDIA documentation, leaving them blind to the intricacies of other architectures.
The study suggests that AI's ability to accelerate AI research is currently bottlenecked by its familiarity with specific hardware ecosystems. "For AscendC kernels, using category-aware one-shot examples significantly improves both compilation and execution success rates," the researchers found, showing that context can bridge the gap. However, the underlying issue remains: the AI's understanding of kernel development is proxying its knowledge of NVIDIA's tools rather than the fundamental principles of computing. This creates a dangerous dependency where the future of AI acceleration is tied to the dominance of a single hardware vendor's documentation.
"Benchmarks like MultiKernelBench tell us that some of the data we get about kernel development could be actually just a proxy for 'how much do LLMs know about NVIDIA chips and CUDA' rather than 'how much do AI systems understand the core principles of kernel development.'"
Bottom Line
Clark's analysis effectively dismantles the notion of a monolithic AI progress curve, revealing instead a patchwork of breakthroughs and blind spots. The strongest part of his argument is the demonstration that scale solves visual and linguistic tasks but fails to guarantee reliability in high-stakes, rule-based domains like taxation or hardware programming. The biggest vulnerability in the current landscape is the over-reliance on specific data ecosystems, from NVIDIA's documentation to English-centric training sets, which leaves the technology fragile when faced with the complexity of the real world or geopolitical shifts. Readers should watch for how these hardware and data bottlenecks will shape the next generation of sovereign AI strategies and economic automation.