← Back to Library

Import AI 423: Multilingual clip; anti-drone tracking; and Huawei kernel design

Jack Clark delivers a sobering reality check on the gap between AI hype and economic reality, proving that while machines can generate poetry, they still cannot reliably file your taxes or write code for every chip on the market. This week's coverage strips away the veneer of "general intelligence" to reveal a landscape where scale solves some problems but exposes critical fragility in others, from multilingual understanding to the very kernels that power our hardware.

The Scale Solution and the Tax Trap

Clark begins by highlighting a significant leap in how AI perceives the world, noting that Meta has finally cracked the code on multilingual image-text models. He writes, "We empirically show that the curse of multilinguality in CLIP is the consequence of insufficient scaling due to the lack of a proper recipe for worldwide data curation and model training." This is a crucial distinction: the failure wasn't a lack of cleverness, but a lack of volume and the right data ordering. By curating 29 billion pairs across 300 languages, Meta's new model, CLIP 2, manages to outperform its English-only predecessor.

Import AI 423: Multilingual clip; anti-drone tracking; and Huawei kernel design

However, Clark pivots sharply to a domain where sheer scale has not yet delivered reliability: financial compliance. He introduces a benchmark from startup Column Tax that tests whether AI can file returns, with results that are frankly alarming for anyone hoping for automation. "No model score above ~33% on the benchmark," Clark notes, observing that even the strongest performers like Gemini 2.5 Pro fail to consistently use correct tax tables or determine eligibility. The gap between a 50% success rate in a controlled test and the zero-error tolerance required by law is a chasm. "Our analysis finds that models consistently use incorrect tax tables, make calculation errors, and incorrectly determine eligibility, leading to overall incorrectly computed tax returns," the researchers behind the benchmark write.

This juxtaposition is the piece's analytical core. Clark argues that while we can scale our way to better visual understanding, the "ecologically valid" nature of tax law exposes the brittleness of current reasoning capabilities. A counterargument worth considering is that tax law is uniquely rigid and jurisdiction-specific, perhaps an outlier compared to other economic tasks. Yet, Clark's point stands: if an AI cannot navigate the rules of a single country's tax code without hallucinating eligibility, the promise of autonomous economic agents is still distant.

"None of us would hire a tax accountant who has a 50% success rate!"

The Geopolitics of Vision and Compute

The newsletter then turns to the darker side of AI advancement: the militarization of computer vision. Clark details a new dataset from Chinese researchers designed to track tiny drones in cluttered urban environments, a direct response to the realities of modern conflict. He describes the CST Anti-UAV dataset as a tool to test how well AI can spot small, distant objects against complex backgrounds. "Our dataset contains 78,224 tiny objects, which is 4.5 times larger than existing large datasets," the authors write, emphasizing the sheer volume of data required to teach a machine to distinguish a drone from a bird or a building.

The implications here are heavy. Clark frames this not just as a technical milestone but as a necessary evolution in defense capabilities. "We believe the CST Anti-UAV benchmark will inspire the development of more robust UAV tracking methods and accelerate the deployment of reliable vision-based anti-UAV systems in the real world," the researchers claim. While the technical achievement is notable, the human cost of the conflict driving this innovation is the unspoken backdrop. These systems are being built to identify and neutralize threats in populated areas, where the margin for error is non-existent and the consequences of failure are catastrophic. Critics might argue that open-sourcing such datasets lowers the barrier for bad actors to develop their own tracking systems, but the authors frame it as a way to improve defensive reliability.

Shifting from vision to architecture, Clark examines a "frankenmodel" from Abu Dhabi's Technology Innovation Institute (TII). The Falcon-H1 family combines standard transformer architectures with state-space models to create efficient, high-performance systems. Clark highlights the sheer resource disparity this represents: the team trained on a cluster of 4,096 H100 GPUs, a setup costing roughly $120 million. "This performance gain is particularly impactful at smaller scales, where our 1.5B-Deep model delivers capabilities competitive with leading 7B-10B models," the authors write. This is a clear signal that "sovereign AI" is no longer just a slogan; it is a capital-intensive race where state-backed entities can outmaneuver typical academic or even some commercial efforts.

The Chip Divide and the Kernel Gap

Finally, Clark exposes a critical vulnerability in the AI supply chain: the inability of large language models to write code for non-NVIDIA hardware. In a benchmark testing kernel generation for NVIDIA, Google, and Huawei chips, the results show a stark bias. "Claude Sonnet 4 gets 92.3% compilation accuracy on CUDA versus 5.3% on Huawei AscendC," Clark reports. This isn't just a performance gap; it's a data gap. The models have been trained on vast amounts of NVIDIA documentation, leaving them blind to the intricacies of other architectures.

The study suggests that AI's ability to accelerate AI research is currently bottlenecked by its familiarity with specific hardware ecosystems. "For AscendC kernels, using category-aware one-shot examples significantly improves both compilation and execution success rates," the researchers found, showing that context can bridge the gap. However, the underlying issue remains: the AI's understanding of kernel development is proxying its knowledge of NVIDIA's tools rather than the fundamental principles of computing. This creates a dangerous dependency where the future of AI acceleration is tied to the dominance of a single hardware vendor's documentation.

"Benchmarks like MultiKernelBench tell us that some of the data we get about kernel development could be actually just a proxy for 'how much do LLMs know about NVIDIA chips and CUDA' rather than 'how much do AI systems understand the core principles of kernel development.'"

Bottom Line

Clark's analysis effectively dismantles the notion of a monolithic AI progress curve, revealing instead a patchwork of breakthroughs and blind spots. The strongest part of his argument is the demonstration that scale solves visual and linguistic tasks but fails to guarantee reliability in high-stakes, rule-based domains like taxation or hardware programming. The biggest vulnerability in the current landscape is the over-reliance on specific data ecosystems, from NVIDIA's documentation to English-centric training sets, which leaves the technology fragile when faced with the complexity of the real world or geopolitical shifts. Readers should watch for how these hardware and data bottlenecks will shape the next generation of sovereign AI strategies and economic automation.

Sources

Import AI 423: Multilingual clip; anti-drone tracking; and Huawei kernel design

by Jack Clark · Import AI · Read full article

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Meta makes CLIP multilingual:…Meta CLIP 2 will help AI systems reason about text and images in hundreds of languages…Researchers with Meta, Princeton University, New York University have built Meta CLIP 2, a larger-scale, multilingual version of OpenAI's venerable CLIP model. CLIP, short for Contrastive Language-Image Pretraining (CLIP), is a way to train a pair of neural nets to understand images and text and being able to map between them. CLIP is a utility technology which is used for a vast range of downstream purposes, from image generation to image search and classification. The original CLIP was trained to map English text to images. Meta CLIP 2 is a scaled up version which also maps non-English text to images. Along with releasing the model, Meta has also released a detailed paper going through "the first recipe training CLIP from scratch on worldwide web-scale image-text pairs".Scale is all that matters: As usual, the main lesson here is one of scale. Earlier attempts to train versions of CLIP on multiple languages failed, leading to degraded performance relative to the original model. "We empirically show that the curse of multilinguality in CLIP is the consequence of insufficient scaling due to the lack of a proper recipe for worldwide data curation and model training". To scale the system, Meta had to do three things: 1) it gathered large-scale multilingual metadata across 300+ languages, 2) it built its own curation algorithm to help it curate a representative multilingual dataset to train on, and 3) it figured out the right proportion and ordering of data to use when training its system. To get an idea of scale, there were 12.8B pairs in the original OpenAI CLIP, and 29B in CLIP2. The main training trick was "increasing the global training batch size, which encourages cross-lingual learning, and meanwhile keeping the other training hyperparameters unchanged. We choose a 2.3× scaling of global batch to reflect that English pairs constitute 44% of our training data".Best results: Meta CLIP 2 beats its English-only counterpart by 0.8% on zero-shot image classification and by 0.7% on mSigLIP, and also sets new state-of-the-art scores on multilingual benchmarks like CVQA (57.4%), Babel-ImageNet (50.2%), and XM3600 (64.3%).Why this matters - multilingual sensors for the internet: CLIP is less a ...