Import AI 446: Nuclear llms; China's big AI benchmark; measurement and AI policy

Measuring AI Before Governing It

Issue 446 of Import AI, Jack Clark's long-running newsletter on artificial intelligence research, covers four distinct threads this week: the case for measurement-driven AI policy, a nuclear wargaming study pitting large language models (LLMs) against each other, a Chinese safety benchmark that echoes Western concerns, and a scientific skills test exposing gaps in frontier models. The throughline is evaluation -- what we can measure about these systems, and what we still cannot.

Clark opens with Jacob Steinhardt's argument that technical measurement is the prerequisite for meaningful AI governance. The logic is straightforward: you cannot regulate what you cannot see. Clark draws on his own experience building evaluation teams at Anthropic to endorse the thesis.

"Measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance."

Steinhardt's examples from other domains -- carbon dioxide monitoring for climate policy, COVID-19 testing for pandemic response -- are well-chosen. Satellite imagery of methane emissions did not just reveal the problem; it shifted incentives for pipeline operators. The analogy to AI is that behavioral benchmarks, like rates of harmful sycophancy, could play a similar role if the evaluation infrastructure matures.

Import AI 446: Nuclear llms; China's big AI benchmark; measurement and AI policy

The bottleneck, as Steinhardt sees it, is human. Clark quotes him directly:

"The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility."

Critics might note that measurement alone does not guarantee good policy -- poorly designed metrics can create perverse incentives just as easily as they can illuminate risks. The history of standardized testing in education offers a cautionary tale. Still, the alternative of governing AI with no empirical footing is worse.

Nuclear Wargames with LLMs

The most provocative section covers a King's College London study simulating nuclear crises with GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash. The scale was substantial: 21 games, over 300 turns, and roughly 780,000 words of strategic reasoning generated by the models.

"Models produced roughly 780,000 words of strategic reasoning. To put this in perspective: the tournament generated more words of strategic reasoning than War and Peace and The Iliad combined."

The results are stark. LLMs used nuclear weapons more often and earlier than human players in equivalent scenarios. Not a single model ever chose a de-escalatory option from the action menu. Clark reports:

"Across all action choices in our 21 matches, no model ever selected a negative value on the escalation ladder. The eight de-escalatory options went entirely unused."

Claude Sonnet 4 won 67 percent of the games, earning the label "a calculating hawk." GPT-5.2 was dubbed "Jekyll and Hyde" for its inconsistency, and Gemini 3 Flash was called "The Madman" for erratic play. The models also built surprisingly sophisticated profiles of one another during play:

"These characterizations -- Claude as 'opportunistic,' GPT-5.2 as 'systematic bluffers,' Gemini as 'erratic' -- emerged organically and largely matched actual behaviour."

The finding that 95 percent of games saw tactical nuclear use is alarming on its face, though some caution is warranted. Wargame simulations are not real crises. The models had no genuine stakes, no domestic political pressure, no fear of personal annihilation. The absence of these constraints likely inflates escalatory behavior. That said, the paper raises a legitimate concern: if AI advisory systems are deployed in national security contexts, their tendency toward escalation could influence real decisions at the margins.

China's Safety Benchmark Mirrors Western Concerns

Clark highlights ForesightSafety Bench, a comprehensive AI safety evaluation built by the Beijing Institute of AI Safety and Governance, the Beijing Key Laboratory of Safe AI and Superalignment, and the Chinese Academy of Sciences. The framework spans 94 risk subcategories across seven fundamental safety categories, five extended safety pillars, and eight industrial domains.

What stands out is the overlap with Western safety priorities. The benchmark includes evaluations for alignment faking, sandbagging, deception, sycophancy, loss of control, power seeking, and malicious self-replication -- the same concerns that dominate discourse at frontier labs in San Francisco and London.

"Leading models, epitomized by the Claude series, demonstrate exceptional defensive resilience across critical dimensions -- including Fundamental Safety, Extended Safety, and Industrial Safety -- establishing remarkably high safety thresholds."

Anthropic's Claude 4.5 family led the general leaderboard, followed by Gemini 3 Flash, with DeepSeek and GPT models close behind. Clark uses this as evidence that despite geopolitical friction, AI researchers in the United States and China face common technical problems and are converging on similar evaluation approaches.

A skeptic might point out that shared benchmarks do not imply shared values. Two countries can measure the same properties for very different reasons -- one to mitigate risks, another to optimize for deployment in surveillance or military contexts. Still, the technical convergence is real and worth noting.

AI Scientists Have Lopsided Skills

The final research highlight covers LABBench2, a 1,900-task evaluation of how well AI systems support biological research. Built by Edison Scientific, the University of California at Berkeley, FutureHouse, and the Broad Institute, the benchmark reveals that frontier models have deeply uneven scientific capabilities.

Models perform well at searching patents and lab trial papers but struggle with cross-referencing biological databases and interpreting scientific figures and tables. The researchers identify three key weaknesses:

"The largest performance drops arise when models must identify the correct source, and then localize a specific figure, table, or supplemental information within a long document."

String-level fidelity is another stumbling block -- even conceptually simple operations fail when they require exact manipulation of DNA sequences within complex protocols. And models still lack scientific "taste," the ability to judge why a particular study is inappropriate for a given research question.

Clark frames this as a gating factor for AI's economic impact:

"Once the realm of atoms becomes as intuitive for it to deal with as the digital world, we'll likely see a vast growth in economic and scientific activity attributable to AI."

That framing is directionally right, though it elides the enormous regulatory, safety, and reproducibility challenges that separate digital competence from physical-world deployment in biology.

Bottom Line

Import AI 446 circles a single theme from four angles: the gap between what AI systems can do and what we can prove they can do. Steinhardt wants better measurement to unlock governance. The nuclear wargame study reveals that LLMs behave in ways their creators did not intend when placed in adversarial contexts. China's ForesightSafety Bench suggests that the evaluation problem transcends borders. And LABBench2 shows that even in science, AI capability is spotty and hard to characterize. Clark does not tie these threads together explicitly, but the implication is clear -- the evaluation deficit is the central challenge of the current era in AI development, and closing it will require talent, funding, and institutional will that the field has not yet marshaled.

Import AI 446: Nuclear llms; China's big AI benchmark; measurement and AI policy

by Jack Clark · Import AI · Read full article

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Want to make AI go better? Figure out how to measure it:…One simple policy intervention that works well…Jacob Steinhardt, an AI researcher, has written a nice blog laying out the virtues in investing in technical tools to measure properties of AI systems and drive down costs in complying with technical policy solutions. As someone who has spent their professional life in AI writing about AI measurement and building teams (e.g, the Frontier Red Team and Societal Impacts and Economic Research teams at Anthropic) to measure properties of AI systems, I agree with the general thesis: measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance.How measurement has helped in other fields: Steinhardt points out that accurate measurement has been crucial to orienting people around the strategy for solving problems in other fields; CO2 monitoring helps people think about climate change, and COVID-19 testing helped governments work out how to respond to COVID. There are also examples where you can measure something to shift incentives - for instance, satellite imagery of methane emissions can help shift incentives for people that build gas infrastructure.The AI sector has built some of the measures we need: The infamous METR time horizons plot (and before that, various LLM metrics, and before that ImageNet) has proved helpful for orienting people around the pace of AI progress. And behavioural benchmarks of AI systems, like rates of harmful sycophancy, are already helping to shift incentives. But more work is needed - if we want to be able to enable direct governance interventions in the AI sector, we’ll need to do a better job of measuring and accounting for compute, Steinhardt notes. More ambitiously, if we want to ultimately shift equilibria to make certain paths more attractive, we’ll have to unlock some more fundamental technologies, like the ability to cheaply evaluate frontier AI agents (makes it less costly to measure the frontier), and to develop privacy-preserving audit tools (makes it less painful for firms to comply with policy).Why this matters - measurement unlocks policy: “In an ideal world, rigorous evaluation and oversight of AI systems would become standard practice through natural incentives alone,” ...

Import AI 446: Nuclear llms; China's big AI benchmark; measurement and AI policy

Measuring AI Before Governing It

Nuclear Wargames with LLMs

China's Safety Benchmark Mirrors Western Concerns

AI Scientists Have Lopsided Skills

Bottom Line

Deep Dives

Sources

Import AI 446: Nuclear llms; China's big AI benchmark; measurement and AI policy