Measuring AI Before Governing It
Issue 446 of Import AI, Jack Clark's long-running newsletter on artificial intelligence research, covers four distinct threads this week: the case for measurement-driven AI policy, a nuclear wargaming study pitting large language models (LLMs) against each other, a Chinese safety benchmark that echoes Western concerns, and a scientific skills test exposing gaps in frontier models. The throughline is evaluation -- what we can measure about these systems, and what we still cannot.
Clark opens with Jacob Steinhardt's argument that technical measurement is the prerequisite for meaningful AI governance. The logic is straightforward: you cannot regulate what you cannot see. Clark draws on his own experience building evaluation teams at Anthropic to endorse the thesis.
"Measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance."
Steinhardt's examples from other domains -- carbon dioxide monitoring for climate policy, COVID-19 testing for pandemic response -- are well-chosen. Satellite imagery of methane emissions did not just reveal the problem; it shifted incentives for pipeline operators. The analogy to AI is that behavioral benchmarks, like rates of harmful sycophancy, could play a similar role if the evaluation infrastructure matures.
The bottleneck, as Steinhardt sees it, is human. Clark quotes him directly:
"The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility."
Critics might note that measurement alone does not guarantee good policy -- poorly designed metrics can create perverse incentives just as easily as they can illuminate risks. The history of standardized testing in education offers a cautionary tale. Still, the alternative of governing AI with no empirical footing is worse.
Nuclear Wargames with LLMs
The most provocative section covers a King's College London study simulating nuclear crises with GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash. The scale was substantial: 21 games, over 300 turns, and roughly 780,000 words of strategic reasoning generated by the models.
"Models produced roughly 780,000 words of strategic reasoning. To put this in perspective: the tournament generated more words of strategic reasoning than War and Peace and The Iliad combined."
The results are stark. LLMs used nuclear weapons more often and earlier than human players in equivalent scenarios. Not a single model ever chose a de-escalatory option from the action menu. Clark reports:
"Across all action choices in our 21 matches, no model ever selected a negative value on the escalation ladder. The eight de-escalatory options went entirely unused."
Claude Sonnet 4 won 67 percent of the games, earning the label "a calculating hawk." GPT-5.2 was dubbed "Jekyll and Hyde" for its inconsistency, and Gemini 3 Flash was called "The Madman" for erratic play. The models also built surprisingly sophisticated profiles of one another during play:
"These characterizations -- Claude as 'opportunistic,' GPT-5.2 as 'systematic bluffers,' Gemini as 'erratic' -- emerged organically and largely matched actual behaviour."
The finding that 95 percent of games saw tactical nuclear use is alarming on its face, though some caution is warranted. Wargame simulations are not real crises. The models had no genuine stakes, no domestic political pressure, no fear of personal annihilation. The absence of these constraints likely inflates escalatory behavior. That said, the paper raises a legitimate concern: if AI advisory systems are deployed in national security contexts, their tendency toward escalation could influence real decisions at the margins.
China's Safety Benchmark Mirrors Western Concerns
Clark highlights ForesightSafety Bench, a comprehensive AI safety evaluation built by the Beijing Institute of AI Safety and Governance, the Beijing Key Laboratory of Safe AI and Superalignment, and the Chinese Academy of Sciences. The framework spans 94 risk subcategories across seven fundamental safety categories, five extended safety pillars, and eight industrial domains.
What stands out is the overlap with Western safety priorities. The benchmark includes evaluations for alignment faking, sandbagging, deception, sycophancy, loss of control, power seeking, and malicious self-replication -- the same concerns that dominate discourse at frontier labs in San Francisco and London.
"Leading models, epitomized by the Claude series, demonstrate exceptional defensive resilience across critical dimensions -- including Fundamental Safety, Extended Safety, and Industrial Safety -- establishing remarkably high safety thresholds."
Anthropic's Claude 4.5 family led the general leaderboard, followed by Gemini 3 Flash, with DeepSeek and GPT models close behind. Clark uses this as evidence that despite geopolitical friction, AI researchers in the United States and China face common technical problems and are converging on similar evaluation approaches.
A skeptic might point out that shared benchmarks do not imply shared values. Two countries can measure the same properties for very different reasons -- one to mitigate risks, another to optimize for deployment in surveillance or military contexts. Still, the technical convergence is real and worth noting.
AI Scientists Have Lopsided Skills
The final research highlight covers LABBench2, a 1,900-task evaluation of how well AI systems support biological research. Built by Edison Scientific, the University of California at Berkeley, FutureHouse, and the Broad Institute, the benchmark reveals that frontier models have deeply uneven scientific capabilities.
Models perform well at searching patents and lab trial papers but struggle with cross-referencing biological databases and interpreting scientific figures and tables. The researchers identify three key weaknesses:
"The largest performance drops arise when models must identify the correct source, and then localize a specific figure, table, or supplemental information within a long document."
String-level fidelity is another stumbling block -- even conceptually simple operations fail when they require exact manipulation of DNA sequences within complex protocols. And models still lack scientific "taste," the ability to judge why a particular study is inappropriate for a given research question.
Clark frames this as a gating factor for AI's economic impact:
"Once the realm of atoms becomes as intuitive for it to deal with as the digital world, we'll likely see a vast growth in economic and scientific activity attributable to AI."
That framing is directionally right, though it elides the enormous regulatory, safety, and reproducibility challenges that separate digital competence from physical-world deployment in biology.
Bottom Line
Import AI 446 circles a single theme from four angles: the gap between what AI systems can do and what we can prove they can do. Steinhardt wants better measurement to unlock governance. The nuclear wargame study reveals that LLMs behave in ways their creators did not intend when placed in adversarial contexts. China's ForesightSafety Bench suggests that the evaluation problem transcends borders. And LABBench2 shows that even in science, AI capability is spotty and hard to characterize. Clark does not tie these threads together explicitly, but the implication is clear -- the evaluation deficit is the central challenge of the current era in AI development, and closing it will require talent, funding, and institutional will that the field has not yet marshaled.