Import AI 445: Timing superintelligence; ais solve frontier math proofs; a new ml research benchmark

A newsletter that jumps from the economics of live musicians to the mathematics of catching a falling knife isn't trying to do one thing. It's trying to map the entire terrain of what happens when machine intelligence keeps climbing — and whether anyone can steer.

The Human Touch as Economic Shield

Jack Clark opens with a proposition that feels almost too gentle for the era it arrives in. Adam Ozimek, chief economist at the Economic Innovation Group, has argued that even if artificial intelligence becomes capable of doing every job humans currently do, certain work will survive simply because people want other people doing it.

Import AI 445: Timing superintelligence; ais solve frontier math proofs; a new ml research benchmark

As Clark puts it, "There are many jobs and tasks that easily could have been automated by now — the technology to automate them has long existed — and yet we humans continue to do them." The reason, Ozimek argues, is what he calls "the human touch": live music, actors, waiters, concierge-style experiences. The more money you have, the more you seem to crave actual human contact, not less.

"the human touch also appears to be what economists call a 'normal good,' which means the demand for it goes up as income goes up"

Clark notes Ozimek's broader assumption: there is a high chance that even if AI automates huge chunks of the current economy, there will be a boom in demand for "human artisans" — including entirely new categories of work nobody has named yet. Wages for those jobs could rise massively, depending on economic growth and policy choices.

Critics might note this is a comforting story precisely because it asks nothing of the present. It assumes displaced workers smoothly retrain as artisanal experience providers, and assumes the income distribution necessary to sustain that demand. Neither assumption has a great track record.

Scaling Laws for the Attention Economy

The coverage then pivots to something altogether less warm. Meta has published details on a recommendation system called Kunlun — named, presumably, after the mountain range rather than any particular philosophy — that achieves dramatically better computational efficiency than its predecessors.

Recommendation systems are the invisible architecture of social media advertising. They decide what billions of people see, buy, and pay attention to. And until now, they've been stubbornly inefficient: achieving only 3 to 15 percent of their theoretical processing capacity, compared to 40 to 60 percent for large language models. Kunlun pushes that figure from 17 percent to 37 percent.

More significant than the raw improvement is what Clark identifies as the real shift. Meta has discovered predictable scaling laws for recommendation models — the same kind of mathematical relationship that showed language models improve in lockstep with compute investment. But here the metric isn't "loss" on a dataset. It's "normalized entropy," a measure of how well the system predicts user behavior.

Clark writes that Meta's Kunlun models have been "deployed across major Meta Ads models, delivering a 1.2 percent improvement in topline metrics." One-point-two percent of an advertising business that generates over a hundred billion dollars a year is a staggering amount of money.

"What we're seeing here is the optimization of some of the most societally significant AI systems in the world — ones which direct billions of eyeballs towards a variety of products and online information — colliding with a greater degree of performance predictability"

The language is clinical. The implications are not. Recommendation systems shape elections, public health behaviors, and consumer economies. Making them more predictable to invest in means they will get much larger.

Critics might note that a 1.2 percent improvement in ad metrics sounds trivial until you remember these same systems have been implicated in political radicalization, teen mental health crises, and the wholesale restructuring of journalism. A scaling law for money is not the same thing as a scaling law for social benefit.

The Falling Knife

The newsletter's most philosophically loaded section arrives via Nick Bostrom — the Oxford philosopher whose work first introduced many readers to the concept of existential risk from artificial general intelligence. Bostrom has written a new paper titled Optimal Timing for Superintelligence, and its argument is bracing.

The core claim: if superintelligence can dramatically improve human health — extending lives, curing diseases, lifting populations out of preventable suffering — then every month of delay is itself a moral cost. Bostrom writes that "if nobody builds it, everyone dies." The choice is not between safety and risk. It is between different kinds of risk.

Jack Clark lays out Bostrom's key variables: the baseline risk of superintelligence causing human extinction, and the rate at which safety research can reduce that risk. Under most plausible combinations, Bostrom concludes, the optimal strategy is to develop superintelligence quickly — with perhaps a brief pause only when safety research is advancing rapidly and the technology is nearly ready.

"Swift to harbor, slow to berth," Bostrom writes. Move quickly toward capability, then slow down at the very end to deliberate.

Clark compares this to trying to catch a falling knife. Too early, you bleed. Too late, you miss entirely. Timing is everything.

But Bostrom is also skeptical of broad development pauses. Pause too early and society loses credibility. Bad regulation chokes off future progress. A pause that exempts national security applications creates a world where militaries hold the most powerful systems. And prolonged delay leaves the world exposed to current AI harms without the defenses that more advanced systems might provide.

Critics might note that Bostrom's framework assumes a level of coordination and timing precision that no historical technological transition has ever achieved. The "brief pause at the end" requires knowing exactly when the end is — something no regulator, company, or government has consistently demonstrated the ability to identify in advance.

Machines Doing Machine Learning

Clark then covers AIRS-Bench, a benchmark built by researchers at Meta, Oxford, and University College London to test whether artificial intelligence systems can perform contemporary machine learning research tasks. The benchmark covers 20 distinct problems drawn from 17 recent papers — molecular biology, time series forecasting, code generation, math, text classification.

The results are, in Clark's phrase, somewhat perplexing. The models tested — including GPT-4o, o3-mini, and several open-weight models — are not frontier systems. One of the paper's authors admitted on social media that publication delays meant older models were evaluated. None of the tested systems matched the performance of expert human researchers.

But the interest lies elsewhere. Clark points out that AI systems solve problems differently than humans do. On one text classification task, the state-of-the-art human approach was straightforward: fine-tune a single model. The best AI agent, GPT-OSS-120B, produced a "two-level stacked ensemble that combines multiple transformer models and a meta-learner" with five-fold stratified cross-validation. Enormously complicated. Probably overengineered.

"As Blaise Pascal once apocryphally said, 'I have only made this letter longer because I have not had the time to make it shorter'"

Clark suggests that as models grow more capable, their solutions might actually get simpler — approaching elegance rather than escaping it. That remains to be seen.

Held-Out Mathematics

The final section covers First Proof, a mathematics benchmark with a clever twist: ten problems drawn from active research in algebraic combinatorics, spectral graph theory, algebraic topology, and other fields, none of which had published solutions before February 2026. The answers are encrypted. The problems are genuine.

This matters because most AI math benchmarks test against problems that are already solved and therefore already in the training data. First Proof samples from "the true distribution of questions that mathematicians are currently working on."

Clark reports the results as mixed: AI systems can help, but they cannot yet independently crack genuinely open mathematical problems at the frontier. The gap is narrowing. It is not closed.

Bottom Line

This is a newsletter trying to hold five different futures in its head at once: one where humans retreat into artisanal niches, one where advertising algorithms grow infinitely more precise, one where a philosopher argues that rushing is the safest option, one where machines start doing their own research — and doing it messily. The common thread is not optimism or pessimism. It is velocity. The question is no longer whether these systems will change the shape of work, knowledge, and risk. It is whether anyone has a working theory of what to do with the steering wheel.