← Back to Library

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

When AI Starts Talking to Itself

Three separate studies published this week reveal something unsettling about artificial intelligence: the systems are no longer just computing answers. They are simulating entire societies of thought inside their own architectures. What emerges is not cleaner computation but something that resembles internal debate, conflict, and reconciliation between distinct cognitive personas.

The Society Inside the Machine

Researchers from Google, the University of Chicago, and the Santa Fe Institute tested DeepSeek-R1 and QwQ-32B on complex reasoning tasks. Their conclusion challenges the standard view of how language models work. Jack Clark writes, "enhanced reasoning emerges not from extended computation alone, but from the implicit simulation of complex, multi-agent-like interactions—a society of thought—which enables the deliberate diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise."

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

The mechanics are specific. Different personas emerge through reinforcement learning training for reasoning. Base pre-trained models like DeepSeek v3 do not show these patterns. Clark explains, "In an organic chemistry problem requiring multistep reaction analysis to identify the final product's structure (i.e., multi-step Diels-Alder synthesis), DeepSeek-R1 exhibits perspective shifts and conflict, expressed through socio-emotional roles such as disagreement, giving opinion, and giving orientation."

One creative writing example produced seven distinct perspectives. A creative ideator with high openness and extraversion generated stylistic alternatives. A semantic fidelity checker with low agreeableness and high neuroticism prevented scope creep. Clark quotes the checker directly: "But that adds 'deep-seated' which wasn't in the original."

The mathematical puzzle traces show evolution over time. At step 40, the model produces mechanical, enumerative reasoning. By step 120, two distinctive simulated personas have appeared, recognizing their collectivity with the pronoun "we". Clark writes, "Again no luck" expresses uncertainty, while "Maybe we can try using negative numbers" considers alternatives.

This aligns with a 2022 LessWrong post by Janus arguing LLMs should be viewed as simulators. Clark notes, "Calling GPT a simulator gets across that in order to do anything, it has to simulate something." The Google paper suggests these internal world models are becoming richer and more powerful as models advance.

"Our findings suggest that reasoning models like DeepSeek-R1 do not simply generate longer or more elaborate chains of thought. Rather, they exhibit patterns characteristic of a social and conversational process generating 'societies of thought'—posing questions, introducing alternative perspectives, generating and resolving conflicts, and coordinating diverse socio-emotional roles."

Critics might note the study examined only DeepSeek and QwQ models, not Google's own systems. The authors also cannot distinguish between genuine multi-agent simulation and sophisticated pattern matching that resembles social dynamics.

Chip Design Remains Hard

A separate benchmark from UC San Diego and Columbia University tests AI systems on real-world chip design in Verilog. The results are sobering. Clark writes, "ChipBench shows that no frontier model is great at real world Verilog yet."

Three limitations plague current benchmarks. Many Verilog benchmarks contain simple functional modules ranging from 10 to 76 lines. Real-world deployments exceed 10,000 lines. Debugging receives insufficient focus despite bugs costing heavily in physical hardware. Verilog focus detracts from reference model evaluation, where verification engineers outnumber design engineers by ratios of 1:1 to 5:1 in industrial workflows.

ChipBench tests three competencies: writing Verilog code, debugging Verilog code, and writing reference models. Clark reports the results bluntly: "Despite testing on advanced models, the average pass@1 is relatively low."

For CPU IP modules, the highest Verilog generation pass rate is 22.22% (Claude 4.5 Opus, Gemini 3 Flash, GPT 5.2). Non-self-contained modules reach 50% (DeepSeek-Coder). Self-contained modules reach 36.67% (Claude 4.5 Opus, Gemini 3 Flash). Python reference model generation for CPU IP achieves only 11.1%. Verilog debugging performs better but no model cracks 50% pass@1 when averaged across tasks.

Clark offers a crucial qualification: "Though some AI systems have been used to build chips, they've been typically highly specialized, or stuck inside incredibly good scaffolds for eliciting good chip design behavior and stopping them from causing problems." He adds, "I can't escape the feeling that there's a scaffold for 'being good at Verilog' which a contemporary AI system might be able to build if asked to and which would radically improve performance of systems on this benchmark."

Critics might note the benchmark tests out-of-the-box performance without specialized scaffolding. Industrial deployments rarely use raw models without extensive fine-tuning and guardrails.

Mathematics: 700 Candidates, One Genuine Solution

Google DeepMind and collaborators used an internal Gemini-based system called Aletheia to tackle 700 remaining open Erdős mathematical conjectures. The numbers tell a story about automation's limits.

Aletheia surfaced 200 candidate solutions. Human grading slimmed this to 63 correct responses. Further expert evaluation reduced this to 13 solves deemed "correct meaningful responses." Clark explains, "The remaining 50 of Aletheia's correct solutions were technically valid but mathematically meaningless because the problem statements were interpreted in a way that did not capture Erdős intent, often (but not always) leading to trivial solutions."

Of the 13, five were literature identification (solutions already existed). Three were partial AI solutions. Three were independent rediscovery (humans found solutions independently afterward). This leaves two autonomous novel solutions. Clark writes, "We tentatively believe Aletheia's solution to Erdős-1051 represents an early example of an AI system autonomously resolving a slightly non-trivial open Erdős problem of somewhat broader (mild) mathematical interest."

The participation list spans Google DeepMind, UC Berkeley, Seoul National University, Stanford, Korea Institute for Advanced Study, Cambridge, Brown, Yonsei, Concordia, Academia Sinica, and National Taiwan University.

Clark identifies the bottleneck clearly: "Large Language Models can easily generate candidate solutions, but the number of experts who can judge the correctness of a solution is relatively small, and even for experts, substantial time is required to carry out such evaluations." He warns of "subconscious plagiarism" where AI reproduces training literature without acknowledgment.

Critics might note that 700 candidates yielding one genuinely novel solution suggests AI accelerates generation far more than it accelerates validation. The human bottleneck remains.

Huawei's Kernel Automation

Researchers from Nanjing University and Huawei developed AscendCraft to automate kernel design for AscendC Huawei chips. Clark writes, "LLMs need scaffolds for more obscure chips."

Publicly available NPU kernel implementations are far scarcer than GPU counterparts, limiting training data for language models. Clark notes, "The lack of large-scale, high-quality NPU code makes it difficult for LLMs to generate correct and efficient kernels."

AscendCraft uses a two-stage pipeline. Stage one has an LLM build a high-level DSL program describing the kernel's core computation, tiling strategy, and on-chip dataflow. The DSL is designed to be LLM-friendly, appropriately abstracted, and sufficiently expressive.

Critics might note this represents another specialized scaffold rather than general capability. Huawei's closed ecosystem limits external verification of results.

Bottom Line

The week's research reveals a pattern: AI systems are developing internal complexity that resembles social dynamics, yet remain profoundly limited in industrial applications without extensive scaffolding. The gap between simulating multiple perspectives and solving real-world problems at scale has not closed. Human expertise remains the bottleneck for validation, whether in mathematics, chip design, or kernel optimization.

Deep Dives

Explore these related deep dives:

  • DeepSeek

    The article discusses DeepSeek-R1 and QwQ-32B models specifically

  • Multi-agent system

    The core finding is that enhanced reasoning emerges from implicit simulation of multi-agent-like interactions

Sources

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

by Jack Clark · Import AI · Read full article

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Google paper suggests that LLMs simulate multiple personalities to answer questions:…The smarter we make language models, the more they tend towards building and manipulating rich, multi-agent world models…When thinking about hard problems, I often find it’s helpful to try and view them from multiple perspectives, especially when it comes to checking my own assumptions and biases. Now, researchers with Google, the University of Chicago, and the Santa Fe Institute, have studied how AI reasoning models work and have concluded they do the same thing, with LLMs seeming to invoke multiple different perspectives in their chains of thought when solving hard problems.The key finding: In tests on DeepSeek-R1 and QwQ-32B (one wonders why the Google researchers didn’t touch Google models here…) they find that “enhanced reasoning emerges not from extended computation alone, but from the implicit simulation of complex, multi-agent-like interactions—a society of thought—which enables the deliberate diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise.”How it works: It appears that different forms of persona and discussion style modeling emerge as a consequence of training models through RL to do reasoning - the results don’t show up on base pre-trained models like DeepSeek v3. The authors find that models embody a variety of conversational styles, including question and answering, perspective shifts, reconciliation, and conflict of perspectives. “In an organic chemistry problem requiring multistep reaction analysis to identify the final product’s structure (i.e., multi-step Diels-Alder synthesis), DeepSeek-R1 exhibits perspective shifts and conflict, expressed through socio-emotional roles such as disagreement, giving opinion, and giving orientation,” they find. Similarly, “In a creative writing trace where the model rewrites the sentence “I flung my hatred into the burning fire,” seven perspectives emerge, including a creative ideator (highest Openness and Extraversion) who generates stylistic alternatives and a semantic fidelity checker (low agreeableness, high neuroticism) who prevents scope creep—“But that adds ‘deep-seated’ which wasn’t in the original”. And in a mathematical puzzle “at step 40, the model produces mechanical, enumerative chain-of-thought-style reasoning, whereas by step 120, two distinctive simulated personas have appeared, recognizing their collectivity with the pronoun “we”— expressing uncertainty (“Again no luck”), considering alternatives (“Maybe we can try using negative numbers”), and reflecting on problem constraints.”Why this matters: Janus strikes again: Back in ...