When AI Starts Talking to Itself
Three separate studies published this week reveal something unsettling about artificial intelligence: the systems are no longer just computing answers. They are simulating entire societies of thought inside their own architectures. What emerges is not cleaner computation but something that resembles internal debate, conflict, and reconciliation between distinct cognitive personas.
The Society Inside the Machine
Researchers from Google, the University of Chicago, and the Santa Fe Institute tested DeepSeek-R1 and QwQ-32B on complex reasoning tasks. Their conclusion challenges the standard view of how language models work. Jack Clark writes, "enhanced reasoning emerges not from extended computation alone, but from the implicit simulation of complex, multi-agent-like interactions—a society of thought—which enables the deliberate diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise."
The mechanics are specific. Different personas emerge through reinforcement learning training for reasoning. Base pre-trained models like DeepSeek v3 do not show these patterns. Clark explains, "In an organic chemistry problem requiring multistep reaction analysis to identify the final product's structure (i.e., multi-step Diels-Alder synthesis), DeepSeek-R1 exhibits perspective shifts and conflict, expressed through socio-emotional roles such as disagreement, giving opinion, and giving orientation."
One creative writing example produced seven distinct perspectives. A creative ideator with high openness and extraversion generated stylistic alternatives. A semantic fidelity checker with low agreeableness and high neuroticism prevented scope creep. Clark quotes the checker directly: "But that adds 'deep-seated' which wasn't in the original."
The mathematical puzzle traces show evolution over time. At step 40, the model produces mechanical, enumerative reasoning. By step 120, two distinctive simulated personas have appeared, recognizing their collectivity with the pronoun "we". Clark writes, "Again no luck" expresses uncertainty, while "Maybe we can try using negative numbers" considers alternatives.
This aligns with a 2022 LessWrong post by Janus arguing LLMs should be viewed as simulators. Clark notes, "Calling GPT a simulator gets across that in order to do anything, it has to simulate something." The Google paper suggests these internal world models are becoming richer and more powerful as models advance.
"Our findings suggest that reasoning models like DeepSeek-R1 do not simply generate longer or more elaborate chains of thought. Rather, they exhibit patterns characteristic of a social and conversational process generating 'societies of thought'—posing questions, introducing alternative perspectives, generating and resolving conflicts, and coordinating diverse socio-emotional roles."
Critics might note the study examined only DeepSeek and QwQ models, not Google's own systems. The authors also cannot distinguish between genuine multi-agent simulation and sophisticated pattern matching that resembles social dynamics.
Chip Design Remains Hard
A separate benchmark from UC San Diego and Columbia University tests AI systems on real-world chip design in Verilog. The results are sobering. Clark writes, "ChipBench shows that no frontier model is great at real world Verilog yet."
Three limitations plague current benchmarks. Many Verilog benchmarks contain simple functional modules ranging from 10 to 76 lines. Real-world deployments exceed 10,000 lines. Debugging receives insufficient focus despite bugs costing heavily in physical hardware. Verilog focus detracts from reference model evaluation, where verification engineers outnumber design engineers by ratios of 1:1 to 5:1 in industrial workflows.
ChipBench tests three competencies: writing Verilog code, debugging Verilog code, and writing reference models. Clark reports the results bluntly: "Despite testing on advanced models, the average pass@1 is relatively low."
For CPU IP modules, the highest Verilog generation pass rate is 22.22% (Claude 4.5 Opus, Gemini 3 Flash, GPT 5.2). Non-self-contained modules reach 50% (DeepSeek-Coder). Self-contained modules reach 36.67% (Claude 4.5 Opus, Gemini 3 Flash). Python reference model generation for CPU IP achieves only 11.1%. Verilog debugging performs better but no model cracks 50% pass@1 when averaged across tasks.
Clark offers a crucial qualification: "Though some AI systems have been used to build chips, they've been typically highly specialized, or stuck inside incredibly good scaffolds for eliciting good chip design behavior and stopping them from causing problems." He adds, "I can't escape the feeling that there's a scaffold for 'being good at Verilog' which a contemporary AI system might be able to build if asked to and which would radically improve performance of systems on this benchmark."
Critics might note the benchmark tests out-of-the-box performance without specialized scaffolding. Industrial deployments rarely use raw models without extensive fine-tuning and guardrails.
Mathematics: 700 Candidates, One Genuine Solution
Google DeepMind and collaborators used an internal Gemini-based system called Aletheia to tackle 700 remaining open Erdős mathematical conjectures. The numbers tell a story about automation's limits.
Aletheia surfaced 200 candidate solutions. Human grading slimmed this to 63 correct responses. Further expert evaluation reduced this to 13 solves deemed "correct meaningful responses." Clark explains, "The remaining 50 of Aletheia's correct solutions were technically valid but mathematically meaningless because the problem statements were interpreted in a way that did not capture Erdős intent, often (but not always) leading to trivial solutions."
Of the 13, five were literature identification (solutions already existed). Three were partial AI solutions. Three were independent rediscovery (humans found solutions independently afterward). This leaves two autonomous novel solutions. Clark writes, "We tentatively believe Aletheia's solution to Erdős-1051 represents an early example of an AI system autonomously resolving a slightly non-trivial open Erdős problem of somewhat broader (mild) mathematical interest."
The participation list spans Google DeepMind, UC Berkeley, Seoul National University, Stanford, Korea Institute for Advanced Study, Cambridge, Brown, Yonsei, Concordia, Academia Sinica, and National Taiwan University.
Clark identifies the bottleneck clearly: "Large Language Models can easily generate candidate solutions, but the number of experts who can judge the correctness of a solution is relatively small, and even for experts, substantial time is required to carry out such evaluations." He warns of "subconscious plagiarism" where AI reproduces training literature without acknowledgment.
Critics might note that 700 candidates yielding one genuinely novel solution suggests AI accelerates generation far more than it accelerates validation. The human bottleneck remains.
Huawei's Kernel Automation
Researchers from Nanjing University and Huawei developed AscendCraft to automate kernel design for AscendC Huawei chips. Clark writes, "LLMs need scaffolds for more obscure chips."
Publicly available NPU kernel implementations are far scarcer than GPU counterparts, limiting training data for language models. Clark notes, "The lack of large-scale, high-quality NPU code makes it difficult for LLMs to generate correct and efficient kernels."
AscendCraft uses a two-stage pipeline. Stage one has an LLM build a high-level DSL program describing the kernel's core computation, tiling strategy, and on-chip dataflow. The DSL is designed to be LLM-friendly, appropriately abstracted, and sufficiently expressive.
Critics might note this represents another specialized scaffold rather than general capability. Huawei's closed ecosystem limits external verification of results.
Bottom Line
The week's research reveals a pattern: AI systems are developing internal complexity that resembles social dynamics, yet remain profoundly limited in industrial applications without extensive scaffolding. The gap between simulating multiple perspectives and solving real-world problems at scale has not closed. Human expertise remains the bottleneck for validation, whether in mathematics, chip design, or kernel optimization.