← Back to Library

Prediction is hard, especially about the future

Rohit Krishnan challenges the prevailing assumption that artificial intelligence is a static tool, proposing instead that its true potential lies in a continuous, real-time feedback loop with the messy reality of the world. By running a tiny, open-source model on a personal laptop to predict daily news headlines, he demonstrates that even modest systems can learn to navigate complex, adversarial environments without massive computational resources. This is not just a technical demo; it is a blueprint for how AI might finally evolve from a database of past facts into a dynamic participant in the future.

The Limits of Static Benchmarks

Krishnan begins by dismantling the standard metrics we use to evaluate large language models. He notes that while current systems excel at math, logic puzzles, or even booking plane tickets, these tasks fail to capture the essence of understanding a changing world. "One would imagine they go hand in hand but alas," he observes regarding the boom in AI usage versus the clarity of its capabilities. The author argues that traditional benchmarks suffer from "teaching to the test," where models optimize for a specific exam rather than genuine comprehension. Instead, he proposes a more rigorous yardstick: prediction markets. If a model can accurately forecast future events, it must possess a robust internal model of how the world works.

"The key thing that you know differentiates us is the fact that we are able to learn right like if you have a trader who gets better making predictions they do that because like you know he or she is able to read about what they did before and can use that as a springboard to learn something else."

This comparison to human traders is the piece's most compelling insight. Krishnan suggests that the current gap between human and machine intelligence isn't about raw processing power, but about the mechanism of learning. Humans improve because they constantly update their mental models based on outcomes; most AI models, once trained, remain frozen in time. The author's experiment, dubbed "Foresight Forge," was designed to bridge this gap by creating an automated research engine that not only makes predictions but also ingests the results the next day to update its own policy.

Prediction is hard, especially about the future

Critics might note that predicting headlines is a noisy task where luck often masquerades as skill, and a model could simply be regurgitating common tropes rather than truly understanding causality. However, Krishnan anticipates this by designing a system that requires specific structural elements in its predictions, forcing the model to articulate drivers and verification sketches rather than vague guesses.

The Tiny Model Revolution

The most surprising element of Krishnan's coverage is his decision to run this experiment on a "tiny model" rather than the massive, state-of-the-art systems dominating the headlines. He chose a 0.6 billion parameter model, Qwen, running locally on a laptop to avoid the prohibitive costs of cloud computing. "For instance what's the best way to do this would be to say make a bunch of predictions and the next day you can look back and see how close you got to some of those predictions and update your views," he explains. By using a small model, he forces the system to learn efficiently from sparse rewards, mimicking the constraints of the natural environment.

"I was super surprised that even a small model did learn to get better at predicting next day's headlines. I wouldn't have expected it because there is no logical reason to believe that tiny models can still learn sufficient world model type information that it can do this."

This finding upends the industry's current trajectory, which assumes that scaling up parameters and data is the only path to intelligence. Krishnan argues that the bottleneck is not size, but the feedback mechanism. He details how he had to engineer a specific reward function using semantic similarity to judge the model's accuracy, a workaround for the fact that small models are poor at judging their own work. "The hardest part was trying to figure out the exact combination of rewards that would actually make the model do what I wanted and not whatever it wanted to try and maximise and reward by doing weird stuff," he writes. This highlights a critical, often overlooked challenge in AI development: the reward function is just as important as the model architecture.

"The future is totally going to look like a video game."

This metaphor captures the essence of his argument. In a video game, an agent learns by interacting with the environment, receiving immediate feedback on success or failure, and adjusting its strategy in real-time. Krishnan sees this as the missing link for AI. He points to companies like Cursor, which already use similar reinforcement learning techniques to update their coding assistants every few hours based on human acceptance or rejection of suggestions. If this works for code, Krishnan posits, it should work for the broader world.

The Path to Continuous Learning

The broader implication of Krishnan's work is a shift from episodic training to continuous adaptation. He envisions a future where AI systems are not static products but evolving partners that get smarter every day. "There's no reason to believe that this is an isolated incident, just like with the RLNVR paper there is no reason to believe that this will not scale to doing more interesting things," he asserts. The technical hurdles are significant—specifically, creating a reward function that is both interesting and robust enough to teach the model something new without causing it to collapse into nonsense. But the proof of concept is there.

"While I chose one of the harder ways to do this by predicting the whole world, I was super surprised that even a small model did learn to get better at predicting next day's headlines."

The author's willingness to publish his code and methodology, including the specific parameters that worked best, invites the community to replicate and improve upon his findings. This openness stands in contrast to the secretive nature of many major AI labs. By demonstrating that a personal laptop can run a continuous learning loop, Krishnan democratizes the path to advanced AI capabilities. He suggests that the next breakthrough won't come from a single massive training run, but from thousands of small, daily updates driven by real-world feedback.

Bottom Line

Krishnan's most powerful contribution is the demonstration that continuous, on-policy learning is not just a theoretical goal but a practical reality achievable with modest resources. While the experiment relies on a somewhat noisy metric—headline prediction—the underlying mechanism of updating a model based on daily outcomes offers a viable path toward truly adaptive AI. The biggest vulnerability remains the difficulty of designing reward functions that generalize well across diverse, complex domains without encouraging the model to game the system. Readers should watch for how this "video game" approach to AI training scales beyond news prediction to more critical areas like policy analysis and scientific discovery.

Sources

Prediction is hard, especially about the future

by Rohit Krishnan · Strange Loop Canon · Read full article

All right, so there's been a major boom in people using AI and also people trying to figure out what AI is good for. One would imagine they go hand in hand but alas. About 10% of the world are already using it. Almost every company has people using it. It’s pretty much all people can talk about on conference calls. You can hardly find an email or a document these days that is not written by ChatGPT. Okay, considering that is the case, there is a question about, like, how good are these models, right? Any yardstick that we have kind of used, whether it's its ability to do math or to do word problems or logic puzzles or, I don't know, going and buying a plane ticket online or researching a concert ticket, it's kind of beaten all those tasks, and more.

So, considering that, what is a question, a good way to figure out what they're ultimately capable of? One the models are actually doing reasonably well and can be mapped on some kind of a curve, which doesn’t suffer from the “teaching to the test” problem.

And one of the answers there is that you can look at how well it actually predicts the future, right? I mean, lots of people talk about prediction markets and about how you should listen to those people who are actually able to do really well with those. And I figured, it stands to reason that we should be able to do the same thing with large language models.

So the obvious next step became to test it is to try and take a bunch of news items and then ask, you know, the model what will happen next. Which is what I did. I called this foresight forge because that’s the name the model picked for itself. (It publishes daily predictions with GPT-5, used to be o31.) I thought I would let it take all the decisions, from choosing the sources to predictions to ranking it with probabilities after and doing regular post mortems.

Like an entirely automated research engine.

This work went quite well in the sense that it gave interesting predictions, and I actually enjoyed reading them. It was insightful! Though, like, a bit biased toward positive outcomes. Anyway, still useful, and a herald of what’s to come.

But, like, the bigger question I kept asking myself ...