Efficient llms at scale: My NeurIPS week in kv caches, spec decoding, and fp4

While the broader AI world fixates on reinforcement learning, a quieter but more urgent revolution is reshaping how we actually run these models in production. The Kaitchup reports from the floor of NeurIPS that the real bottleneck isn't intelligence anymore; it's memory bandwidth. With over 29,000 attendees flooding San Diego, the consensus has shifted from "how smart can we make it?" to "how fast can we make it without breaking the bank?" This piece cuts through the hype to argue that the future of large language models depends not on bigger parameters, but on smarter compression.

The Memory Wall

The article opens by establishing the sheer scale of the event, noting that "Downtown San Diego, especially the 'historic' Gaslamp district, was completely taken over." Yet, amidst the noise, the technical focus was laser-sharp. The Kaitchup identifies the core problem: as reasoning models generate longer chains of thought, the "KV cache" (the memory store of past tokens) explodes in size. "With long reasoning traces, that history can easily exceed 10,000 tokens, translating to gigabytes of tensors that must be read at every decoding step." This creates a physical limit where the GPU spends more time shuffling data than calculating answers.

Efficient llms at scale: My NeurIPS week in kv caches, spec decoding, and fp4

The piece argues that the solution isn't a single silver bullet, but a family of engineering tricks unified by a simple logic: "Score how important each stored token is... Keep the most important entries as full KV pairs, approximate or share the 'borderline' ones, and drop the rest." This approach mirrors the history of signal processing, where quantization techniques have long traded minor fidelity for massive efficiency gains. Just as early audio compression algorithms learned to discard frequencies humans couldn't hear, these new methods discard tokens the model doesn't actually need to attend to.

"Once you're serving many requests in parallel, GPU memory bandwidth quickly becomes the bottleneck."

The coverage details several distinct flavors of this approach. Some methods, like SmallKV, employ a "helper model" to predict which tokens matter, effectively outsourcing the memory management to a smaller, cheaper neural network. Others, like AttentionPredictor, use a tiny convolutional network to forecast attention patterns over time. The Kaitchup notes that while the mechanisms differ, the outcome is the same: "The big model outsources attention 'intuition' to a cheaper module." This is a pragmatic shift from theoretical purity to engineering reality.

Critics might argue that aggressively pruning the cache risks degrading the model's ability to handle complex, long-context tasks. However, the piece counters this by highlighting methods like ChunkKV, which preserves semantic units rather than individual tokens, ensuring that "local semantics" aren't destroyed by random pruning. The evidence suggests that for many applications, the trade-off is not just acceptable, but essential.

Speculative Decoding and Architectural Shifts

Beyond compression, the article explores speculative decoding, a technique where a cheap model "guesses ahead" and a larger model simply verifies the work. The Kaitchup observes that this is becoming a standard tool for speed, but the real innovation lies in how these guesses are structured. The piece highlights that researchers are moving beyond simple token prediction to "speculating on steps vs whole responses," fundamentally changing the inference pipeline.

Perhaps the most intriguing development discussed is the architectural shift seen in models like SkipV1Former. Instead of trying to compress the cache after the fact, these models are being built to require less memory from the start. "From the second block onward, each layer in SkipV1Former reuses half of its value heads directly from the first layer's values," the article explains. This design choice reduces the distinct projections needed, cutting the cache footprint by nearly half while actually improving perplexity scores. It's a reminder that efficiency is often a design constraint, not just a post-hoc optimization.

The Kaitchup also touches on the visual domain, noting that ScaleKV adapts these compression techniques for image generation by recognizing that "attention patterns differ drastically across layers and scales." This modality-aware approach suggests that a one-size-fits-all compression strategy is insufficient; the architecture must understand the specific nature of the data it is processing.

"The differences that make each paper feel 'new' are mostly engineering choices along a few axes."

This observation is the piece's most valuable insight. It demystifies the rapid pace of AI progress, revealing it as a series of calculated engineering trade-offs rather than magical breakthroughs. By focusing on the "axes" of importance signals, timing, and granularity, the article provides a framework for readers to evaluate future claims.

The Human Element and the RL Distraction

Despite the technical focus, the piece acknowledges the human cost of this relentless optimization. It notes that the official "hot topic" of the conference was reinforcement learning (RL), yet the author deliberately avoided it. "I'm not a fan of the way RL is often being run right now for LLMs," the piece admits, citing a disconnect between the hype and practical utility. This decision to focus on inference efficiency rather than RL aligns with the immediate needs of the industry, where latency and cost are the primary barriers to adoption.

The article concludes with a nod to Yejin Choi's keynote, which "perfectly captured the year." While the piece doesn't detail the speech, it implies that the community is beginning to grapple with the broader implications of these efficiency gains. As models become cheaper to run, the question shifts from feasibility to responsibility. The Kaitchup's focus on the "how" rather than the "what" of AI development is a necessary corrective to the current discourse.

Bottom Line

The Kaitchup delivers a vital reality check: the next leap in AI isn't about making models smarter, but making them leaner. The strongest part of this argument is its granular breakdown of the "KV cache" problem, turning a dry technical bottleneck into a clear narrative of engineering necessity. Its biggest vulnerability is the implicit assumption that these compression techniques will scale indefinitely without unforeseen degradation in reasoning capabilities. For the busy professional, the takeaway is clear: the future of AI infrastructure lies in the quiet, unglamorous work of memory management, not just the flashy headlines of new model releases.

Efficient llms at scale: My NeurIPS week in kv caches, spec decoding, and fp4

by Various · The Kaitchup · Read full article

One week after getting back from NeurIPS in San Diego, my full report is here. I had a lot to process and write up.

If you’re not familiar with it, NeurIPS has been the flagship conference for AI research for years, with tracks on deep learning, optimization, theory, and a growing number of applications. Ten years ago, it already felt massive with around 4,000 participants. This year, the organizers announced more than 29,000 registrations (including virtual attendees, and 500 people attending from the Mexico satellite venue,1 but not counting on-site late registrations). I’ve been to dozens of research conferences in my life, and I’ve never seen anything close to this scale.

Downtown San Diego, especially the “historic” Gaslamp district, was completely taken over. Every hotel lobby, bar, and restaurant seemed to be hosting some variation of the same conversation about LLMs and agents. French and US border agents at airports had the same remarks: “Are you going to THIS conference?”. I’m sure the local businesses did very well. It felt like every second table was a mini-NeurIPS in itself.

In this article, I want to share what I took away from the week, not a complete overview of NeurIPS, it’s impossible, but a view from my corner of it. My work is mostly about adapting LLMs: fine-tuning, quantization, evaluation, and efficient inference. So that’s where I naturally gravitated: the sessions, posters, and hallway conversations on scaling, compression, and inference tricks. I spent (almost) no time in the more theoretical or classical computer vision tracks.2 There was also a “Machine Learning” track that I didn’t visit.

I’ll structure this report around a few themes:

Highlights around three topics

KV Cache: The Enemy Number One

Speculative Decoding with Trees

FP4 at Training Time: Four Bits Are (Almost) Enough

Good paper special mentions:

How to do AdamW with a batch size of 1

How leaderboards are gamed

Comments on Yejin Choi’s keynote, which perfectly captured the year

How to survive (and enjoy) a conference as big as NeurIPS

One important disclaimer: the official hot topic of this year was clearly reinforcement learning. But you won’t find much RL in this write-up. That’s partly practical: the RL area was packed, and having the kind of in-depth discussions I’d want there would have eaten most of my conference time. It’s also because I’ve already read and played with many of the RL works that interested ...

Efficient llms at scale: My NeurIPS week in kv caches, spec decoding, and fp4

The Memory Wall

Speculative Decoding and Architectural Shifts

The Human Element and the RL Distraction

Bottom Line

Deep Dives

Sources

Efficient llms at scale: My NeurIPS week in kv caches, spec decoding, and fp4