← Back to Library

How d-Matrix's in-memory compute tackles AI inference economics

Vikram Sekar cuts through the hype of the current artificial intelligence boom by focusing on a single, stubborn bottleneck: the energy and cost of moving data. While the industry chases ever-larger models, Sekar argues that the next breakthrough won't come from bigger chips, but from a fundamental architectural shift that stops shuffling bits back and forth. This piece is notable because it doesn't just praise a new startup; it dissects a specific engineering pivot that challenges the dominance of the graphics processing unit (GPU) as the default engine for AI inference.

The Physics of the Problem

Sekar begins by grounding the discussion in the physical reality of computation. He explains that the standard GPU approach, while powerful, is inefficient because it treats memory and processing as separate entities. "The process of AI training and inference involves a lot of matrix multiplications, followed by additions... These operations are called multiply-accumulate (MAC) operations," he writes. The core of his argument is that nature already offers a more efficient way to perform these calculations, a concept rooted in basic physics rather than complex logic gates.

How d-Matrix's in-memory compute tackles AI inference economics

He draws a direct line to the analog world, noting that "the analog multiply operation stems from Ohm's law expressed in the current-voltage relation form: I=GV." This is a crucial framing choice. By invoking Ohm's law, Sekar reminds the reader that computing isn't just about software; it's about electricity. He points out that in an analog system, the entire matrix can be computed "nearly instantaneously while consuming minimal power because there are no switching transistors." This is where the historical context of the memristor becomes relevant; the theoretical groundwork for these devices was laid decades ago, yet the industry only recently began to seriously revisit them as a solution to the "memory wall."

However, Sekar is careful not to present this as a magic bullet. He acknowledges the significant hurdles that have kept analog computing in the lab. "The main reason this has not become the de-facto computing standard is because conductance (G) varies in the analog world. Noise, temperature, and variability all make current-based AIMC challenging." This nuance is vital. It prevents the commentary from devolving into techno-optimism and instead sets the stage for why d-Matrix's specific approach matters. They didn't stick with the pure analog route; they found a middle ground.

"Going digital alleviates a lot of the 'variability' problems that plagues analog approaches, while being able to scale with newer technologies."

The Digital Compromise

The article's most compelling section details d-Matrix's decision to abandon their initial analog prototype, Nighthawk, in favor of a digital in-memory compute (DIMC) architecture. Sekar explains that the analog approach required placing analog-to-digital converters on every bitline, a design choice that proved too costly and complex. Instead, the company wove SRAM (static random-access memory) cells directly into the compute function.

This is not merely a minor tweak; it is a reimagining of the chip's internal geography. Sekar notes that "instead of allocating a portion of the chip to SRAM and a portion to compute, d-Matrix has finely-tuned digital in-memory compute (DIMC) cores consisting of MAC functions and SRAM that are designed to handle very specific numeric array sizes - 64 × 64 to be exact." The result is a system where data doesn't have to travel far to be processed. The author highlights the staggering performance gains: "d-Matrix chips provide a memory bandwidth up to 150 TB/s; much faster than the best-HBM based implementations which only achieve about 2 TB/s per chip."

Critics might note that SRAM is inherently expensive and low-density compared to the high-bandwidth memory (HBM) used in modern GPUs. Sekar anticipates this objection, arguing that the trade-off is worth it for specific workloads where latency is the primary constraint. He suggests that the industry is moving away from "once-size-fits-all" racks toward architectures fine-tuned for specific tasks. This is a persuasive argument for specialized hardware, echoing the shift from general-purpose CPUs to GPUs in the last decade.

Scaling the Solution

The final challenge addressed is capacity. A single chiplet with 256 megabytes of SRAM cannot hold a massive language model. Sekar details how d-Matrix solves this by connecting four chiplets together on a single substrate, creating a 1-gigabyte pool of ultra-fast memory. "The die-to-die interconnect IP called 'DMX Link' is a custom built in-house solution that achieves a total bandwidth of 1 TB/s," he writes. This modular approach allows the system to scale up to rack-level solutions without relying on the expensive and complex CoWoS packaging required for HBM.

The author emphasizes the economic implications of this design. By using standard LPDDR memory for expansion and avoiding the premium interposers of competitors, d-Matrix claims a "3× lower cost" and "3–5× better energy efficiency than GPU-based systems." Sekar frames this as a strategic advantage for enterprises and sovereign customers who need to run inference at scale without burning through their power budgets. He points out that the system can be configured in "Performance or Capacity mode," allowing operators to balance speed against memory needs depending on whether they are in the prefill or decoding phase of a query.

"Fine-tuned AI accelerator architectures are becoming increasingly important over 'once-size-fits-all' racks, and is a key reason why the d-Matrix approach is attractive."

Bottom Line

Vikram Sekar's analysis effectively demonstrates that the path to efficient AI inference may lie in abandoning the brute-force scaling of GPUs in favor of architectural specialization. The strongest part of the argument is the clear explanation of how moving computation into memory bypasses the energy costs of data movement, a physical limitation that software alone cannot solve. The biggest vulnerability, however, remains the ecosystem lock-in; convincing the industry to adopt a new hardware standard when the GPU supply chain is already entrenched is a monumental challenge. Readers should watch to see if d-Matrix's performance claims hold up in real-world, large-scale deployments beyond the controlled environments of their test chips.

Deep Dives

Explore these related deep dives:

  • In-memory processing

    The article's central focus is on d-Matrix's in-memory compute approach for AI inference. Understanding the broader technical foundations of in-memory computing - how it differs from traditional von Neumann architectures and why moving computation closer to data reduces latency - provides essential context for evaluating d-Matrix's claims.

  • Memristor

    The article mentions memristors as one approach to implementing analog in-memory compute weights. Memristors are a fascinating fourth fundamental circuit element theorized by Leon Chua in 1971 and first physically realized in 2008, with significant implications for neuromorphic computing that readers may not know deeply.

  • Ohm's law

    The article explains how analog in-memory compute leverages Ohm's law (I=GV) for multiplication operations. While readers may remember the basic formula, the deeper history of Georg Ohm's discovery, the physics behind electrical resistance, and its foundational role in electronics provides enriching context for understanding why this natural property enables efficient computation.

Sources

How d-Matrix's in-memory compute tackles AI inference economics

by Vikram Sekar · Vik's Newsletter · Read full article

Each week, I help investors and professionals stay up-to-date on the semiconductor industry. If you’re new, start here. See here for all the benefits of upgrading your subscription tier!

Paid subscribers will get have access to a video discussion of this essay, key highlights, and a google drive link to this article to parse with LLMs.

Disclaimer: This article is entirely my own opinion. I have not been paid by d-Matrix, nor do I have any access to internal documents. All information is publicly available (references cited). I do not hold any investment position in d-Matrix, and this is not investment advice. Do your own research. This article does not reflect the views of any past, present, or future employers, nor does it directly or indirectly imply any competitors are better or worse. This is my attempt at trying to understand how core technology works and where its advantages lie. I do not endorse any products.

Disclosure: I requested that d-Matrix review the article to ensure that I do not misunderstand/misrepresent their technology. I’m grateful to them for pointing out errors in my conceptual understanding. All editorial decisions are entirely mine.

Recently d-Matrix, a Bay Area AI inference chip startup, announced its Series C funding of $275M which brings its total funding up to $450M.

d-Matrix claims to have the “world’s highest performing, most efficient data center inference platform for hyperscale, enterprise, and sovereign customers,” and a “full-stack inference platform that combines breakthrough compute-memory integration, high-speed networking, and inference-optimized software to deliver 10× faster performance, 3× lower cost, and 3–5× better energy efficiency than GPU-based systems.”

Their main compute engine is called Corsair, and is based on a different approach to inference called in-memory compute. In this post, we will look at this technology in detail, how it provides all those benefits, and where it is useful.

For free subscribers:

Analog in-memory computing

d-Matrix’s digital in-memory compute solution

Four chiplets and LPDDR5

Scaling up to rack-level solutions

References

For paid subscribers:

A real-world use-case for d-Matrix DIMC hardware

Designing Hardware for Latency, Throughput, and TCO

The PCIe Advantage

Possible Uses of Small Inference Models running d-Matrix Hardware

Analog In-memory Compute (AIMC).

The process of AI training and inference involves a lot of matrix multiplications, followed by additions, which come from vector multiplications. If you need a deeper understanding of what those operations are in the context of transformers, check out ...