Thanks to the success of its GPUs in powering the first stage of the AI boom, Nvidia became not only the world’s most successful chip company, but the world’s most valuable company. But as we enter phase two of the AI boom, we’re seeing a new class of chip based on static random access memory (SRAM) come to the forefront. That’s good news for Nvidia, which bought its own SRAM chipmaker, as well as upstarts like d-Matrix, Cerebras, and an SRAM cloud company called Gimlet Labs.
While GPUs excel at chewing through massive gobs of data, keeping the previously computed AI model values in memory is the main bottleneck with AI inference workloads today. The so-called GPU memory wall is the primary barrier, as it imposes a hard limit on the number of previously computed keys and values an AI inference system can cache in memory for quick recall during an AI session. A smaller KV cache translates into a substandard experience for users, either through limited context windows, longer response times, or fewer number of concurrent users.
The brute force answer to the KV cache problems is to cram as much fast memory into the system as physically (or fiscally) possible. While the high-bandwidth memory (HBM) used in Nvidia and AMD GPUs offers relatively big caches for GPUs to stash data as they processes data, HBM resides off the chip, which limits the total memory bandwidth. The fastest memory available is SRAM, which resides directly on the chip and offers memory bandwidths on the order of 100 TB per second to 150 TBps, compared to the 1.2 TBps (HBM3) to 2 TBps (HBM4) per stack.
Cerebras WSE-3 features 44GB of on-chip SRAM
SRAM is superfast but it’s relatively expensive, which has traditionally limited its use to the chip registers and internal L1, L2, and L3 caches. Dynamic RAM (DRAM), by contrast, is slower but less expensive than SRAM, and traditionally has been the choice for use as main working memory.
But as AI pushes the limits of computing, that has started to change, and chipmakers are starting to build designs that feature SRAM as main memory. Groq is one of those chipmakers. Acquired by Nvidia in December for $20 billion, Groq built its Language Processing Unit (LPU) by building vector and matrix computing units directly onto the chips containing loads of SRAM. Nvidia quickly turned its acquisition around, and launched its Groq 3 LPX racks at GTC in March.
But Groq LPUs aren’t the only SRAM game in town. d-Matrix also based its chip architecture on a digital in-memory compute (DCIM) architecture that utilizes large amounts of SRAM. The Santa Clara, California company recently has been developing its Corsair accelerator, which incorporates 256 MB of SRAM in a 3D-stacked chiplet form factor. Each Corsair card delivers 150 TBps of memory bandwidth directly from a PCIe Gen 5 card.
Last week, d-Matrix announced that Corsair is in full production, with product shipping in volume to priority customers. Each Corsair accelerator offers up to 2,400 teraflops of 8 bit dense compute within a 600 watt TDP, and can be installed in standard air-cooled server racks. It’s being manufactured by d-Matrix partners TSMC and Alchip Technologies on TSMC’s N6 process node.
“We built Corsair specifically for this moment, the Age of AI Inference,” stated Sid Sheth, founder and CEO of d-Matrix. “The applications that matter most today–agentic AI, interactive coding, real-time voice agents–live or die on latency. Corsair takes off from where the GPU leaves off, and this summer our customers will be able to experience the turbocharge d-Matrix brings at full rack scale.”
Nvidia’s new Groq 3 LPU
Another chipmaker building around SRAM is Cerebras Systems. Where the d-Matrix Corsair takes a svelte, chiplet-based approach that can scale from very small to big, Cerebras is swinging for the fences with its massive chip, the Wafer Scale Engine.
The WSE-3, which Cerebras announced in March 2024, is a monster of a chip, containing 4 trillion transistors and packing 44GB of on-chip SRAM along with nearly 1 million AI compute cores onto a silicon wafer the size of a dinner plate. With up to 1 PB of external memory, it can train the biggest AI models in the world, spanning 24 trillion parameters or more.
Cerebras Systems went public three weeks ago. Trading on the Nasdaq under the ticker symbol CBRS, Cerebras raised $5.55 billion with its shares price at $185 per, making it the biggest IPO of the year (so far). The company is currently valued at $56 billion, which makes it the poster child for the emerging SRAM chip market.
An SRAM company to keep an eye on is Gimlet Labs, an applied AI research startup out of Stanford University. The company is developing what it calls a “multi-silicon inference cloud” that eliminates hardware considerations for AI customers by building an abstraction layer in between the workload and the hardware.
Zain Asgar, the Stanford adjunct professor who co-founded Gimlet and is its CEO, is a fan of traditional GPUs as well as newer SRAM accelerators.
Source: Gimlet Labs
“Our software orchestration slices and maps inference workloads to the optimal hardware, and that experience has given us a practical view of where each architecture sits,” Asgar wrote in a March blog titled “The emerging role of SRAM-centric chips in AI inference.” “With top labs increasingly investing in inference speed and throughput, SRAM-centric accelerators are positioned to capture a meaningful share of the market,”
Gimlet, which recently raised $80 million in a Series A round, is running SRAM-based accelerators in its cloud, alongside traditional chips like Nvidia GPUs. The prefill and decode stages of AI inference put dramatically different demands on processors and memory stacks, and fitting each workload to the appropriate hardware is a constant challenge.
The auto-regressive nature of the decode stage maps very well to high memory intensity and doesn’t benefit from compute density, which favors the near-compute memory architectures, such as SRAM, Asgar wrote.
“This has created a perfect storm for near-compute memory chips (like today’s SRAM-centric architectures), which provide superior performance for decode,” he wrote. “As an industry, we are entering an exciting new era of chip design, with the unique demands of today’s inference workloads pulling architectures in many different directions at once. We look forward to seeing (and using) the new chips that emerge under these constraints.”
Editor’s note: A version of this story first appeared in HPCwire.
The post SRAM Chips Pulling Ahead in the New AI World appeared first on AIwire.

