MinIO rolled out its second major product earlier this month. Dubbed MemKV, the software expands the KV cache layer in AI inference clusters, thereby enabling bigger context windows. Living at the 3.5G layer in Nvidia’s CMX stack, MinIO says MemKV will give customers microsecond context retrieval latencies on petabyte-scale data sets.
As AI inference workloads scale up, systems are quickly running out of memory to store AI state, or the responses to questions that have already been computed by the GPU during the initial prefill stage and other information that the user or agent added for context.
This state data is stored as a key-value (KV) pair in a cache. As the KV cache fills up, it spills over from fast HBM and DRAM memory to slower storage medium, which increases the wait times that users or agents experience. In some cases, it’s faster to simply recompute the state by re-running the prefill jobs on GPUs rather than wait for KV state information to be pulled from slower storage. But paying that “recompute tax” clearly is not an ideal long-term solution.
MinIO’s MemKV product slots in at the G3.5 layer
Nvidia’s response to this problem was to launch the Context Memory Storage (CMX) platform at the Consumer Electronics Show in early January 2026. CMX utilizes Nvidia’s upcoming BlueField-4 data processing units (DPUs) and ConnectX-8 and -9 SuperNICs to logically extend the KV cache from the high-speed HBM and DRAM memory connected to its superchips over RDMA-enabled Ethernet links to just-a-bunch-of-flash (JBOF) NVMe storage. CMX also uses the Nvidia Inference Xfer Library (NIXL) software, which Nvidia created to accelerate point-to-point communications within AI inference frameworks, such as its own Dynamo framework.
Over the past few months, all of the major storage vendors have rolled out their strategies for supporting the CMX architecture, which is an extension of Nvidia’s reference architecture for AI storage, dubbed STX. The BlueField-4 DPUs won’t be shipping until the second half of 2026, so none of the new storage solutions based on the CMX and STX reference architectures are available yet, although many vendors have existing workarounds for the KV cache problem.
MinIO says MemKV takes a fresh approach to tackle this issue that is free of technological baggage and slow workarounds. The key, according to MinIO CTO Ugur Tigli, is to minimize the amount of code in the critical AI inference path, thereby maximizing the response times, which is the mantra that MinIO has espoused with its popular open source S3-compatible object store, which it rebranded AI Store.
“We used our experience from the past distributed file systems and S3 storage with AI Stor with persistent storage, and we came up with MemKV, which is essentially sitting in that G3.5 layer,” Tigli told HPCwire. “It’s a distributed KV memory that can be addressed by all GPUs in that layer.”
In Nvidia’s “G” hierarchy, G1 refers to the HBM connected directly to the GPUs, while G2 refers to DRAM connected directly to CPUs on Nvidia Grace Blackwell and Vera Rubin superchips. The G3 layer is a PCIe-connected NVMe drive that sits right next to the Nvidia superchips within the same rack, while G4 refers to network-attached storage. G3.5 refers to a CMX solution that is technically network attached over RDMA and Ethernet but runs at in-memory speeds.
MemKV will run directly on DPU-4 ARM cores sitting in CMX storage appliances from OEMs. Rather than using NIXL, MemKV implements its own I/O stack over RDMA that’s based on 2 to 16 MB block sizes, which is what GPUs expect to see. There is no file system or object store in MemKV, just RDMA-acceleration to JBOF over twin 400 GbE or 800 GbE links. The result is a very fast system that can deliver microsecond latencies for KV cache data sitting in petabytes of JBOF storage.
MinIO says MemKV can dramatically reduce the recompute tax. Tests performed using the AI Perf benchmark showed that MemKV delivers a 50% improvement in the time-to-first-token (TTFT) metric compared to recomputing state at the GPU. Running at scale with 128 GPUs and a 128K-token context length showed that MemKV increased GPU utilization from around 50% to more than 90%. Extended out over a year, that equates to $2 million in compute savings.
Sitting at the 3.5G layer gives MemKV an advantage when it comes to large AI workloads that may span multiple GPUs. With 3G solution, the state is tied to a specific NVMe drive connected to a given GPU or CPU. With MemKV’s 3.5G solution, the state in any NVMe drive is accessible to any GPU or CPU that may need to access it.
“This is going to be dramatically important when agents come into the picture in enterprise, because agents will just spill out so much intermediate data, and having it pinned to a GPU is not going to cut it,” Tigli said. “They need to have a global altogether address memory.”
While keeping GPUs running near their capacity makes good economic sense in the new world of generative and agentic AI, it depends on what the GPU is doing. Keeping the GPU occupied with constantly re-computing state because the KV cache is too small is not a good use of the most expensive piece of hardware in the AI factory. The hardware pandemic has made getting some NVMe drives difficult and expensive, but it is still a better use of $80,000 or so to build a 1 petabyte NVMe system for extending the KV cache via MemKV rather than spending the money on a GPU, the company says.
“Every GPU recovered from recomputation is also a GPU you don’t need to buy, power, or cool, and at 1,200 watts per modern AI GPU, the energy savings compound fast,” the company says in a blog post on MemKV. “This isn’t a storage story. This is a ‘how many fewer GPUs do I need to buy next quarter’ story. The recompute tax is the dominant cost driver in production inference, and MemKV collapses it.”
MemKV is available now in preview mode. Although it could theoretically run on any ARM system, the solution currently is tied to Nvidia BlueField-4 DPUs, which are expected to ship in the second half of 2026.
The post MinIO Spies Solution to GPU Memory Wall Problem with MemKV appeared first on AIwire.

