MarkTechPost

NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Delivers near-Lossless 2x-4x Compression

Back to overview

NVIDIA open-sourced KVzap, an advanced KV-cache pruning method enabling near-lossless 2x-4x compression for transformers. As context lengths expand to tens of thousands of tokens, KV caches become critical bottlenecks in decoder implementation. KVzap optimizes memory usage by compressing key-value data stored across layers and heads, significantly reducing the ~335GB footprint in models like Llama-65B while maintaining performance.