kvcached – UC Berkeley Sky Computing Lab

Elastic KV Cache for Dynamic GPU Sharing and Efficient Multi-LLM Inference

kvcached (KV cache daemon) is a KV cache library for LLM serving/training on shared GPUs. By bringing OS-style virtual memory abstraction to LLM systems, it enables elastic and demand-driven KV cache allocation, improving GPU utilization under dynamic workloads.

kvcached achieves this by decoupling GPU virtual addressing from physical memory allocation for KV caches. It allows serving engines to initially reserve virtual memory only and later back it with physical GPU memory when the cache is actively used. This decoupling enables on-demand allocation and flexible sharing, bringing better GPU memory utilization under dynamic and mixed workloads.

GitHub

Blog Post

Contributors

Jiarong Xing, Yifan Qiao, Shan Yu, Xingqi Cui

Publications

CoRR – Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
CoRR – Towards Efficient and Practical GPU Multitasking in the Era of LLM