Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
-
Updated
Jun 2, 2026 - C++
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]
(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation
Span Queries: What if we had a way to plan and optimize GenAI like we do for SQL?
A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime
High-Performance KV Cache Sharing Library
An empirical study of benchmarking LLM inference with KV cache offloading using vLLM and LMCache on NVIDIA GB200 with high-bandwidth NVLink-C2C .
KV-cache compression for LLMs: reference implementations of TurboAngle and TurboQuant codecs with Triton GPU kernels
KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.
晚上下班不刷手机,学点什么。系列二:从 0 手写大模型推理框架,完成 Qwen3-4B 模型的本地单卡部署和 GPU 推理优化,显存不够可用 Qwen3-0.5B。
[MLSys-26] FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
Clean from-scratch inference engine for shannon-prime-lattice. NTT-based attention, two-node CRT-sharded inference path, KSTE-encoded KV state.
Add a description, image, and links to the kvcache topic page so that developers can more easily learn about it.
To associate your repository with the kvcache topic, visit your repo's landing page and select "manage topics."