#

kvcache

Here are 25 public repositories matching this topic...

Mooncake

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

reinforcement-learning inference rdma disaggregation llm vllm sglang kvcache tokenspeed

Updated Jun 2, 2026
C++

uccl-project / uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

ai networking hpc amd gpu collective cuda p2p nvidia broadcom moe rdma allreduce llm kvcache

Updated Jun 1, 2026
C++

Zefan-Cai / R-KV

[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

llm kvcache reasoning-models

Updated Oct 16, 2025
Python

ovg-project / kvcached

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

serverless inference-engine llm llm-serving vllm llm-inference ollama llm-framework sglang kvcache gpu-sharing kvcached gpu-mutiplexing kvcache-optimization elastic-kvcache online-offline-coserve

Updated Jun 2, 2026
Python

ModelEngine-Group / unified-cache-management

Persist and reuse KV Cache to speedup your LLM.

gpu cuda nfs torch ssd dram hbm ucm npu ascend llm vllm deepseek kvcache

Updated Jun 2, 2026
Python

alibaba / tair-kvcache

Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global cache management, inference simulation(HiSim), and more.

simulator kv-cache llm kvcache hisim

Updated Jun 2, 2026
C++

rh-aiservices-bu / sardeenz

Sardeenz is a proof-of-concept application that allows you to load more than one model on a given GPU. It allows you to add more and more models onto a GPU, until it is fully utilized.

vllm kvcache kvcached

Updated May 20, 2026
TypeScript

NoakLiu / PiKV

PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]

distributed-systems parallel-computing moe mixture-model management-system mixture-of-experts mlsystem kv-cache kvcache

Updated May 19, 2026
Python

Linking-ai / SCOPE

(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation

long-context kv-cache-compression kvcache

Updated May 28, 2025
Jupyter Notebook

SiO-2 / kvcloak

Official implementation of "Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference" (NDSS 2026)

privacy llm kvcache ndss-2026

Updated Feb 28, 2026
Python

IBM / spnl

Span Queries: What if we had a way to plan and optimize GenAI like we do for SQL?

sql optimization locality generative-ai kvcache

Updated May 29, 2026
Rust

jimliddle / turboquant-amd-vulkan

A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime

amd vulkan llms kvcache turboquant

Updated Apr 1, 2026
C++

lizixi-0x2F / March

High-Performance KV Cache Sharing Library

optimization high-performance-computing kv-cache llm vllm kvcache

Updated Apr 2, 2026
Python

xxrjun / gb200-kvcache-offload-study

An empirical study of benchmarking LLM inference with KV cache offloading using vLLM and LMCache on NVIDIA GB200 with high-bandwidth NVLink-C2C .

offloading blackwell kvcache gb200

Updated Dec 20, 2025
Python

SuperMarioYL / inference-cookbook

inference cookbook / inference 框架原理解析

vllm sglang kvcache

Updated Feb 24, 2026
HTML

kvcompress

llmsresearch / kvcompress

KV-cache compression for LLMs: reference implementations of TurboAngle and TurboQuant codecs with Triton GPU kernels

kvcache kvcache-compression turboquant turboangle

Updated Apr 5, 2026
Python

turboquant-experiment

amitshekhariitbhu / turboquant-experiment

KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.

inference large-language-models llm llms llm-inference kvcache kvcache-optimization kvcache-compression turboquant

Updated Mar 26, 2026
Python

muyuuuu / LLM-Inference

晚上下班不刷手机，学点什么。系列二：从 0 手写大模型推理框架，完成 Qwen3-4B 模型的本地单卡部署和 GPU 推理优化，显存不够可用 Qwen3-0.5B。

triton sampling llm-inference flash-attention kvcache qwen3 page-attention

Updated Feb 23, 2026
Python

NazmulTakbir / FlexiCache

[MLSys-26] FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

vllm llm-inference kvcache kvcache-optimization

Updated Mar 9, 2026
Python

nihilistau / shannon-prime-system-engine

Clean from-scratch inference engine for shannon-prime-lattice. NTT-based attention, two-node CRT-sharded inference path, KSTE-encoded KV state.

machine-learning machine-learning-algorithms transformers feedforward-neural-network attention-mechanism attention-is-all-you-need transformer-encoder transformer-architecture kvcache kvcache-optimization kvcache-compression

Updated Jun 2, 2026
HTML

Improve this page

Add a description, image, and links to the kvcache topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kvcache topic, visit your repo's landing page and select "manage topics."