Name: Lightning Talk: KV-Cache Centric Inference: Building a State-Aware Serving Platform With Llm-d and VLLM - Maroon Ayoub & Martin Hickey, IBM Research
Start: 2026-04-08T11:05:00+0200
End: 2026-04-08T11:15:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Lightning Talk: KV-Cache Centric Inference: Building a State-Aware Serving Platform With Llm-d and VLLM - Maroon Ayoub & Martin Hickey, IBM Research

Wednesday April 8, 2026 11:05 - 11:15 CEST

Central Room

We’ve spent years optimizing LLM inference around compute - faster kernels, better batching, smarter parallelism. But in production, the bottleneck increasingly isn’t FLOPs. It’s state. Specifically, the KV-cache: the attention state that makes the difference between a 4-second prefill and a sub-second cache hit. Lose it to eviction, isolate it on a single node, or fail to route to it - and you’re paying the full compute cost again for work already done.

KV-cache centric inference flips the design priority. Instead of treating cache as a byproduct, it becomes the organizing principle of the serving platform. This means tiered memory management - offloading KV blocks from GPU to CPU to shared storage so capacity scales beyond any single node. It means cross-replica visibility - so cached state computed on one instance is reusable by any other. And it means cache-aware scheduling - routing requests to where their prefix already lives.

We cover how llm-d and vLLM implement each layer, how they compose into a coherent system, and what it looks like in practice - with benchmarks, deployment patterns, and lessons from building a KV-cache centric platform in the open.

Speakers

Martin Hickey

Senior Technical Staff Member, IBM Research

Martin Hickey is a STSM at IBM Research, focused on Open Source, Cloud Native Computing, and AI. Martin has notable contributions to open source projects like vLLM, LMCache, Kubernetes, Helm, OpenTelemetry and OpenStack. Martin is a core maintainer for LMCache and an emeritus core... Read More →

Maroon Ayoub

Research Scientist & Architect, IBM Research

Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.

pytorch eu kv cache pptx

Wednesday April 8, 2026 11:05 - 11:15 CEST
Central Room

Inference & Production

Audience Level Any

PyTorch Conference Europe 2026

Martin Hickey

Maroon Ayoub

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event