Name: Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving - Maroon Ayoub, IBM Research & Hyunkyun Moon, moreh
Start: 2026-04-08T11:20:00+0200
End: 2026-04-08T11:30:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving - Maroon Ayoub, IBM Research & Hyunkyun Moon, moreh

Wednesday April 8, 2026 11:20 - 11:30 CEST

Central Room

Agentic AI workloads - tree-of-thought exploration, ReAct loops, hierarchical swarms - expose a fundamental mismatch in how we serve PyTorch models. Today's inference stacks treat the KV-cache as a flat, anonymous tensor buffer with blind LRU eviction. This ignores the structural reality of agents: system prompts are durable, tool definitions are shared, and reasoning scratchpads are ephemeral. We are currently evicting high-value state to preserve throwaway tokens.

In this talk, we present Semantic KV-Cache, an architectural evolution for llm-d and vLLM that replaces anonymous blocks with Typed State.

We demonstrate a runtime that tags blocks as SystemPrompt, ToolDefinition, or ReasoningBranch, applying differentiated policies to each: pinning foundational context, replicating shared tools, and eagerly evicting completed thoughts. We show how this "lifecycle-aware" caching reduces recomputation and minimizes the "Agentic Tax" - evolving the PyTorch serving stack from request-centric to workload-aware.

Speakers

Maroon Ayoub

Research Scientist & Architect, IBM Research

Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.

hyunkyun moon

MLOps Engineer, Moreh

Hyunkyun Moon is an ML Platform Engineer at Moreh, focusing on building high-performance LLM inference platforms with llm-d. He is an active contributor to open-source projects, including llm-d and vLLM. With a strong background in large-scale Kubernetes-native infrastructure, he... Read More →

not all tokens are equal pytorch eu pptx

Wednesday April 8, 2026 11:20 - 11:30 CEST
Central Room

Inference & Production

Audience Level Intermediate

PyTorch Conference Europe 2026

Maroon Ayoub

hyunkyun moon

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event