Name: The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google
Start: 2026-04-07T14:15:00+0200
End: 2026-04-07T14:40:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google

Tuesday April 7, 2026 14:15 - 14:40 CEST

Central Room

Production LLM serving faces a critical trade-off: while continuous batching maximizes throughput, it often sacrifices SLAs due to Head-of-Line (HoL) blocking. When long-context requests hijack the engine, tail latencies spike. Without fine-grained preemption, guaranteeing priority or fairness remains nearly impossible.

We propose a solution: Chunked Decoding. By treating a fixed number of tokens as a "time slice," we bring 50 years of OS scheduling wisdom to inference. This technique decouples generation from completion, enabling a preemptive multitasking environment for LLMs.

In this talk, we present a sidecar implementation for PyTorch-based servers (like vLLM) that orchestrates decoding in manageable chunks. This allows the system to pause, hold, or swap requests mid-stream without discarding the KV cache. We will share early evaluation results, discussing how varying chunk sizes impact priority handling and tail latency. Attendees will learn how a sidecar approach enables sophisticated scheduling while keeping the core engine lean—offering a blueprint for integrating preemptive scheduling into the next generation of model servers.

Speakers

Maroon Ayoub

Research Scientist & Architect, IBM Research

Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.

Kellen Swain

Senior Software Engineer, Google

Kellen is a Senior Engineer at Google, and is a maintainer of both the llm-d and Inference Gateway projects.

Token Slice pptx

Tuesday April 7, 2026 14:15 - 14:40 CEST
Central Room

Inference & Production

Audience Level Intermediate

PyTorch Conference Europe 2026

Maroon Ayoub

Kellen Swain

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event