Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Wednesday April 8, 2026 15:25 - 15:50 CEST


As inference demand explodes, new techniques to optimize these deployments have emerged. One such technique is disaggregated inference, which splits inference into differently optimized workloads (e.g. prefill and decode) on separate workers. The theory is straightforward–better GPU utilization, inference performance, and tighter control over SLAs.The deployment in production is not.
Scaling happens at multiple connected levels. Adding prefill workers for a traffic spike? Those workers belong to a prefill leader and must scale as a unit. But your prefill-to-decode ratio matters too, scale prefill without matching decode capacity and you've moved the bottleneck.Placement also plays a role: place prefill and decode far apart in your network topology and KV-cache transfers will kill your latency.Standard autoscaling treats these as independent components.They're not.
In this talk, we'll share what we've learned running disaggregated vLLM and SGLang deployments on K8s: what broke,what worked, and how we're improving performance. We'll evaluate approaches from standard deployments to specialized APIs like LWS and Grove, discuss how these integrate with frameworks like llm-d and Dynamo.
Speakers
avatar for Ekin Karabulut

Ekin Karabulut

AI/ML Developer Advocate, NVIDIA
Ekin is a Developer Advocate at NVIDIA, following the acquisition of Run:ai. Prior to that, she specialized in the privacy implications of federated learning systems with DNNs in distributed environments as a data scientist. Currently, she is exploring the efficient usage of large... Read More →
avatar for Ron Kahn

Ron Kahn

Senior Software Engineer, NVIDIA
Ron Kahn is a Senior Software Engineer in the NVIDIA Run:ai platform team. Ron works on the design and implementation of workload management systems that abstract Kubernetes complexity for AI practitioners. When not simplifying AI training jobs, Ron can be found cooking something... Read More →
Wednesday April 8, 2026 15:25 - 15:50 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link