Name: Beyond the Theory: What Actually Breaks When You Scale Your Disaggregated Pytorch Models - Ekin Karabulut & Ron Kahn, NVIDIA
Start: 2026-04-08T15:25:00+0200
End: 2026-04-08T15:50:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Beyond the Theory: What Actually Breaks When You Scale Your Disaggregated Pytorch Models - Ekin Karabulut & Ron Kahn, NVIDIA

Wednesday April 8, 2026 15:25 - 15:50 CEST

Central Room

As inference demand explodes, new techniques to optimize these deployments have emerged. One such technique is disaggregated inference, which splits inference into differently optimized workloads (e.g. prefill and decode) on separate workers. The theory is straightforward–better GPU utilization, inference performance, and tighter control over SLAs.The deployment in production is not.
Scaling happens at multiple connected levels. Adding prefill workers for a traffic spike? Those workers belong to a prefill leader and must scale as a unit. But your prefill-to-decode ratio matters too, scale prefill without matching decode capacity and you've moved the bottleneck.Placement also plays a role: place prefill and decode far apart in your network topology and KV-cache transfers will kill your latency.Standard autoscaling treats these as independent components.They're not.
In this talk, we'll share what we've learned running disaggregated vLLM and SGLang deployments on K8s: what broke,what worked, and how we're improving performance. We'll evaluate approaches from standard deployments to specialized APIs like LWS and Grove, discuss how these integrate with frameworks like llm-d and Dynamo.

Speakers

Ekin Karabulut

AI/ML Developer Advocate, NVIDIA

Ekin is a Developer Advocate at NVIDIA, following the acquisition of Run:ai. Prior to that, she specialized in the privacy implications of federated learning systems with DNNs in distributed environments as a data scientist. Currently, she is exploring the efficient usage of large... Read More →

Ron Kahn

Senior Software Engineer, NVIDIA

Ron Kahn is a Senior Software Engineer in the NVIDIA Run:ai platform team. Ron works on the design and implementation of workload management systems that abstract Kubernetes complexity for AI practitioners. When not simplifying AI training jobs, Ron can be found cooking something... Read More →

Beyond the Theory What Actually Breaks When You Scale Your Disaggregated Pytorch Models.pptx pdf

Wednesday April 8, 2026 15:25 - 15:50 CEST
Central Room

Inference & Production

Audience Level Any
Slides Attached Yes

PyTorch Conference Europe 2026

Ekin Karabulut

Ron Kahn

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event