Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Tuesday April 7, 2026 12:00 - 12:25 CEST


Modern GPUs are fast enough that CPU kernel launch overhead has become a real bottleneck. CUDA Graphs can eliminate this overhead, but in practice they are hard to use and easy to get wrong.

When CUDA Graph capture fails, PyTorch users typically face two choices: fix the code that breaks capture—often with limited guidance—or capture only parts of the workload. Partial capture comes with sharp footguns, most notably large increases in device memory usage due to CUDA Graphs’ private memory pools.

This talk walks through the most common CUDA Graph capture failures seen in real PyTorch workloads and shows how to diagnose and fix them. It then presents an alternative to CUDA Graph Trees: Parameterized CUDA Graph launch, which automatically applies CUDA Graphs to only the compatible regions of a workload. All you need to do is make your workload compatible with torch.compile(). This enables CUDA Graph acceleration with minimal user effort and without increased memory usage.

Using this approach, llama3.1-70B in torchtitan runs with only a 2 GB memory increase over a non-graph baseline, compared to ~10 GB using traditional CUDA Graph techniques.
Speakers
avatar for Daniel Galvez

Daniel Galvez

Manager, NVIDIA
Daniel Galvez is an AI developer technology engineer working on speech recognition and natural language processing inference and training. He has contributed to software like PyTorch, NeMo, Megatron, ESPNet, vLLM, and TRT-LLM. He is currently working on reducing CPU overheads in CUDA... Read More →
Tuesday April 7, 2026 12:00 - 12:25 CEST
Junior Stage

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link