Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Company: Advanced clear filter
arrow_back View All Dates
Tuesday, April 7
 

12:00 CEST

Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA
Tuesday April 7, 2026 12:00 - 12:25 CEST
Modern GPUs are fast enough that CPU kernel launch overhead has become a real bottleneck. CUDA Graphs can eliminate this overhead, but in practice they are hard to use and easy to get wrong.

When CUDA Graph capture fails, PyTorch users typically face two choices: fix the code that breaks capture—often with limited guidance—or capture only parts of the workload. Partial capture comes with sharp footguns, most notably large increases in device memory usage due to CUDA Graphs’ private memory pools.

This talk walks through the most common CUDA Graph capture failures seen in real PyTorch workloads and shows how to diagnose and fix them. It then presents an alternative to CUDA Graph Trees: Parameterized CUDA Graph launch, which automatically applies CUDA Graphs to only the compatible regions of a workload. All you need to do is make your workload compatible with torch.compile(). This enables CUDA Graph acceleration with minimal user effort and without increased memory usage.

Using this approach, llama3.1-70B in torchtitan runs with only a 2 GB memory increase over a non-graph baseline, compared to ~10 GB using traditional CUDA Graph techniques.
Speakers
avatar for Daniel Galvez

Daniel Galvez

Manager, NVIDIA
Daniel Galvez is an AI developer technology engineer working on speech recognition and natural language processing inference and training. He has contributed to software like PyTorch, NeMo, Megatron, ESPNet, vLLM, and TRT-LLM. He is currently working on reducing CPU overheads in CUDA... Read More →
Tuesday April 7, 2026 12:00 - 12:25 CEST
Junior Stage

12:15 CEST

Lightning Talk: FlexAttention + FlashAttention-4: Fast and Flexible - Driss Guessous, Meta
Tuesday April 7, 2026 12:15 - 12:25 CEST
FlexAttention democratized attention research by letting researchers prototype custom attention variants in PyTorch without hand-written CUDA. Over 1,000 repos have adopted it, and dozens of papers cite it. But flexibility came at a cost: FlexAttention achieved only ~60% of FlashAttention-3's throughput on Hopper, and the gap widened dramatically on Blackwell GPUs.

We bridged this gap by integrating FlexAttention with FlashAttention-4, the new CuTeDSL-based implementation optimized for Blackwell's async pipelines and tensor memory. PyTorch's Inductor now generates CuTeDSL score/mask modifications directly, enabling JIT instantiation of FA4 for arbitrary attention variants.

Results: 1.2–3.2× speedups over the Triton backend on compute-bound workloads. On B200, patterns like ALiBi, document masking, and sliding window see up to 2.7× forward and 3× backward speedups. On Hopper, gains range from 1.3–2× across all sequence lengths.

This talk covers the technical integration: how Inductor lowers score mods to CuTeDSL, how FA4's warp-specialized kernel accommodates block-sparse iteration, and practical considerations for users adopting the Flash backend today.
Speakers
avatar for Driss Guessous

Driss Guessous

Machine Learning Engineer, Meta
I am currently a machine learning engineer working on core development of PyTorch. I received my Masters in Computer Science from the University of Illinois at Urbana-Champaign. I received a dual degree in Physics and Applied Mathematics from The Ohio State University. I also won... Read More →
Tuesday April 7, 2026 12:15 - 12:25 CEST
Master Stage

14:45 CEST

Model-Changing Transforms With Torch.compile - Thomas Viehmann, Lightning AI
Tuesday April 7, 2026 14:45 - 15:10 CEST
torch.compile is the goto mechanism to increase performance of PyTorch models of all shapes and forms.

While it is widely understood how to change the computation by manipulating the FX trace representation, it becomes a much more general tool by also transforming model and input expectations (the guards):
This enables model-changing transformations like quantization and distributed without needing to adapt the model to it.

We take a deep dive into the torch.compile internals to see what's going on under the hood and how we can hook into the gears to enable distributed (starting from a single-GPU model) and quantization.
In this quest, marvel at the interplay between PyTorch's Python code, the Pyton interpreter and PyTorch's C++ code that enable the Dynamo frontend of torch.compile and then use a big hammer to use it in unexpected ways. Building on our experience with Lightning Thunder, an experimental compiler for PyTorch models, we propose a transform mechanism taking care of compute, model, and weights.
Speakers
avatar for Thomas Viehmann

Thomas Viehmann

Thunder, Lightning AI
Thomas Viehmann does PyTorch and Optimization at Lightning AI, PyTorch contributor since 2017, founded MathInf GmbH in 2018, co-authored of “Deep Learning with PyTorch” in 2020.
Tuesday April 7, 2026 14:45 - 15:10 CEST
Master Stage

15:40 CEST

Lightning Talk: Cross-Region Model Serving: PyTorch Inference, Observability & LLMOps - Suraj Muraleedharan, Amazon Web Services
Tuesday April 7, 2026 15:40 - 15:50 CEST
As PyTorch models move to production, organizations face a critical challenge: deploying, monitoring, and operating inference at scale across multiple regions. Single-region serving is well-understood, but multi-region LLMOps—model distribution, observability, failover, and cost management—remains ad-hoc and challenging for multiple customers.

This session presents production-tested architectures for multi-region PyTorch inference and LLMOps workflows. We cover:

Serving: Multi-region TorchServe/KServe on Kubernetes with latency-based routing, blue-green deployments, model versioning, and automated failover with circuit breakers.

Observability: OpenTelemetry distributed tracing, Prometheus/Grafana dashboards for latency, throughput, GPU utilization, and LLM-specific metrics like time-to-first-token and KV-cache hit rate.

LLMOps: CI/CD pipelines for cross-region model deployment with automated rollback, drift detection, and SLO-based alerting.

Attendees leave with serving architectures, dashboards, and deployment pipelines using open-source tooling.
Speakers
avatar for Suraj Muraleedharan

Suraj Muraleedharan

Principal Platform Engineer, Amazon Web Services
Principal Engineer driving technical strategy and building mission-critical foundational platforms for AI, HPC, and distributed systems, bridging the gap between infrastructure, AI research, and product organizations.
Tuesday April 7, 2026 15:40 - 15:50 CEST
Founders Cafe
  Inference & Production

16:40 CEST

Optimizing PyTorch on CPU-GPU Coherent Platforms - Matthias Jouanneaux, Nvidia
Tuesday April 7, 2026 16:40 - 17:05 CEST
In recent years, both Nvidia and AMD have introduced hardware coherent platforms: GH200, GB200 and MI300A. These coherent platforms provide both many new features and challenges for PyTorch applications attempting to make the most out of the platform.
This talk will focus on Nvidia's GB200 and walk through techniques to utilize the features of the coherent architecture in PyTorch, such as the high CPU-GPU interconnect bandwidth, unified memory, as well as the advantages and caveats of sharing system memory between CPU and GPU.
Speakers
avatar for Matthias Jouanneaux

Matthias Jouanneaux

Sr Software Engineer - PyTorch, NVIDIA
After his master’s degree, Matthias Jouanneaux worked at Konica Minolta's european research lab on medical image analysis using deep learning for 2 years.
He then joined Nvidia, focusing on optimizing application performance for Nvidia hardware as a Developer Technology enginee... Read More →
Tuesday April 7, 2026 16:40 - 17:05 CEST
Founders Cafe
  Frameworks & Compilers
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -