PyTorch Conference Europe 2026: Full Schedule

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

arrow_back View All Dates

10:50 CEST

Lightning Talk: Step-Aligned Telemetry for Distributed PyTorch Training (Time & Memory Attribution Across Ranks) - Abhinav Srivastav, TraceOpt

Wednesday April 8, 2026 10:50 - 11:00 CEST

Central Room

Distributed PyTorch training often looks healthy in system dashboards; GPU utilization is high, memory is stable and yet throughput degrades, steps jitter, or GPUs go idle intermittently. The core issue is misalignment: most
telemetry is sampled by time, while training progresses by "steps", and distributed behavior is dominated by the slowest rank rather than averages.

In this talk I will breaks down common failure modes in DDP training that standard metrics miss (rank stragglers, dataloader stalls, step-time variance, and memory spikes/creep). We will show how step-aligned, rank-aware aggregation changes debugging: per-step worst-rank vs median-rank views, gating to completed steps across ranks, and how to tie time and memory back to training semantics without relying on heavyweight profilers.

Speakers

Abhinav Srivastav

ML Scientist, TraceOpt

ML researcher with a PhD in Computer Science. Industry experience at IBM Research, Huawei Research, and Zalando.Currently building TraceML: an open source tool that shows you the step-level breakdown of your PyTorch training run while it's still running.I am partially interested in... Read More →

Pytorch conf pdf

Wednesday April 8, 2026 10:50 - 11:00 CEST
Central Room

Training Systems

Audience Level Advanced
Slides Attached Yes

11:05 CEST

Fp8 Training From Hopper To Blackwell - Luca Wehrstedt, Meta

Wednesday April 8, 2026 11:05 - 11:30 CEST

Master Stage

The Hopper generation of NVIDIA GPUs first enabled the use of low-precision float8 data types for training via TensorCore acceleration. However, the recipe to best leverage it was far from settled. Practitioners had to find their way through many entangled decisions around accuracy-vs-efficiency, precision-vs-range, overflows-vs-underflows, and more. The frontier was further push forward by the DeepSeek release, and then by the micro-scaling formats introduced by Blackwell. In this talk we will go through all these approaches, comparing their pros and cons, thus guiding researchers in finding the options that work best for them.

Speakers

Luca Wehrstedt

Software Engineer, Meta

Research Engineer in Meta's Fundamental AI Research team (FAIR). At the intersection of research and infrastructure, Luca specialized in training efficiency and distributed communication. Regular contributor to PyTorch.

Fp8 at PT conf (1) pdf

Wednesday April 8, 2026 11:05 - 11:30 CEST
Master Stage

Training Systems

Audience Level Advanced
Slides Attached Yes

11:35 CEST

Optimizing Large MoE Inference on NVIDIA Blackwell: NVFP4, ADP, and DualPipe Strategies - Julien Demouth, NVIDIA

Wednesday April 8, 2026 11:35 - 12:00 CEST

Central Room

Deploying massive Mixture-of-Experts (MoE) architectures like DeepSeek-V3/R1 requires a co-designed approach leveraging NVIDIA Blackwell’s fifth-generation Tensor Cores. This session details the transition to NVFP4 precision for MoE weights to significantly reduce memory load, coupled with FP4/FP8 KV caching to minimize attention layer footprint and enable higher concurrency.
We will analyze the architectural shift to Expert Parallelism (EP) for expert layers to maximize FLOPS, and Attention Data Parallelism (ADP) for attention heads—avoiding redundant KV replication and converting Multi-Head Latent Attention (MLA) into Multi-Query Attention (MQA) via weight absorption. The talk will demonstrate advanced execution strategies, including DualPipe algorithms to overlap dispatch/combine communication with computation, and the integration of DeepGEMM and FlashInfer kernels. Finally, we will cover runtime optimizations using Programmatic Dependent Launch (PDL) and CUDA Graphs to minimize host latency, alongside Multi-Token Prediction (MTP) for accelerated speculative decoding.

Speakers

Julien Demouth

Senior Distinguished Engineer - Eng. Lead for AI Labs & Models, NVIDIA

Wednesday April 8, 2026 11:35 - 12:00 CEST
Central Room

Inference & Production

Audience Level Advanced

14:55 CEST

Birds of A Feather: NCCL in the Wild: Scaling Communications To Thousands of GPUs - Jeff Hammond, Gabrielle Talavera, Ke Wen & Asma Farjallah, NVIDIA

Wednesday April 8, 2026 14:55 - 15:20 CEST

Open Platform

We will share the latest updates to NCCL and how they can be used in PyTorch. We invite the community to share their feedback on challenges using NCCL at scale and ways to improve integration of NCCL with PyTorch applications.

Some of the important topics for community discussion include:
- Symmetric memory support and GPU-initiated networking.
- Copy-engine collectives and maximizing overlap of communication and computation for better end-to-end performance.
- Profiling, debugging and tuning, as well as resilience (handling failed nodes without a restart).

Speakers

Asma Farjallah

AI DevTech, NVIDIA

Asma Farjallah is an AI Developer Technology Engineer at NVIDIA. Prior to her role as DevTech, she was part of the Solution Architect team at NVIDIA for 5 years and was part of the global energy team. Before joining NVIDIA, Asma worked for Intel for 4 years as an Application Engineer... Read More →

Gabrielle US

Product Manager, NVIDIA

Gabrielle Talavera is the Product Manager for NCCL at NVIDIA, focused on shaping the product roadmap and improving the experience of teams building on GPU‑accelerated software. She joined NVIDIA in 2021 as a Solutions Architect, helping customers adopt NVIDIA software and debug... Read More →

Jeff Hammond

Distinguished Engineer, NVIDIA Helsinki Oy

Jeff Hammond is a Distinguished Engineer in the NCCL team at NVIDIA focused on user education and research outreach. His background is in parallel application and algorithm development, open-source software, and supercomputing architecture. Jeff has made significant contributions... Read More →

Ke Wen

Principal Software Architect, NVIDIA

Ke Wen works on distributed features, including Symmetric Memory, multi-GPU kernels, Expert Parallelism, inference, pipelining and graph analysis.

Wednesday April 8, 2026 14:55 - 15:20 CEST
Open Platform

Birds of A Feather

Audience Level Advanced

15:55 CEST

DualPipe from Scratch: Implementing DeepSeek's 5D Parallelism in PyTorch - Dev Jadhav, ING Bank

Wednesday April 8, 2026 15:55 - 16:20 CEST

Founders Cafe

The DeepSeek-V3 paper describes 5D parallelism and DualPipe at a high level, but leaves critical implementation details undocumented. This session presents our open-source PyTorch reference implementation that fills those gaps - verified against the original architecture and designed for learning and extension.

We'll share what we discovered building it from scratch:
Why K_pe is shared across heads in decoupled RoPE (not explicit in paper)
The critical timing of bias updates in auxiliary-loss-free load balancing
How sigmoid routing separates selection scores from gate values
The warmup formula that makes DualPipe achieve 3% bubble overhead
Bugs we caught: causal mask position offsets, EMA initialization, capacity dropping priority

What you'll learn:

5D Parallelism: How TP, PP, DP, EP, and SP interact at 2,048+ GPU scale
DualPipe: Building the bidirectional scheduler with 55% throughput gain over GPipe
Hierarchical All-to-All: Two-level communication reducing MoE dispatch overhead by 4x
Teachable abstractions: CapacityMetrics, ExpertSpecializationTracker, ScheduleStep enums

Prerequisites: torch.distributed basics.
Code: github.com/DevJadhav/deepseek-from-scratch

Speakers

Dev Jadhav

Tech Lead ML Engineer, ING Bank

Dev Jadhav is a production AI/ML engineer with 10+ years building AI
systems at scale. He currently leads ML engineering at Major Bank,
developing financial-grade AI and large-scale model operations. Dev is
the creator of DeepSeek From Scratch, an open-source implementation of
DeepSe... Read More →

DualPipe PyTorch Conference pdf

Wednesday April 8, 2026 15:55 - 16:20 CEST
Founders Cafe

Training Systems

Audience Level Advanced
Slides Attached Yes

10:50 CEST

11:05 CEST

11:35 CEST

14:55 CEST

15:55 CEST

Get help with the event