PyTorch Conference Europe 2026: Full Schedule

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

11:00 CEST

Lightning Talk: Training Embedding Model Resiliently for Multimodal Model Inference Routing - Huamin Chen, Red Hat & Haichen Zhang, AMD

Tuesday April 7, 2026 11:00 - 11:10 CEST

Junior Stage

LLM systems increasingly rely on intelligent routing to balance cost, latency, and quality tradeoffs. The vLLM Semantic Router, a vLLM Ecosystem project, provides both semantic and performance level routing intelligence for Mixture-of-Multimodal Models (MoM) architectures, but its effectiveness depends on fast and accurate classifiers.

This talk presents our end-to-end journey training production-grade embedding and classification models on AMD GPUs using native PyTorch, achieving high GPU utilization with distributed training optimizations.

We introduce a multilingual text embedding model with 32K context window and 2D Matryoshka support, and multimodal embedding models, trained on AMD GPUs using PyTorch DDP. The talk covers practical training optimizations for AMD ROCm. All training code uses native PyTorch distributed primitives, with additional enhancement to improve training stability and pipeline efficiency.

Attendees will learn how to train efficient classifiers for LLM routing systems and integrate these models into production inference pipelines.

Speakers

Huamin Chen

Technical Advisor, Microsoft

Dr. Huamin Chen is a passionate developer. He co-founded the Semantic Router project under vLLM community. His recent contributions to the CNCF ecosystem include Project Kepler, TAG Environmental Sustainability, and Cloud Native AI WG. He is also one of the founding members... Read More →

Haichen Zhang

Senior AI Software Engineer, AMD

Haichen is the Senior AI Engineer for AMD AI Group, specializing in accelerating training and inference for large language models, recommender systems, computer vision (CV), and natural language processing (NLP) tailored to internet customers. Before joining AMD, Haichen worked at... Read More →

Tuesday April 7, 2026 11:00 - 11:10 CEST
Junior Stage

Training Systems

Audience Level Intermediate

11:45 CEST

Lightning Talk: TorchJD: Jacobian Descent in PyTorch - Pierre Quinton, EPFL & Valérian Rey, Simplex Lab

Tuesday April 7, 2026 11:45 - 11:55 CEST

Founders Cafe

Jacobian descent (JD) is an extension of gradient descent supporting the optimization of vector-valued functions. This algorithm can be used to train neural networks with multiple loss functions (e.g. multi-task learning). JD iteratively updates the parameters of the model using the Jacobian matrix of the vector of losses (the matrix stacking each individual loss' gradient).

To support and extend our research, we have developed the TorchJD library. With it, it's easy and efficient to compute the Jacobians with respect to the model parameters, and to aggregate them into an update direction that is beneficial to every objective. In contrast, if we had averaged the losses and used gradient descent, the update would have been beneficial to the average loss, but may have actually increased one of the individual losses.

In this session, we will give a quick introduction to the theory behind Jacobian descent, and then show how to use TorchJD on a variety of use-cases, beyond multi-task learning.

Library: https://github.com/TorchJD/torchjd
Paper: https://arxiv.org/abs/2406.16232

Speakers

Pierre Quinton

Teacher, EPFL

PhD in Information Theory and Master in Data Science, specializing in fundamental math and multi-objective optimization (MOO). I am the co-author of TorchJD, a PyTorch library for Jacobian Descent developed with Valerian, currently at ~300 GitHub stars. My work aims to translate complex... Read More →

Valérian Rey

Research Engineer, Simplex Lab

I graduated from EPFL with a MSc in Data Science in 2021. Since then, I worked as a Data Scientist as Withings, and I worked on Jacobian descent, initially as a side-project, but now as a full-time occupation. I now spend most of my time developing and maintaining TorchJD, and I love... Read More →

TorchJD pdf

Tuesday April 7, 2026 11:45 - 11:55 CEST
Founders Cafe

Training Systems

Audience Level Intermediate
Slides Attached Yes

12:00 CEST

Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Storage via fsspec to Keep GPUs Busy - Ankita Luthra & Trinadh Kotturu, Google

Tuesday April 7, 2026 12:00 - 12:10 CEST

Master Stage

As PyTorch models scale to billions of parameters, the bottleneck has quietly shifted from compute to storage. Modern GPU clusters often sit idle, "starving" for data while waiting on legacy REST-based protocols. This talk introduces Rapid Storage: a fundamental architectural shift bringing Google’s Colossus stateful protocol (that powers many Google’s products) to PyTorch via fsspec , a common Pythonic file interface used by many frameworks within PyTorch ecosystem.
By bypassing REST APIs entirely via persistent gRPC streams to the storage layer, we eliminate protocol overhead. In this talk, we also dive into how Rapid achieves <1ms random read/write latency, 20x faster data access, and a massive 6 TB/s of aggregate throughput. Crucially, it delivers up to 10x lower tail latency for random I/O, preventing the stragglers that often stall distributed training jobs.
Beyond raw speed, we will deconstruct the integration with gcsfs and the broader fsspec ecosystem. This ensures that high-performance I/O is available across the entire data stack including Dask, Ray, HF Datasets and vLLM etc. Join us to learn how to stop wasting GPU cycles and achieve linear scaling in the cloud.

Speakers

Ankita Luthra

Senior Software Engineer, Google

Ankita Luthra is a Software Developer at Google, focused on AI/ML infrastructure and scalable data pipelines. Her work with open-source tools like fsspec(gcsfs) and gcsfuse improves how frameworks such as PyTorch/ JAX efficiently access data from Google Cloud Storage.

Trinadh Kotturu

Senior Product Manager, Google

Trinadh Kotturu is a Senior Product Manager specializing in AI/ML and analytics client strategy at Google. An alumnus of IIM Bangalore with 12 years of experience, he has a proven track record of shipping v1 products and scaling them into robust platform services. His expertise spans large-scale distributed storage systems, autonomous driving, and system resiliency... Read More →

Bringing Rapid Buckets to Pytorch pdf

Tuesday April 7, 2026 12:00 - 12:10 CEST
Master Stage

Training Systems

Audience Level Any
Slides Attached Yes

15:00 CEST

Lightning Talk: Jigsaw: Domain and Tensor Parallelism for High-Resolution Input Training - Deifilia Kieckhefen, Karlsruhe Institute of Technology

Tuesday April 7, 2026 15:00 - 15:10 CEST

Founders Cafe

Distributed neural network training frameworks typically optimize for specific architectures while minimizing communication overhead. Transformer layers can be efficiently parallelized, but other operations such as convolutions often remain inefficient. This creates bottlenecks for complex model architectures.
Moreover, existing tensor parallelism strategies typically replicate input data across all processes, creating redundant I/O that scales poorly with input size. In applications with heavy I/O demands-weather forecasting, medical imaging, or video processing-unsharded input data creates additional data-loading bottlenecks that could benefit from parallelization.
Jigsaw is a PyTorch library that shards both model weights and input data across parallel processes. It maintains a PyTorch-like interface while parallelizing activations, convolutions, linear layers, and attention through a distributed matrix multiplication backend. We demonstrate the usability of Jigsaw across a wide range of model architectures and shows performance when scaling multi-billion-parameter models sharded across up to 8 processes and compares the scalability to DDP, FSDP, and Megatron-LM approaches.

Speakers

Deifilia Kieckhefen

Doctoral Researcher, Karlsruhe Institute of Technology

Deifilia Kieckhefen is a doctoral researcher at the Karlsruhe Institute of Technology. She works on scalable and distributed training of neural network architectures.

PyTorchConf jigsaw Kieckhefen pdf

Tuesday April 7, 2026 15:00 - 15:10 CEST
Founders Cafe

Training Systems

Audience Level Any
Slides Attached Yes

16:10 CEST

Optimizing Reinforcement Learning at Trillion-Parameter Scale - Songlin Jiang, Aalto University & Mind Lab

Tuesday April 7, 2026 16:10 - 16:35 CEST

Junior Stage

This talk will dive into how we implemented and optimized reinforcement learning on trillion-parameter Mixture-of-Experts reasoning models using veRL, Megatron-Bridge and vLLM. The session is useful to anyone building large-scale RL training systems.

For the first part, I will walk through the system design required to make RL work at this scale using LoRA: how LoRA adapters are implemented for expert layers, how adapters are sharded and fused under tensor/pipeline/expert parallelism, and most importantly, how refit (parameter sync) is implemented for LoRA between training backend (Megatron) and rollout engine (vLLM).

The second part of the talk focuses on training–inference mismatch in MoE RL. I will explain why common mitigations such as clipping and importance sampling can fail, and how we implement fixed Router Replay R3 across vLLM, veRL, and Megatron to align routing decisions between rollout and training.

These works are done together with Mind Lab and some of the related blog posts are at:
- https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus
- https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it

Speakers

Songlin Jiang

Doctoral Researcher, Aalto University & Mind Lab

I am a doctoral researcher at Aalto University, focusing on reducing training and inference latency for Reinforcement Learning and Large Language Models (LLMs) on High-Performance Computing (HPC) clusters. I am also a passionate free software developer, a maintainer of VeRL, and a... Read More →

1T RL PTC EU26 Songlin pdf

Tuesday April 7, 2026 16:10 - 16:35 CEST
Junior Stage

Training Systems

Audience Level Intermediate
Slides Attached Yes

16:10 CEST

TorchStore: What We Learned Building Distributed Storage Solutions for AysncRL - Lucas Pasqualin, Danielle Pintz, Allen Wang, Amir Afzail Meta

Tuesday April 7, 2026 16:10 - 16:35 CEST

Master Stage

Asynchronous Reinforcement Learning (AsyncRL) workloads have unique data sharing requirements: actors must efficiently exchange large tensors across processes and nodes, often with different sharding configurations—not just at checkpoint time, but continuously during training for live weight synchronization. This talk presents Torchstore, an open-source distributed tensor storage system built on Monarch actors that tackles these challenges. We'll share the key lessons learned—from designing pluggable transport backends (RDMA, shared memory, RPC) to implementing transparent live DTensor resharding that lets producers and consumers use entirely different parallelism strategies. We'll also discuss the friction we encountered integrating with inference engines like vLLM, where differing model definitions and integrations present new bottlenecks. Whether you're building actor-based training systems or thinking about disaggregated training-inference architectures, you'll leave with practical insights on distributed tensor storage design.

Speakers

Lucas Pasqualin

ML Engineer, PyTorch (Meta)

Lucas has been developing Machine Learning Applications and Machine Learning infrastructure at scale for years, and has recently been focused on extending the product offering of PyTorch's Distributed Checkpointing stack.

Allen Wang

Software Engineer, Meta

Danielle Pintz

Software Engineer, Meta

Danielle is a software engineer working on PyTorch, currently focused on TorchStore and Async RL. She previously worked on the Llama Research team.

Amir Afzali

Software Engineer, Meta

Software engineer working on Pytorch distributed infra and large scale training

Tuesday April 7, 2026 16:10 - 16:35 CEST
Master Stage

Training Systems

Audience Level Intermediate

10:50 CEST

Lightning Talk: Step-Aligned Telemetry for Distributed PyTorch Training (Time & Memory Attribution Across Ranks) - Abhinav Srivastav, TraceOpt

Wednesday April 8, 2026 10:50 - 11:00 CEST

Central Room

Distributed PyTorch training often looks healthy in system dashboards; GPU utilization is high, memory is stable and yet throughput degrades, steps jitter, or GPUs go idle intermittently. The core issue is misalignment: most
telemetry is sampled by time, while training progresses by "steps", and distributed behavior is dominated by the slowest rank rather than averages.

In this talk I will breaks down common failure modes in DDP training that standard metrics miss (rank stragglers, dataloader stalls, step-time variance, and memory spikes/creep). We will show how step-aligned, rank-aware aggregation changes debugging: per-step worst-rank vs median-rank views, gating to completed steps across ranks, and how to tie time and memory back to training semantics without relying on heavyweight profilers.

Speakers

Abhinav Srivastav

ML Scientist, TraceOpt

ML researcher with a PhD in Computer Science. Industry experience at IBM Research, Huawei Research, and Zalando.Currently building TraceML: an open source tool that shows you the step-level breakdown of your PyTorch training run while it's still running.I am partially interested in... Read More →

Pytorch conf pdf

Wednesday April 8, 2026 10:50 - 11:00 CEST
Central Room

Training Systems

Audience Level Advanced
Slides Attached Yes

11:05 CEST

Bringing PyTorch Monarch to AMD GPUs: Single-Controller Distributed Training on ROCm - Liz Li & Zachary Streeter, AMD

Wednesday April 8, 2026 11:05 - 11:30 CEST

Founders Cafe

PyTorch Monarch introduces a new distributed programming paradigm that enables developers to orchestrate entire GPU clusters from a single Python program. With its actor-based runtime, process mesh abstraction, and asynchronous execution model, Monarch simplifies large-scale distributed training and enables complex workflows that combine training, evaluation, and reinforcement learning within one unified script.

In this talk, we present our work enabling PyTorch Monarch on AMD Instinct GPUs with ROCm, expanding the single-controller model beyond CUDA environments and bringing this emerging runtime to a broader hardware ecosystem. We describe the engineering effort required to port Monarch’s GPU runtime and distributed communication stack to ROCm, including HIPification of CUDA-specific components, adaptation of memory management and synchronization semantics, and integration with high-performance GPU-to-GPU communication on multi-node clusters through RDMA.

We will share lessons learned from running Monarch workloads on MI300-class clusters, including performance considerations, debugging workflows, and developer experience improvements. Our results demonstrate that Monarch’s architecture can be successfully extended to heterogeneous hardware environments while preserving scalability and ease of use.

This work advances hardware diversity in distributed PyTorch and highlights how portable runtimes can simplify large-scale training while enabling scalable, cluster-wide experimentation across accelerator platforms.

Speakers

Liz Li

Principal AI engineer, AMD

Liz Li is a Principal AI Engineer in the AMD AI group, specializing in enabling and optimizing cutting-edge AI models on AMD Instinct GPUs for both distributed inference and training. With over 10 years of experience in computer, graphics, and AI architecture, she has previously led... Read More →

Zachary Streeter

Senior Member of Technical Staff, AMD

I'm a computational physicist working in the field of AI the past 5 years. I have a wide range of expertise from mathematics to performance optimizations and system engineering. Feel free to nerd out with me! Please connect with me on LinkedIn.

Monarch PTC v1 pptx

Wednesday April 8, 2026 11:05 - 11:30 CEST
Founders Cafe

Training Systems

Audience Level Any

11:05 CEST

Fp8 Training From Hopper To Blackwell - Luca Wehrstedt, Meta

Wednesday April 8, 2026 11:05 - 11:30 CEST

Master Stage

The Hopper generation of NVIDIA GPUs first enabled the use of low-precision float8 data types for training via TensorCore acceleration. However, the recipe to best leverage it was far from settled. Practitioners had to find their way through many entangled decisions around accuracy-vs-efficiency, precision-vs-range, overflows-vs-underflows, and more. The frontier was further push forward by the DeepSeek release, and then by the micro-scaling formats introduced by Blackwell. In this talk we will go through all these approaches, comparing their pros and cons, thus guiding researchers in finding the options that work best for them.

Speakers

Luca Wehrstedt

Software Engineer, Meta

Research Engineer in Meta's Fundamental AI Research team (FAIR). At the intersection of research and infrastructure, Luca specialized in training efficiency and distributed communication. Regular contributor to PyTorch.

Fp8 at PT conf (1) pdf

Wednesday April 8, 2026 11:05 - 11:30 CEST
Master Stage

Training Systems

Audience Level Advanced
Slides Attached Yes

14:00 CEST

Lightning Talk: Backpropagation-Free Optimization in PyTorch - Andrii Krutsylo, Polish Academy of Sciences

Wednesday April 8, 2026 14:00 - 14:10 CEST

Central Room

Backpropagation is not the only mechanism for training deep networks. This talk presents a compact, implementation-driven map of backpropagation-free training methods, organized around representative algorithms that expose key design trade-offs.

We focus on four families: Difference Target Propagation (target-based credit assignment), Direct Feedback Alignment (random feedback without weight transport), local loss / greedy layerwise training (strictly local objectives), and Forward-Forward learning as a forward-only alternative. Each is treated as a minimal working pattern rather than a full system.

For each representative, we answer the same practical questions: what learning signal is propagated, what intermediate state must be stored, how parameters are updated, and what limits scalability on modern accelerators. The emphasis is on PyTorch-level mechanics—explicit update loops, local objectives, and training without autograd—rather than derivations.

The goal is to give practitioners a clear mental model of the backprop-free design space and concrete patterns for experimenting with these methods in real PyTorch training pipelines.

Speakers

Andrii Krutsylo

PhD Candidate, Institute of Computer Science, Polish Academy of Sciences

Andrii Krutsylo is a deep learning researcher focusing on continual learning and optimization dynamics. His work studies experience replay, gradient-free and local learning rules, and structured optimization for adaptive, resource-efficient systems.

Wednesday April 8, 2026 14:00 - 14:10 CEST
Central Room

Training Systems

Audience Level Intermediate

14:00 CEST

Lightning Talk: Debugging the Undebuggable: Introducing Torch.distributed.debug - Tristan Rice, Meta, PyTorch

Wednesday April 8, 2026 14:00 - 14:10 CEST

Founders Cafe

Distributed training in PyTorch enables unprecedented scale, but it also introduces notoriously difficult debugging challenges. When a job with thousands of ranks hangs or slows down, identifying the root cause can feel like searching for a needle in a haystack. This lightning talk introduces the new PyTorch Distributed Debug Server, a powerful, interactive tool designed to bring clarity and control to the chaos of distributed debugging. We will provide a high-level overview of its architecture and core features, demonstrating how it provides a unified interface to inspect stack traces, analyze performance, and diagnose hangs across all workers simultaneously. Attendees will learn how this extensible server can dramatically reduce debugging time and improve the reliability of large-scale training jobs.

Speakers

Tristan Rice

Software Engineer, PyTorch Distributed, Meta

Software engineer working on PyTorch Distributed and large scale training.

Wednesday April 8, 2026 14:00 - 14:10 CEST
Founders Cafe

Training Systems

Audience Level Intermediate

14:15 CEST

Lightning Talk: Scaling Recommendation Systems To 2K GPUs and Beyond - Zain Huda, Meta

Wednesday April 8, 2026 14:15 - 14:25 CEST

Founders Cafe

TLDR: In this session, we go over one of the key technologies to Ads model scaling at Meta, 2D sparse parallelism. Which scales sparse recommendation embedding tables beyond 1k GPUs to 8k GPUs - enabling the largest Ads model training runs in production at Meta.

Scaling Laws have dominated LLMs and shown the industry we can achieve better model performance through scaling. The same scaling law can be applied to recommendation systems. However, the path to scaling recommender systems is not the same. The leap from hundreds to thousands of GPUs introduces complex technical challenges, particularly around handling sparse operations in recommendation models.

In this talk, we will detail the development of 2D sparse parallelism, tracing its path from research to production to address sparse scaling challenges. We will demonstrate how we optimize these systems to push performance boundaries, increasing speed and reducing memory at scale. Participants will walk away with lessons learned from designing 1,000+ GPU scale systems, and a deeper understanding of how to implement these solutions efficiently in production.

Speakers

Zain Huda

Software Engineer, Meta

Zain works on large scale training systems for recommender systems at Meta. He works on TorchRec, a library for distributed parallelism for sparse recommender models. He is also one of the authors of 2D sparse parallelism.

Wednesday April 8, 2026 14:15 - 14:25 CEST
Founders Cafe

Training Systems

Audience Level Intermediate

14:30 CEST

From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face

Wednesday April 8, 2026 14:30 - 14:55 CEST

Founders Cafe

Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks.

We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.

Speakers

Kashif Rasul

Research Scientist, Hugging Face

Kashif has a PhD. in Mathematics from the Freie Universität Berlin. He is passionate about high-performance computing, Reinforcement learning, and has presented at NVIDIA's GTC in 2009 and at StrangeLoop in 2012, and is also contributing to a number of data science and deep learning... Read More →

Sergio Paniego Blanco

Machine Learning Engineer, Hugging Face

Sergio tiene una amplia trayectoria en el ámbito del código abierto y la inteligencia artificial, campo en el que también obtuvo su doctorado. Lleva más de ocho años participando en iniciativas como Google Summer of Code, donde ha contribuido como desarrollador y mentor. Actualmente... Read More →

From Responses To Trajectories Multi Turn and Multi Environment Reinforcement Learning pdf

Wednesday April 8, 2026 14:30 - 14:55 CEST
Founders Cafe

Training Systems

Audience Level Intermediate
Slides Attached Yes

15:25 CEST

Lightning Talk: Trinity Large - Torchtitan on 2000+ B300s - Matej Sirovatka, Prime Intellect

Wednesday April 8, 2026 15:25 - 15:35 CEST

Founders Cafe

In this talk, we'll cover how to use torchtitan to scale training of ultra-sparse mixture-of-experts models across over 2,000 GPUs. We'll walk through the pre-training of Trinity Large, a 400B mixture-of-experts model trained entirely using torchtitan, focusing on maximizing throughput and minimizing the impact of hardware induced failures. Along the way, we'll discuss challenges like fault tolerance, large-scale distributed training, and ensuring determinism - and how we've addressed each of these using torchtitan. Finally, we'll share insights and common pitfalls to avoid in your own large-scale training runs.

Speakers

Matej Sirovatka

Research Engineer, Prime Intellect

Research Engineer at Prime Intellect, mainly focusing on distributed training, performance and scaling.

PTC 2026 Trinity Large & torchtitan (1) pdf

Wednesday April 8, 2026 15:25 - 15:35 CEST
Founders Cafe

Training Systems

Audience Level Intermediate
Slides Attached Yes

15:55 CEST

Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Visible in Practice - Sahana Venkatesh, Wayve

Wednesday April 8, 2026 15:55 - 16:05 CEST

Central Room

PyTorch teams often log rich training metrics, yet still discover training regressions late after significant developer time and GPU budget have already been spent. In this talk, I’ll share a practical pattern we used to turn PyTorch training metrics into an operational guardrail for large-model training.

The approach combines scheduled short and long training runs, standardized performance and stability metrics (throughput, memory, loss, divergence), and simple statistical baselines to automatically surface regressions via alerts without hard gates or complex infrastructure.

I’ll focus on why logging alone is insufficient, how we chose what to monitor, and what tradeoffs we encountered (false positives, alert fatigue, baseline drift). The goal is not a tool demo, but a reusable pattern other PyTorch teams can adapt to catch training regressions earlier and make retraining more predictable.

Speakers

Sahana Venkatesh

Software engineer, Wayve

Making PyTorch Training Regressions Visible in Practice pdf

Wednesday April 8, 2026 15:55 - 16:05 CEST
Central Room

Training Systems

Audience Level Intermediate
Slides Attached Yes

15:55 CEST

DualPipe from Scratch: Implementing DeepSeek's 5D Parallelism in PyTorch - Dev Jadhav, ING Bank

Wednesday April 8, 2026 15:55 - 16:20 CEST

Founders Cafe

The DeepSeek-V3 paper describes 5D parallelism and DualPipe at a high level, but leaves critical implementation details undocumented. This session presents our open-source PyTorch reference implementation that fills those gaps - verified against the original architecture and designed for learning and extension.

We'll share what we discovered building it from scratch:
Why K_pe is shared across heads in decoupled RoPE (not explicit in paper)
The critical timing of bias updates in auxiliary-loss-free load balancing
How sigmoid routing separates selection scores from gate values
The warmup formula that makes DualPipe achieve 3% bubble overhead
Bugs we caught: causal mask position offsets, EMA initialization, capacity dropping priority

What you'll learn:

5D Parallelism: How TP, PP, DP, EP, and SP interact at 2,048+ GPU scale
DualPipe: Building the bidirectional scheduler with 55% throughput gain over GPipe
Hierarchical All-to-All: Two-level communication reducing MoE dispatch overhead by 4x
Teachable abstractions: CapacityMetrics, ExpertSpecializationTracker, ScheduleStep enums

Prerequisites: torch.distributed basics.
Code: github.com/DevJadhav/deepseek-from-scratch

Speakers

Dev Jadhav

Tech Lead ML Engineer, ING Bank

Dev Jadhav is a production AI/ML engineer with 10+ years building AI
systems at scale. He currently leads ML engineering at Major Bank,
developing financial-grade AI and large-scale model operations. Dev is
the creator of DeepSeek From Scratch, an open-source implementation of
DeepSe... Read More →

DualPipe PyTorch Conference pdf

Wednesday April 8, 2026 15:55 - 16:20 CEST
Founders Cafe

Training Systems

Audience Level Advanced
Slides Attached Yes

15:55 CEST

Sponsored Session: Fault-Tolerant Training: How We Build Reliable Clusters for Distributed AI Workloads - Cyril Konkratenko & Maurits de Groot, Nebius

Wednesday April 8, 2026 15:55 - 16:20 CEST

Junior Stage

Large-scale distributed AI training is highly sensitive to infrastructure failures, where even a single node disruption can halt progress and waste substantial compute. This talk presents Nebius’s approach to fault-tolerant training, combining reliability metrics such as goodput, MTBF, and MTTR with automated infrastructure practices including health checks, workload isolation, node replacement, state recovery, and observability. Drawing on production cluster results, the presentation shows how these techniques reduce interruptions, accelerate recovery, and improve the stability and efficiency of long-running AI workloads.

Speakers

Cyril Kondratenko

AI/ML Specialist Solutions Architect, Nebius

Maurits de Groot

AI/ML Specialist Solutions Architect, Nebius

Wednesday April 8, 2026 15:55 - 16:20 CEST
Junior Stage

Training Systems

11:00 CEST

11:45 CEST

12:00 CEST

15:00 CEST

16:10 CEST

16:10 CEST

10:50 CEST

11:05 CEST

11:05 CEST

14:00 CEST

14:00 CEST

14:15 CEST

14:30 CEST

15:25 CEST

15:55 CEST

15:55 CEST

15:55 CEST

Get help with the event