Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Venue: Founders Cafe clear filter
arrow_back View All Dates
Wednesday, April 8
 

10:35 CEST

How To Write C++ Extensions in 2026 - Jane Xu, Meta & Mikayla Gawarecki, Meta
Wednesday April 8, 2026 10:35 - 11:00 CEST
Are you writing a C++ custom op extension to PyTorch? It's 2026 and are you still shipping M x N wheels for M CPython versions and N libtorch versions? Did you know you can just ship 1 wheel that works across multiple CPythons and libtorches? If you're curious how, attend this talk to get the deets on py_limited_api, APIs like torch::stable::Tensor & TORCH_TARGET_VERSION, and generally the latest and greatest ways for keeping your code and your release matrix simple. Get your custom kernel enrolling in new features with benefits proven out in FA3, xformers, torchao, torchaudio, and more in progress! We'll also share some of our vision towards smoother and faster custom ops extensions.
Speakers
avatar for Jane Xu

Jane Xu

PyTorch SWE, Meta
Hi, I'm Jane! Please don't hesitate to come talk to me about your favorite optimizer, fitting models in GPU memory, how to free C++ extensions from libtorch version, and anything that interests you.
avatar for Mikayla Gawarecki

Mikayla Gawarecki

Software Engineer, Meta
Software Engineer on PyTorch
Wednesday April 8, 2026 10:35 - 11:00 CEST
Founders Cafe
  Frameworks & Compilers

11:05 CEST

Bringing PyTorch Monarch to AMD GPUs: Single-Controller Distributed Training on ROCm - Liz Li & Zachary Streeter, AMD
Wednesday April 8, 2026 11:05 - 11:30 CEST
PyTorch Monarch introduces a new distributed programming paradigm that enables developers to orchestrate entire GPU clusters from a single Python program. With its actor-based runtime, process mesh abstraction, and asynchronous execution model, Monarch simplifies large-scale distributed training and enables complex workflows that combine training, evaluation, and reinforcement learning within one unified script.

In this talk, we present our work enabling PyTorch Monarch on AMD Instinct GPUs with ROCm, expanding the single-controller model beyond CUDA environments and bringing this emerging runtime to a broader hardware ecosystem. We describe the engineering effort required to port Monarch’s GPU runtime and distributed communication stack to ROCm, including HIPification of CUDA-specific components, adaptation of memory management and synchronization semantics, and integration with high-performance GPU-to-GPU communication on multi-node clusters through RDMA.

We will share lessons learned from running Monarch workloads on MI300-class clusters, including performance considerations, debugging workflows, and developer experience improvements. Our results demonstrate that Monarch’s architecture can be successfully extended to heterogeneous hardware environments while preserving scalability and ease of use.

This work advances hardware diversity in distributed PyTorch and highlights how portable runtimes can simplify large-scale training while enabling scalable, cluster-wide experimentation across accelerator platforms.
Speakers
avatar for Liz Li

Liz Li

Principal AI engineer, AMD
Liz Li is a Principal AI Engineer in the AMD AI group, specializing in enabling and optimizing cutting-edge AI models on AMD Instinct GPUs for both distributed inference and training. With over 10 years of experience in computer, graphics, and AI architecture, she has previously led... Read More →
avatar for Zachary Streeter

Zachary Streeter

Senior Member of Technical Staff, AMD
I'm a computational physicist working in the field of AI the past 5 years. I have a wide range of expertise from mathematics to performance optimizations and system engineering. Feel free to nerd out with me! Please connect with me on LinkedIn.
Wednesday April 8, 2026 11:05 - 11:30 CEST
Founders Cafe
  Training Systems
  • Audience Level Any

11:35 CEST

Lightning Talk: Enabling the Audio Modality for Language Models - Eustache Le Bihan, Hugging Face
Wednesday April 8, 2026 11:35 - 11:45 CEST
As the maintainer of everything audio in `transformers` (the lib), this talk shares how audio is being integrated into large language models, grounded in what we observe from the OS ecosystem.

Beginning with a brief overview of the current landscape of Audio LMs, I'll then highlight emerging trends in how audio is incorporated into pretrained text backbones. In particular, we examine the growing convergence of architectural choices, many inspired by VLMs, as well as newer concepts such as audio tokenization and streaming.

The core of the talk focuses on providing the audience with key technical insights: audio encoders vs audio tokenizers, their respective advantages and limitations. It covers the motivations behind introducing concepts such as audio tokenizers and audio processors into transformers, shows how these design choices are reflected in the library, and explains how PyTorch tooling is leveraged to make audio a standardized modality for the open-source community.
Speakers
avatar for Eustache Le Bihan

Eustache Le Bihan

MLE, Hugging Face
A 2024 MVA graduate, I now work on open-source audio at Hugging Face. My current focus is on standardising audio in the transformers library and strengthening support across models.
Wednesday April 8, 2026 11:35 - 11:45 CEST
Founders Cafe

13:30 CEST

Optimizing CPU LLM Inference in PyTorch: Lessons From VLLM - Crefeda Rodrigues, Arm Limited & Fadi Arafeh, Arm
Wednesday April 8, 2026 13:30 - 13:55 CEST
vLLM has emerged as a reference inference stack in the PyTorch ecosystem for high-throughput large language model serving. CPUs continue to play an important role in LLM inference, supporting cost-sensitive deployments, hybrid CPU/GPU serving, and batch or off-peak workloads on general-purpose infrastructure.

In this talk, we examine CPU-based LLM inference through the lens of PyTorch internals, using vLLM as a case study. We describe how vLLM interacts with PyTorch’s operator stack, including tensor layout management, backend dispatch, and threading behaviour, and highlight common sources of overhead such as repeated weight repacking and poor threading behaviour.

We present runtime and kernel-level optimizations that reduce overhead including CPU paged-attention kernel tuning with vectorized softmax, specialized Q–K and P–V GEMM kernels aligned with vLLM’s scheduler, an ISA-aware BF16 attention, pre-packed weight layouts for quantized matmul, SIMD vectorization using PyTorch’s at::vec::Vectorized primitives, and NUMA-aware scheduling for scalable parallel inference.

Finally, we conclude with lessons learned from building and upstreaming a high-performance CPU inference engine.
Speakers
avatar for Crefeda Rodrigues

Crefeda Rodrigues

Staff Software Engineer, Arm
Crefeda Rodrigues is a Staff Software Engineer at Arm, focusing on performance and scalability driven machine learning software optimization for Arm server CPUs. She previously worked on large-scale climate and weather model optimization as a postdoctoral researcher at the University... Read More →
avatar for Fadi Arafeh

Fadi Arafeh

Senior Machine Learning Engineer, Arm
Fadi is a Senior Machine Learning Engineer at Arm, working on optimizing PyTorch and vLLM for Arm Infrastructure cores. Prior to that, Fadi obtained a BSc in Artificial Intelligence from the University of Manchester.
Wednesday April 8, 2026 13:30 - 13:55 CEST
Founders Cafe
  Inference & Production

14:00 CEST

Lightning Talk: Debugging the Undebuggable: Introducing Torch.distributed.debug - Tristan Rice, Meta, PyTorch
Wednesday April 8, 2026 14:00 - 14:10 CEST
Distributed training in PyTorch enables unprecedented scale, but it also introduces notoriously difficult debugging challenges. When a job with thousands of ranks hangs or slows down, identifying the root cause can feel like searching for a needle in a haystack. This lightning talk introduces the new PyTorch Distributed Debug Server, a powerful, interactive tool designed to bring clarity and control to the chaos of distributed debugging. We will provide a high-level overview of its architecture and core features, demonstrating how it provides a unified interface to inspect stack traces, analyze performance, and diagnose hangs across all workers simultaneously. Attendees will learn how this extensible server can dramatically reduce debugging time and improve the reliability of large-scale training jobs.
Speakers
avatar for Tristan Rice

Tristan Rice

Software Engineer, PyTorch Distributed, Meta
Software engineer working on PyTorch Distributed and large scale training.
Wednesday April 8, 2026 14:00 - 14:10 CEST
Founders Cafe

14:15 CEST

Lightning Talk: Scaling Recommendation Systems To 2K GPUs and Beyond - Zain Huda, Meta
Wednesday April 8, 2026 14:15 - 14:25 CEST
TLDR: In this session, we go over one of the key technologies to Ads model scaling at Meta, 2D sparse parallelism. Which scales sparse recommendation embedding tables beyond 1k GPUs to 8k GPUs - enabling the largest Ads model training runs in production at Meta.

Scaling Laws have dominated LLMs and shown the industry we can achieve better model performance through scaling. The same scaling law can be applied to recommendation systems. However, the path to scaling recommender systems is not the same. The leap from hundreds to thousands of GPUs introduces complex technical challenges, particularly around handling sparse operations in recommendation models.

In this talk, we will detail the development of 2D sparse parallelism, tracing its path from research to production to address sparse scaling challenges. We will demonstrate how we optimize these systems to push performance boundaries, increasing speed and reducing memory at scale. Participants will walk away with lessons learned from designing 1,000+ GPU scale systems, and a deeper understanding of how to implement these solutions efficiently in production.
Speakers
avatar for Zain Huda

Zain Huda

Software Engineer, Meta
Zain works on large scale training systems for recommender systems at Meta. He works on TorchRec, a library for distributed parallelism for sparse recommender models. He is also one of the authors of 2D sparse parallelism.
Wednesday April 8, 2026 14:15 - 14:25 CEST
Founders Cafe

14:30 CEST

From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face
Wednesday April 8, 2026 14:30 - 14:55 CEST
Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks.

We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.
Speakers
avatar for Kashif Rasul

Kashif Rasul

Research Scientist, Hugging Face
Kashif has a PhD. in Mathematics from the Freie Universität Berlin. He is passionate about high-performance computing, Reinforcement learning, and has presented at NVIDIA's GTC in 2009 and at StrangeLoop in 2012, and is also contributing to a number of data science and deep learning... Read More →
avatar for Sergio Paniego Blanco

Sergio Paniego Blanco

Machine Learning Engineer, Hugging Face
Sergio tiene una amplia trayectoria en el ámbito del código abierto y la inteligencia artificial, campo en el que también obtuvo su doctorado. Lleva más de ocho años participando en iniciativas como Google Summer of Code, donde ha contribuido como desarrollador y mentor. Actualmente... Read More →
Wednesday April 8, 2026 14:30 - 14:55 CEST
Founders Cafe
  Training Systems

15:25 CEST

Lightning Talk: Trinity Large - Torchtitan on 2000+ B300s - Matej Sirovatka, Prime Intellect
Wednesday April 8, 2026 15:25 - 15:35 CEST
In this talk, we'll cover how to use torchtitan to scale training of ultra-sparse mixture-of-experts models across over 2,000 GPUs. We'll walk through the pre-training of Trinity Large, a 400B mixture-of-experts model trained entirely using torchtitan, focusing on maximizing throughput and minimizing the impact of hardware induced failures. Along the way, we'll discuss challenges like fault tolerance, large-scale distributed training, and ensuring determinism - and how we've addressed each of these using torchtitan. Finally, we'll share insights and common pitfalls to avoid in your own large-scale training runs.
Speakers
avatar for Matej Sirovatka

Matej Sirovatka

Research Engineer, Prime Intellect
Research Engineer at Prime Intellect, mainly focusing on distributed training, performance and scaling.
Wednesday April 8, 2026 15:25 - 15:35 CEST
Founders Cafe
  Training Systems

15:40 CEST

Lightning Talk: Faster Than SOTA Kernels in Torch.compile With Subgraph Fusions and Custom Op Autotuning - Elias Ellison & Paul Zhang, Meta
Wednesday April 8, 2026 15:40 - 15:50 CEST
Unlocking state-of-the-art performance, this talk reveals how subgraph and custom operator autotuning in torch.compile deliver breakthrough speedups—surpassing previous SOTA for matmul and distributed collective ops.

DecomposeK is a novel subgraph optimization in PyTorch, designed to accelerate matrix multiplication when the inner dimension (K) is very large. DecomposeK achieves, delivering up to 28% speedup over ATen with activation fusion and 10% over ATen without fusion.

Building on subgraph infrastructure, we introduced Custom Op Autotuning, which benchmarks and selects the fastest kernel implementations for custom ops. This enables epilogue fusion and the first distributed collective op autotuning in PyTorch. We also introduce Range-based dispatch autotuning that enables dynamic selection of optimal implementations based on input shapes, ensuring performance that closely matches the theoretical best for each range. Our demo shows our autotuned kernels outperform Async TP Fused AG+MM by 9% and Async TP Fully Fused kernel by 41% across all input ranges.
Speakers
avatar for Elias Ellison

Elias Ellison

Software Engineer, Meta
Elias has been working on the PyTorch team for four years, most recently on the torch.compile stack
avatar for Paul Zhang

Paul Zhang

Software Engineer, Meta
Paul Zhang is currently a software engineer working on PyTorch and Triton at Meta, ensuring that PyTorch and PT2 best utilizes the hardware it is run on. Previous to this, Paul has done extensive work on recommendation systems for training and inference, optimizing performance and... Read More →
Wednesday April 8, 2026 15:40 - 15:50 CEST
Founders Cafe

15:55 CEST

DualPipe from Scratch: Implementing DeepSeek's 5D Parallelism in PyTorch - Dev Jadhav, ING Bank
Wednesday April 8, 2026 15:55 - 16:20 CEST
The DeepSeek-V3 paper describes 5D parallelism and DualPipe at a high level, but leaves critical implementation details undocumented. This session presents our open-source PyTorch reference implementation that fills those gaps - verified against the original architecture and designed for learning and extension.

We'll share what we discovered building it from scratch:
Why K_pe is shared across heads in decoupled RoPE (not explicit in paper)
The critical timing of bias updates in auxiliary-loss-free load balancing
How sigmoid routing separates selection scores from gate values
The warmup formula that makes DualPipe achieve 3% bubble overhead
Bugs we caught: causal mask position offsets, EMA initialization, capacity dropping priority

What you'll learn:

5D Parallelism: How TP, PP, DP, EP, and SP interact at 2,048+ GPU scale
DualPipe: Building the bidirectional scheduler with 55% throughput gain over GPipe
Hierarchical All-to-All: Two-level communication reducing MoE dispatch overhead by 4x
Teachable abstractions: CapacityMetrics, ExpertSpecializationTracker, ScheduleStep enums

Prerequisites: torch.distributed basics.
Code: github.com/DevJadhav/deepseek-from-scratch
Speakers
avatar for Dev Jadhav

Dev Jadhav

Tech Lead ML Engineer, ING Bank
Dev Jadhav is a production AI/ML engineer with 10+ years building AI
systems at scale. He currently leads ML engineering at Major Bank,
developing financial-grade AI and large-scale model operations. Dev is
the creator of DeepSeek From Scratch, an open-source implementation of
DeepSe... Read More →
Wednesday April 8, 2026 15:55 - 16:20 CEST
Founders Cafe
  Training Systems

16:25 CEST

Lightning Talk: Bridging the Gap: Engineering Compliant "Glass Box" Medical AI With PyTorch - Muhammad Saqib Hussain, Neurosonic & Mohaddisa Maryam, Neurosonic Academy
Wednesday April 8, 2026 16:25 - 16:35 CEST
While state-of-the-art models like NeuroBOLT demonstrate mathematical excellence in EEG-to-fMRI synthesis, they often remain clinically opaque. With the EU AI Act classifying medical AI as "high-risk," hospitals cannot deploy "black boxes"; they require systems that are transparent, auditable, and legally compliant.
​This session presents a "Clinical Auditing System" built within the PyTorch ecosystem, designed to transform opaque deep learning models into transparent "Glass Boxes." I will demonstrate a workflow that backpropagates gradients from high-dimensional 4D fMRI volumes to identify the specific EEG spectral signatures driving those predictions.
​Key Technical Takeaways:
​1. The Audit Layer: Implementing IntegratedGradients (Captum) to verify model fidelity, ensuring predictions stem from valid neural oscillations rather than noise artifacts.
​2. Cross-Modal Reasoning: A technical demonstration of mapping 4D volumetric outputs back to 1D EEG frequency bands, enabling the model to "reason" through neurovascular coupling.
​This presentation is designed for developers seeking to wrap PyTorch models in safety layers that satisfy demands of healthcare regulation.
Speakers
avatar for Mohaddisa Maryam

Mohaddisa Maryam

Miss, Neurosonic Academy
I am a First Year Student of Medicine in Italy.
avatar for Muhammad Saqib Hussain

Muhammad Saqib Hussain

Medical Student, AI Researcher and Neurotech Founder, ClinExplain
Muhammad Saqib is a 4th-year medical student at Comenius University Bratislava and Founder of Neurosonic Academy. His M.D. thesis explores AI for Sleep Medicine. Leveraging PyTorch and Captum, he builds "Glass Box" auditing frameworks to validate generative neuroimaging models against... Read More →
Wednesday April 8, 2026 16:25 - 16:35 CEST
Founders Cafe
  Applications & Case Studies
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -