Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Type: Training Systems clear filter
arrow_back View All Dates
Tuesday, April 7
 

11:00 CEST

Lightning Talk: Training Embedding Model Resiliently for Multimodal Model Inference Routing - Huamin Chen, Red Hat & Haichen Zhang, AMD
Tuesday April 7, 2026 11:00 - 11:10 CEST
LLM systems increasingly rely on intelligent routing to balance cost, latency, and quality tradeoffs. The vLLM Semantic Router, a vLLM Ecosystem project, provides both semantic and performance level routing intelligence for Mixture-of-Multimodal Models (MoM) architectures, but its effectiveness depends on fast and accurate classifiers.

This talk presents our end-to-end journey training production-grade embedding and classification models on AMD GPUs using native PyTorch, achieving high GPU utilization with distributed training optimizations.

We introduce a multilingual text embedding model with 32K context window and 2D Matryoshka support, and multimodal embedding models, trained on AMD GPUs using PyTorch DDP. The talk covers practical training optimizations for AMD ROCm. All training code uses native PyTorch distributed primitives, with additional enhancement to improve training stability and pipeline efficiency.

Attendees will learn how to train efficient classifiers for LLM routing systems and integrate these models into production inference pipelines.
Speakers
avatar for Huamin Chen

Huamin Chen

Technical Advisor, Microsoft
Dr. Huamin Chen is a passionate developer. He co-founded the Semantic Router project under vLLM community. His recent contributions to the CNCF ecosystem include Project Kepler, TAG Environmental Sustainability, and Cloud Native AI WG. He is also one of the founding members... Read More →
avatar for Haichen Zhang

Haichen Zhang

Senior AI Software Engineer, AMD
Haichen is the Senior AI Engineer for AMD AI Group, specializing in accelerating training and inference for large language models, recommender systems, computer vision (CV), and natural language processing (NLP) tailored to internet customers. Before joining AMD, Haichen worked at... Read More →
Tuesday April 7, 2026 11:00 - 11:10 CEST
Junior Stage

11:45 CEST

Lightning Talk: TorchJD: Jacobian Descent in PyTorch - Pierre Quinton, EPFL & Valérian Rey, Simplex Lab
Tuesday April 7, 2026 11:45 - 11:55 CEST
Jacobian descent (JD) is an extension of gradient descent supporting the optimization of vector-valued functions. This algorithm can be used to train neural networks with multiple loss functions (e.g. multi-task learning). JD iteratively updates the parameters of the model using the Jacobian matrix of the vector of losses (the matrix stacking each individual loss' gradient).

To support and extend our research, we have developed the TorchJD library. With it, it's easy and efficient to compute the Jacobians with respect to the model parameters, and to aggregate them into an update direction that is beneficial to every objective. In contrast, if we had averaged the losses and used gradient descent, the update would have been beneficial to the average loss, but may have actually increased one of the individual losses.

In this session, we will give a quick introduction to the theory behind Jacobian descent, and then show how to use TorchJD on a variety of use-cases, beyond multi-task learning.

Library: https://github.com/TorchJD/torchjd
Paper: https://arxiv.org/abs/2406.16232
Speakers
avatar for Pierre Quinton

Pierre Quinton

Teacher, EPFL
PhD in Information Theory and Master in Data Science, specializing in fundamental math and multi-objective optimization (MOO). I am the co-author of TorchJD, a PyTorch library for Jacobian Descent developed with Valerian, currently at ~300 GitHub stars. My work aims to translate complex... Read More →
avatar for Valérian Rey

Valérian Rey

Research Engineer, Simplex Lab
I graduated from EPFL with a MSc in Data Science in 2021. Since then, I worked as a Data Scientist as Withings, and I worked on Jacobian descent, initially as a side-project, but now as a full-time occupation. I now spend most of my time developing and maintaining TorchJD, and I love... Read More →
Tuesday April 7, 2026 11:45 - 11:55 CEST
Founders Cafe
  Training Systems

12:00 CEST

Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Storage via fsspec to Keep GPUs Busy - Ankita Luthra & Trinadh Kotturu, Google
Tuesday April 7, 2026 12:00 - 12:10 CEST
As PyTorch models scale to billions of parameters, the bottleneck has quietly shifted from compute to storage. Modern GPU clusters often sit idle, "starving" for data while waiting on legacy REST-based protocols. This talk introduces Rapid Storage: a fundamental architectural shift bringing Google’s Colossus stateful protocol (that powers many Google’s products) to PyTorch via fsspec , a common Pythonic file interface used by many frameworks within PyTorch ecosystem.
By bypassing REST APIs entirely via persistent gRPC streams to the storage layer, we eliminate protocol overhead. In this talk, we also dive into how Rapid achieves <1ms random read/write latency, 20x faster data access, and a massive 6 TB/s of aggregate throughput. Crucially, it delivers up to 10x lower tail latency for random I/O, preventing the stragglers that often stall distributed training jobs.
Beyond raw speed, we will deconstruct the integration with gcsfs and the broader fsspec ecosystem. This ensures that high-performance I/O is available across the entire data stack including Dask, Ray, HF Datasets and vLLM etc. Join us to learn how to stop wasting GPU cycles and achieve linear scaling in the cloud.
Speakers
avatar for Ankita Luthra

Ankita Luthra

Senior Software Engineer, Google
Ankita Luthra is a Software Developer at Google, focused on AI/ML infrastructure and scalable data pipelines. Her work with open-source tools like fsspec(gcsfs) and gcsfuse improves how frameworks such as PyTorch/ JAX efficiently access data from Google Cloud Storage.
avatar for Trinadh Kotturu

Trinadh Kotturu

Senior Product Manager, Google
Trinadh Kotturu is a Senior Product Manager specializing in AI/ML and analytics client strategy at Google. An alumnus of IIM Bangalore with 12 years of experience, he has a proven track record of shipping v1 products and scaling them into robust platform services. His expertise spans large-scale distributed storage systems, autonomous driving, and system resiliency... Read More →
Tuesday April 7, 2026 12:00 - 12:10 CEST
Master Stage
  Training Systems
  • Audience Level Any
  • Slides Attached Yes

15:00 CEST

Lightning Talk: Jigsaw: Domain and Tensor Parallelism for High-Resolution Input Training - Deifilia Kieckhefen, Karlsruhe Institute of Technology
Tuesday April 7, 2026 15:00 - 15:10 CEST
Distributed neural network training frameworks typically optimize for specific architectures while minimizing communication overhead. Transformer layers can be efficiently parallelized, but other operations such as convolutions often remain inefficient. This creates bottlenecks for complex model architectures.
Moreover, existing tensor parallelism strategies typically replicate input data across all processes, creating redundant I/O that scales poorly with input size. In applications with heavy I/O demands-weather forecasting, medical imaging, or video processing-unsharded input data creates additional data-loading bottlenecks that could benefit from parallelization.
Jigsaw is a PyTorch library that shards both model weights and input data across parallel processes. It maintains a PyTorch-like interface while parallelizing activations, convolutions, linear layers, and attention through a distributed matrix multiplication backend. We demonstrate the usability of Jigsaw across a wide range of model architectures and shows performance when scaling multi-billion-parameter models sharded across up to 8 processes and compares the scalability to DDP, FSDP, and Megatron-LM approaches.
Speakers
avatar for Deifilia Kieckhefen

Deifilia Kieckhefen

Doctoral Researcher, Karlsruhe Institute of Technology
Deifilia Kieckhefen is a doctoral researcher at the Karlsruhe Institute of Technology. She works on scalable and distributed training of neural network architectures.
Tuesday April 7, 2026 15:00 - 15:10 CEST
Founders Cafe
  Training Systems
  • Audience Level Any
  • Slides Attached Yes

16:10 CEST

Optimizing Reinforcement Learning at Trillion-Parameter Scale - Songlin Jiang, Aalto University & Mind Lab
Tuesday April 7, 2026 16:10 - 16:35 CEST
This talk will dive into how we implemented and optimized reinforcement learning on trillion-parameter Mixture-of-Experts reasoning models using veRL, Megatron-Bridge and vLLM. The session is useful to anyone building large-scale RL training systems.

For the first part, I will walk through the system design required to make RL work at this scale using LoRA: how LoRA adapters are implemented for expert layers, how adapters are sharded and fused under tensor/pipeline/expert parallelism, and most importantly, how refit (parameter sync) is implemented for LoRA between training backend (Megatron) and rollout engine (vLLM).

The second part of the talk focuses on training–inference mismatch in MoE RL. I will explain why common mitigations such as clipping and importance sampling can fail, and how we implement fixed Router Replay R3 across vLLM, veRL, and Megatron to align routing decisions between rollout and training.

These works are done together with Mind Lab and some of the related blog posts are at:
- https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus
- https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it
Speakers
avatar for Songlin Jiang

Songlin Jiang

Doctoral Researcher, Aalto University & Mind Lab
I am a doctoral researcher at Aalto University, focusing on reducing training and inference latency for Reinforcement Learning and Large Language Models (LLMs) on High-Performance Computing (HPC) clusters. I am also a passionate free software developer, a maintainer of VeRL, and a... Read More →
Tuesday April 7, 2026 16:10 - 16:35 CEST
Junior Stage
  Training Systems

16:10 CEST

TorchStore: What We Learned Building Distributed Storage Solutions for AysncRL - Lucas Pasqualin, Danielle Pintz, Allen Wang, Amir Afzail Meta
Tuesday April 7, 2026 16:10 - 16:35 CEST
Asynchronous Reinforcement Learning (AsyncRL) workloads have unique data sharing requirements: actors must efficiently exchange large tensors across processes and nodes, often with different sharding configurations—not just at checkpoint time, but continuously during training for live weight synchronization. This talk presents Torchstore, an open-source distributed tensor storage system built on Monarch actors that tackles these challenges. We'll share the key lessons learned—from designing pluggable transport backends (RDMA, shared memory, RPC) to implementing transparent live DTensor resharding that lets producers and consumers use entirely different parallelism strategies. We'll also discuss the friction we encountered integrating with inference engines like vLLM, where differing model definitions and integrations present new bottlenecks. Whether you're building actor-based training systems or thinking about disaggregated training-inference architectures, you'll leave with practical insights on distributed tensor storage design.
Speakers
avatar for Lucas Pasqualin

Lucas Pasqualin

ML Engineer, PyTorch (Meta)
Lucas has been developing Machine Learning Applications and Machine Learning infrastructure at scale for years, and has recently been focused on extending the product offering of PyTorch's Distributed Checkpointing stack.
AW

Allen Wang

Software Engineer, Meta
avatar for Danielle Pintz

Danielle Pintz

Software Engineer, Meta
Danielle is a software engineer working on PyTorch, currently focused on TorchStore and Async RL. She previously worked on the Llama Research team.
avatar for Amir Afzali

Amir Afzali

Software Engineer, Meta
Software engineer working on Pytorch distributed infra and large scale training
Tuesday April 7, 2026 16:10 - 16:35 CEST
Master Stage
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -