Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Company: Intermediate clear filter
arrow_back View All Dates
Tuesday, April 7
 

11:00 CEST

Lightning Talk: Why Your Forecasting Transformer Isn’t Working (And How To Fix It in Python) - Rosheen Naeem, Open Climate Fix
Tuesday April 7, 2026 11:00 - 11:10 CEST
Renewable energy is clean — but it’s also inherently variable. Solar PV generation can change dramatically within minutes due to cloud cover and weather conditions, making accurate short-term forecasts essential for grid stability, energy trading, and smart-home optimisation.
Open Climate Fix builds open and high-impact forecasting tools to accelerate the transition to a low-carbon energy system. One of these projects is Open Quartz Solar Forecast: an open-source model that uses public PV generation data, site metadata, and numerical weather prediction variables to forecast solar power for any location.
In this talk, I’ll present a real case study from my Google Summer of Code project where I implemented and trained a Temporal Fusion Transformer for multi-horizon solar forecasting. I’ll cover the practical engineering challenges behind making transformer forecasting work in Python: building continuous training windows, aligning weather forecast steps with observations, separating static vs time-varying features, and stabilising training using PyTorch Forecasting and PyTorch Lightning.
Attendees will leave with reusable patterns for real-world time-series forecasting pipelines.
Speakers
avatar for Rosheen Naeem

Rosheen Naeem

Software Engineer, Miro
I am a Software Engineer at Miro and a community member at Open Climate Fix. I completed the Erasmus Mundus Master’s in Software Engineering for the Green Deal (SE4GD), a joint degree program across Vrije Universiteit Amsterdam (Netherlands), LUT University (Finland), and Universit... Read More →
Tuesday April 7, 2026 11:00 - 11:10 CEST
Central Room
  Applications & Case Studies

11:00 CEST

Lightning Talk: Training Embedding Model Resiliently for Multimodal Model Inference Routing - Huamin Chen, Red Hat & Haichen Zhang, AMD
Tuesday April 7, 2026 11:00 - 11:10 CEST
LLM systems increasingly rely on intelligent routing to balance cost, latency, and quality tradeoffs. The vLLM Semantic Router, a vLLM Ecosystem project, provides both semantic and performance level routing intelligence for Mixture-of-Multimodal Models (MoM) architectures, but its effectiveness depends on fast and accurate classifiers.

This talk presents our end-to-end journey training production-grade embedding and classification models on AMD GPUs using native PyTorch, achieving high GPU utilization with distributed training optimizations.

We introduce a multilingual text embedding model with 32K context window and 2D Matryoshka support, and multimodal embedding models, trained on AMD GPUs using PyTorch DDP. The talk covers practical training optimizations for AMD ROCm. All training code uses native PyTorch distributed primitives, with additional enhancement to improve training stability and pipeline efficiency.

Attendees will learn how to train efficient classifiers for LLM routing systems and integrate these models into production inference pipelines.
Speakers
avatar for Huamin Chen

Huamin Chen

Technical Advisor, Microsoft
Dr. Huamin Chen is a passionate developer. He co-founded the Semantic Router project under vLLM community. His recent contributions to the CNCF ecosystem include Project Kepler, TAG Environmental Sustainability, and Cloud Native AI WG. He is also one of the founding members... Read More →
avatar for Haichen Zhang

Haichen Zhang

Senior AI Software Engineer, AMD
Haichen is the Senior AI Engineer for AMD AI Group, specializing in accelerating training and inference for large language models, recommender systems, computer vision (CV), and natural language processing (NLP) tailored to internet customers. Before joining AMD, Haichen worked at... Read More →
Tuesday April 7, 2026 11:00 - 11:10 CEST
Junior Stage

11:00 CEST

Helion 1.0: A High-Level DSL for Performance Portable Kernels - Oguz Ulgen, Meta
Tuesday April 7, 2026 11:00 - 11:25 CEST
ML practitioners increasingly author bespoke kernels, but achieving portable performance demands low-level expertise and repeated manual tuning for each accelerator generation and type. We introduce Helion, a Python-embedded DSL with a “PyTorch with tiles” programming model that preserves familiar PyTorch APIs while giving developers lower-level control over the generated kernels. Helion integrates tightly with TorchInductor to reuse PyTorch operator lowerings, automatically manages host/device boundaries, and provides rich language constructs for tiling, memory movement, and synchronization. The language defines an implicit high-dimensional configuration space that our autotuner explores, shifting the tuning burden from developers to automated search.

In this session, I will cover both the language and what is new since PTC'25, as well as announcing the official GA launch. This session will be open for both experienced and beginner kernel authors.
Speakers
avatar for Oguz Ulgen

Oguz Ulgen

Software Engineer, Meta
I'm a software engineer at Meta where I used to work on the Hack programming language and now work on PyTorch.
Tuesday April 7, 2026 11:00 - 11:25 CEST
Master Stage

11:15 CEST

Lightning Talk: Flexible Deployment of PyTorch Models on MCU-Class Devices Using ExecuTorch - Robert Kalmar & Martin Pavella, NXP
Tuesday April 7, 2026 11:15 - 11:25 CEST
ExecuTorch has recently matured into a production ready framework designed specifically for efficient edge deployment of PyTorch models. Its architecture supports a broad spectrum of hardware targets—from low power, bare metal or RTOS based microcontrollers (MCU) to higher performance Linux or Android based microprocessor platforms—while meeting the demanding constraints of memory, compute, and power typically found in real world embedded applications.
This talk focuses on the deployment flexibility ExecuTorch offers for MCU class devices, highlighting how different backends enable efficient execution across heterogeneous compute units. We will explore CPU, DSP, and NPU acceleration paths using the Cortex-M, Cadence, Ethos-U, and eIQ Neutron backends, and discuss how these integrate into typical ML model deployment workflows.
To make the session practical and application oriented, we will present an optimization journey aimed at reducing power consumption—an essential requirement for ML workloads in energy constrained environments. Attendees will gain insights into backend selection, performance trade offs, and best practices for suitable deploying PyTorch models on edge devices.
Speakers
avatar for Robert Kalmar

Robert Kalmar

Principal AI/ML Engineer at NXP Semiconductors, NXP Semiconductors
Robert Kalmar is a Principal Machine Learning Engineer at NXP Semiconductors. He received his master’s degree in machine learning and intelligent systems from Brno University of Technology. At NXP he focus on machine learning solution enablement for embedded and mobile devices... Read More →
avatar for Martin Pavella

Martin Pavella

ML SW Engineer, NXP Semiconductors
I hold a Master’s degree in Machine Learning from the Brno University of Technology, graduating with distinction at both bachelor’s and master’s levels. I am a mid-level AI/ML Software Engineer at NXP Semiconductors with 2.5+ years of experience. I won the 2025 iGEM overgraduate... Read More →
Tuesday April 7, 2026 11:15 - 11:25 CEST
Junior Stage
  Inference & Production

11:30 CEST

Lightning Talk: Coding Agents for Compiler Construction: Beyond the AI Assistant Paradigm - Reza Rahimi, yasp.ai & Stefan Krassin, yasp
Tuesday April 7, 2026 11:30 - 11:40 CEST
Modern ML compilers follow a familiar pattern: a frontend lowers models into an intermediate representation, while a backend applies graph and kernel optimizations before generating code for target accelerators. PyTorch provides strong foundations through nn.Module, FX, and graph capture, but implementing optimized backends remains challenging due to hardware diversity and kernel-level complexity.

Optimizing GPU kernels is hard. Few engineers do it well. Hardware architectures evolve yearly, and with hyperscalers, chip makers, and AI labs building custom silicon, demand for efficient kernel generation keeps growing. This creates a gap between model developers and hardware capabilities.

This talk explores coding agents as engineering tools for compiler construction, not general-purpose assistants. We discuss how agents can generate and refine backend components by analyzing model mathematics and hardware specifications to produce optimized kernels tailored to specific targets.

We present a compiler architecture built as a PyTorch add-on that accepts PyTorch models or FX graphs and produces executable artifacts, demonstrating practical integration with existing PyTorch workflows.
Speakers
avatar for Reza Rahimi

Reza Rahimi

CTO, yasp
Reza Rahimi is a seasoned technologist with a strong background in accelerating engineering software and scaling machine learning systems. With experience leading teams across embedded AI, compiler design, and model optimization, he now serves as CTO of yasp, where he is pioneering... Read More →
avatar for Stefan Krassin

Stefan Krassin

CEO, yasp.ai
With a background in electrical engineering and a career spanning embedded systems to executive leadership, he combines technical expertise with a vision for scale. After 10+ years of leading companies to outstanding growth, he co-founded yasp in 2023. His mission is to eliminate... Read More →
Tuesday April 7, 2026 11:30 - 11:40 CEST
Founders Cafe
  Agents & Interop

11:30 CEST

Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft
Tuesday April 7, 2026 11:30 - 11:55 CEST
Making your GPUs go brrr is complex. Efficient LLM inference requires navigating a maze of optimization techniques each with different trade-offs. This session provides a practical journey through inference optimizations, clearly categorized by implementation effort.

We'll explore techniques across three levels:

- Model choices (start here): Model selection, quantization, smart routing

- Library-level improvements (using PyTorch-based frameworks like vLLM, SGLang, TensorRT-LLM): Continuous batching, KV-cache management, tensor parallelism

- Custom implementations: Speculative decoding with custom draft heads, disaggregated inference, fine-tuning smaller models

The session covers practical trade-offs and key metrics: time to first token, inter-token latency, throughput, and cost per token.

Whether deploying your first model or optimizing at scale, this talk delivers actionable insights into which techniques to prioritize for deeper investigation.
Speakers
avatar for Christin Pohl

Christin Pohl

Global Black Belt Solution Engineer AI Infrastructure, Microsoft
Christin Pohl is a Global Black Belt Solution Engineer for AI Infrastructure at Microsoft (Switzerland), now in her third year. After building her first chatbot in 2018 and 5+ years at SAP, she helps enterprises worldwide choose the right GPU, run LLM training and inference end-to-end... Read More →
Tuesday April 7, 2026 11:30 - 11:55 CEST
Master Stage

11:45 CEST

Lightning Talk: TorchJD: Jacobian Descent in PyTorch - Pierre Quinton, EPFL & Valérian Rey, Simplex Lab
Tuesday April 7, 2026 11:45 - 11:55 CEST
Jacobian descent (JD) is an extension of gradient descent supporting the optimization of vector-valued functions. This algorithm can be used to train neural networks with multiple loss functions (e.g. multi-task learning). JD iteratively updates the parameters of the model using the Jacobian matrix of the vector of losses (the matrix stacking each individual loss' gradient).

To support and extend our research, we have developed the TorchJD library. With it, it's easy and efficient to compute the Jacobians with respect to the model parameters, and to aggregate them into an update direction that is beneficial to every objective. In contrast, if we had averaged the losses and used gradient descent, the update would have been beneficial to the average loss, but may have actually increased one of the individual losses.

In this session, we will give a quick introduction to the theory behind Jacobian descent, and then show how to use TorchJD on a variety of use-cases, beyond multi-task learning.

Library: https://github.com/TorchJD/torchjd
Paper: https://arxiv.org/abs/2406.16232
Speakers
avatar for Pierre Quinton

Pierre Quinton

Teacher, EPFL
PhD in Information Theory and Master in Data Science, specializing in fundamental math and multi-objective optimization (MOO). I am the co-author of TorchJD, a PyTorch library for Jacobian Descent developed with Valerian, currently at ~300 GitHub stars. My work aims to translate complex... Read More →
avatar for Valérian Rey

Valérian Rey

Research Engineer, Simplex Lab
I graduated from EPFL with a MSc in Data Science in 2021. Since then, I worked as a Data Scientist as Withings, and I worked on Jacobian descent, initially as a side-project, but now as a full-time occupation. I now spend most of my time developing and maintaining TorchJD, and I love... Read More →
Tuesday April 7, 2026 11:45 - 11:55 CEST
Founders Cafe
  Training Systems

13:45 CEST

Lightning Talk: From Pretrained To Personal: Privacy-First Fine-Tuning on AI PCs - Daniel Holanda Noronha & Iswarya Alex, AMD
Tuesday April 7, 2026 13:45 - 13:55 CEST
Pytorch on AI PCs crossed a threshold: local hardware can now support meaningful model fine-tuning, not just inference. This unlocks a new class of enterprise workflows where sensitive data never leaves the device, yet models can still be personalized and adapted using PyTorch.

In this session, we’ll show how to design on-device fine-tuning pipelines for AI PCs, focusing on enterprise scenarios where privacy is non-negotiable: regulated healthcare data, government and public-sector workloads, financial services, and proprietary enterprise systems. We’ll walk through key decisions such as selecting efficient pre-trained models, and how the right PyTorch optimizations enable effective personalization on large private datasets.

We'll also showcase practical fine-tuning techniques such as supervised fine-tuning (SFT), LoRA, and QLoRA, and show how mixed-precision training and correct use of training vs. evaluation modes make these approaches efficient and practical on AI PCs while preserving privacy. The result is a cloud-free, privacy-first fine-tuning blueprint that turns AI PCs into secure personalization engines for enterprise AI.
Speakers
avatar for Daniel Holanda

Daniel Holanda

Solutions Architect & ML Engineer, AMD
Daniel is a Sr. ML Engineer at AMD, specializing in local AI. He leads the development of local fine-tuning workflows for AI PCs and co-leads several open-source projects where he designs production-grade LLM/VLM tooling to accelerate the AI development lifecycle.

Previously, he was a Machine Learning Engineer at Groq and a contributor to Microsoft’s Project Brainwave. Daniel holds a PhD in AI understanding and hardware architecture from UBC... Read More →
avatar for Iswarya Alex

Iswarya Alex

Iswarya Alex, AMD
I am an ML Engineer at AMD focused on enabling high-performance on-device AI experiences. I work on optimizing and deploying models on AMD's Ryzen AI powered devices with GPUs and NPUs efficiently
Tuesday April 7, 2026 13:45 - 13:55 CEST
Founders Cafe
  Security & Privacy

13:45 CEST

Bringing ExecuTorch To the Next Frontiers of Edge AI - Mergen Nachin, Meta
Tuesday April 7, 2026 13:45 - 14:10 CEST
Since the General Availability release of ExecuTorch 1.0 in October 2025, our team has continued to advance the state of the on-device AI software stack. In this talk, we will share our upcoming roadmap and present demos that highlight ExecuTorch’s deployment across the next frontiers, such as AI PCs, robotics, TinyML devices, and the integration of AI agents to improve productivity for on-device deployment.

ExecuTorch is built on open source collaboration, encouraging community adoption, contributions from hardware partners, and interoperability with other ecosystem libraries. We will discuss how these foundations set the stage for the next phase of edge AI with ExecuTorch.
Speakers
avatar for Mergen Nachin

Mergen Nachin

Software Engineer, Meta
Mergen Nachin is a Software Engineer specializing in creating rich AI experiences on low latency, high performance, and privacy-aware embedded systems. With a background in distributed systems, developer infrastructure, remote sensing, and localization, he brings a versatile skill... Read More →
Tuesday April 7, 2026 13:45 - 14:10 CEST
Master Stage
  Applications & Case Studies

13:45 CEST

Teaching PyTorch To Read Your Worst PDFs With Docling - Mingxuan Zhao & Peter Staar, IBM & Carol Chen, Red Hat
Tuesday April 7, 2026 13:45 - 14:10 CEST
Building production RAG pipelines starts with a problem most teams underestimate: getting clean, structured data out of real-world documents. PDFs lose table structure, figures get separated from captions, and multi-column layouts become unreadable. Before your PyTorch models even see your data, crucial information is already lost.
Docling is an open-source, MIT-licensed document parsing library that uses PyTorch-based deep learning models to understand documents the way humans read them. It preserves hierarchy, extracts structured data from tables and figures, and supports over ten common file formats through a consistent API. Because everything runs locally, it integrates cleanly into PyTorch-native workflows with low latency and no data leaving your infrastructure.
In this talk, I'll walk through Docling's PyTorch-powered architecture and show how to build document processing pipelines for RAG and other GenAI applications. I'll also share the architecture of real-world applications of Docling and how it has improved workflows. You'll leave with practical patterns for connecting Docling to your own PyTorch-based GenAI stack.
Speakers
avatar for Carol Chen

Carol Chen

Principal AI Community Architect, Red Hat
Carol Chen is a Community Architect at Red Hat, having led several upstream communities including InstructLab, Ansible and ManageIQ. She has been actively involved in open source communities while working for Jolla and Nokia previously. In addition, she also has experiences in software... Read More →
avatar for Mingxuan Zhao

Mingxuan Zhao

Software Developer/Developer Advocate, IBM
Ming Zhao is an open source developer and Developer Advocate at IBM Research, where he helps IBM leverage open technologies while building impactful tools and growing vibrant open-source communities. He’s passionate about making open tech accessible to all and ensuring developers... Read More →
Tuesday April 7, 2026 13:45 - 14:10 CEST
Junior Stage

14:15 CEST

The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google
Tuesday April 7, 2026 14:15 - 14:40 CEST
Production LLM serving faces a critical trade-off: while continuous batching maximizes throughput, it often sacrifices SLAs due to Head-of-Line (HoL) blocking. When long-context requests hijack the engine, tail latencies spike. Without fine-grained preemption, guaranteeing priority or fairness remains nearly impossible.

We propose a solution: Chunked Decoding. By treating a fixed number of tokens as a "time slice," we bring 50 years of OS scheduling wisdom to inference. This technique decouples generation from completion, enabling a preemptive multitasking environment for LLMs.

In this talk, we present a sidecar implementation for PyTorch-based servers (like vLLM) that orchestrates decoding in manageable chunks. This allows the system to pause, hold, or swap requests mid-stream without discarding the KV cache. We will share early evaluation results, discussing how varying chunk sizes impact priority handling and tail latency. Attendees will learn how a sidecar approach enables sophisticated scheduling while keeping the core engine lean—offering a blueprint for integrating preemptive scheduling into the next generation of model servers.
Speakers
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for Kellen Swain

Kellen Swain

Senior Software Engineer, Google
Kellen is a Senior Engineer at Google, and is a maintainer of both the llm-d and Inference Gateway projects.
Tuesday April 7, 2026 14:15 - 14:40 CEST
Central Room

14:45 CEST

Lightning Talk: Implementing Single-Dim Strategies With Sharding Validator - Anshul Sinha, Meta
Tuesday April 7, 2026 14:45 - 14:55 CEST
DTensor sharding propagation is a major bottleneck to full operator coverage: adding or fixing an op strategy is complex, bug‑prone, and gaps often surface as unexpected resharding and extra collectives. A key source of complexity is that today’s rules conflate (1) semantic correctness—valid input/output sharding combinations for an operator—with (2) search‑space pruning to avoid combinatorial blowups on N‑dimensional meshes.

This talk presents a landed prototype that separates these concerns via Single Mesh Dim Strategies: each operator specifies valid placement combinations for one mesh dimension, while infra expands/composes them across the full mesh and selects low‑cost strategies. For contributors, this provides a clear path to refactor existing op_strategies into single‑dim rules that are easier to review and extend. We also introduce a Truth Table‑style sharding validator that systematically tests shapes and sharding specs to check soundness/completeness and to flag unnecessary redistribution/collectives caused by missing cases.

The goal of this presentation is faster, higher‑confidence contributions that improve correctness and expand DTensor operator coverage.
Speakers
avatar for Anshul Sinha

Anshul Sinha

Software Engineer, Meta
I graduated from the University of Michigan with a B.S in Computer Science in December 2024. I joined Meta's PyTorch Distributed as a SWE in June 2025.
Tuesday April 7, 2026 14:45 - 14:55 CEST
Founders Cafe
  Frameworks & Compilers

14:45 CEST

Brevitas Quantization Library - Pablo Monteagudo Lago, AMD
Tuesday April 7, 2026 14:45 - 15:10 CEST
Brevitas is an open‑source PyTorch library from AMD designed to support the research of state‑of‑the‑art quantization methods, including Qronos (ICLR 2026) and MixQuant (arXiv). Built for flexibility and composability, it offers modular components for exploring reduced‑precision data paths and accuracy‑preserving techniques.
As generative models scale, post‑training quantization (PTQ) has become the preferred strategy for maintaining quality without retraining, yet PTQ methods are often applied in isolation due to fragmented tooling. Brevitas provides a unified environment for modern PTQ algorithms—including Qronos, SpinQuant and AutoRound—enabling practitioners to combine complementary techniques effectively.
Brevitas leverages the latest PyTorch features, like Dynamo for tracing and selectively modifying compute graphs—for example, by inserting rotation ops to mitigate outliers. It integrates with frameworks like transformers and supports export flows including vLLM and GGUF, ensuring a smooth transition from experimentation to deployment.
This talk shows how to use Brevitas for an end‑to‑end quantization flow, showcasing how its flexibility enables new research directions.
Speakers
avatar for Pablo Monteagudo Lago

Pablo Monteagudo Lago

Research Scientist, AMD
Pablo Monteagudo is a research scientist in AMD Research and Advanced Development, based in Dublin. He specialises in co-design of neural networks and accelerators, in particular, working on topics involving neural network quantization, sparsity and accelerator design.
Tuesday April 7, 2026 14:45 - 15:10 CEST
Junior Stage
  Frameworks & Compilers

14:45 CEST

The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA
Tuesday April 7, 2026 14:45 - 15:10 CEST
Rapid advances in AI have expanded the range of capabilities required for successful real-world deployment. Understanding where we are in this multi-dimensional frontier is essential for accelerating innovation through effective quality assurance. Rigorous evaluation is increasingly difficult to scale as development requires testing many checkpoints across numerous benchmarks. Model comparison is further complicated by limited transparency of reported results. This talk explores challenges, best practices, and open-source tools that elevate evaluation to a core component of LLM development, delivering continuous signals across the model lifecycle.
We discuss principles for standardizing evaluation methods and improving consistency through practical patterns and anti-patterns, and examples of integrating the science of evaluation directly into model development. Using Nemo-Evaluator, an open-source scalable evaluation tool, we demonstrate modular architectures that enable transparent, reproducible measurement. Finally, we show how Nemo-Evaluator supports reproducible evaluation for the Nemotron model family, helping enable one of the most open development processes in modern AI.
Speakers
avatar for Grzegorz Chlebus

Grzegorz Chlebus

Manager R&D, NVIDIA
Grzegorz Chlebus is a Manager at Frontier Model Evaluation at NVIDIA, where he leads tooling and infrastructure efforts for evaluating frontier AI models. He holds a PhD in Medical Sciences from Radboud University Nijmegen, focused on deep learning-based medical image segmentation... Read More →
Tuesday April 7, 2026 14:45 - 15:10 CEST
Central Room
  GenAI & Multimodal

15:40 CEST

Lightning Talk: Graph Based Pipeline Parallelism - Sanket Purandare, Meta & Simon Fan, Meta PyTorch
Tuesday April 7, 2026 15:40 - 15:50 CEST
Pipeline parallelism is vital for large models, but advanced schedules for SOTA LLMs are difficult to express in current PyTorch. MoE communication dominates the critical path, making latency hiding essential. Leading systems use fw-bw overlapping; fw-fw and bw-bw overlapping further boost throughput.

Schedules like ZeroBubbleV and DualPipeV rely on dI-dW backward splitting for fine-grained overlap. However, eager-mode implementations require a patchwork of fragile integrations (multi-threading, custom autograd functions, activation checkpointing, etc.) that rely on implicit behavior and hand-written logic with poor torch.compile compatibility and upstream composability.

We present Graph-Based PP: stages are compiled to reusable FX graphs executed via an explicit schedule language. Users write standard PyTorch code while specifying schedules at varying granularity; all manipulations run as graph passes, abstracting complexity away from user code and into the compiler/runtime, allowing for greater composability.

We have integrated Graph-PP into TorchTitan and AutoParallel on real MoE workloads, targeting upstream inclusion in torch.distributed.
Speakers
avatar for Simon Fan

Simon Fan

Software Engineer, Meta
I work on the PyTorch team at Meta, focusing on distributed training efficiency.
avatar for Sanket Purandare

Sanket Purandare

Research Engineer, Meta
Currently, Sanket serves as a Research Engineer at Meta's SuperIntelligence Lab, in PyTorch Distributed and Compiler team. He specializes in performance optimization of large scale training of LLMs based on Mixture of Experts architectures.

Prior to this he obtained his PhD in A... Read More →
Tuesday April 7, 2026 15:40 - 15:50 CEST
Master Stage
  Frameworks & Compilers

15:40 CEST

Enabling State-of-the-art Asynchronous Execution in Torch.compile With CUDA Streams - Michael Lazos, Meta
Tuesday April 7, 2026 15:40 - 16:05 CEST
CUDA streams are a widely-used method for parallelizing GPU computation on NVIDIA GPUs. They have long been requested by our users and enable multiple key capabilities - overlapping communication and compute kernels, training on multiple batches in parallel and parallelizing kernels, all of which are needed for achieving SOTA training performance. Another key capability is activation offloading - this can be applied to any model to prevent OOMs by asynchronously storing activations in cpu memory until they are needed by the model.

Before this work, torch.compile previously would graph break on CUDA stream contexts, which can be costly for models that utilize streams. Although workarounds exist (e.g. wrapping stream manipulation into custom ops), these solutions add complexity and create friction in the user experience. By enabling seamless CUDA stream support in PT2, we allow our users to leverage the familiar eager APIs for stream assignment and synchronization directly within torch.compile. This not only simplifies the workflow but also ensures that models using custom streaming patterns can run efficiently out-of-the-box without manual intervention or code restructuring.
Speakers
avatar for Michael Lazos

Michael Lazos

Software Engineer, Meta
Michael Lazos is a software engineer at Meta where he contributes to torch.compile. His expertise spans both graph extraction with TorchDynamo and generating optimized kernels with the backend compiler TorchInductor. Previously, he was at Microsoft contributing to project Brainwave... Read More →
Tuesday April 7, 2026 15:40 - 16:05 CEST
Central Room
  Frameworks & Compilers

15:40 CEST

torch.compile and Diffusers: A Hands-On Guide to Peak Performance - Sayak Paul, Hugging Face
Tuesday April 7, 2026 15:40 - 16:05 CEST
This session shows how to use torch.compile with the Diffusers library to speed up diffusion models like Flux-1-Dev.

You'll learn practical techniques for both model authors and users. For authors, we cover how to make models compiler-friendly using fullgraph=True. For users, we explain regional compilation (which cuts compile time by 7x while keeping the same runtime gains) and how to avoid recompilations with dynamic=True.

We also cover real-world scenarios: running on memory-constrained GPUs using CPU offloading and quantization, and swapping LoRA adapters without triggering recompilation.

Key takeaways:
- Compiling just the Diffusion Transformer (DiT) delivers ~1.5x speedup on H100
- Regional compilation reduces cold-start compile time from 67s to 9.6s
- NF4 quantization cuts memory from 33GB to 15GB
- Combining quantization + offloading drops memory to 12.2GB
- LoRA hot-swap lets you switch adapters without recompiling

Whether you're building diffusion models or using them, this guide helps you get the best performance with minimal effort.
Speakers
avatar for Sayak Paul

Sayak Paul

Research Engineer, Hugging Face
I am a Research Engineer at Hugging Face, working on image and video generation. My day-to-day includes maintaining the Diffusers library, training, and babysitting models. When I am not working, I can be found either watching Suits for the n-th time or playing the guitar.
Tuesday April 7, 2026 15:40 - 16:05 CEST
Junior Stage

15:55 CEST

Lightning Talk: Beyond Generic Spans: Distributed Tracing for Actionable LLM Observability - Sally O'Malley & Greg Pereira, Red Hat
Tuesday April 7, 2026 15:55 - 16:05 CEST
End-to-end observability is non-negotiable for production LLMs to track performance, attribute costs, and validate optimizations. Generating actionable traces from complex distributed inference remains a significant challenge.

We implemented tracing for llm-d, a high-performance distributed LLM inference framework. Using manual OpenTelemetry instrumentation with carefully crafted spans at critical paths, we expose insights that generic tooling can't capture.

This talk explores how distributed tracing illuminates requests through unique inference scenarios:

* Prefix cache-aware routing: Track cache hits and validate whether intelligent scheduling improves TTFT
* Prefill/decode disaggregation: Analyze why each request chose split vs unified processing based on cache locality.
* Wide expert-parallelism: Profile MoE models across multi-node deployments
* Workload autoscaling: Correlate request patterns with scaling decisions

Attendees will learn why LLMOps requires a new approach to distributed tracing, contrasting it with traditional microservices, and how to instrument inference stacks effectively. Walk away ready to add meaningful observability to your own deployments.
Speakers
avatar for Greg Pereira

Greg Pereira

Sr. Machine Learning Engineer, Red Hat
Greg began his career as SRE focusing on CICD and automation in the Emerging Technologies org at redhat. After transferring to the platform and services team he started from the ground up, refocusing on AI centric software development. Three years later he has been involved in building... Read More →
avatar for Sally O'Malley

Sally O'Malley

Principal Software Engineer, Red Hat

Tuesday April 7, 2026 15:55 - 16:05 CEST
Master Stage

16:10 CEST

Optimizing Reinforcement Learning at Trillion-Parameter Scale - Songlin Jiang, Aalto University & Mind Lab
Tuesday April 7, 2026 16:10 - 16:35 CEST
This talk will dive into how we implemented and optimized reinforcement learning on trillion-parameter Mixture-of-Experts reasoning models using veRL, Megatron-Bridge and vLLM. The session is useful to anyone building large-scale RL training systems.

For the first part, I will walk through the system design required to make RL work at this scale using LoRA: how LoRA adapters are implemented for expert layers, how adapters are sharded and fused under tensor/pipeline/expert parallelism, and most importantly, how refit (parameter sync) is implemented for LoRA between training backend (Megatron) and rollout engine (vLLM).

The second part of the talk focuses on training–inference mismatch in MoE RL. I will explain why common mitigations such as clipping and importance sampling can fail, and how we implement fixed Router Replay R3 across vLLM, veRL, and Megatron to align routing decisions between rollout and training.

These works are done together with Mind Lab and some of the related blog posts are at:
- https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus
- https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it
Speakers
avatar for Songlin Jiang

Songlin Jiang

Doctoral Researcher, Aalto University & Mind Lab
I am a doctoral researcher at Aalto University, focusing on reducing training and inference latency for Reinforcement Learning and Large Language Models (LLMs) on High-Performance Computing (HPC) clusters. I am also a passionate free software developer, a maintainer of VeRL, and a... Read More →
Tuesday April 7, 2026 16:10 - 16:35 CEST
Junior Stage
  Training Systems

16:10 CEST

TorchStore: What We Learned Building Distributed Storage Solutions for AysncRL - Lucas Pasqualin, Danielle Pintz, Allen Wang, Amir Afzail Meta
Tuesday April 7, 2026 16:10 - 16:35 CEST
Asynchronous Reinforcement Learning (AsyncRL) workloads have unique data sharing requirements: actors must efficiently exchange large tensors across processes and nodes, often with different sharding configurations—not just at checkpoint time, but continuously during training for live weight synchronization. This talk presents Torchstore, an open-source distributed tensor storage system built on Monarch actors that tackles these challenges. We'll share the key lessons learned—from designing pluggable transport backends (RDMA, shared memory, RPC) to implementing transparent live DTensor resharding that lets producers and consumers use entirely different parallelism strategies. We'll also discuss the friction we encountered integrating with inference engines like vLLM, where differing model definitions and integrations present new bottlenecks. Whether you're building actor-based training systems or thinking about disaggregated training-inference architectures, you'll leave with practical insights on distributed tensor storage design.
Speakers
avatar for Lucas Pasqualin

Lucas Pasqualin

ML Engineer, PyTorch (Meta)
Lucas has been developing Machine Learning Applications and Machine Learning infrastructure at scale for years, and has recently been focused on extending the product offering of PyTorch's Distributed Checkpointing stack.
AW

Allen Wang

Software Engineer, Meta
avatar for Danielle Pintz

Danielle Pintz

Software Engineer, Meta
Danielle is a software engineer working on PyTorch, currently focused on TorchStore and Async RL. She previously worked on the Llama Research team.
avatar for Amir Afzali

Amir Afzali

Software Engineer, Meta
Software engineer working on Pytorch distributed infra and large scale training
Tuesday April 7, 2026 16:10 - 16:35 CEST
Master Stage

16:40 CEST

Securing Agentic AI With PyTorch: Threat Modeling & LLM Red Teaming in Practice - Valeri Milke, VamiSec GmbH
Tuesday April 7, 2026 16:40 - 17:05 CEST
Agentic AI systems built with PyTorch introduce a new security paradigm: autonomous decision-making, tool usage, memory, and multi-step reasoning significantly expand the attack surface beyond traditional ML pipelines.

This session presents a practical, security-first approach to building and testing agentic AI systems using PyTorch, combining AI threat modeling and hands-on LLM security testing.

We introduce MAESTRO-based AI Threat Modeling to systematically identify risks across prompts, tools, memory, orchestration and model interactions. Building on this foundation, we demonstrate how the OWASP LLM Top 10 and the OWASP LLM Testing Guide can be applied to real PyTorch-based agent architectures.

The session includes a live demo of a prompt injection attack against an agentic workflow, showing how task delegation and tool invocation can be abused — and how developers can detect, mitigate and test these risks early in the AI development lifecycle.

Attendees will leave with concrete techniques to integrate AI security testing and threat modeling into PyTorch-based systems, bridging research, engineering and real-world AI risk.
Speakers
avatar for Valeri Milke

Valeri Milke

CEO, VamiSec GmbH
Valeri Milke is an AI security and cybersecurity specialist focusing on secure AI and agentic system design. He works at the intersection of PyTorch-based AI engineering, threat modeling and LLM security testing. His work includes AI red teaming, prompt injection analysis and the... Read More →
Tuesday April 7, 2026 16:40 - 17:05 CEST
Junior Stage
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -