Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
arrow_back View All Dates
Wednesday, April 8
 

08:00 CEST

Registration & Badge Pick-Up
Wednesday April 8, 2026 08:00 - 15:25 CEST

Wednesday April 8, 2026 08:00 - 15:25 CEST
Lobby

08:00 CEST

Community Expo
Wednesday April 8, 2026 08:00 - 15:40 CEST

Wednesday April 8, 2026 08:00 - 15:40 CEST
Open Platform

09:00 CEST

Keynote: PyTorch CTO - Matt White, Global CTO of AI, Linux Foundation
Wednesday April 8, 2026 09:00 - 09:10 CEST
Matt White, Global CTO of AI and CTO at PyTorch Foundation will provide an update on technical strategy, ecosystem and projects and working groups
Speakers
avatar for Matt White

Matt White

Global CTO of AI, Linux Foundation, The Linux Foundation
Matt White is the Executive Director of the PyTorch Foundation and GM of AI at the Linux Foundation. He is also the Director of the Generative AI Commons. Matt has years of experience in applied research and standards in AI and data in telecom, media and gaming industries. Matt is... Read More →
Wednesday April 8, 2026 09:00 - 09:10 CEST
Master Stage
  Keynote Sessions
  • Audience Level Any
  • Slides Attached Yes

09:10 CEST

Keynote: vLLM & Ray Updates - Tyler Michael Smith, Chief Architect - Inference Engineering, Red Hat & Artur Niederfahrenhorst, Member of Technical Staff,Anyscale
Wednesday April 8, 2026 09:10 - 09:25 CEST

Speakers
avatar for Tyler Michael Smith

Tyler Michael Smith

Chief Architect - Inference Engineering, Red Hat
Tyler received a PhD in Computer Science at The University of Texas at Austin, studying high performance dense linear algebra - microkernels, parallelism, and theoretical lower bounds on data movement.. After a postdoc at ETH Zürich, he joined Neural Magic, first working on a graph... Read More →
avatar for Artur Niederfahrenhorst

Artur Niederfahrenhorst

Member of Technical Staff, Anyscale
Artur is a member of the technical staff at Anyscale, the company that recently donated Ray to the Linux Foundation. He has been contributing to Ray since early 2022, where his main contributions have been in distributed reinforcement learning. Artur majored in Computer Science at... Read More →
Wednesday April 8, 2026 09:10 - 09:25 CEST
Master Stage
  Keynote Sessions
  • Audience Level Any
  • Slides Attached Yes

09:25 CEST

Keynote: The Hub as Infrastructure. From Open PyTorch Models, to a Safe and Performant Distribution Hub - Lysandre Debut, Chief Open-Source Officer, Hugging Face
Wednesday April 8, 2026 09:25 - 09:40 CEST

Speakers
avatar for Lysandre Debut

Lysandre Debut

Chief Open-Source Officer, Hugging Face
Lysandre is the Chief Open-Source Officer at Hugging Face; ensuring that the ecosystem is as well supported as possible in the ML lifecycle, with open-source tools.

He has been at Hugging Face for the past six years and was the first open-source employee at Hugging Face; working on transformers and the entire stack of Hugging Face open-source libraries since then... Read More →
Wednesday April 8, 2026 09:25 - 09:40 CEST
Master Stage
  Keynote Sessions
  • Audience Level Any
  • Slides Attached Yes

09:45 CEST

Sponsored Keynote: Open Source Infrastructure for the AI Native Era - Jonathan Bryce, Executive Director, Cloud Native Computing Foundation
Wednesday April 8, 2026 09:45 - 09:50 CEST
AI adoption will not be limited by model ideas alone. It will be limited by how fast we can deploy, secure, observe, and scale AI systems in production. Inference is where AI becomes real for most organizations. As AI moves from frontier labs into mainstream production, the operational challenges start to look increasingly cloud native: orchestration, autoscaling, routing, security, policy, and observability. This keynote explores why the next phase of AI adoption will move faster if PyTorch and cloud native communities work together to extend proven open source patterns.
Speakers
avatar for Jonathan Bryce

Jonathan Bryce

Executive Director, Cloud and Infrastructure, The Linux Foundation
Jonathan Bryce is the Executive Director of Cloud & Infrastructure at the Linux Foundation, where he leads both the Cloud Native Computing Foundation (CNCF) and the OpenInfra Foundation—two of the largest and most influential open source communities in the world. With over... Read More →
Wednesday April 8, 2026 09:45 - 09:50 CEST
Master Stage
  Keynote Sessions
  • Audience Level Any
  • Slides Attached Yes

09:50 CEST

Keynote: Gemma 4: Compacting Intelligence for the Edge - Léonard Hussenot, Research Scientist, Google Deepmind
Wednesday April 8, 2026 09:50 - 10:05 CEST
This talk explores the philosophy and engineering behind Gemma 4, arguing that the future of AI isn't only about size, but about "intelligence per byte."
We will dive into why compacting intelligence—maximizing the reasoning and instruction following ability of every single token—is the ultimate bottleneck for truly useful AI. By optimizing for token efficiency and memory footprints, we unlock a new class of applications that are faster, private, and more accessible.
Speakers
avatar for Leonard Hussenot

Leonard Hussenot

Research Scientist, Google Deepmind
I am a Research Scientist at Google DeepMind, where I lead the Gemma post-training team focused on developing the most useful compact models for on-device applications. Since joining Google Brain, I have contributed to the evolution of Bard, Gemini, and Gemma, specializing in scaling... Read More →
Wednesday April 8, 2026 09:50 - 10:05 CEST
Master Stage
  Keynote Sessions
  • Audience Level Any

10:05 CEST

Birds of A Feather: Disaggregated Tokenization: Building Toward Tokens-In-Tokens-Out LLM Inference - Maroon Ayoub, IBM Research; Hang Yin & Xi Ning Wang, Alibaba Cloud; Nili Guy, IBM; Hyunkyun Moon, Moreh
Wednesday April 8, 2026 10:05 - 10:35 CEST
LLMs are token-in, token-out - but our serving stacks aren't. Tokenization and preprocessing are still locked inside the inference engine, blocking the cache-aware routing and encode/prefill/decode (E/P/D) disaggregation that production deployments demand. To route smart, you need tokens before you reach the backend - and with multi-modal inputs requiring heavy encode-stage preprocessing, this is an architectural imperative, not just an optimization.

In llm-d, we learned this the hard way: three tokenization approaches, three gaps. We're now converging on disaggregated tokenization via vLLM's Renderer API as a gRPC sidecar, and collaborating with the Gateway API Inference Extension community to define the tokens-in-tokens-out interface. For multi-modal workloads, disaggregating preprocessing unlocks independent scaling of encode, prefill, and decode - each with different compute profiles.

Join us to discuss: How should we standardize tokenization and multi-modal preprocessing outside the engine? How does this shape E/P/D disaggregation? What are your pain points? We'll frame the problem from scheduling, vLLM, and gateway perspectives - then open the floor.
Speakers
avatar for Xi Ning Wang

Xi Ning Wang

Senior Technical Expert, Alibaba Cloud
Wang Xining, senior technical expert of Alibaba Cloud, focusing on MaaS/LLM, Kubernetes, service mesh and other advanced cloud native technical strategies. Previously worked in the IBM as tech architect focusing on SOA/Cloud and served as the chairman of the Patent Technology Review... Read More →
avatar for Hang Yin

Hang Yin

Senior R&D Engineer, Alibaba Cloud
Hang Yin, senior engineer of Alibaba Cloud, focusing on Kubernetes, service mesh, Gateway API Inference Extension and other cloud native fields. Currently served in the Alibaba Cloud Container Service for Kubernetes (ACK) team, responsible for the developing of ACK Gateway with Inference... Read More →
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for Nili Guy

Nili Guy

IBM Research, IBM
Nili is a Research Manager and Senior Technical Staff Member at IBM Research, co-creator of llm-d, and an expert in distributed inference and Kubernetes-native AI systems. She has led key open-source and productized inference initiatives across IBM’s AI platforms.
avatar for hyunkyun moon

hyunkyun moon

MLOps Engineer, Moreh
Hyunkyun Moon is an ML Platform Engineer at Moreh, focusing on building high-performance LLM inference platforms with llm-d. He is an active contributor to open-source projects, including llm-d and vLLM. With a strong background in large-scale Kubernetes-native infrastructure, he... Read More →
Wednesday April 8, 2026 10:05 - 10:35 CEST
Open Platform

10:05 CEST

Coffee Break
Wednesday April 8, 2026 10:05 - 10:35 CEST
Menu: 
-Brioche
-Granola bar (Gluten Free, Vegan)
-Seasonal fruits (Gluten Free, Vegan)
-Roasted pumpkin cake
-Dry fruits and dry grapes mix (Gluten Free, Vegan)
Wednesday April 8, 2026 10:05 - 10:35 CEST
Open Platform

10:05 CEST

Meet the vLLM Maintainers
Wednesday April 8, 2026 10:05 - 10:35 CEST
Meet the core maintainers of vLLM at this session! Come and discuss use cases, features, roadmap with us, or just learn how the vLLM development happens under the hood.
Speakers
avatar for Tyler Michael Smith

Tyler Michael Smith

Chief Architect - Inference Engineering, Red Hat
Tyler received a PhD in Computer Science at The University of Texas at Austin, studying high performance dense linear algebra - microkernels, parallelism, and theoretical lower bounds on data movement.. After a postdoc at ETH Zürich, he joined Neural Magic, first working on a graph... Read More →
avatar for Nicolò Lucchesi

Nicolò Lucchesi

Senior Machine Learning Engineer, Red Hat
Nicolò is a Senior Machine Learning Engineer at Red Hat with a background in Deep Learning and Computer Vision. He works on Inference Optimization for vLLM, where he is a maintainer.
Wednesday April 8, 2026 10:05 - 10:35 CEST
Open Platform
  Meet the Developers
  • Audience Level Any

10:25 CEST

Sponsor Activity - Validating AI on CPUs: The vLLM 3-Phase Evaluation Framework
Wednesday April 8, 2026 10:25 - 10:40 CEST
Stop guessing your hardware capabilities. This automated test engine benchmarks vLLM on CPUs through controlled, realistic, and production phases, delivering precise metrics on throughput, latency, and optimal KV cache sizing. Join us for a demo!

Sponsor: Red Hat
Location: Red Hat within the Community Showcase

In order to facilitate networking and business relationships at the event, you may choose to visit a third party's booth or access sponsored content. You are never required to visit third party booths or to access sponsored content. When visiting a booth or participating in sponsored activities, the third party will receive some of your registration data. This data includes your first name, last name, title, company, address, email, standard demographics questions (i.e. job function, industry), consenting to receipt and use of such data by the third-party recipients, which will be subject to their own privacy policies. 

Wednesday April 8, 2026 10:25 - 10:40 CEST
Open Platform

10:35 CEST

Lightning Talk: Monarch: An API To Your Supercomputer - Marius Eriksen, Meta
Wednesday April 8, 2026 10:35 - 10:45 CEST
The training systems driving today’s most advanced AIs are distributed, dynamic, and complex. Pre-training relies on layered parallelism and careful fault isolation. Post-training RL spans thousands of GPUs while coordinating verifiers, compilers, and code execution.

Systems complexity pulls focus away from the core algorithms: developers are forced to assemble systems from schedulers, RPC stacks, container orchestrators, observability tooling, service discovery, and app frameworks just to begin work.

Monarch is a distributed programming framework for PyTorch that makes the cluster programmable through a single-program Python API. It exposes the supercomputer as a coherent, directly controllable system—bringing the experience of local development to large-scale training; handling fault tolerance, orchestration, tooling integration, etc.

In this talk, we will demonstrate how Monarch enables developers to focus on training logic rather than glue, extend systems easily, and supervise and debug distributed systems through a unified programming interface.

Attendees will leave with a clear model for building robust, scalable and customizable distributed PyTorch systems using Monarch.
Speakers
avatar for Marius Eriksen

Marius Eriksen

Software Engineer, Meta
Marius Eriksen is a software engineer at Meta, where he works on infrastructure for large-scale training systems.
Wednesday April 8, 2026 10:35 - 10:45 CEST
Master Stage

10:35 CEST

Lightning Talk: Live Migration of PyTorch GPU Nodes From Azure To European Clouds - Mike Krom, Acf Cyber Solutions
Wednesday April 8, 2026 10:35 - 10:45 CEST
Many European PyTorch teams run their GPU workloads on hyperscalers like Azure, AWS, or GCP—often without realizing that this places their data and models under US jurisdiction.

This lightning talk shows how PyTorch compute nodes can be migrated to European cloud providers while keeping the full ML environment intact. Through a live demo, we migrate a GPU-enabled PyTorch VM—including CUDA drivers and Jupyter notebooks—from Azure to European infrastructure, without retraining models or rebuilding environments.

The focus is on practical challenges: GPU compatibility, reproducibility, and data movement across clouds.

The migration is demonstrated using DigitalNomadSky, an open-source Python platform for cross-cloud VM migration, but the lessons apply broadly to PyTorch teams aiming to reduce jurisdictional risk and vendor lock-in.

Key takeaways
Why PyTorch workloads on hyperscalers raise sovereignty concerns for EU teams
What actually breaks (and what doesn’t) when migrating GPU-based ML nodes
How to regain control over ML infrastructure without rewriting your stack
Speakers
avatar for Mike Krom

Mike Krom

Partner, ACF Cybersolutions
I am a software architect and lead developer of the open-source project DigitalNomadSky. I have extensive experience with Microsoft Azure from working at Microsoft and supporting large-scale cloud migrations. My work focuses on supporting datascience and ML-teams with cloud infrastructure... Read More →
Wednesday April 8, 2026 10:35 - 10:45 CEST
Central Room
  Security & Privacy

10:35 CEST

Beyond JSON-RPC: Scaling Model Context Protocols With gRPC in the PyTorch Ecosystem - Ashesh Vidyut & Madhav Bissa, Google
Wednesday April 8, 2026 10:35 - 11:00 CEST
Right now, MCP mostly relies on HTTP and STDIO. That works for simple scripts, but if you’re running high-performance PyTorch models in production, you’re going to hit a wall. When you’re moving large context windows or tensor metadata, the overhead of JSON-RPC starts to hurt.
We’re introducing SEP-1352, which adds gRPC as a native transport for MCP. Since gRPC is already the standard for microservices, it’s a natural fit for the PyTorch ecosystem. By using Protobuf instead of JSON, we get much higher throughput and lower latency—essentially making the communication between models and tools as fast as the models themselves.
In this session, we’ll cover:
Why Protobuf matters: Moving away from bulky JSON to keep bandwidth low and speed high.
Built-in Streaming: How to use gRPC’s streaming to handle long-running model outputs without timeouts.
Production-ready features: Using the same auth, load balancing, and service mesh (mTLS) you already use for your ML microservices.
Upgrading your stack: How to move from PyTorch MCP HTTP services to MCP gRPC services without throwing away your existing infra.
Speakers
avatar for Ashesh Vidyut

Ashesh Vidyut

Senior Software Engineer, Google

avatar for Madhav Bissa

Madhav Bissa

Senior Software Engineer, Google
member, grpc-Go
Wednesday April 8, 2026 10:35 - 11:00 CEST
Junior Stage
  Agents & Interop

10:35 CEST

How To Write C++ Extensions in 2026 - Jane Xu, Meta & Mikayla Gawarecki, Meta
Wednesday April 8, 2026 10:35 - 11:00 CEST
Are you writing a C++ custom op extension to PyTorch? It's 2026 and are you still shipping M x N wheels for M CPython versions and N libtorch versions? Did you know you can just ship 1 wheel that works across multiple CPythons and libtorches? If you're curious how, attend this talk to get the deets on py_limited_api, APIs like torch::stable::Tensor & TORCH_TARGET_VERSION, and generally the latest and greatest ways for keeping your code and your release matrix simple. Get your custom kernel enrolling in new features with benefits proven out in FA3, xformers, torchao, torchaudio, and more in progress! We'll also share some of our vision towards smoother and faster custom ops extensions.
Speakers
avatar for Jane Xu

Jane Xu

PyTorch SWE, Meta
Hi, I'm Jane! Please don't hesitate to come talk to me about your favorite optimizer, fitting models in GPU memory, how to free C++ extensions from libtorch version, and anything that interests you.
avatar for Mikayla Gawarecki

Mikayla Gawarecki

Software Engineer, Meta
Software Engineer on PyTorch
Wednesday April 8, 2026 10:35 - 11:00 CEST
Founders Cafe
  Frameworks & Compilers

10:50 CEST

Lightning Talk: Achieving SOTA GEMM Performance: A CuTeDSL Backend for PyTorch Inductor - Nikhil Patel, Meta
Wednesday April 8, 2026 10:50 - 11:00 CEST
Matrix multiplication is a central compute primitive in modern deep learning, but achieving SOTA performance on novel architectures like NVIDIA Blackwell has become a bottleneck. Existing Triton-based kernels in torch.compile struggle to keep pace with rapid hardware evolution, often forcing users to hand-write custom, architecture-specific kernels - a growing gap as hardware feature velocity accelerates.

We present a new CuTeDSL GEMM backend in PyTorch Inductor that integrates NVIDIA’s kernel implementations directly into torch.compile. Built using the Cutlass API for kernel discovery, this backend allows PyTorch to expose first-class support for NVIDIA-authored GEMMs and automatically leverage new architectural features as NVIDIA updates their kernels.

The backend currently supports standard GEMM, grouped GEMM, and block-scaled MXFP8 GEMM, along with pointwise epilogue fusions (with reductions forthcoming). We present early end-to-end results from vLLM inference and TorchTitan training, demonstrating how this approach enables PyTorch to achieve high-performance GEMMs on Blackwell and beyond, while eliminating the need for users or developers to maintain handwritten kernels.
Speakers
avatar for Nikhil Patel

Nikhil Patel

Software Engineer, Meta
Nikhil is a software engineer on the PyTorch Inductor team at Meta Superintelligence Labs, where he works on Inductor’s CuTeDSL GEMM backend. His work sits at the boundary between compiler code generation and hardware-native GPU features, optimizing large-scale training and inference... Read More →
Wednesday April 8, 2026 10:50 - 11:00 CEST
Master Stage
  Frameworks & Compilers

10:50 CEST

Lightning Talk: Step-Aligned Telemetry for Distributed PyTorch Training (Time & Memory Attribution Across Ranks) - Abhinav Srivastav, TraceOpt
Wednesday April 8, 2026 10:50 - 11:00 CEST
Distributed PyTorch training often looks healthy in system dashboards; GPU utilization is high, memory is stable and yet throughput degrades, steps jitter, or GPUs go idle intermittently. The core issue is misalignment: most
telemetry is sampled by time, while training progresses by "steps", and distributed behavior is dominated by the slowest rank rather than averages.

In this talk I will breaks down common failure modes in DDP training that standard metrics miss (rank stragglers, dataloader stalls, step-time variance, and memory spikes/creep). We will show how step-aligned, rank-aware aggregation changes debugging: per-step worst-rank vs median-rank views, gating to completed steps across ranks, and how to tie time and memory back to training semantics without relying on heavyweight profilers.
Speakers
avatar for Abhinav Srivastav

Abhinav Srivastav

ML Scientist, TraceOpt
ML researcher with a PhD in Computer Science. Industry experience at IBM Research, Huawei Research, and Zalando.Currently building TraceML: an open source tool that shows you the step-level breakdown of your PyTorch training run while it's still running.I am partially interested in... Read More →
Wednesday April 8, 2026 10:50 - 11:00 CEST
Central Room
  Training Systems

11:05 CEST

Lightning Talk: Accelerating PyTorch Models With Torch.compile's C++ Wrapper Mode - Bin Bao, Meta
Wednesday April 8, 2026 11:05 - 11:15 CEST
This lightning talk introduces torch.compile's C++ wrapper mode, a powerful feature that reduces CPU overhead and significantly improves model performance. As modern GPUs become increasingly powerful and compiler optimizations make GPU kernels run faster, CPU overhead has become more visible as the bottleneck. By generating optimized C++ code instead of Python, cpp-wrapper mode directly tackles this challenge.

While CUDAGraphs can also reduce CPU overhead, it is not always applicable—especially with highly dynamic input shapes. In these scenarios, cpp-wrapper mode provides a robust alternative with significant performance gains. Benchmark results from the OSS Huggingface suite demonstrate that cpp-wrapper mode delivers a 39% speedup over default torch.compile.

Attendees will learn when and how to leverage cpp-wrapper mode to overcome CPU-bound limitations and understand how this feature fits into PyTorch's performance optimization landscape, enabling them to build faster machine learning applications.
Speakers
avatar for Bin Bao

Bin Bao

Software Engineer, Meta
Bin Bao is a software engineer working with the PyTorch Compiler team at Meta. He focuses on developing TorchInductor optimizations and AOTInductor for C++ deployment.
Wednesday April 8, 2026 11:05 - 11:15 CEST
Junior Stage
  Frameworks & Compilers

11:05 CEST

Lightning Talk: KV-Cache Centric Inference: Building a State-Aware Serving Platform With Llm-d and VLLM - Maroon Ayoub & Martin Hickey, IBM Research
Wednesday April 8, 2026 11:05 - 11:15 CEST
We’ve spent years optimizing LLM inference around compute - faster kernels, better batching, smarter parallelism. But in production, the bottleneck increasingly isn’t FLOPs. It’s state. Specifically, the KV-cache: the attention state that makes the difference between a 4-second prefill and a sub-second cache hit. Lose it to eviction, isolate it on a single node, or fail to route to it - and you’re paying the full compute cost again for work already done.

KV-cache centric inference flips the design priority. Instead of treating cache as a byproduct, it becomes the organizing principle of the serving platform. This means tiered memory management - offloading KV blocks from GPU to CPU to shared storage so capacity scales beyond any single node. It means cross-replica visibility - so cached state computed on one instance is reusable by any other. And it means cache-aware scheduling - routing requests to where their prefix already lives.

We cover how llm-d and vLLM implement each layer, how they compose into a coherent system, and what it looks like in practice - with benchmarks, deployment patterns, and lessons from building a KV-cache centric platform in the open.​​​​​​​​​​​​​​​​
Speakers
avatar for Martin Hickey

Martin Hickey

Senior Technical Staff Member, IBM Research
Martin Hickey is a STSM at IBM Research, focused on Open Source, Cloud Native Computing, and AI. Martin has notable contributions to open source projects like vLLM, LMCache, Kubernetes, Helm, OpenTelemetry and OpenStack. Martin is a core maintainer for LMCache and an emeritus core... Read More →
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
Wednesday April 8, 2026 11:05 - 11:15 CEST
Central Room

11:05 CEST

Bringing PyTorch Monarch to AMD GPUs: Single-Controller Distributed Training on ROCm - Liz Li & Zachary Streeter, AMD
Wednesday April 8, 2026 11:05 - 11:30 CEST
PyTorch Monarch introduces a new distributed programming paradigm that enables developers to orchestrate entire GPU clusters from a single Python program. With its actor-based runtime, process mesh abstraction, and asynchronous execution model, Monarch simplifies large-scale distributed training and enables complex workflows that combine training, evaluation, and reinforcement learning within one unified script.

In this talk, we present our work enabling PyTorch Monarch on AMD Instinct GPUs with ROCm, expanding the single-controller model beyond CUDA environments and bringing this emerging runtime to a broader hardware ecosystem. We describe the engineering effort required to port Monarch’s GPU runtime and distributed communication stack to ROCm, including HIPification of CUDA-specific components, adaptation of memory management and synchronization semantics, and integration with high-performance GPU-to-GPU communication on multi-node clusters through RDMA.

We will share lessons learned from running Monarch workloads on MI300-class clusters, including performance considerations, debugging workflows, and developer experience improvements. Our results demonstrate that Monarch’s architecture can be successfully extended to heterogeneous hardware environments while preserving scalability and ease of use.

This work advances hardware diversity in distributed PyTorch and highlights how portable runtimes can simplify large-scale training while enabling scalable, cluster-wide experimentation across accelerator platforms.
Speakers
avatar for Liz Li

Liz Li

Principal AI engineer, AMD
Liz Li is a Principal AI Engineer in the AMD AI group, specializing in enabling and optimizing cutting-edge AI models on AMD Instinct GPUs for both distributed inference and training. With over 10 years of experience in computer, graphics, and AI architecture, she has previously led... Read More →
avatar for Zachary Streeter

Zachary Streeter

Senior Member of Technical Staff, AMD
I'm a computational physicist working in the field of AI the past 5 years. I have a wide range of expertise from mathematics to performance optimizations and system engineering. Feel free to nerd out with me! Please connect with me on LinkedIn.
Wednesday April 8, 2026 11:05 - 11:30 CEST
Founders Cafe
  Training Systems
  • Audience Level Any

11:05 CEST

Fp8 Training From Hopper To Blackwell - Luca Wehrstedt, Meta
Wednesday April 8, 2026 11:05 - 11:30 CEST
The Hopper generation of NVIDIA GPUs first enabled the use of low-precision float8 data types for training via TensorCore acceleration. However, the recipe to best leverage it was far from settled. Practitioners had to find their way through many entangled decisions around accuracy-vs-efficiency, precision-vs-range, overflows-vs-underflows, and more. The frontier was further push forward by the DeepSeek release, and then by the micro-scaling formats introduced by Blackwell. In this talk we will go through all these approaches, comparing their pros and cons, thus guiding researchers in finding the options that work best for them.
Speakers
avatar for Luca Wehrstedt

Luca Wehrstedt

Software Engineer, Meta
Research Engineer in Meta's Fundamental AI Research team (FAIR). At the intersection of research and infrastructure, Luca specialized in training efficiency and distributed communication. Regular contributor to PyTorch.
Wednesday April 8, 2026 11:05 - 11:30 CEST
Master Stage
  Training Systems

11:20 CEST

Lightning Talk: Building AI That Ops Teams Actually Trust - Robert King, Chronosphere / Palo Alto Networks
Wednesday April 8, 2026 11:20 - 11:30 CEST
You've built an AI that identifies root causes of incidents faster than any human could... but there's one problem, no one trusts it.

Ops teams are skeptical by nature. They've been burned by noisy alerts, black-box tools, and "intelligent" systems that weren't.
This talk covers what we learned building AI for incident response across enterprise environments: why technically correct recommendations get ignored, and how to design for skepticism from day one.

I'll share specific patterns that moved the needle:

- Validating agent responses before they reach users, catching hallucinations, weak reasoning, and overconfident outputs
- Explainability that fits the operator's mental model, not the data scientist's
- Feedback loops that improve the AI and build user trust simultaneously
- Rollout strategies that let teams build confidence gradually

Whether you're using LLMs, agents, or traditional ML for operational tasks, the trust problem is the same. Ship something wrong during an incident and you've lost your users for months.

You'll leave with a practical framework for validating AI outputs and building the kind of trust that gets recommendations acted on.
Speakers
avatar for Robert King

Robert King

Senior Sales Engineer, Chronosphere
Robert is Lead Enterprise Solutions Engineer at Chronosphere and an OpenTelemetry contributor. He recently presented on AI Observability with OpenTelemetry at Cloud Native London https://www.youtube.com/live/qF4wz-pha1w?si=PFzjNcGkbD4pFKnA&t=625 and has spoken at AWS Summit, and other... Read More →
Wednesday April 8, 2026 11:20 - 11:30 CEST
Junior Stage
  Inference & Production

11:20 CEST

Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving - Maroon Ayoub, IBM Research & Hyunkyun Moon, moreh
Wednesday April 8, 2026 11:20 - 11:30 CEST
Agentic AI workloads - tree-of-thought exploration, ReAct loops, hierarchical swarms - expose a fundamental mismatch in how we serve PyTorch models. Today's inference stacks treat the KV-cache as a flat, anonymous tensor buffer with blind LRU eviction. This ignores the structural reality of agents: system prompts are durable, tool definitions are shared, and reasoning scratchpads are ephemeral. We are currently evicting high-value state to preserve throwaway tokens.

In this talk, we present Semantic KV-Cache, an architectural evolution for llm-d and vLLM that replaces anonymous blocks with Typed State.

We demonstrate a runtime that tags blocks as SystemPrompt, ToolDefinition, or ReasoningBranch, applying differentiated policies to each: pinning foundational context, replicating shared tools, and eagerly evicting completed thoughts. We show how this "lifecycle-aware" caching reduces recomputation and minimizes the "Agentic Tax" - evolving the PyTorch serving stack from request-centric to workload-aware.
Speakers
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for hyunkyun moon

hyunkyun moon

MLOps Engineer, Moreh
Hyunkyun Moon is an ML Platform Engineer at Moreh, focusing on building high-performance LLM inference platforms with llm-d. He is an active contributor to open-source projects, including llm-d and vLLM. With a strong background in large-scale Kubernetes-native infrastructure, he... Read More →
Wednesday April 8, 2026 11:20 - 11:30 CEST
Central Room

11:35 CEST

Lightning Talk: Enabling the Audio Modality for Language Models - Eustache Le Bihan, Hugging Face
Wednesday April 8, 2026 11:35 - 11:45 CEST
As the maintainer of everything audio in `transformers` (the lib), this talk shares how audio is being integrated into large language models, grounded in what we observe from the OS ecosystem.

Beginning with a brief overview of the current landscape of Audio LMs, I'll then highlight emerging trends in how audio is incorporated into pretrained text backbones. In particular, we examine the growing convergence of architectural choices, many inspired by VLMs, as well as newer concepts such as audio tokenization and streaming.

The core of the talk focuses on providing the audience with key technical insights: audio encoders vs audio tokenizers, their respective advantages and limitations. It covers the motivations behind introducing concepts such as audio tokenizers and audio processors into transformers, shows how these design choices are reflected in the library, and explains how PyTorch tooling is leveraged to make audio a standardized modality for the open-source community.
Speakers
avatar for Eustache Le Bihan

Eustache Le Bihan

MLE, Hugging Face
A 2024 MVA graduate, I now work on open-source audio at Hugging Face. My current focus is on standardising audio in the transformers library and strengthening support across models.
Wednesday April 8, 2026 11:35 - 11:45 CEST
Founders Cafe

11:35 CEST

Accelerating Complex-Valued Tensors With Torch.compile - Hameer Abbasi, OpenTeams Inc.
Wednesday April 8, 2026 11:35 - 12:00 CEST
torch.compile has been invaluable in accelerating many machine learning and scientific computing workflows. It has become a one-shot way to get free performance for many kinds of programs and models.

However, it comes with its own set of limitations. One of these limitations is that, for a long time, torch.compile didn't accept complex-valued tensors. These tensors have many uses, from quantum mechanics to simplifying the physics for world models. Support for such tensors would accelerate many of these workflows.

In this talk, we will take a journey into the current progress for supporting such tensors in torch.compile; some of the encountered challenges and what we hope to achieve, including some side-benefits for reducing binary size by JIT-ing kernels on demand.
Speakers
avatar for Hameer Abbasi

Hameer Abbasi

Senior Software Engineer I, OpenTeams, Inc.
Hameer Abbasi is a Senior Software Developer at OpenTeams, Inc. As part of his day job and also as a hobby, he has contributed to various projects in the scientific computing space, including NumPy, SciPy and PyTorch. He is also the lead maintainer of PyData/Sparse, a library for... Read More →
Wednesday April 8, 2026 11:35 - 12:00 CEST
Junior Stage
  Frameworks & Compilers

11:35 CEST

Optimizing Large MoE Inference on NVIDIA Blackwell: NVFP4, ADP, and DualPipe Strategies - Julien Demouth, NVIDIA
Wednesday April 8, 2026 11:35 - 12:00 CEST
Deploying massive Mixture-of-Experts (MoE) architectures like DeepSeek-V3/R1 requires a co-designed approach leveraging NVIDIA Blackwell’s fifth-generation Tensor Cores. This session details the transition to NVFP4 precision for MoE weights to significantly reduce memory load, coupled with FP4/FP8 KV caching to minimize attention layer footprint and enable higher concurrency.
We will analyze the architectural shift to Expert Parallelism (EP) for expert layers to maximize FLOPS, and Attention Data Parallelism (ADP) for attention heads—avoiding redundant KV replication and converting Multi-Head Latent Attention (MLA) into Multi-Query Attention (MQA) via weight absorption. The talk will demonstrate advanced execution strategies, including DualPipe algorithms to overlap dispatch/combine communication with computation, and the integration of DeepGEMM and FlashInfer kernels. Finally, we will cover runtime optimizations using Programmatic Dependent Launch (PDL) and CUDA Graphs to minimize host latency, alongside Multi-Token Prediction (MTP) for accelerated speculative decoding.
Speakers
JD

Julien Demouth

Senior Distinguished Engineer - Eng. Lead for AI Labs & Models, NVIDIA
Wednesday April 8, 2026 11:35 - 12:00 CEST
Central Room

11:35 CEST

Portable High‑Performance LLM Serving: A Triton Backend for VLLM - Burkhard Ringlein, IBM Research & Jan van Lunteren, IBM
Wednesday April 8, 2026 11:35 - 12:00 CEST
Today, vLLM is the de-facto industry standard for serving Large Language Models and is widely adopted in production.

However, for most of the past, vLLM’s state-of-the-art performance was largely dependent on hand-written CUDA or HIP kernels. These kernels have typically been carefully optimized for a specific GPU platform and may pose a serious obstacle to the portability of vLLM across different hardware.

Leveraging Triton, we introduced a “Triton attention backend” to vLLM that produces highly competitive performance across GPU platforms with a single code base, without involving hand-written CUDA or HIP kernels. The Triton attention backend became the default for AMD GPUs and is used in scenarios where other attention backends have missing features. Additionally, this backend automatically selects appropriate specialized kernels based on model type or request length.

In this talk, we will present our recent advances that consistently deliver high performance on both NVIDIA and AMD GPUs with a single Triton-only code-base. We will present the engineering and science behind this Triton-only backend, including system aspects, kernel improvements, and launch grid optimizations.
Speakers
avatar for Jan van Lunteren

Jan van Lunteren

Senior Research Scientist, IBM Research
Jan van Lunteren is a Senior Research Scientist at IBM Research Zurich holding MSc and PhD degrees in Electrical Engineering. His research has covered a broad range of topics, including high‑speed networking, near‑memory computing, and high‑performance machine‑learning inference... Read More →
avatar for Burkhard Ringlein

Burkhard Ringlein

Research Staff Member, IBM Research
Dr. Burkhard Ringlein is a Research Staff Member in the AI Platform team of IBM Research, based in Zurich. He is an accomplished AI systems researcher and designs, builds, debugs, and optimizes practical systems for low-latency, high-throughput machine learning applications. Currently... Read More →
Wednesday April 8, 2026 11:35 - 12:00 CEST
Master Stage

12:00 CEST

Attendee Lunch
Wednesday April 8, 2026 12:00 - 13:30 CEST
Menu | Boxed Lunch

Vegan: (Vegetarian)
-Organic green lentils from Beauce, lentil hummus, and red cabbage pickles
-Chocolate cookie

Gluten-Free: (Vegetarian)
-Organic Beauce quinoa with dried fruit, coconut yogurt with herbs
-Yogurt to drink

Classic:
Bulgur wheat and red lentil salad (Vegetarian)
Cereal bread, poached salmon, and vegetables
Or
Pastrami burger with vegetable caviar and tomato sauce
Or
Round baguette with artichoke tapenade, arugula, tomato, and Parmesan cheese (Vegetarian)
Brownie  (Vegetarian)
Wednesday April 8, 2026 12:00 - 13:30 CEST
Open Platform

13:00 CEST

Sponsor Activity - Lobster Trap: OpenClaw in Containers
Wednesday April 8, 2026 13:00 - 13:10 CEST
In this demo, we containerize OpenClaw with Docker/Podman, wire up HashiCorp Vault so secrets work identically on a laptop and in a cluster, and then deploy to K8s. With containers, one teammate's carefully built agent becomes a deployable team standard.

Sponsor: Red Hat
Location: Red Hat within the Community Showcase

In order to facilitate networking and business relationships at the event, you may choose to visit a third party's booth or access sponsored content. You are never required to visit third party booths or to access sponsored content. When visiting a booth or participating in sponsored activities, the third party will receive some of your registration data. This data includes your first name, last name, title, company, address, email, standard demographics questions (i.e. job function, industry), consenting to receipt and use of such data by the third-party recipients, which will be subject to their own privacy policies. 

Wednesday April 8, 2026 13:00 - 13:10 CEST
Open Platform

13:30 CEST

Lightning Talk: From Hugging Face To Handheld: Scaling LLM Deployment With LiteRT Generative API - Cormac Brick & Weiyi Wang, Google
Wednesday April 8, 2026 13:30 - 13:40 CEST
This session will demonstrate the E2E journey of bringing custom PyTorch-based Open Source LLMs on cross platform devices using LiteRT. We will show developers how to take a custom Hugging Face Transformers checkpoint and convert them for on-device execution, including:
-Taking the Pytorch model from conversion to deployment.
-Automated Optimization: How LiteRT performs automated patching of performance-critical components, including architecture-specific rewrites for PyTorch models.
-Seamless Fine-Tuning Integration: How to move from an Unsloth fine-tuning session to a TorchAO-quantized model and LiteRT export without leaving your script.
-The "0-Day" Enablement Strategy: Well-known architectures are supported out-of-the-box. We’ll share how we enabled the QWEN0.6 (or Liquid AI) model in just 20 minutes.
-Interactive Validation: Run inference on the exported model directly in the Terminal or Colab to verify numerical correctness before deploying to device.
This workflow shows a smooth fine-tune-to-deployment story where everything stays within the original PyTorch/Hugging Face ecosystem. Viewers can "vibe code" along using Gemini CLI or other coding agents.
Speakers
avatar for Cormac Brick

Cormac Brick

Principal Engineer, Google AI Edge, Google
Cormac Brick is a Principal Engineer on the Google AI Edge team, where he specializes in frameworks and on-device AI. He has over 10 years experience in AI software, silicon and systems, with work spanning AI frameworks and ecosystems and compilers down to silicon microarchitecture... Read More →
avatar for Weiyi Wang

Weiyi Wang

Software Engineer, Google
Weiyi Is lead software engineer on LiteRT/TFLite, focusing on compiler, NPU and GenAI stack.
Wednesday April 8, 2026 13:30 - 13:40 CEST
Central Room

13:30 CEST

PyTorch on RISC-V: From Cross-Compilation To Native CI - Ludovic Henry, Meta
Wednesday April 8, 2026 13:30 - 13:55 CEST
As RISC-V matures into a viable architecture for AI and data center workloads, bringing first-class PyTorch support to the ecosystem is a critical milestone. This session provides a technical deep dive into the ongoing efforts to port PyTorch natively to RISC-V, moving beyond experimental cross-compilation toward a stable, tested, and optimized environment. We detail the challenges of reconciling native math library dependencies like OpenBLAS and oneDNN with RISC-V Vector (RVV) extensions, alongside the work required to upstream these accelerations to ensure sustainable, long-term performance.

The talk also addresses the critical "last mile" of the Python ecosystem: ensuring that the broader dependency tree—including NumPy, SciPy, and ONNX—is natively available and performant on the architecture. Finally, we examine the primary bottleneck for official support: CI infrastructure. We outline the roadmap for transitioning from tagged cross-compilation to a native testing pool, discussing the logistics of maintaining a reliable hardware fleet to meet the high-volume validation standards required for the PyTorch master branch and pull request workflows.
Speakers
avatar for Ludovic Henry

Ludovic Henry

Software Engineering Lead, Rivos
Ludovic works at the intersection of open-source software and emerging hardware. He is a key contributor to the RISC-V ecosystem, focusing on the performance and stability of the AI stack. His recent work involves optimizing native dependencies like OpenBLAS and oneDNN and establishing... Read More →
Wednesday April 8, 2026 13:30 - 13:55 CEST
Junior Stage

13:30 CEST

PyTorch Symmetric Memory + NCCL Device APIs: A New Path Towards Multi-GPU Kernels - Ke Wen & Sylvain Jeaugey, NVIDIA
Wednesday April 8, 2026 13:30 - 13:55 CEST
As large models shift toward inference and Mixture-of-Experts (MoE) architectures, small batch sizes and dynamic routing present new scaling challenges. Fused, customized multi-GPU kernels are emerging as the solution, but programming them for high performance remains difficult. This talk introduces a paradigm shift enabled by PyTorch Symmetric Memory and NCCL device APIs.

PyTorch Symmetric Memory provides a unified infrastructure for direct GPU-to-GPU memory access without CPU involvement. By leveraging symmetric tensor allocation and CUDA Graph-compatible signaling, it enables fine-grained, dynamic data exchange while bypassing traditional "send/receive" overhead.

We further demonstrate how NCCL device APIs simplify this model using in-kernel primitives for NVLink and GPU-Initiated Networking (GIN). We will showcase practical examples of compute-communication fusion, such as AllGather-Matmul, and customized patterns like deduplicated expert all-to-all.

These abstractions represent one of the most significant evolutions in the PyTorch and NCCL ecosystems, offering a versatile path to high-performance distributed programming.
Speakers
avatar for Ke Wen

Ke Wen

Principal Software Architect, NVIDIA
Ke Wen works on distributed features, including Symmetric Memory, multi-GPU kernels, Expert Parallelism, inference, pipelining and graph analysis.
avatar for Sylvain Jeaugey

Sylvain Jeaugey

Distinguished Engineer, NVIDIA
Sylvain has been developing the NCCL library since its inception in 2015. He has been working on optimizing communication libraries for large parallel systems for more than 20 years.
Wednesday April 8, 2026 13:30 - 13:55 CEST
Master Stage

13:30 CEST

Optimizing CPU LLM Inference in PyTorch: Lessons From VLLM - Crefeda Rodrigues, Arm Limited & Fadi Arafeh, Arm
Wednesday April 8, 2026 13:30 - 13:55 CEST
vLLM has emerged as a reference inference stack in the PyTorch ecosystem for high-throughput large language model serving. CPUs continue to play an important role in LLM inference, supporting cost-sensitive deployments, hybrid CPU/GPU serving, and batch or off-peak workloads on general-purpose infrastructure.

In this talk, we examine CPU-based LLM inference through the lens of PyTorch internals, using vLLM as a case study. We describe how vLLM interacts with PyTorch’s operator stack, including tensor layout management, backend dispatch, and threading behaviour, and highlight common sources of overhead such as repeated weight repacking and poor threading behaviour.

We present runtime and kernel-level optimizations that reduce overhead including CPU paged-attention kernel tuning with vectorized softmax, specialized Q–K and P–V GEMM kernels aligned with vLLM’s scheduler, an ISA-aware BF16 attention, pre-packed weight layouts for quantized matmul, SIMD vectorization using PyTorch’s at::vec::Vectorized primitives, and NUMA-aware scheduling for scalable parallel inference.

Finally, we conclude with lessons learned from building and upstreaming a high-performance CPU inference engine.
Speakers
avatar for Crefeda Rodrigues

Crefeda Rodrigues

Staff Software Engineer, Arm
Crefeda Rodrigues is a Staff Software Engineer at Arm, focusing on performance and scalability driven machine learning software optimization for Arm server CPUs. She previously worked on large-scale climate and weather model optimization as a postdoctoral researcher at the University... Read More →
avatar for Fadi Arafeh

Fadi Arafeh

Senior Machine Learning Engineer, Arm
Fadi is a Senior Machine Learning Engineer at Arm, working on optimizing PyTorch and vLLM for Arm Infrastructure cores. Prior to that, Fadi obtained a BSc in Artificial Intelligence from the University of Manchester.
Wednesday April 8, 2026 13:30 - 13:55 CEST
Founders Cafe
  Inference & Production

13:45 CEST

Lightning Talk: Slash LLM Cold-Start Times by Pre-distributing GPU Caches - Billy McFall & Maryam Tahhan, Red Hat
Wednesday April 8, 2026 13:45 - 13:55 CEST
Are your Large Language Model (LLM) deployments stuck waiting for GPU kernels to compile? If you are running distributed inference at scale, your infrastructure is likely wasting time rebuilding the same GPU Kernel Cache for every single instance. You may not even realize the time and resources that are being consumed for rebuilding. This session is designed for platform engineers and ML practitioners who need to optimize inference scaling and reduce startup latency.

We will demonstrate how to eliminate redundant compilation by pre-distributing GPU kernel caches to all the inference nodes using KServe, a distributed model inference runtime for Kubernetes. Beyond just the "what," we will dive into the technical implementation of signing, verifying, and mounting cache images to ensure supply-chain security across clusters. Attendees will leave with a practical blueprint for reducing cold-start times and securing GPU-heavy workloads in production.
Speakers
avatar for Billy McFall

Billy McFall

Sr. Principal Software Engineer, Red Hat
Billy McFall is a software engineer in the Emerging Tech Networking Team within the Office of the CTO at Red Hat for 9+ years. Billy previously worked on Kubernetes/OpenShift networking, including the integration of the NVIDIA DPU into OpenShift. Billy has also been a maintainer of... Read More →
avatar for Maryam Tahhan

Maryam Tahhan

Principal Engineer, Red Hat
Maryam is a Principal Engineer in Red Hat's Office of the CTO, where she focuses on standardising CPU inferencing performance evaluation to help effectively validate and scale ML workloads.
Wednesday April 8, 2026 13:45 - 13:55 CEST
Central Room
  Inference & Production

14:00 CEST

Lightning Talk: Pluggable PyTorch LLM Inference Architecture With VLLM and AWS Neuron Backends - Yahav Biran, Annapurna Labs & Maen Suleiman, Amazon
Wednesday April 8, 2026 14:00 - 14:10 CEST
As PyTorch-based LLM serving matures, the challenge shifts from monolithic inference stacks to integrating diverse hardware accelerators efficiently. This session explores how modular plugin architectures enable PyTorch models to run optimally across backends—demonstrating AWS Trainium integration into vLLM through standardized interfaces.

We'll examine how vLLM's Hardware Plugin architecture uses Python's entry_points for automatic platform detection, allowing hardware vendors to extend PyTorch inference without fragmenting the codebase. This delivers automatic device detection, modular feature development, and seamless integration with PyTorch's model loading patterns.

Technical deep-dive includes NeuronWorker and NeuronxDistributedModelRunner extending vLLM base classes, NKI kernels for attention and MoE, and continuous batching with prefill/decode separation. We'll demo HuggingFace models loading through standard vLLM APIs and executing on Trainium without hardware-specific code.

Attendees learn how plugin architectures enable hardware vendors to join PyTorch inference while maintaining standard workflow compatibility.
Speakers
MS

Maen Suleiman

Product Manager, Amazon
avatar for Yahav Biran

Yahav Biran

Principal Architect, Amazon
Yahav Biran is a Principal Architect at AWS, focusing on large-scale AI workloads. He contributes to open-source projects and publishes in AWS blogs and academic journals, including the AWS compute and AI blogs and the Journal of Systems Engineering. He frequently delivers technical... Read More →
Wednesday April 8, 2026 14:00 - 14:10 CEST
Junior Stage

14:00 CEST

Lightning Talk: Backpropagation-Free Optimization in PyTorch - Andrii Krutsylo, Polish Academy of Sciences
Wednesday April 8, 2026 14:00 - 14:10 CEST
Backpropagation is not the only mechanism for training deep networks. This talk presents a compact, implementation-driven map of backpropagation-free training methods, organized around representative algorithms that expose key design trade-offs.

We focus on four families: Difference Target Propagation (target-based credit assignment), Direct Feedback Alignment (random feedback without weight transport), local loss / greedy layerwise training (strictly local objectives), and Forward-Forward learning as a forward-only alternative. Each is treated as a minimal working pattern rather than a full system.

For each representative, we answer the same practical questions: what learning signal is propagated, what intermediate state must be stored, how parameters are updated, and what limits scalability on modern accelerators. The emphasis is on PyTorch-level mechanics—explicit update loops, local objectives, and training without autograd—rather than derivations.

The goal is to give practitioners a clear mental model of the backprop-free design space and concrete patterns for experimenting with these methods in real PyTorch training pipelines.
Speakers
AK

Andrii Krutsylo

PhD Candidate, Institute of Computer Science, Polish Academy of Sciences
Andrii Krutsylo is a deep learning researcher focusing on continual learning and optimization dynamics. His work studies experience replay, gradient-free and local learning rules, and structured optimization for adaptive, resource-efficient systems.
Wednesday April 8, 2026 14:00 - 14:10 CEST
Central Room

14:00 CEST

Lightning Talk: Debugging the Undebuggable: Introducing Torch.distributed.debug - Tristan Rice, Meta, PyTorch
Wednesday April 8, 2026 14:00 - 14:10 CEST
Distributed training in PyTorch enables unprecedented scale, but it also introduces notoriously difficult debugging challenges. When a job with thousands of ranks hangs or slows down, identifying the root cause can feel like searching for a needle in a haystack. This lightning talk introduces the new PyTorch Distributed Debug Server, a powerful, interactive tool designed to bring clarity and control to the chaos of distributed debugging. We will provide a high-level overview of its architecture and core features, demonstrating how it provides a unified interface to inspect stack traces, analyze performance, and diagnose hangs across all workers simultaneously. Attendees will learn how this extensible server can dramatically reduce debugging time and improve the reliability of large-scale training jobs.
Speakers
avatar for Tristan Rice

Tristan Rice

Software Engineer, PyTorch Distributed, Meta
Software engineer working on PyTorch Distributed and large scale training.
Wednesday April 8, 2026 14:00 - 14:10 CEST
Founders Cafe

14:00 CEST

Deploying PyTorch Models To the Browser and Beyond With Transformers.js - Joshua Lochner, Hugging Face
Wednesday April 8, 2026 14:00 - 14:25 CEST
This session presents a comprehensive engineering roadmap for running Hugging Face Transformers entirely locally in your web browser using Transformers.js. We will explore the end-to-end pipeline required to export, optimize, and deploy PyTorch models to the web, leveraging emerging web technologies like WebGPU for efficient, cross-platform inference.

We will dive into the technical nuances of converting PyTorch models to ONNX using torch.export (Dynamo) and applying runtime-specific optimizations via ONNX Runtime GenAI. This workflow enables the production of highly efficient, quantized model artifacts suitable for browser-based execution. Finally, we will demonstrate how to deploy these optimized models using Transformers.js to create performant, interactive, and visually stunning WebAI experiences.
Speakers
avatar for Joshua Lochner

Joshua Lochner

Creator of Transformers.js, Hugging Face
Bringing the power of machine learning to the web. Currently working on Transformers.js (@huggingface 🤗)
Wednesday April 8, 2026 14:00 - 14:25 CEST
Master Stage

14:15 CEST

Lightning Talk: Distributed AI Without the Infrastructure Tax - Yahav Biran, Annapurna Labs & Maen Suleiman, Amazon
Wednesday April 8, 2026 14:15 - 14:25 CEST
Running distributed AI workloads in production requires solving three problems: package compatibility, hardware abstraction, and network configuration. AWS Neuron Deep Learning Containers (DLCs) address all three by providing open-source, production-ready images for Trainium and Inferentia.
This lightning talk shows how DLCs eliminate common failure modes. We'll cover three layers: First, how DLCs solve dependency hell by versioning PyTorch, Neuron SDK, XLA backend, and PyTorch PrivateUse1 dispatcher together as a tested contract. Second, how Dynamic Resource Allocation (DRA) in Kubernetes abstracts hardware complexity—enabling Neuron core slicing, multi-tenant workloads, and topology-aware scheduling without manual device mapping. Third, how pre-configured EFA drivers settings ensure zero-copy data movement, avoiding silent performance degradation that can cost 10x throughput.
We'll demonstrate scaling from laptop to 32-node cluster using the same container image and simple Kubernetes manifests.
Attendees will learn how to eliminate weeks of setup time, achieve 65-80% cluster utilization, and deploy workloads confidently. We'll share the GitHub repository and extension patterns.
Speakers
MS

Maen Suleiman

Product Manager, Amazon
avatar for Yahav Biran

Yahav Biran

Principal Architect, Amazon
Yahav Biran is a Principal Architect at AWS, focusing on large-scale AI workloads. He contributes to open-source projects and publishes in AWS blogs and academic journals, including the AWS compute and AI blogs and the Journal of Systems Engineering. He frequently delivers technical... Read More →
Wednesday April 8, 2026 14:15 - 14:25 CEST
Junior Stage

14:15 CEST

Lightning Talk: Inside VLLM's KV Offloading Connector: Async Memory Transfers for Higher Inference Throughput - Nicolò Lucchesi, Red Hat
Wednesday April 8, 2026 14:15 - 14:25 CEST
Every LLM request produces KV-cache state that is expensive to recompute. However, GPU memory is limited in size and when memory fills up, entries are discarded from cache. A natural mitigation is expanding the KV cache to CPU DRAM which is meaningfully larger than GPU memory.
vLLM 0.11.0 introduced the Offloading Connector - an asynchronous, pluggable API for KV-cache offloading which is bundled with a native CPU backend. This new feature executes transfers concurrently with model computation on the GPU cores by using GPU DMA. This solution offers speedy loading of KV data from DRAM and near zero overhead from offloading. Getting here required rethinking vLLM's memory layout. The default per-layer KV fragmentation devastated transfer throughput. A new contiguous block layout, upstreamed in 0.12.0, increased effective block sizes by up to 125× and delivered an order-of-magnitude improvement in offloading performance.
We'll walk through the connector architecture, discuss memory transfer tradeoffs, the memory layout redesign, and practical guidance for enabling CPU offloading in production.
Speakers
avatar for Nicolò Lucchesi

Nicolò Lucchesi

Senior Machine Learning Engineer, Red Hat
Nicolò is a Senior Machine Learning Engineer at Red Hat with a background in Deep Learning and Computer Vision. He works on Inference Optimization for vLLM, where he is a maintainer.
Wednesday April 8, 2026 14:15 - 14:25 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

14:15 CEST

Lightning Talk: Scaling Recommendation Systems To 2K GPUs and Beyond - Zain Huda, Meta
Wednesday April 8, 2026 14:15 - 14:25 CEST
TLDR: In this session, we go over one of the key technologies to Ads model scaling at Meta, 2D sparse parallelism. Which scales sparse recommendation embedding tables beyond 1k GPUs to 8k GPUs - enabling the largest Ads model training runs in production at Meta.

Scaling Laws have dominated LLMs and shown the industry we can achieve better model performance through scaling. The same scaling law can be applied to recommendation systems. However, the path to scaling recommender systems is not the same. The leap from hundreds to thousands of GPUs introduces complex technical challenges, particularly around handling sparse operations in recommendation models.

In this talk, we will detail the development of 2D sparse parallelism, tracing its path from research to production to address sparse scaling challenges. We will demonstrate how we optimize these systems to push performance boundaries, increasing speed and reducing memory at scale. Participants will walk away with lessons learned from designing 1,000+ GPU scale systems, and a deeper understanding of how to implement these solutions efficiently in production.
Speakers
avatar for Zain Huda

Zain Huda

Software Engineer, Meta
Zain works on large scale training systems for recommender systems at Meta. He works on TorchRec, a library for distributed parallelism for sparse recommender models. He is also one of the authors of 2D sparse parallelism.
Wednesday April 8, 2026 14:15 - 14:25 CEST
Founders Cafe

14:30 CEST

Lightning Talk: Torch-Spyre: Compiling To a Multi-core Dataflow Accelerator With Inductor - David Grove & Olivier Tardieu, IBM
Wednesday April 8, 2026 14:30 - 14:40 CEST
Torch-Spyre (https://github.com/torch-spyre/torch-spyre) is an open source project that provides a PyTorch PrivateUse1 device with OpenReg, including an Inductor backend, for the IBM Spyre Accelerator. IBM Spyre is a high-performance energy-efficient AI accelerator featuring 32 AI-optimized compute cores each with on-chip interconnect and compiler-managed scratchpad memory.

Our goal in this session is to describe how we evolved the Spyre software stack to fully leverage Inductor. This enabled the elimination of a significant fraction of our proprietary compiler code base resulting in improved compilation time and operation coverage without loss of inference performance. We will highlight several technical challenges in compiling for Spyre-like accelerators and describe how we adapted and extended Inductor to tackle them. In particular, we will discuss our extensions to Inductor to support device-specific tiled Tensor memory layouts, and new compiler optimization passes for core-level work division and scratchpad management. We hope to engage the community in evolving the PyTorch ecosystem to more fully support them.
Speakers
avatar for Dave Grove

Dave Grove

Distinguished Research Scientist, IBM
David Grove is a Distinguished Research Scientist at IBM T.J. Watson, NY, USA. He has been a software systems researcher at IBM since 1998, specializing in programming language implementation and scalable runtime systems. He has authored more than sixty peer-reviewed publications... Read More →
avatar for Olivier Tardieu

Olivier Tardieu

Principal Research Scientist, Manager, IBM
Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →
Wednesday April 8, 2026 14:30 - 14:40 CEST
Junior Stage
  Frameworks & Compilers

14:30 CEST

Lightning Talk: Every Millisecond Counts: The Fine-tuning Journey of an Ultra-Efficient PyTorch Model for the Edge - Pavel Macenauer, NXP Semiconductors
Wednesday April 8, 2026 14:30 - 14:40 CEST
From smart cameras that protect privacy by analyzing video on-device, to wearables that interpret voice and motion instantly, to industrial sensors that prevent failures before they happen, edge AI is shaping our everyday routines and transforming our lives.

Eliminating cloud dependency and making connectivity optional is essential for data staying local. Without cloud, our options become severely limited to the constraints of the devices, and efficiency drives innovation. Every millisecond and milliwatt can unlock a new use case — or limit one.

This talk will explore optimization techniques for vision, audio, and language models that allow them to run on tiny, resource-constrained devices, and fine-tune them to the limit of our model’s latency, accuracy, or power efficiency. We will start with an initial rapid simulation, and follow up with silicon-level tuning with real device profiling feedback.
Speakers
avatar for Pavel Macenauer

Pavel Macenauer

AI/ML R&D Software Lead, NXP Semiconductors
A software lead at NXP Semiconductors leading teams developing tools, runtime libraries, and enabling AI on Edge-class devices. Both professionally and out of human curiosity, Pavel developed software visualizing the World around us. Initially through the lens of a camera, then from... Read More →
Wednesday April 8, 2026 14:30 - 14:40 CEST
Central Room
  Inference & Production

14:30 CEST

Seamless Integration: Custom Kernels in the Torch.compile Stack Without Graphbreaks - Kshiteej Kalambarkar, Masaki Kozuki & Pawel Gadzinski, NVIDIA
Wednesday April 8, 2026 14:30 - 14:55 CEST
Custom kernels are essential for high-performance PyTorch workflows, but their integration often comes with a hidden cost. While torch.compile promises speedups, calling custom operations typically triggers graph-breaks: fallbacks to Eager mode that introduce overhead and negate your performance gains.

In this session, we provide a practical roadmap for making your extensions "compiler-aware". Using the Transformer Engine project as a case study, we will show how to utilize the custom_op extension point to bridge the gap between high-performance kernels and the torch.compile stack.

What you will learn:
• Identifying the Friction: How to profile and detect graph-breaks caused by custom extensions.
• The Registration Path: A walkthrough of the custom_op registration process for torch.compile.
• Solving the "Hard Parts": Strategies for handling complex Python-side logic that disrupts graph capture.
• Real-World Impact: How these integrations function within the Transformer Engine to maintain peak throughput.

Who should join: This talk is designed for developers building custom PyTorch extensions who want to understand how advanced operations fit into the compiled stack.
Speakers
avatar for Kshiteej Kalambarkar

Kshiteej Kalambarkar

Software Engineer Frameworks, NVIDIA
Kshiteej Kalambarkar is a software engineer at NVIDIA specializing in PyTorch and compiler technologies, with experience in torch.compile and custom kernel integration
avatar for Masaki Kozuki

Masaki Kozuki

Software Engineer, NVIDIA
Masaki Kozuki is working at NVIDIA on PyTorch.
avatar for Pawel Gadzinski

Pawel Gadzinski

Senior Performance Engineer - Deep Learning, NVIDIA
Pawel Gadzinski is a Deep Learning Performance Engineer at NVIDIA, where he works on the Transformer Engine library, enabling state-of-the-art techniques for accelerating transformer models on NVIDIA GPUs, with a focus on low-precision training.
Wednesday April 8, 2026 14:30 - 14:55 CEST
Master Stage

14:30 CEST

From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face
Wednesday April 8, 2026 14:30 - 14:55 CEST
Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks.

We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.
Speakers
avatar for Kashif Rasul

Kashif Rasul

Research Scientist, Hugging Face
Kashif has a PhD. in Mathematics from the Freie Universität Berlin. He is passionate about high-performance computing, Reinforcement learning, and has presented at NVIDIA's GTC in 2009 and at StrangeLoop in 2012, and is also contributing to a number of data science and deep learning... Read More →
avatar for Sergio Paniego Blanco

Sergio Paniego Blanco

Machine Learning Engineer, Hugging Face
Sergio tiene una amplia trayectoria en el ámbito del código abierto y la inteligencia artificial, campo en el que también obtuvo su doctorado. Lleva más de ocho años participando en iniciativas como Google Summer of Code, donde ha contribuido como desarrollador y mentor. Actualmente... Read More →
Wednesday April 8, 2026 14:30 - 14:55 CEST
Founders Cafe
  Training Systems

14:45 CEST

Lightning Talk: Building a PyTorch‑native VLLM Plugin for IBM Spyre - Thomas Parnell, IBM Research & Thomas Ortner, IBM Research Europe - Zurich
Wednesday April 8, 2026 14:45 - 14:55 CEST
IBM Spyre is an AI accelerator used across IBM Z and Power systems for agentic inference in production. Today, we serve models on Spyre using upstream vLLM together with an out-of-tree platform plugin. While the current plugin delivers crucial functionality for our business, it re-uses relatively little of upstream vLLM’s capabilities, and also carries a high maintenance cost.

In this talk, we will describe our efforts to redesign the Spyre vLLM plugin in a more PyTorch-native fashion. We will describe the architectural evolution of the project and describe how it leverages torch‑spyre, an open‑source extension that enables Spyre support in PyTorch via the PrivateUse1 device interface. We discuss key challenges—such as implementing a custom vLLM attention backend for Spyre—and share lessons learned while aligning vLLM’s execution model with Spyre’s hardware capabilities.

Finally, we will demonstrate a vLLM model running natively on Spyre through the new plugin and highlight areas where the community can work together to improve vLLM’s plugin interface. This talk will be especially relevant for those looking to extend vLLM to a wider variety of accelerators and use cases.
Speakers
avatar for Thomas Parnell

Thomas Parnell

Principal Research Scientist, IBM Research
Thomas received his B.Sc. and Ph.D. degrees in mathematics from the University of Warwick. U.K., in 2006 and 2011, respectively. He began his career in the field of EDA, working at Arithmatica and Siglead before joining IBM Research in 2013. During his time at IBM, Thomas has worked... Read More →
avatar for Thomas Ortner

Thomas Ortner

Research Scientist, IBM Research Europe - Zurich
Thomas Ortner is a Research Scientist at IBM Research Europe, Switzerland, in the group of Emerging Computing and Circuits. He holds a PhD and a MSc in Computer Science, a MSc degree in Technical Physics and a MSc degree in Software Engineering and Management from Graz University... Read More →
Wednesday April 8, 2026 14:45 - 14:55 CEST
Junior Stage

14:45 CEST

Lightning Talk: Full-Stack PyTorch Robotics VLA: From Data To Edge Via ExecuTorch/OpenVINO - Samet Akcay & Dmitriy Pastushenkov, Intel
Wednesday April 8, 2026 14:45 - 14:55 CEST
While research-centric tools have lowered the entry barrier for robotics data collection, transitioning Vision-Language-Action models to production remains challenging due to fragmented edge deployment paths. This session presents a unified, PyTorch-native workflow spanning the full robotics lifecycle, from data capture and curation to optimized edge execution. We introduce a modular Physical AI pipeline designed to resolve the disconnect between research scripts and real-time hardware. The talk details practical patterns for robotics data capture and policy training in a unified PyTorch ecosystem, followed by concrete steps to export models via ExecuTorch. Using an OpenVINO backend, Quantizer, and AOT compilation, we address latency, accuracy, and operator coverage gaps, and demonstrate efficient on-device VLA inference. Using a WidowX pick-and-sort task as a case study, we demonstrate how to validate latency and numerical tolerances under physical constraints. Attendees will leave with a reference architecture and a checklist for monitoring, safety gates, and managing dataset drift, providing a roadmap for moving robotics VLA from research to production-grade edge deployment.
Speakers
avatar for Dmitriy Pastushenkov

Dmitriy Pastushenkov

AI Software Product Manager, Intel
Dmitriy Pastushenkov is a passionate Software Product Manager at Intel with more than 20 years of comprehensive and international experience in the industrial automation, industrial Internet of Things (IIoT) and real-time operating systems and AI. Dmitriy has held various roles in... Read More →
avatar for Samet Akcay

Samet Akcay

Principal AI Engineer, Intel
Samet Akcay is a Principal AI Engineer at Intel who leads ML R&D efforts across Open Edge Platform libraries, including Intel Geti, Datumaro, Anomalib, Training Extensions, and Inference libraries. His research specializes self-supervised learning and multi-modal object detection... Read More →
Wednesday April 8, 2026 14:45 - 14:55 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

14:55 CEST

Birds of A Feather: NCCL in the Wild: Scaling Communications To Thousands of GPUs - Jeff Hammond, Gabrielle Talavera, Ke Wen & Asma Farjallah, NVIDIA
Wednesday April 8, 2026 14:55 - 15:20 CEST
We will share the latest updates to NCCL and how they can be used in PyTorch. We invite the community to share their feedback on challenges using NCCL at scale and ways to improve integration of NCCL with PyTorch applications.

Some of the important topics for community discussion include:
- Symmetric memory support and GPU-initiated networking.
- Copy-engine collectives and maximizing overlap of communication and computation for better end-to-end performance.
- Profiling, debugging and tuning, as well as resilience (handling failed nodes without a restart).
Speakers
avatar for Asma Farjallah

Asma Farjallah

AI DevTech, NVIDIA
Asma Farjallah is an AI Developer Technology Engineer at NVIDIA. Prior to her role as DevTech, she was part of the Solution Architect team at NVIDIA for 5 years and was part of the global energy team. Before joining NVIDIA, Asma worked for Intel for 4 years as an Application Engineer... Read More →
avatar for Gabrielle US

Gabrielle US

Product Manager, NVIDIA
Gabrielle Talavera is the Product Manager for NCCL at NVIDIA, focused on shaping the product roadmap and improving the experience of teams building on GPU‑accelerated software. She joined NVIDIA in 2021 as a Solutions Architect, helping customers adopt NVIDIA software and debug... Read More →
avatar for Jeff Hammond

Jeff Hammond

Distinguished Engineer, NVIDIA Helsinki Oy
Jeff Hammond is a Distinguished Engineer in the NCCL team at NVIDIA focused on user education and research outreach. His background is in parallel application and algorithm development, open-source software, and supercomputing architecture. Jeff has made significant contributions... Read More →
avatar for Ke Wen

Ke Wen

Principal Software Architect, NVIDIA
Ke Wen works on distributed features, including Symmetric Memory, multi-GPU kernels, Expert Parallelism, inference, pipelining and graph analysis.
Wednesday April 8, 2026 14:55 - 15:20 CEST
Open Platform

14:55 CEST

Coffee Break
Wednesday April 8, 2026 14:55 - 15:25 CEST
Menu:
-Lemon cake
-Caramelized arlette
-Seasonal fruits (GF, Vegan)
-Roasted pumpkin cake
-Dry fruits and dry grapes mix (GF, Vegan)
-Chocolate Cookie (GF, Vegan)
Wednesday April 8, 2026 14:55 - 15:25 CEST
Open Platform

14:55 CEST

Meet the Ray Maintainers
Wednesday April 8, 2026 14:55 - 15:25 CEST
Meet the core maintainers of Ray at this session! Come and discuss use cases, features, roadmap with us, or just learn how the Ray development happens under the hood.
Speakers
avatar for Artur Niederfahrenhorst

Artur Niederfahrenhorst

Member of Technical Staff, Anyscale
Artur is a member of the technical staff at Anyscale, the company that recently donated Ray to the Linux Foundation. He has been contributing to Ray since early 2022, where his main contributions have been in distributed reinforcement learning. Artur majored in Computer Science at... Read More →
Wednesday April 8, 2026 14:55 - 15:25 CEST
Open Platform
  Meet the Developers
  • Audience Level Any

15:25 CEST

Lightning Talk: Trinity Large - Torchtitan on 2000+ B300s - Matej Sirovatka, Prime Intellect
Wednesday April 8, 2026 15:25 - 15:35 CEST
In this talk, we'll cover how to use torchtitan to scale training of ultra-sparse mixture-of-experts models across over 2,000 GPUs. We'll walk through the pre-training of Trinity Large, a 400B mixture-of-experts model trained entirely using torchtitan, focusing on maximizing throughput and minimizing the impact of hardware induced failures. Along the way, we'll discuss challenges like fault tolerance, large-scale distributed training, and ensuring determinism - and how we've addressed each of these using torchtitan. Finally, we'll share insights and common pitfalls to avoid in your own large-scale training runs.
Speakers
avatar for Matej Sirovatka

Matej Sirovatka

Research Engineer, Prime Intellect
Research Engineer at Prime Intellect, mainly focusing on distributed training, performance and scaling.
Wednesday April 8, 2026 15:25 - 15:35 CEST
Founders Cafe
  Training Systems

15:25 CEST

Bridging the Hardware Gap With Code Harnesses on the Hugging Face Kernels Hub - Ben Burtenshaw, Hugging Face
Wednesday April 8, 2026 15:25 - 15:50 CEST
What: We share experiments and tooling to standardise kernel writing for agentic coding.

We present an end-to-end experiment benchmarking 6 harnesses across 10 models on CUDA and Metal kernel writing. We compare agent cost, kernel latency, VRAM usage, and end inference performance, and show how the Kernels Hub enables distribution at scale.

We demo two tools:

Kernels Hub: Infrastructure for writing, maintaining, and distributing reproducible kernels in the PyTorch ecosystem.

HF Skills: A library for defining and evaluating agent skills for ML tasks like kernel writing.

Why: Beyond agentic hype, kernel writing is a fundamental problem requiring robust evaluation to scale the community. High-performance kernels demand rare expertise in memory coalescing, warp-level primitives, and hardware-specific optimization. In practice, builders optimize for the highest market-share hardware, leaving a massive matrix of model×hardware combinations unserved, For example: edge inference with ExecuTorch, local LLMs on Metal via vLLM, classic ML at scale on Intel. This talk is technical, intended for kernel writers and PyTorch builders who want to use agents robustly.
Speakers
avatar for Ben Burtenshaw

Ben Burtenshaw

Community, Hugging Face
Ben Burtenshaw is an MLE in the Hugging Face open source community team, specializing in agents, LLMs, and fine-tuning. He leads the development of open-source educational initiatives like the Agents Course, the MCP Course, and the LLM Course, which bridge the gap between complex... Read More →
Wednesday April 8, 2026 15:25 - 15:50 CEST
Master Stage

15:25 CEST

Beyond the Theory: What Actually Breaks When You Scale Your Disaggregated Pytorch Models - Ekin Karabulut & Ron Kahn, NVIDIA
Wednesday April 8, 2026 15:25 - 15:50 CEST
As inference demand explodes, new techniques to optimize these deployments have emerged. One such technique is disaggregated inference, which splits inference into differently optimized workloads (e.g. prefill and decode) on separate workers. The theory is straightforward–better GPU utilization, inference performance, and tighter control over SLAs.The deployment in production is not.
Scaling happens at multiple connected levels. Adding prefill workers for a traffic spike? Those workers belong to a prefill leader and must scale as a unit. But your prefill-to-decode ratio matters too, scale prefill without matching decode capacity and you've moved the bottleneck.Placement also plays a role: place prefill and decode far apart in your network topology and KV-cache transfers will kill your latency.Standard autoscaling treats these as independent components.They're not.
In this talk, we'll share what we've learned running disaggregated vLLM and SGLang deployments on K8s: what broke,what worked, and how we're improving performance. We'll evaluate approaches from standard deployments to specialized APIs like LWS and Grove, discuss how these integrate with frameworks like llm-d and Dynamo.
Speakers
avatar for Ekin Karabulut

Ekin Karabulut

AI/ML Developer Advocate, NVIDIA
Ekin is a Developer Advocate at NVIDIA, following the acquisition of Run:ai. Prior to that, she specialized in the privacy implications of federated learning systems with DNNs in distributed environments as a data scientist. Currently, she is exploring the efficient usage of large... Read More →
avatar for Ron Kahn

Ron Kahn

Senior Software Engineer, NVIDIA
Ron Kahn is a Senior Software Engineer in the NVIDIA Run:ai platform team. Ron works on the design and implementation of workload management systems that abstract Kubernetes complexity for AI practitioners. When not simplifying AI training jobs, Ron can be found cooking something... Read More →
Wednesday April 8, 2026 15:25 - 15:50 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

15:25 CEST

Building Trust for Users and Regulators Alike: A Cost-Efficient PyTorch Path To Compliance-as-Code - Raja Gopal Hari Vijay, Zoho Corporation
Wednesday April 8, 2026 15:25 - 15:50 CEST
Traditional compliance relies on retroactive logs and manually stitched audit trails, while Opacus, CrypTen, and Captum address isolated concerns without providing end-to-end lifecycle traceability. Compliance-as-Code embeds regulatory controls as executable logic within training and inference pipelines, turning compliance into a continuous engineering function and reducing audit costs.

PyTorch’s dynamic execution model enables real-time auditing and compliance gates across the model lifecycle. Features such as the Dispatcher, custom Autograd functions, and the hook system allow logging, constraint checks, and risk controls to be embedded directly into execution. For ex, a fairness gate using training hooks can block model export if disparity exceeds thresholds. Dataset initialization can detect imbalance, while dispatcher-level monitoring generates tamper-resistant audit trails linking data, model versions, and outputs. In deployment, metrics, inference hooks track bias drift, accuracy degradation, and human-intervention counts.

The talk presents practical PyTorch patterns for automated documentation, immutable audit trails, and faster certification in regulated AI deployments.
Speakers
avatar for Raja Gopal Hari Vijay -

Raja Gopal Hari Vijay -

Member Leadership Staff, Zoho Corporation
At Zoho, Raja builds large-scale Video AI (CCTV analytics, edge inference, privacy-aware deployments) on PyTorch, drives green computing via custom accelerators and FPGAs, and owns a custom Linux distribution for Zoho products and agentic workflows with security reasoning across LSM... Read More →
Wednesday April 8, 2026 15:25 - 15:50 CEST
Junior Stage

15:40 CEST

Lightning Talk: Faster Than SOTA Kernels in Torch.compile With Subgraph Fusions and Custom Op Autotuning - Elias Ellison & Paul Zhang, Meta
Wednesday April 8, 2026 15:40 - 15:50 CEST
Unlocking state-of-the-art performance, this talk reveals how subgraph and custom operator autotuning in torch.compile deliver breakthrough speedups—surpassing previous SOTA for matmul and distributed collective ops.

DecomposeK is a novel subgraph optimization in PyTorch, designed to accelerate matrix multiplication when the inner dimension (K) is very large. DecomposeK achieves, delivering up to 28% speedup over ATen with activation fusion and 10% over ATen without fusion.

Building on subgraph infrastructure, we introduced Custom Op Autotuning, which benchmarks and selects the fastest kernel implementations for custom ops. This enables epilogue fusion and the first distributed collective op autotuning in PyTorch. We also introduce Range-based dispatch autotuning that enables dynamic selection of optimal implementations based on input shapes, ensuring performance that closely matches the theoretical best for each range. Our demo shows our autotuned kernels outperform Async TP Fused AG+MM by 9% and Async TP Fully Fused kernel by 41% across all input ranges.
Speakers
avatar for Elias Ellison

Elias Ellison

Software Engineer, Meta
Elias has been working on the PyTorch team for four years, most recently on the torch.compile stack
avatar for Paul Zhang

Paul Zhang

Software Engineer, Meta
Paul Zhang is currently a software engineer working on PyTorch and Triton at Meta, ensuring that PyTorch and PT2 best utilizes the hardware it is run on. Previous to this, Paul has done extensive work on recommendation systems for training and inference, optimizing performance and... Read More →
Wednesday April 8, 2026 15:40 - 15:50 CEST
Founders Cafe

15:55 CEST

Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Visible in Practice - Sahana Venkatesh, Wayve
Wednesday April 8, 2026 15:55 - 16:05 CEST
PyTorch teams often log rich training metrics, yet still discover training regressions late after significant developer time and GPU budget have already been spent. In this talk, I’ll share a practical pattern we used to turn PyTorch training metrics into an operational guardrail for large-model training.

The approach combines scheduled short and long training runs, standardized performance and stability metrics (throughput, memory, loss, divergence), and simple statistical baselines to automatically surface regressions via alerts without hard gates or complex infrastructure.

I’ll focus on why logging alone is insufficient, how we chose what to monitor, and what tradeoffs we encountered (false positives, alert fatigue, baseline drift). The goal is not a tool demo, but a reusable pattern other PyTorch teams can adapt to catch training regressions earlier and make retraining more predictable.
Speakers
avatar for Sahana Venkatesh

Sahana Venkatesh

Software engineer, Wayve
Wednesday April 8, 2026 15:55 - 16:05 CEST
Central Room
  Training Systems

15:55 CEST

From Gradients To Governance: Making PyTorch Lineage-Aware - Kateryna Romashko & Clodagh Walsh, Red Hat
Wednesday April 8, 2026 15:55 - 16:20 CEST
PyTorch was built to track how models learn, but not whether they should have. As AI systems increasingly operate on regulated, jurisdiction bound, and sovereign data, lineage and policy can no longer live outside the runtime. This talk explores data sovereignty as a first class constraint and argues that lineage is the missing primitive in modern ML frameworks. Building on PyTorch’s dynamic graphs and autograd system, we outline how tensors could carry origin, consent, and policy metadata through training and inference. The goal is not compliance tooling, but a lineage aware PyTorch that enables trustworthy, auditable, and deployable AI across edge, federated, and European AI ecosystems.
Speakers
avatar for Kateryna Romashko

Kateryna Romashko

Associate Software Engineer, RedHat
Kateryna Romashko is a Software Engineer and a Master’s student in Computer Science, currently working in the Emerging Technology team at Red Hat. Her work focuses on ML systems, data lineage, and event-driven architectures, with hands-on experience across ML platforms, distributed... Read More →
avatar for Clodagh Walsh

Clodagh Walsh

Software Engineer, Red Hat
Clodagh is a software engineer at Red Hat working on the Emerging Technologies team under the office of the CTO. She has experience working with cloud native technologies. She is currently working on a range of AI related projects focused on topics such as MLOps and dLLMs.
Wednesday April 8, 2026 15:55 - 16:20 CEST
Master Stage
  Responsible AI & Compliance

15:55 CEST

DualPipe from Scratch: Implementing DeepSeek's 5D Parallelism in PyTorch - Dev Jadhav, ING Bank
Wednesday April 8, 2026 15:55 - 16:20 CEST
The DeepSeek-V3 paper describes 5D parallelism and DualPipe at a high level, but leaves critical implementation details undocumented. This session presents our open-source PyTorch reference implementation that fills those gaps - verified against the original architecture and designed for learning and extension.

We'll share what we discovered building it from scratch:
Why K_pe is shared across heads in decoupled RoPE (not explicit in paper)
The critical timing of bias updates in auxiliary-loss-free load balancing
How sigmoid routing separates selection scores from gate values
The warmup formula that makes DualPipe achieve 3% bubble overhead
Bugs we caught: causal mask position offsets, EMA initialization, capacity dropping priority

What you'll learn:

5D Parallelism: How TP, PP, DP, EP, and SP interact at 2,048+ GPU scale
DualPipe: Building the bidirectional scheduler with 55% throughput gain over GPipe
Hierarchical All-to-All: Two-level communication reducing MoE dispatch overhead by 4x
Teachable abstractions: CapacityMetrics, ExpertSpecializationTracker, ScheduleStep enums

Prerequisites: torch.distributed basics.
Code: github.com/DevJadhav/deepseek-from-scratch
Speakers
avatar for Dev Jadhav

Dev Jadhav

Tech Lead ML Engineer, ING Bank
Dev Jadhav is a production AI/ML engineer with 10+ years building AI
systems at scale. He currently leads ML engineering at Major Bank,
developing financial-grade AI and large-scale model operations. Dev is
the creator of DeepSeek From Scratch, an open-source implementation of
DeepSe... Read More →
Wednesday April 8, 2026 15:55 - 16:20 CEST
Founders Cafe
  Training Systems

15:55 CEST

Sponsored Session: Fault-Tolerant Training: How We Build Reliable Clusters for Distributed AI Workloads - Cyril Konkratenko & Maurits de Groot, Nebius
Wednesday April 8, 2026 15:55 - 16:20 CEST
Large-scale distributed AI training is highly sensitive to infrastructure failures, where even a single node disruption can halt progress and waste substantial compute. This talk presents Nebius’s approach to fault-tolerant training, combining reliability metrics such as goodput, MTBF, and MTTR with automated infrastructure practices including health checks, workload isolation, node replacement, state recovery, and observability. Drawing on production cluster results, the presentation shows how these techniques reduce interruptions, accelerate recovery, and improve the stability and efficiency of long-running AI workloads.
Speakers
CK

Cyril Kondratenko

AI/ML Specialist Solutions Architect, Nebius
MD

Maurits de Groot

AI/ML Specialist Solutions Architect, Nebius
Wednesday April 8, 2026 15:55 - 16:20 CEST
Junior Stage

16:10 CEST

Lightning Talk: Ball Tracking and Detection in Soccer Videos - Comparison of VLMs and Traditional Pipelines - Maciej Szymkowski, Future Processing
Wednesday April 8, 2026 16:10 - 16:20 CEST
Nowadays, Vision-Language Models (VLMs) have plenty of different applications. However, it must be pointed out that we cannot be totally sure that they are the most accurate and precise solution for all potential problems. We must compare their possibilities with some other pipelines. In this presentation, we would like to compare on-premise models – Qwen 3 and InternVL-3.5, and cloud-based solutions – Gemini 3, GPT-5 with traditional pipeline based on YOLOv11 and image processing techniques. The battlefield will be ball detection and tracking in soccer matches recordings (from different angles and in diversified light, e.g., sunny, night, and weather conditions, e.g., snowy, rainy day) downloaded from SoccerNet database. In this case, we used both broadcast videos and action and replay images. All of them were marked manually to prepare ground truth database. The models must recognize not only the ball but also track it through the whole sequence of images. To give equal chances we fine-tuned YOLOv11 and provided additional knowledge to VLMs in the form of RAG pipeline. Comparison was made with traditional Machine Learning metrics like accuracy, precision, and recall.
Speakers
avatar for Maciej Szymkowski

Maciej Szymkowski

AI Researcher and Senior Machine Learning Engineer, Future Processing
Maciej Szymkowski, PhD, is a Senior ML Engineer at Future Processing. Formerly Head of AI at Łukasiewicz PIT, his academic background spans BUT, WUT, and AGH. With 45+ publications, he specializes in Computer Vision (med/transport/sport), VLMs, and LLMs. His industry experience includes... Read More →
Wednesday April 8, 2026 16:10 - 16:20 CEST
Central Room
  Applications & Case Studies

16:25 CEST

Lightning Talk: Bridging the Gap: Engineering Compliant "Glass Box" Medical AI With PyTorch - Muhammad Saqib Hussain, Neurosonic & Mohaddisa Maryam, Neurosonic Academy
Wednesday April 8, 2026 16:25 - 16:35 CEST
While state-of-the-art models like NeuroBOLT demonstrate mathematical excellence in EEG-to-fMRI synthesis, they often remain clinically opaque. With the EU AI Act classifying medical AI as "high-risk," hospitals cannot deploy "black boxes"; they require systems that are transparent, auditable, and legally compliant.
​This session presents a "Clinical Auditing System" built within the PyTorch ecosystem, designed to transform opaque deep learning models into transparent "Glass Boxes." I will demonstrate a workflow that backpropagates gradients from high-dimensional 4D fMRI volumes to identify the specific EEG spectral signatures driving those predictions.
​Key Technical Takeaways:
​1. The Audit Layer: Implementing IntegratedGradients (Captum) to verify model fidelity, ensuring predictions stem from valid neural oscillations rather than noise artifacts.
​2. Cross-Modal Reasoning: A technical demonstration of mapping 4D volumetric outputs back to 1D EEG frequency bands, enabling the model to "reason" through neurovascular coupling.
​This presentation is designed for developers seeking to wrap PyTorch models in safety layers that satisfy demands of healthcare regulation.
Speakers
avatar for Mohaddisa Maryam

Mohaddisa Maryam

Miss, Neurosonic Academy
I am a First Year Student of Medicine in Italy.
avatar for Muhammad Saqib Hussain

Muhammad Saqib Hussain

Medical Student, AI Researcher and Neurotech Founder, ClinExplain
Muhammad Saqib is a 4th-year medical student at Comenius University Bratislava and Founder of Neurosonic Academy. His M.D. thesis explores AI for Sleep Medicine. Leveraging PyTorch and Captum, he builds "Glass Box" auditing frameworks to validate generative neuroimaging models against... Read More →
Wednesday April 8, 2026 16:25 - 16:35 CEST
Founders Cafe
  Applications & Case Studies

16:25 CEST

De-mystifying PyTorch for ASICs: When (and Why) To Move Your Development To AI Accelerators - Alpha Romer Coma, Kollab Philippines
Wednesday April 8, 2026 16:25 - 16:50 CEST
GPU availability and cost are squeezing ML teams, making ASICs like Google TPUs and AWS Trainium attractive alternatives. But does the software stack hold up? This session moves beyond the datasheets to provide a practical, code-first reality check on migrating PyTorch workloads to ASICs.

We will de-mystify the underlying compiler stacks, comparing PyTorch/XLA (TPU) and TorchNeuron (Trainium), and analyze the 'Compiler Tax' that often surprises developers. Through side-by-side code diffs and real-world benchmarks on fine-tuning Llama 4, Gemma 3, Qwen 3, and training CNNs and ViTs, we will answer:

1. The Code: How much rewriting is actually required?
2. The Performance: Which model architectures thrive on ASICs, and which ones fail due to dynamic shapes?
3. The Debugging: What happens when you hit an OOM or a compilation hang?

Attendees will leave with a clear 'Migration Decision Matrix' to determine if their specific workload is ready for the ASIC leap.
Speakers
avatar for Alpha Romer Coma

Alpha Romer Coma

Associate Engineer, Cloud Development, Kollab Philippines
Alpha is an Associate Cloud Engineer in Kollab and a CS undergraduate at FEU Tech, Philippines. He specializes in multimodality with text, videos, and audio, and works on Accelerated Computing with Google TPUs and AWS Tranium.

For 5 months, he pushed Google Cloud TPUs v4s to their limit to train vision-language models for use cases like internet brain rot recognition and detection of cognitively overloading content called sludge videos with 92% accuracy... Read More →
Wednesday April 8, 2026 16:25 - 16:50 CEST
Central Room
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -