PyTorch Conference Europe 2026: Full Schedule

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

11:00 CEST

Lights, Camera, Inference! Video Generation as a Service With VLLM-Omni - Ricardo Noriega, Red Hat & Doug Smith, Red Hat, Inc

Tuesday April 7, 2026 11:00 - 11:25 CEST

Founders Cafe

LLMs made for text generation as a service. What does it take to do the same for video?
We built an experimental Video Generation as a Service stack using vLLM-Omni and the LTX-2 open weights video model to explore how far an open, multimodal stack can go toward production use. We’ll share what worked, what busted, and what it takes to treat generative video as a first-class workload.
vLLM is known for high-performance autoregressive inference, and vLLM-Omni extends that foundation to multimodal inputs and outputs. We pushed those capabilities further by adding support for LTX-2, extending the OpenAI-compatible API surface, integrating with front ends, and packaging for scalable deployment. We’re here to walk you through and get you familiar with the touch points for just how we put all the Legos together with vLLM-Omni.
Finally, we’ll examine the gap between novelty demos and real applications: going from quirky spaghetti eating videos to generating consistent characters, personalized media, customized video game cutscenes, and interactive storytelling, and highlight what’s still missing to make generative video truly production-ready.

Speakers

Doug Smith

Principal Software Engineer, Red Hat

Doug Smith is a Principal MLOps Engineer at Red Hat, where he works on the AI Inference Server team and contributes upstream to the vLLM project through its CI Special Interest Group. Recently, he's also been looking into contributions to vLLM-Omni. He’s spent years bridging telecom... Read More →

Ricardo Noriega

Principal SW Engineer, Red Hat

Ricardo is a Principal Software Engineer working at the Red Hat's Office of the CTO in the Emerging Technologies organization. Ricardo is currently focused on AI multimodality and researching the benefits of Small Language Models.
He is a former member of the Akraino TSC and PTL of the Kubernetes-Native-Infrastructure blueprint family, and contributor to Kubernetes, OpenStack, OpenDaylight and OPNFV... Read More →

Lights, Camera, Inference! Video Generation as a Service With VLLM Omni (1) pdf

Tuesday April 7, 2026 11:00 - 11:25 CEST
Founders Cafe

GenAI & Multimodal

Audience Level Any
Slides Attached Yes

11:30 CEST

Lightning Talk: Coding Agents for Compiler Construction: Beyond the AI Assistant Paradigm - Reza Rahimi, yasp.ai & Stefan Krassin, yasp

Tuesday April 7, 2026 11:30 - 11:40 CEST

Founders Cafe

Modern ML compilers follow a familiar pattern: a frontend lowers models into an intermediate representation, while a backend applies graph and kernel optimizations before generating code for target accelerators. PyTorch provides strong foundations through nn.Module, FX, and graph capture, but implementing optimized backends remains challenging due to hardware diversity and kernel-level complexity.

Optimizing GPU kernels is hard. Few engineers do it well. Hardware architectures evolve yearly, and with hyperscalers, chip makers, and AI labs building custom silicon, demand for efficient kernel generation keeps growing. This creates a gap between model developers and hardware capabilities.

This talk explores coding agents as engineering tools for compiler construction, not general-purpose assistants. We discuss how agents can generate and refine backend components by analyzing model mathematics and hardware specifications to produce optimized kernels tailored to specific targets.

We present a compiler architecture built as a PyTorch add-on that accepts PyTorch models or FX graphs and produces executable artifacts, demonstrating practical integration with existing PyTorch workflows.

Speakers

Reza Rahimi

CTO, yasp

Reza Rahimi is a seasoned technologist with a strong background in accelerating engineering software and scaling machine learning systems. With experience leading teams across embedded AI, compiler design, and model optimization, he now serves as CTO of yasp, where he is pioneering... Read More →

Stefan Krassin

CEO, yasp.ai

With a background in electrical engineering and a career spanning embedded systems to executive leadership, he combines technical expertise with a vision for scale. After 10+ years of leading companies to outstanding growth, he co-founded yasp in 2023. His mission is to eliminate... Read More →

260403 pytorch conf pdf

Tuesday April 7, 2026 11:30 - 11:40 CEST
Founders Cafe

Agents & Interop

Audience Level Intermediate
Slides Attached Yes

11:45 CEST

Lightning Talk: TorchJD: Jacobian Descent in PyTorch - Pierre Quinton, EPFL & Valérian Rey, Simplex Lab

Tuesday April 7, 2026 11:45 - 11:55 CEST

Founders Cafe

Jacobian descent (JD) is an extension of gradient descent supporting the optimization of vector-valued functions. This algorithm can be used to train neural networks with multiple loss functions (e.g. multi-task learning). JD iteratively updates the parameters of the model using the Jacobian matrix of the vector of losses (the matrix stacking each individual loss' gradient).

To support and extend our research, we have developed the TorchJD library. With it, it's easy and efficient to compute the Jacobians with respect to the model parameters, and to aggregate them into an update direction that is beneficial to every objective. In contrast, if we had averaged the losses and used gradient descent, the update would have been beneficial to the average loss, but may have actually increased one of the individual losses.

In this session, we will give a quick introduction to the theory behind Jacobian descent, and then show how to use TorchJD on a variety of use-cases, beyond multi-task learning.

Library: https://github.com/TorchJD/torchjd
Paper: https://arxiv.org/abs/2406.16232

Speakers

Pierre Quinton

Teacher, EPFL

PhD in Information Theory and Master in Data Science, specializing in fundamental math and multi-objective optimization (MOO). I am the co-author of TorchJD, a PyTorch library for Jacobian Descent developed with Valerian, currently at ~300 GitHub stars. My work aims to translate complex... Read More →

Valérian Rey

Research Engineer, Simplex Lab

I graduated from EPFL with a MSc in Data Science in 2021. Since then, I worked as a Data Scientist as Withings, and I worked on Jacobian descent, initially as a side-project, but now as a full-time occupation. I now spend most of my time developing and maintaining TorchJD, and I love... Read More →

TorchJD pdf

Tuesday April 7, 2026 11:45 - 11:55 CEST
Founders Cafe

Training Systems

Audience Level Intermediate
Slides Attached Yes

12:00 CEST

Lightning Talk: Ethical, Privacy and Sustainability Considerations in PyTorch Systems - Paula Mesa Macias, Pau&Company

Tuesday April 7, 2026 12:00 - 12:10 CEST

Founders Cafe

PyTorch models are part of larger systems that handle data, logs, APIs and other services. Ethical, privacy, security and environmental considerations appear not only around the AI itself, but across the whole system.
Using the Ethical Software Framework and the Ethical IT Audit, this session explores practical ways to think about these issues in real workflows. It highlights situations where decisions in data handling, model deployment, logging or infrastructure have ethical, compliance or sustainability implications. It also shows considerations for using AI responsibly, such as dataset choices, bias awareness and evaluating risks before deployment.
The goal is to provide a clear, structured lens for identifying risks and trade-offs, making ethical, privacy, security, and sustainability concerns easier to discuss in everyday work.

Speakers

Paula Mesa Macias

Founder and Ethical Technology Consultant, Pau&Company

Founder of Pau&Company (https://pau.company/), which offers Ethical IT Audits (https://pau.company/ethical-it-audit/) based on the Ethical Software Framework (https://pau.company/ethical-software-framework/), Paula focuses on ethical considerations in technology. Through Pau&Company... Read More →

Tuesday April 7, 2026 12:00 - 12:10 CEST
Founders Cafe

Responsible AI & Compliance

Audience Level Any

13:45 CEST

Lightning Talk: From Pretrained To Personal: Privacy-First Fine-Tuning on AI PCs - Daniel Holanda Noronha & Iswarya Alex, AMD

Tuesday April 7, 2026 13:45 - 13:55 CEST

Founders Cafe

Pytorch on AI PCs crossed a threshold: local hardware can now support meaningful model fine-tuning, not just inference. This unlocks a new class of enterprise workflows where sensitive data never leaves the device, yet models can still be personalized and adapted using PyTorch.

In this session, we’ll show how to design on-device fine-tuning pipelines for AI PCs, focusing on enterprise scenarios where privacy is non-negotiable: regulated healthcare data, government and public-sector workloads, financial services, and proprietary enterprise systems. We’ll walk through key decisions such as selecting efficient pre-trained models, and how the right PyTorch optimizations enable effective personalization on large private datasets.

We'll also showcase practical fine-tuning techniques such as supervised fine-tuning (SFT), LoRA, and QLoRA, and show how mixed-precision training and correct use of training vs. evaluation modes make these approaches efficient and practical on AI PCs while preserving privacy. The result is a cloud-free, privacy-first fine-tuning blueprint that turns AI PCs into secure personalization engines for enterprise AI.

Speakers

Daniel Holanda

Solutions Architect & ML Engineer, AMD

Daniel is a Sr. ML Engineer at AMD, specializing in local AI. He leads the development of local fine-tuning workflows for AI PCs and co-leads several open-source projects where he designs production-grade LLM/VLM tooling to accelerate the AI development lifecycle.

Previously, he was a Machine Learning Engineer at Groq and a contributor to Microsoft’s Project Brainwave. Daniel holds a PhD in AI understanding and hardware architecture from UBC... Read More →

Iswarya Alex

Iswarya Alex, AMD

I am an ML Engineer at AMD focused on enabling high-performance on-device AI experiences. I work on optimizing and deploying models on AMD's Ryzen AI powered devices with GPUs and NPUs efficiently

Pytorch Conference 2026 From Pretrained to Personal pdf

Tuesday April 7, 2026 13:45 - 13:55 CEST
Founders Cafe

Security & Privacy

Audience Level Intermediate
Slides Attached Yes

14:15 CEST

Sponsored Session: TorchTPU: Expanding TPU Programmability to Pytorch - Kat Ko & Claudio Basile, Google; Jana van Greunen, Meta

Tuesday April 7, 2026 14:15 - 14:40 CEST

Founders Cafe

Google Tensor Processing Units (TPUs) are designed for ML at massive scale, offering significant benefits in performance, energy, and cost. While TPUs have historically been associated with the TensorFlow and JAX ecosystems, we introduce TorchTPU: a new Google effort to expand TPU programmability to PyTorch.

This talk charts TorchTPU’s evolution, from the initial RFC to establishing a native, eager-first PyTorch backend. We will outline the core technical challenges overcome during this transition—particularly the complexities of translating dynamic, eager execution into highly optimized TPU computations.

We’ll highlight current milestones, including native integration with torch.compile, DTensor, and robust support for the latest Ironwood (TPU v7) architecture. These advancements collectively enable multi-billion parameter models to run on TPUs with minimal code changes, while retaining the ability for users to apply model-specific optimizations (e.g., custom kernels, quantization, sharding) to reach peak performance. Finally, we’ll provide a sneak peek at our roadmap for 2026.

Speakers

Jana van Greunen

Director of PyTorch Engineering, Meta

Jana van Greunen is the Director of PyTorch Engineering at Meta, where she leads efforts to ensure PyTorch remains the leading AI/ML framework for researchers and developers worldwide. With deep expertise in distributed systems, large-scale infrastructure, and over 15 years of experience... Read More →

Kat Ko

Senior Eng Manager, Google

Kat Ko is a Senior Engineering Manager at Google and a lead on TorchTPU, where she drives the integration of PyTorch with TPU technology to enable high-performance computing at scale. An EECS graduate of UC Berkeley, she brings over 15 years of experience building large-scale systems... Read More →

Claudio Basile

Software Engineer, Google

Claudio Basile is a Google Software Engineer and the co-founder and technical lead of TorchTPU. During his tenure at Google, he also authored LiteRT, the company’s new on-device ML framework. With a Ph.D. in ECE from UIUC and over 15 years of experience spanning machine learning... Read More →

TorchTPU PyTorch Paris '26 Google Slides pdf

Tuesday April 7, 2026 14:15 - 14:40 CEST
Founders Cafe

Frameworks & Compilers

Slides Attached Yes

14:45 CEST

Lightning Talk: Implementing Single-Dim Strategies With Sharding Validator - Anshul Sinha, Meta

Tuesday April 7, 2026 14:45 - 14:55 CEST

Founders Cafe

DTensor sharding propagation is a major bottleneck to full operator coverage: adding or fixing an op strategy is complex, bug‑prone, and gaps often surface as unexpected resharding and extra collectives. A key source of complexity is that today’s rules conflate (1) semantic correctness—valid input/output sharding combinations for an operator—with (2) search‑space pruning to avoid combinatorial blowups on N‑dimensional meshes.

This talk presents a landed prototype that separates these concerns via Single Mesh Dim Strategies: each operator specifies valid placement combinations for one mesh dimension, while infra expands/composes them across the full mesh and selects low‑cost strategies. For contributors, this provides a clear path to refactor existing op_strategies into single‑dim rules that are easier to review and extend. We also introduce a Truth Table‑style sharding validator that systematically tests shapes and sharding specs to check soundness/completeness and to flag unnecessary redistribution/collectives caused by missing cases.

The goal of this presentation is faster, higher‑confidence contributions that improve correctness and expand DTensor operator coverage.

Speakers

Anshul Sinha

Software Engineer, Meta

I graduated from the University of Michigan with a B.S in Computer Science in December 2024. I joined Meta's PyTorch Distributed as a SWE in June 2025.

Single Dim Strategies PyTorch Conference pdf

Tuesday April 7, 2026 14:45 - 14:55 CEST
Founders Cafe

Frameworks & Compilers

Audience Level Intermediate
Slides Attached Yes

15:00 CEST

Lightning Talk: Jigsaw: Domain and Tensor Parallelism for High-Resolution Input Training - Deifilia Kieckhefen, Karlsruhe Institute of Technology

Tuesday April 7, 2026 15:00 - 15:10 CEST

Founders Cafe

Distributed neural network training frameworks typically optimize for specific architectures while minimizing communication overhead. Transformer layers can be efficiently parallelized, but other operations such as convolutions often remain inefficient. This creates bottlenecks for complex model architectures.
Moreover, existing tensor parallelism strategies typically replicate input data across all processes, creating redundant I/O that scales poorly with input size. In applications with heavy I/O demands-weather forecasting, medical imaging, or video processing-unsharded input data creates additional data-loading bottlenecks that could benefit from parallelization.
Jigsaw is a PyTorch library that shards both model weights and input data across parallel processes. It maintains a PyTorch-like interface while parallelizing activations, convolutions, linear layers, and attention through a distributed matrix multiplication backend. We demonstrate the usability of Jigsaw across a wide range of model architectures and shows performance when scaling multi-billion-parameter models sharded across up to 8 processes and compares the scalability to DDP, FSDP, and Megatron-LM approaches.

Speakers

Deifilia Kieckhefen

Doctoral Researcher, Karlsruhe Institute of Technology

Deifilia Kieckhefen is a doctoral researcher at the Karlsruhe Institute of Technology. She works on scalable and distributed training of neural network architectures.

PyTorchConf jigsaw Kieckhefen pdf

Tuesday April 7, 2026 15:00 - 15:10 CEST
Founders Cafe

Training Systems

Audience Level Any
Slides Attached Yes

15:40 CEST

Lightning Talk: Cross-Region Model Serving: PyTorch Inference, Observability & LLMOps - Suraj Muraleedharan, Amazon Web Services

Tuesday April 7, 2026 15:40 - 15:50 CEST

Founders Cafe

As PyTorch models move to production, organizations face a critical challenge: deploying, monitoring, and operating inference at scale across multiple regions. Single-region serving is well-understood, but multi-region LLMOps—model distribution, observability, failover, and cost management—remains ad-hoc and challenging for multiple customers.

This session presents production-tested architectures for multi-region PyTorch inference and LLMOps workflows. We cover:

Serving: Multi-region TorchServe/KServe on Kubernetes with latency-based routing, blue-green deployments, model versioning, and automated failover with circuit breakers.

Observability: OpenTelemetry distributed tracing, Prometheus/Grafana dashboards for latency, throughput, GPU utilization, and LLM-specific metrics like time-to-first-token and KV-cache hit rate.

LLMOps: CI/CD pipelines for cross-region model deployment with automated rollback, drift detection, and SLO-based alerting.

Attendees leave with serving architectures, dashboards, and deployment pipelines using open-source tooling.

Speakers

Suraj Muraleedharan

Principal Platform Engineer, Amazon Web Services

Principal Engineer driving technical strategy and building mission-critical foundational platforms for AI, HPC, and distributed systems, bridging the gap between infrastructure, AI research, and product organizations.

PyTorchCon EU26 Cross Region Model Serving pdf

Tuesday April 7, 2026 15:40 - 15:50 CEST
Founders Cafe

Inference & Production

Audience Level Advanced
Slides Attached Yes

15:55 CEST

Lightning Talk: Running ExecuTorch Applications With Silicon Acceleration, in Ultra-low Power - George Gekov, Arm; Aki Makkonen, Alif Semiconductor

Tuesday April 7, 2026 15:55 - 16:05 CEST

Founders Cafe

Efficient deployment of ML models on low-power embedded systems has been a significant challenge for a number of years. At the same time, these embedded SoCs are all around us—from everyday appliances to the latest smart glasses.

ExecuTorch is a PyTorch-native framework for deploying neural networks on resource-constrained systems. In this session, we show how to build an end-to-end speech recognition application using PyTorch and ExecuTorch—from training a Transformer-based neural network in PyTorch, through quantization, all the way to deployment on a low-power embedded device.

We will introduce the key ExecuTorch APIs for quantization and explain how models are transformed and lowered into a form that can run efficiently on a device. The application is running on the Alif Ensemble E8 SoC, the first implementation of the leading Arm® Ethos-U85 NPU which brings native support for Transformer models to the ultra-low power domain.

Join the experts from Arm and Alif Semiconductor to see how we are bridging the gap between PyTorch and embedded deployment—and how you can bring PyTorch models to silicon-accelerated, ultra-low-power systems.

Speakers

George Gekov

ML Engineer, Arm

George Gekov is a Staff Software Engineer in Arm’s Machine Learning team, where he focuses on machine learning inference on embedded systems. He has extensive experience deploying neural networks on resource-constrained devices with Neural Processing Units (NPUs) to enable hardware-accelerated... Read More →

Aki Makkonen

Senior Staff Application Engineer, Alif Semiconductor

Software engineer with background in telecommunication, medical imaging, robotics and embedded systems.

Tuesday April 7, 2026 15:55 - 16:05 CEST
Founders Cafe

Frameworks & Compilers

Audience Level Any

16:10 CEST

On-Device LLM Inference on Android With ExecuTorch and Qualcomm QNN - Shivay Lamba & Kartikey Rawat, Qualcomm

Tuesday April 7, 2026 16:10 - 16:35 CEST

Founders Cafe

Multimodal models like CLIP are typically deployed in the cloud due to their size and computational demands, limiting their use in latency-sensitive, privacy-preserving, and offline-first applications. This talk demonstrates how one can run fully on-device CLIP inference on Android using ExecuTorch with the Qualcomm QNN backend, enabling real-time vision–language understanding without server dependency.

One can run models like CLIP (ViT-B/32) model entirely on edge devices, leveraging QNN for hardware-accelerated inference. A key focus of the talk is a deep dive into ExecuTorch optimizations for QNN, including graph lowering, operator fusion, quantization strategies, memory planning, and backend-specific execution choices that materially impact latency, memory footprint, and power consumption.

The talk will cover architectural insights, model export and compilation workflows, and real-world benchmarks covering latency, memory usage, and power efficiency. This talk highlights how large multimodal PyTorch models can be made production-ready on edge devices, unlocking new classes of private, offline-capable AI applications.

Speakers

Shivay Lamba

Senior ML Engineer, Qualcomm

Shivay Lamba is a software developer specializing in DevOps, Machine Learning and Full Stack Development.

He is an Open Source Enthusiast and has been part of various programs like Google Code In and Google Summer of Code as a Mentor and is currently a MLH Fellow. He has also worked at organizations like Amazon, EY, Genpact. He is a Tensorflow.JS SIG member and community lead from In... Read More →

Kartikey Rawat

Senior Developer Advocate, Qualcomm

Senior Developer Advocate at Qualcomm| Google Developer Expert in AI and Google Cloud

Tuesday April 7, 2026 16:10 - 16:35 CEST
Founders Cafe

GenAI & Multimodal

Audience Level Any

16:40 CEST

Optimizing PyTorch on CPU-GPU Coherent Platforms - Matthias Jouanneaux, Nvidia

Tuesday April 7, 2026 16:40 - 17:05 CEST

Founders Cafe

In recent years, both Nvidia and AMD have introduced hardware coherent platforms: GH200, GB200 and MI300A. These coherent platforms provide both many new features and challenges for PyTorch applications attempting to make the most out of the platform.
This talk will focus on Nvidia's GB200 and walk through techniques to utilize the features of the coherent architecture in PyTorch, such as the high CPU-GPU interconnect bandwidth, unified memory, as well as the advantages and caveats of sharing system memory between CPU and GPU.

Speakers

Matthias Jouanneaux

Sr Software Engineer - PyTorch, NVIDIA

After his master’s degree, Matthias Jouanneaux worked at Konica Minolta's european research lab on medical image analysis using deep learning for 2 years.
He then joined Nvidia, focusing on optimizing application performance for Nvidia hardware as a Developer Technology enginee... Read More →

PyTorchEU26 Optimizing PyTorch on CPU GPU coherent platforms Matthias Jouanneaux pdf

Tuesday April 7, 2026 16:40 - 17:05 CEST
Founders Cafe

Frameworks & Compilers

Audience Level Advanced
Slides Attached Yes

10:35 CEST

How To Write C++ Extensions in 2026 - Jane Xu, Meta & Mikayla Gawarecki, Meta

Wednesday April 8, 2026 10:35 - 11:00 CEST

Founders Cafe

Are you writing a C++ custom op extension to PyTorch? It's 2026 and are you still shipping M x N wheels for M CPython versions and N libtorch versions? Did you know you can just ship 1 wheel that works across multiple CPythons and libtorches? If you're curious how, attend this talk to get the deets on py_limited_api, APIs like torch::stable::Tensor & TORCH_TARGET_VERSION, and generally the latest and greatest ways for keeping your code and your release matrix simple. Get your custom kernel enrolling in new features with benefits proven out in FA3, xformers, torchao, torchaudio, and more in progress! We'll also share some of our vision towards smoother and faster custom ops extensions.

Speakers

Jane Xu

PyTorch SWE, Meta

Hi, I'm Jane! Please don't hesitate to come talk to me about your favorite optimizer, fitting models in GPU memory, how to free C++ extensions from libtorch version, and anything that interests you.

Mikayla Gawarecki

Software Engineer, Meta

Software Engineer on PyTorch

PTC EU 2026 How to Write C++ Extensions in 2026 pdf

Wednesday April 8, 2026 10:35 - 11:00 CEST
Founders Cafe

Frameworks & Compilers

Audience Level Beginner
Slides Attached Yes

11:05 CEST

Bringing PyTorch Monarch to AMD GPUs: Single-Controller Distributed Training on ROCm - Liz Li & Zachary Streeter, AMD

Wednesday April 8, 2026 11:05 - 11:30 CEST

Founders Cafe

PyTorch Monarch introduces a new distributed programming paradigm that enables developers to orchestrate entire GPU clusters from a single Python program. With its actor-based runtime, process mesh abstraction, and asynchronous execution model, Monarch simplifies large-scale distributed training and enables complex workflows that combine training, evaluation, and reinforcement learning within one unified script.

In this talk, we present our work enabling PyTorch Monarch on AMD Instinct GPUs with ROCm, expanding the single-controller model beyond CUDA environments and bringing this emerging runtime to a broader hardware ecosystem. We describe the engineering effort required to port Monarch’s GPU runtime and distributed communication stack to ROCm, including HIPification of CUDA-specific components, adaptation of memory management and synchronization semantics, and integration with high-performance GPU-to-GPU communication on multi-node clusters through RDMA.

We will share lessons learned from running Monarch workloads on MI300-class clusters, including performance considerations, debugging workflows, and developer experience improvements. Our results demonstrate that Monarch’s architecture can be successfully extended to heterogeneous hardware environments while preserving scalability and ease of use.

This work advances hardware diversity in distributed PyTorch and highlights how portable runtimes can simplify large-scale training while enabling scalable, cluster-wide experimentation across accelerator platforms.

Speakers

Liz Li

Principal AI engineer, AMD

Liz Li is a Principal AI Engineer in the AMD AI group, specializing in enabling and optimizing cutting-edge AI models on AMD Instinct GPUs for both distributed inference and training. With over 10 years of experience in computer, graphics, and AI architecture, she has previously led... Read More →

Zachary Streeter

Senior Member of Technical Staff, AMD

I'm a computational physicist working in the field of AI the past 5 years. I have a wide range of expertise from mathematics to performance optimizations and system engineering. Feel free to nerd out with me! Please connect with me on LinkedIn.

Monarch PTC v1 pptx

Wednesday April 8, 2026 11:05 - 11:30 CEST
Founders Cafe

Training Systems

Audience Level Any

11:35 CEST

Lightning Talk: Enabling the Audio Modality for Language Models - Eustache Le Bihan, Hugging Face

Wednesday April 8, 2026 11:35 - 11:45 CEST

Founders Cafe

As the maintainer of everything audio in `transformers` (the lib), this talk shares how audio is being integrated into large language models, grounded in what we observe from the OS ecosystem.

Beginning with a brief overview of the current landscape of Audio LMs, I'll then highlight emerging trends in how audio is incorporated into pretrained text backbones. In particular, we examine the growing convergence of architectural choices, many inspired by VLMs, as well as newer concepts such as audio tokenization and streaming.

The core of the talk focuses on providing the audience with key technical insights: audio encoders vs audio tokenizers, their respective advantages and limitations. It covers the motivations behind introducing concepts such as audio tokenizers and audio processors into transformers, shows how these design choices are reflected in the library, and explains how PyTorch tooling is leveraged to make audio a standardized modality for the open-source community.

Speakers

Eustache Le Bihan

MLE, Hugging Face

A 2024 MVA graduate, I now work on open-source audio at Hugging Face. My current focus is on standardising audio in the transformers library and strengthening support across models.

Wednesday April 8, 2026 11:35 - 11:45 CEST
Founders Cafe

GenAI & Multimodal

Audience Level Intermediate

13:30 CEST

Optimizing CPU LLM Inference in PyTorch: Lessons From VLLM - Crefeda Rodrigues, Arm Limited & Fadi Arafeh, Arm

Wednesday April 8, 2026 13:30 - 13:55 CEST

Founders Cafe

vLLM has emerged as a reference inference stack in the PyTorch ecosystem for high-throughput large language model serving. CPUs continue to play an important role in LLM inference, supporting cost-sensitive deployments, hybrid CPU/GPU serving, and batch or off-peak workloads on general-purpose infrastructure.

In this talk, we examine CPU-based LLM inference through the lens of PyTorch internals, using vLLM as a case study. We describe how vLLM interacts with PyTorch’s operator stack, including tensor layout management, backend dispatch, and threading behaviour, and highlight common sources of overhead such as repeated weight repacking and poor threading behaviour.

We present runtime and kernel-level optimizations that reduce overhead including CPU paged-attention kernel tuning with vectorized softmax, specialized Q–K and P–V GEMM kernels aligned with vLLM’s scheduler, an ISA-aware BF16 attention, pre-packed weight layouts for quantized matmul, SIMD vectorization using PyTorch’s at::vec::Vectorized primitives, and NUMA-aware scheduling for scalable parallel inference.

Finally, we conclude with lessons learned from building and upstreaming a high-performance CPU inference engine.

Speakers

Crefeda Rodrigues

Staff Software Engineer, Arm

Crefeda Rodrigues is a Staff Software Engineer at Arm, focusing on performance and scalability driven machine learning software optimization for Arm server CPUs. She previously worked on large-scale climate and weather model optimization as a postdoctoral researcher at the University... Read More →

Fadi Arafeh

Senior Machine Learning Engineer, Arm

Fadi is a Senior Machine Learning Engineer at Arm, working on optimizing PyTorch and vLLM for Arm Infrastructure cores. Prior to that, Fadi obtained a BSc in Artificial Intelligence from the University of Manchester.

Optimizing LLM Inference on CPU Crefeda Fadi pdf

Wednesday April 8, 2026 13:30 - 13:55 CEST
Founders Cafe

Inference & Production

Audience Level Intermediate
Slides Attached Yes

14:00 CEST

Lightning Talk: Debugging the Undebuggable: Introducing Torch.distributed.debug - Tristan Rice, Meta, PyTorch

Wednesday April 8, 2026 14:00 - 14:10 CEST

Founders Cafe

Distributed training in PyTorch enables unprecedented scale, but it also introduces notoriously difficult debugging challenges. When a job with thousands of ranks hangs or slows down, identifying the root cause can feel like searching for a needle in a haystack. This lightning talk introduces the new PyTorch Distributed Debug Server, a powerful, interactive tool designed to bring clarity and control to the chaos of distributed debugging. We will provide a high-level overview of its architecture and core features, demonstrating how it provides a unified interface to inspect stack traces, analyze performance, and diagnose hangs across all workers simultaneously. Attendees will learn how this extensible server can dramatically reduce debugging time and improve the reliability of large-scale training jobs.

Speakers