Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Venue: Central Room clear filter
Tuesday, April 7
 

11:00 CEST

Lightning Talk: Why Your Forecasting Transformer Isn’t Working (And How To Fix It in Python) - Rosheen Naeem, Open Climate Fix
Tuesday April 7, 2026 11:00 - 11:10 CEST
Renewable energy is clean — but it’s also inherently variable. Solar PV generation can change dramatically within minutes due to cloud cover and weather conditions, making accurate short-term forecasts essential for grid stability, energy trading, and smart-home optimisation.
Open Climate Fix builds open and high-impact forecasting tools to accelerate the transition to a low-carbon energy system. One of these projects is Open Quartz Solar Forecast: an open-source model that uses public PV generation data, site metadata, and numerical weather prediction variables to forecast solar power for any location.
In this talk, I’ll present a real case study from my Google Summer of Code project where I implemented and trained a Temporal Fusion Transformer for multi-horizon solar forecasting. I’ll cover the practical engineering challenges behind making transformer forecasting work in Python: building continuous training windows, aligning weather forecast steps with observations, separating static vs time-varying features, and stabilising training using PyTorch Forecasting and PyTorch Lightning.
Attendees will leave with reusable patterns for real-world time-series forecasting pipelines.
Speakers
avatar for Rosheen Naeem

Rosheen Naeem

Software Engineer, Miro
I am a Software Engineer at Miro and a community member at Open Climate Fix. I completed the Erasmus Mundus Master’s in Software Engineering for the Green Deal (SE4GD), a joint degree program across Vrije Universiteit Amsterdam (Netherlands), LUT University (Finland), and Universit... Read More →
Tuesday April 7, 2026 11:00 - 11:10 CEST
Central Room
  Applications & Case Studies

11:15 CEST

Lightning Talk: Deep Learning in the Wild: Embedded PyTorch for Real-World Conservation Bioacoustics - Taraqur Rahman & Owen O'Donnell, OWL Integrations
Tuesday April 7, 2026 11:15 - 11:25 CEST
Passive acoustic monitoring is a powerful tool for wildlife conservation, but deploying deep learning models in remote rainforest environments introduces strict constraints on power, memory, and compute. In this talk, we present an end-to-end PyTorch-based pipeline for detecting and analyzing the endangered three-wattled bellbird using embedded deep learning systems.

We cover the full lifecycle from audio preprocessing and model training in PyTorch to optimization and deployment on resource-constrained embedded devices. Topics include model architectures for sparse bioacoustic event detection, handling extreme class imbalance, model compression and quantization, and practical trade-offs between accuracy, latency, and power consumption.

The session emphasizes real-world lessons learned deploying machine learning at the edge, where unreliable connectivity, noisy signals, and limited hardware define success more than benchmark metrics. Attendees will gain practical patterns for building and deploying PyTorch models for embedded and edge AI applications with real environmental impact.
Speakers
avatar for Owen O'Donnell

Owen O'Donnell

Embedded Systems and Machine Learning Engineer, OWL Integrations
Owen O'Donnell is a Machine Learning and Embedded Systems Engineer at OWL integrations. He works with training ML models to deploy in remote locations that will be running on resource constrained electronics. This introduces challenges such as needing smaller sized models and having... Read More →
avatar for Taraqur Rahman

Taraqur Rahman

Chief Data Scientist, OWL Integrations
Taraqur Rahman is Chief Data Scientist and Co-Founder at OWL Integrations and Organizer/Co-Founder of Biased Outliers, where he leads applied machine learning and data science initiatives with real-world impact. He combines deep technical expertise in Python with practical deployment... Read More →
Tuesday April 7, 2026 11:15 - 11:25 CEST
Central Room
  Applications & Case Studies
  • Audience Level Any
  • Slides Attached Yes

11:30 CEST

Lightning Talk: How DeepInverse Is Solving Imaging in Science and Healthcare With PyTorch - Andrew Wang, DeepInverse; Minh Hai Nguyen, Université de Toulouse
Tuesday April 7, 2026 11:30 - 11:40 CEST
Deep learning has revolutionised imaging, a foundation of science and healthcare. DeepInverse is the PyTorch library for solving imaging problems, unifying deep learning methods (e.g. diffusion models), physics (medical, optics) and modern tooling. In this talk, we’ll show how the PyTorch community can get involved in this exciting yet accessible application of open-source AI.

AI methods in imaging must model the imaging physics, leading to interesting engineering problems e.g. efficient differentiable ops, physics-informed losses. We’ll show notebooks on real use-cases: accelerating brain MRI, reducing radiation in CT scans, imaging black holes.

PyTorch enthusiasts at any level/background can contribute - from training infra for scientific data to high-level generative modelling frameworks - their AI engineering skills can directly impact imaging across multiple fields.

DeepInverse is supported by a growing international user community and proudly rooted in Paris. We’ve joined the PyTorch Ecosystem and received the Prix Science Ouverte in 2024. We’re excited to join the PyTorch Conf to celebrate the vibrant French developer community!
Speakers
avatar for Andrew Wang

Andrew Wang

CTO & Co-founder, Blur Labs
Andrew is a lead developer of DeepInverse as well as the CTO & co-founder of Blur Labs, a startup based in Paris building AI models for imaging. Andrew did his PhD at the University of Edinburgh in magnetic resonance image reconstruction.
avatar for Minh Hai Nguyen

Minh Hai Nguyen

PhD candidate, Toulouse University
Tuesday April 7, 2026 11:30 - 11:40 CEST
Central Room
  Applications & Case Studies
  • Audience Level Any
  • Slides Attached Yes

11:45 CEST

Lightning Talk: ExecuTorch on Microcontrollers: Deploying PyTorch To the Smallest Edge - RJ Ascani & Matthias Cremon, Meta
Tuesday April 7, 2026 11:45 - 11:55 CEST
ExecuTorch extends PyTorch's reach to the most resource-constrained devices: microcontrollers, DSPs, and specialized neural processing units powering always-on sensors, wearables, and embedded systems. In this talk, we'll share the current state and roadmap for running ExecuTorch on platforms where every kilobyte of memory and milliwatt of power matters.

What you'll learn:
- How ExecuTorch's design enables deployment from ultra-low-power MCUs to DSP and NPU accelerators, all from a single PyTorch workflow
- The state of backend support for Cadence DSPs, ARM Ethos-U and Cortex-M
- Practical considerations for deploying models with sub-megabyte footprints and milliwatt power budgets
- Case studies spanning always-on audio, embedded vision, and TinyML applications
Speakers
avatar for Matthias Cremon

Matthias Cremon

Software Engineering Manager, Meta
Matthias Cremon is a Software Engineering Manager at Meta in the Silicon AI Software Team, working on AI compilers for various edge devices. He focuses on the frontend, graph level optimization side, as well as the integration of low-level, vendor specific implementations to run on... Read More →
avatar for RJ Ascani

RJ Ascani

Software Engineer, Meta
RJ Ascani is an embedded software engineer on Meta’s PyTorch Edge team, focusing on advancing ExecuTorch for microcontroller platforms.
Tuesday April 7, 2026 11:45 - 11:55 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

12:00 CEST

Write Once, Run Everywhere with Pytorch Transformers - Pedro Cuenca, Hugging Face
Tuesday April 7, 2026 12:00 - 12:25 CEST
The Hugging Face transformers library is built on pure PyTorch and can be succinctly described as a model-definition framework. It provides an unified, familiar, clear and concise interface to multiple machine learning architectures across modalities.

Serving and inference optimizations are not its focus.

However, transformers model definitions become the de-facto reference implementations multiple other projects use. This includes training libraries, fast deployment engines such as vLLM and SGLang, and on-device libraries like MLX and llama.cpp.

This session describes the path towards increasingly simpler downstream integration of transformers models into inference and deployment libraries, and how transformers and PyTorch core features enable the ecosystem to enjoy newly-released models as soon as they are released.

We'll go through the journey towards easier modeling, which implies easier downstream porting and adaptation. The end-game is pure interoperability, where no code changes are required! This is now possible with vLLM and SGLang, and we'll show how. We'll end up discussing our ideas on upcoming interop features with MLX and llama.cpp.
Speakers
avatar for Pedro Cuenca

Pedro Cuenca

ML Engineer, Hugging Face
Pedro Cuenca is a machine learning engineer at Hugging Face, working in developer advocacy and on-device ML. He has 20+ years of software development experience across internet applications and iOS. He worked on the technology behind Camera+, an iPhone app using custom ML for photography... Read More →
Tuesday April 7, 2026 12:00 - 12:25 CEST
Central Room

13:45 CEST

Why WideEP Inference Needs Data-Parallel-Aware Scheduling - Maroon Ayoub, IBM; Tyler Michael Smith, Red Hat
Tuesday April 7, 2026 13:45 - 14:10 CEST
WideEP—wide expert parallelism fails not because experts are expensive, but because routing ignores where state already lives. In PyTorch LLM serving with vLLM, WideEP fans tokens across many experts while KV caches accumulate unevenly across data-parallel replicas. When routing is unaware of KV placement and per-replica load, requests land on replicas that cannot reuse cache or make progress efficiently and latency spikes as expert fan-out grows.
The fix is not reshaping expert parallelism, but making routing data-parallel aware using signals vLLM already exposes. In this talk, we show how llm-d extends its router to leverage KV-cache locality and load awareness when routing WideEP flows. Rather than treating replicas as interchangeable, the router prefers replicas with warm KV state and available capacity, aligning routing decisions with vLLM’s execution reality and reducing cache fragmentation.
This session walks through how KV-aware, data-parallel routing changes WideEP inference in practice: which signals matter, how routing behavior evolves, and where the gains come from. Attendees leave with a clear mental model for when KV- and load-aware routing unlocks higher throughput.
Speakers
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for Tyler Michael Smith

Tyler Michael Smith

Chief Architect - Inference Engineering, Red Hat
Tyler received a PhD in Computer Science at The University of Texas at Austin, studying high performance dense linear algebra - microkernels, parallelism, and theoretical lower bounds on data movement.. After a postdoc at ETH Zürich, he joined Neural Magic, first working on a graph... Read More →
Tuesday April 7, 2026 13:45 - 14:10 CEST
Central Room

14:15 CEST

The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google
Tuesday April 7, 2026 14:15 - 14:40 CEST
Production LLM serving faces a critical trade-off: while continuous batching maximizes throughput, it often sacrifices SLAs due to Head-of-Line (HoL) blocking. When long-context requests hijack the engine, tail latencies spike. Without fine-grained preemption, guaranteeing priority or fairness remains nearly impossible.

We propose a solution: Chunked Decoding. By treating a fixed number of tokens as a "time slice," we bring 50 years of OS scheduling wisdom to inference. This technique decouples generation from completion, enabling a preemptive multitasking environment for LLMs.

In this talk, we present a sidecar implementation for PyTorch-based servers (like vLLM) that orchestrates decoding in manageable chunks. This allows the system to pause, hold, or swap requests mid-stream without discarding the KV cache. We will share early evaluation results, discussing how varying chunk sizes impact priority handling and tail latency. Attendees will learn how a sidecar approach enables sophisticated scheduling while keeping the core engine lean—offering a blueprint for integrating preemptive scheduling into the next generation of model servers.
Speakers
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for Kellen Swain

Kellen Swain

Senior Software Engineer, Google
Kellen is a Senior Engineer at Google, and is a maintainer of both the llm-d and Inference Gateway projects.
Tuesday April 7, 2026 14:15 - 14:40 CEST
Central Room

14:45 CEST

The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA
Tuesday April 7, 2026 14:45 - 15:10 CEST
Rapid advances in AI have expanded the range of capabilities required for successful real-world deployment. Understanding where we are in this multi-dimensional frontier is essential for accelerating innovation through effective quality assurance. Rigorous evaluation is increasingly difficult to scale as development requires testing many checkpoints across numerous benchmarks. Model comparison is further complicated by limited transparency of reported results. This talk explores challenges, best practices, and open-source tools that elevate evaluation to a core component of LLM development, delivering continuous signals across the model lifecycle.
We discuss principles for standardizing evaluation methods and improving consistency through practical patterns and anti-patterns, and examples of integrating the science of evaluation directly into model development. Using Nemo-Evaluator, an open-source scalable evaluation tool, we demonstrate modular architectures that enable transparent, reproducible measurement. Finally, we show how Nemo-Evaluator supports reproducible evaluation for the Nemotron model family, helping enable one of the most open development processes in modern AI.
Speakers
avatar for Grzegorz Chlebus

Grzegorz Chlebus

Manager R&D, NVIDIA
Grzegorz Chlebus is a Manager at Frontier Model Evaluation at NVIDIA, where he leads tooling and infrastructure efforts for evaluating frontier AI models. He holds a PhD in Medical Sciences from Radboud University Nijmegen, focused on deep learning-based medical image segmentation... Read More →
Tuesday April 7, 2026 14:45 - 15:10 CEST
Central Room
  GenAI & Multimodal

15:40 CEST

Enabling State-of-the-art Asynchronous Execution in Torch.compile With CUDA Streams - Michael Lazos, Meta
Tuesday April 7, 2026 15:40 - 16:05 CEST
CUDA streams are a widely-used method for parallelizing GPU computation on NVIDIA GPUs. They have long been requested by our users and enable multiple key capabilities - overlapping communication and compute kernels, training on multiple batches in parallel and parallelizing kernels, all of which are needed for achieving SOTA training performance. Another key capability is activation offloading - this can be applied to any model to prevent OOMs by asynchronously storing activations in cpu memory until they are needed by the model.

Before this work, torch.compile previously would graph break on CUDA stream contexts, which can be costly for models that utilize streams. Although workarounds exist (e.g. wrapping stream manipulation into custom ops), these solutions add complexity and create friction in the user experience. By enabling seamless CUDA stream support in PT2, we allow our users to leverage the familiar eager APIs for stream assignment and synchronization directly within torch.compile. This not only simplifies the workflow but also ensures that models using custom streaming patterns can run efficiently out-of-the-box without manual intervention or code restructuring.
Speakers
avatar for Michael Lazos

Michael Lazos

Software Engineer, Meta
Michael Lazos is a software engineer at Meta where he contributes to torch.compile. His expertise spans both graph extraction with TorchDynamo and generating optimized kernels with the backend compiler TorchInductor. Previously, he was at Microsoft contributing to project Brainwave... Read More →
Tuesday April 7, 2026 15:40 - 16:05 CEST
Central Room
  Frameworks & Compilers

16:10 CEST

Build PyTorch to Understand PyTorch - Vijay Janapa Reddi, Harvard University; Andrea Mattia Garavagno, University of Genoa
Tuesday April 7, 2026 16:10 - 16:35 CEST
PyTorch's success depends on more than users—it needs engineers who understand what's inside. Engineers who can debug framework issues, optimize at the systems level, contribute upstream, and build what comes next. But ML education today produces practitioners who call APIs without understanding them. They train models without knowing why Adam needs 3× the memory of SGD, or what happens when they call loss.backward().

TinyTorch is a 20-module open-source curriculum that closes this gap. Students construct PyTorch's core components—tensors, autograd, optimizers, CNNs, transformers—in pure Python, building a complete framework where every operation is code they wrote. By the final module, they don't just use PyTorch; they understand how to build it.

The curriculum uses progressive disclosure, systems-first profiling from Module 01, and build-to-validate milestones—recreating ML breakthroughs from Perceptron (1958) through Transformers (2017), culminating in MLPerf-style benchmarking.

TinyTorch is how we grow the next generation of PyTorch contributors and the engineers who will build what comes after.

Open source: mlsysbook.ai/tinytorch
Speakers
avatar for Vijay Janapa Reddi

Vijay Janapa Reddi

Professor, Harvard University
Vijay Janapa Reddi is a Professor at Harvard University, where he leads research at the intersection of machine learning and computer systems. He is the author of the open-source Machine Learning Systems textbook (mlsysbook.ai) and co-founder of MLCommons, the organization behind... Read More →
avatar for Andrea Mattia Garavagno

Andrea Mattia Garavagno

Research Fellow, University of Genoa & Scuola Superiore Sant'Anna
I am a Research Fellow holding a joint position at the University of Genoa and Scuola Superiore Sant'Anna. My research is centered on Edge AI, where I am currently working to automate the design of applications through Hardware-Aware Neural Architecture Search (NAS). By running these... Read More →
Tuesday April 7, 2026 16:10 - 16:35 CEST
Central Room
  Frameworks & Compilers
  • Audience Level Any
  • Slides Attached Yes

16:40 CEST

Lightning Talk: TerraKit: Standardising AI-Ready Geospatial Data Preparation for the TorchGeo Ecosystem - Rosie Lickorish & Romeo Kienzler, IBM
Tuesday April 7, 2026 16:40 - 16:50 CEST
With the advent of geospatial foundation models, unexplored use cases are emerging that require well-curated datasets. Currently, no standardised approach exists for creating such AI-ready geospatial datasets. In this session, we introduce TerraKit: a comprehensive open-source Python library for retrieving, and processing geospatial data, that seamlessly integrates with upstream geospatial model training libraries such as TorchGeo or TerraTorch.

From raster/vector annotations, TerraKit will match, download, process, align and split the requested data source (e.g., EarthData, CDSE, Planetary Computer) based on user specifications provided by a simple configuration file. TerraKit also supports spatial train/val splits and exports datasets in standard formats such as TACO datasets. TerraKit streamlines the pipeline from raw EO data to AI-ready datasets, accelerating the development of custom geospatial applications, and ensuring query and processing pipelines are reproducible. By lowering the barrier to entry, a wider community of TorchGeo and TerraTorch users are empowered to leverage foundation models for Earth observation.
Speakers
avatar for Romeo Kienzler

Romeo Kienzler

AI Research Engineer, IBM
Romeo is a data scientist working for IBM Research and an advocate for ethical machine learning, transparency and privacy
avatar for Rosie Lickorish

Rosie Lickorish

Research Software Engineer, IBM
Rosie is a Research Software Engineer at IBM, specializing in the development of next-generation tools and technologies designed to drastically accelerate solutions for today’s most urgent global challenges. Her technical focus involves leveraging geospatial data, AI models... Read More →
Tuesday April 7, 2026 16:40 - 16:50 CEST
Central Room
  GenAI & Multimodal
  • Audience Level Any
  • Slides Attached Yes

16:55 CEST

Lightning Talk: Bayesian Neural Networks With Variational Inference in PyTorch - Lars Heyen, Karlsruhe Instute of Technology, Scientific Computing Center
Tuesday April 7, 2026 16:55 - 17:05 CEST
Uncertainty quantification is becoming more and more important as neural networks are used for increasingly critical tasks. Bayesian neural networks (BNNs) inherently provide a measure of their own uncertainty, but can be either hard to implement or inflexible if one uses common frameworks. In this session I discuss how to efficiently implement BNNs using Variational Inference within PyTorch and present torch_blue, a light-weight open source library that implements these methods with the goal of being easy to pick up, yet flexible enough for research on BNNs.
Speakers
avatar for Lars Heyen

Lars Heyen

PostDoc, Karlsruhe Institute of Technology
I am a postdoctoral researcher working on uncertainty quantification in the research group "Robust and Efficient AI" at the Scientific Computing Center of the Karlsruhe Institute of Technology. I also coauthored the PyTorch-based library torch_blue for implementing Bayesian neural... Read More →
Tuesday April 7, 2026 16:55 - 17:05 CEST
Central Room
  Frameworks & Compilers
  • Audience Level Any
  • Slides Attached Yes
 
Wednesday, April 8
 

10:35 CEST

Lightning Talk: Live Migration of PyTorch GPU Nodes From Azure To European Clouds - Mike Krom, Acf Cyber Solutions
Wednesday April 8, 2026 10:35 - 10:45 CEST
Many European PyTorch teams run their GPU workloads on hyperscalers like Azure, AWS, or GCP—often without realizing that this places their data and models under US jurisdiction.

This lightning talk shows how PyTorch compute nodes can be migrated to European cloud providers while keeping the full ML environment intact. Through a live demo, we migrate a GPU-enabled PyTorch VM—including CUDA drivers and Jupyter notebooks—from Azure to European infrastructure, without retraining models or rebuilding environments.

The focus is on practical challenges: GPU compatibility, reproducibility, and data movement across clouds.

The migration is demonstrated using DigitalNomadSky, an open-source Python platform for cross-cloud VM migration, but the lessons apply broadly to PyTorch teams aiming to reduce jurisdictional risk and vendor lock-in.

Key takeaways
Why PyTorch workloads on hyperscalers raise sovereignty concerns for EU teams
What actually breaks (and what doesn’t) when migrating GPU-based ML nodes
How to regain control over ML infrastructure without rewriting your stack
Speakers
avatar for Mike Krom

Mike Krom

Partner, ACF Cybersolutions
I am a software architect and lead developer of the open-source project DigitalNomadSky. I have extensive experience with Microsoft Azure from working at Microsoft and supporting large-scale cloud migrations. My work focuses on supporting datascience and ML-teams with cloud infrastructure... Read More →
Wednesday April 8, 2026 10:35 - 10:45 CEST
Central Room
  Security & Privacy

10:50 CEST

Lightning Talk: Step-Aligned Telemetry for Distributed PyTorch Training (Time & Memory Attribution Across Ranks) - Abhinav Srivastav, TraceOpt
Wednesday April 8, 2026 10:50 - 11:00 CEST
Distributed PyTorch training often looks healthy in system dashboards; GPU utilization is high, memory is stable and yet throughput degrades, steps jitter, or GPUs go idle intermittently. The core issue is misalignment: most
telemetry is sampled by time, while training progresses by "steps", and distributed behavior is dominated by the slowest rank rather than averages.

In this talk I will breaks down common failure modes in DDP training that standard metrics miss (rank stragglers, dataloader stalls, step-time variance, and memory spikes/creep). We will show how step-aligned, rank-aware aggregation changes debugging: per-step worst-rank vs median-rank views, gating to completed steps across ranks, and how to tie time and memory back to training semantics without relying on heavyweight profilers.
Speakers
avatar for Abhinav Srivastav

Abhinav Srivastav

ML Scientist, TraceOpt
ML researcher with a PhD in Computer Science. Industry experience at IBM Research, Huawei Research, and Zalando.Currently building TraceML: an open source tool that shows you the step-level breakdown of your PyTorch training run while it's still running.I am partially interested in... Read More →
Wednesday April 8, 2026 10:50 - 11:00 CEST
Central Room
  Training Systems

11:05 CEST

Lightning Talk: KV-Cache Centric Inference: Building a State-Aware Serving Platform With Llm-d and VLLM - Maroon Ayoub & Martin Hickey, IBM Research
Wednesday April 8, 2026 11:05 - 11:15 CEST
We’ve spent years optimizing LLM inference around compute - faster kernels, better batching, smarter parallelism. But in production, the bottleneck increasingly isn’t FLOPs. It’s state. Specifically, the KV-cache: the attention state that makes the difference between a 4-second prefill and a sub-second cache hit. Lose it to eviction, isolate it on a single node, or fail to route to it - and you’re paying the full compute cost again for work already done.

KV-cache centric inference flips the design priority. Instead of treating cache as a byproduct, it becomes the organizing principle of the serving platform. This means tiered memory management - offloading KV blocks from GPU to CPU to shared storage so capacity scales beyond any single node. It means cross-replica visibility - so cached state computed on one instance is reusable by any other. And it means cache-aware scheduling - routing requests to where their prefix already lives.

We cover how llm-d and vLLM implement each layer, how they compose into a coherent system, and what it looks like in practice - with benchmarks, deployment patterns, and lessons from building a KV-cache centric platform in the open.​​​​​​​​​​​​​​​​
Speakers
avatar for Martin Hickey

Martin Hickey

Senior Technical Staff Member, IBM Research
Martin Hickey is a STSM at IBM Research, focused on Open Source, Cloud Native Computing, and AI. Martin has notable contributions to open source projects like vLLM, LMCache, Kubernetes, Helm, OpenTelemetry and OpenStack. Martin is a core maintainer for LMCache and an emeritus core... Read More →
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
Wednesday April 8, 2026 11:05 - 11:15 CEST
Central Room

11:20 CEST

Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving - Maroon Ayoub, IBM Research & Hyunkyun Moon, moreh
Wednesday April 8, 2026 11:20 - 11:30 CEST
Agentic AI workloads - tree-of-thought exploration, ReAct loops, hierarchical swarms - expose a fundamental mismatch in how we serve PyTorch models. Today's inference stacks treat the KV-cache as a flat, anonymous tensor buffer with blind LRU eviction. This ignores the structural reality of agents: system prompts are durable, tool definitions are shared, and reasoning scratchpads are ephemeral. We are currently evicting high-value state to preserve throwaway tokens.

In this talk, we present Semantic KV-Cache, an architectural evolution for llm-d and vLLM that replaces anonymous blocks with Typed State.

We demonstrate a runtime that tags blocks as SystemPrompt, ToolDefinition, or ReasoningBranch, applying differentiated policies to each: pinning foundational context, replicating shared tools, and eagerly evicting completed thoughts. We show how this "lifecycle-aware" caching reduces recomputation and minimizes the "Agentic Tax" - evolving the PyTorch serving stack from request-centric to workload-aware.
Speakers
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for hyunkyun moon

hyunkyun moon

MLOps Engineer, Moreh
Hyunkyun Moon is an ML Platform Engineer at Moreh, focusing on building high-performance LLM inference platforms with llm-d. He is an active contributor to open-source projects, including llm-d and vLLM. With a strong background in large-scale Kubernetes-native infrastructure, he... Read More →
Wednesday April 8, 2026 11:20 - 11:30 CEST
Central Room

11:35 CEST

Optimizing Large MoE Inference on NVIDIA Blackwell: NVFP4, ADP, and DualPipe Strategies - Julien Demouth, NVIDIA
Wednesday April 8, 2026 11:35 - 12:00 CEST
Deploying massive Mixture-of-Experts (MoE) architectures like DeepSeek-V3/R1 requires a co-designed approach leveraging NVIDIA Blackwell’s fifth-generation Tensor Cores. This session details the transition to NVFP4 precision for MoE weights to significantly reduce memory load, coupled with FP4/FP8 KV caching to minimize attention layer footprint and enable higher concurrency.
We will analyze the architectural shift to Expert Parallelism (EP) for expert layers to maximize FLOPS, and Attention Data Parallelism (ADP) for attention heads—avoiding redundant KV replication and converting Multi-Head Latent Attention (MLA) into Multi-Query Attention (MQA) via weight absorption. The talk will demonstrate advanced execution strategies, including DualPipe algorithms to overlap dispatch/combine communication with computation, and the integration of DeepGEMM and FlashInfer kernels. Finally, we will cover runtime optimizations using Programmatic Dependent Launch (PDL) and CUDA Graphs to minimize host latency, alongside Multi-Token Prediction (MTP) for accelerated speculative decoding.
Speakers
JD

Julien Demouth

Senior Distinguished Engineer - Eng. Lead for AI Labs & Models, NVIDIA
Wednesday April 8, 2026 11:35 - 12:00 CEST
Central Room

13:30 CEST

Lightning Talk: From Hugging Face To Handheld: Scaling LLM Deployment With LiteRT Generative API - Cormac Brick & Weiyi Wang, Google
Wednesday April 8, 2026 13:30 - 13:40 CEST
This session will demonstrate the E2E journey of bringing custom PyTorch-based Open Source LLMs on cross platform devices using LiteRT. We will show developers how to take a custom Hugging Face Transformers checkpoint and convert them for on-device execution, including:
-Taking the Pytorch model from conversion to deployment.
-Automated Optimization: How LiteRT performs automated patching of performance-critical components, including architecture-specific rewrites for PyTorch models.
-Seamless Fine-Tuning Integration: How to move from an Unsloth fine-tuning session to a TorchAO-quantized model and LiteRT export without leaving your script.
-The "0-Day" Enablement Strategy: Well-known architectures are supported out-of-the-box. We’ll share how we enabled the QWEN0.6 (or Liquid AI) model in just 20 minutes.
-Interactive Validation: Run inference on the exported model directly in the Terminal or Colab to verify numerical correctness before deploying to device.
This workflow shows a smooth fine-tune-to-deployment story where everything stays within the original PyTorch/Hugging Face ecosystem. Viewers can "vibe code" along using Gemini CLI or other coding agents.
Speakers
avatar for Cormac Brick

Cormac Brick

Principal Engineer, Google AI Edge, Google
Cormac Brick is a Principal Engineer on the Google AI Edge team, where he specializes in frameworks and on-device AI. He has over 10 years experience in AI software, silicon and systems, with work spanning AI frameworks and ecosystems and compilers down to silicon microarchitecture... Read More →
avatar for Weiyi Wang

Weiyi Wang

Software Engineer, Google
Weiyi Is lead software engineer on LiteRT/TFLite, focusing on compiler, NPU and GenAI stack.
Wednesday April 8, 2026 13:30 - 13:40 CEST
Central Room

13:45 CEST

Lightning Talk: Slash LLM Cold-Start Times by Pre-distributing GPU Caches - Billy McFall & Maryam Tahhan, Red Hat
Wednesday April 8, 2026 13:45 - 13:55 CEST
Are your Large Language Model (LLM) deployments stuck waiting for GPU kernels to compile? If you are running distributed inference at scale, your infrastructure is likely wasting time rebuilding the same GPU Kernel Cache for every single instance. You may not even realize the time and resources that are being consumed for rebuilding. This session is designed for platform engineers and ML practitioners who need to optimize inference scaling and reduce startup latency.

We will demonstrate how to eliminate redundant compilation by pre-distributing GPU kernel caches to all the inference nodes using KServe, a distributed model inference runtime for Kubernetes. Beyond just the "what," we will dive into the technical implementation of signing, verifying, and mounting cache images to ensure supply-chain security across clusters. Attendees will leave with a practical blueprint for reducing cold-start times and securing GPU-heavy workloads in production.
Speakers
avatar for Billy McFall

Billy McFall

Sr. Principal Software Engineer, Red Hat
Billy McFall is a software engineer in the Emerging Tech Networking Team within the Office of the CTO at Red Hat for 9+ years. Billy previously worked on Kubernetes/OpenShift networking, including the integration of the NVIDIA DPU into OpenShift. Billy has also been a maintainer of... Read More →
avatar for Maryam Tahhan

Maryam Tahhan

Principal Engineer, Red Hat
Maryam is a Principal Engineer in Red Hat's Office of the CTO, where she focuses on standardising CPU inferencing performance evaluation to help effectively validate and scale ML workloads.
Wednesday April 8, 2026 13:45 - 13:55 CEST
Central Room
  Inference & Production

14:00 CEST

Lightning Talk: Backpropagation-Free Optimization in PyTorch - Andrii Krutsylo, Polish Academy of Sciences
Wednesday April 8, 2026 14:00 - 14:10 CEST
Backpropagation is not the only mechanism for training deep networks. This talk presents a compact, implementation-driven map of backpropagation-free training methods, organized around representative algorithms that expose key design trade-offs.

We focus on four families: Difference Target Propagation (target-based credit assignment), Direct Feedback Alignment (random feedback without weight transport), local loss / greedy layerwise training (strictly local objectives), and Forward-Forward learning as a forward-only alternative. Each is treated as a minimal working pattern rather than a full system.

For each representative, we answer the same practical questions: what learning signal is propagated, what intermediate state must be stored, how parameters are updated, and what limits scalability on modern accelerators. The emphasis is on PyTorch-level mechanics—explicit update loops, local objectives, and training without autograd—rather than derivations.

The goal is to give practitioners a clear mental model of the backprop-free design space and concrete patterns for experimenting with these methods in real PyTorch training pipelines.
Speakers
AK

Andrii Krutsylo

PhD Candidate, Institute of Computer Science, Polish Academy of Sciences
Andrii Krutsylo is a deep learning researcher focusing on continual learning and optimization dynamics. His work studies experience replay, gradient-free and local learning rules, and structured optimization for adaptive, resource-efficient systems.
Wednesday April 8, 2026 14:00 - 14:10 CEST
Central Room

14:15 CEST

Lightning Talk: Inside VLLM's KV Offloading Connector: Async Memory Transfers for Higher Inference Throughput - Nicolò Lucchesi, Red Hat
Wednesday April 8, 2026 14:15 - 14:25 CEST
Every LLM request produces KV-cache state that is expensive to recompute. However, GPU memory is limited in size and when memory fills up, entries are discarded from cache. A natural mitigation is expanding the KV cache to CPU DRAM which is meaningfully larger than GPU memory.
vLLM 0.11.0 introduced the Offloading Connector - an asynchronous, pluggable API for KV-cache offloading which is bundled with a native CPU backend. This new feature executes transfers concurrently with model computation on the GPU cores by using GPU DMA. This solution offers speedy loading of KV data from DRAM and near zero overhead from offloading. Getting here required rethinking vLLM's memory layout. The default per-layer KV fragmentation devastated transfer throughput. A new contiguous block layout, upstreamed in 0.12.0, increased effective block sizes by up to 125× and delivered an order-of-magnitude improvement in offloading performance.
We'll walk through the connector architecture, discuss memory transfer tradeoffs, the memory layout redesign, and practical guidance for enabling CPU offloading in production.
Speakers
avatar for Nicolò Lucchesi

Nicolò Lucchesi

Senior Machine Learning Engineer, Red Hat
Nicolò is a Senior Machine Learning Engineer at Red Hat with a background in Deep Learning and Computer Vision. He works on Inference Optimization for vLLM, where he is a maintainer.
Wednesday April 8, 2026 14:15 - 14:25 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

14:30 CEST

Lightning Talk: Every Millisecond Counts: The Fine-tuning Journey of an Ultra-Efficient PyTorch Model for the Edge - Pavel Macenauer, NXP Semiconductors
Wednesday April 8, 2026 14:30 - 14:40 CEST
From smart cameras that protect privacy by analyzing video on-device, to wearables that interpret voice and motion instantly, to industrial sensors that prevent failures before they happen, edge AI is shaping our everyday routines and transforming our lives.

Eliminating cloud dependency and making connectivity optional is essential for data staying local. Without cloud, our options become severely limited to the constraints of the devices, and efficiency drives innovation. Every millisecond and milliwatt can unlock a new use case — or limit one.

This talk will explore optimization techniques for vision, audio, and language models that allow them to run on tiny, resource-constrained devices, and fine-tune them to the limit of our model’s latency, accuracy, or power efficiency. We will start with an initial rapid simulation, and follow up with silicon-level tuning with real device profiling feedback.
Speakers
avatar for Pavel Macenauer

Pavel Macenauer

AI/ML R&D Software Lead, NXP Semiconductors
A software lead at NXP Semiconductors leading teams developing tools, runtime libraries, and enabling AI on Edge-class devices. Both professionally and out of human curiosity, Pavel developed software visualizing the World around us. Initially through the lens of a camera, then from... Read More →
Wednesday April 8, 2026 14:30 - 14:40 CEST
Central Room
  Inference & Production

14:45 CEST

Lightning Talk: Full-Stack PyTorch Robotics VLA: From Data To Edge Via ExecuTorch/OpenVINO - Samet Akcay & Dmitriy Pastushenkov, Intel
Wednesday April 8, 2026 14:45 - 14:55 CEST
While research-centric tools have lowered the entry barrier for robotics data collection, transitioning Vision-Language-Action models to production remains challenging due to fragmented edge deployment paths. This session presents a unified, PyTorch-native workflow spanning the full robotics lifecycle, from data capture and curation to optimized edge execution. We introduce a modular Physical AI pipeline designed to resolve the disconnect between research scripts and real-time hardware. The talk details practical patterns for robotics data capture and policy training in a unified PyTorch ecosystem, followed by concrete steps to export models via ExecuTorch. Using an OpenVINO backend, Quantizer, and AOT compilation, we address latency, accuracy, and operator coverage gaps, and demonstrate efficient on-device VLA inference. Using a WidowX pick-and-sort task as a case study, we demonstrate how to validate latency and numerical tolerances under physical constraints. Attendees will leave with a reference architecture and a checklist for monitoring, safety gates, and managing dataset drift, providing a roadmap for moving robotics VLA from research to production-grade edge deployment.
Speakers
avatar for Dmitriy Pastushenkov

Dmitriy Pastushenkov

AI Software Product Manager, Intel
Dmitriy Pastushenkov is a passionate Software Product Manager at Intel with more than 20 years of comprehensive and international experience in the industrial automation, industrial Internet of Things (IIoT) and real-time operating systems and AI. Dmitriy has held various roles in... Read More →
avatar for Samet Akcay

Samet Akcay

Principal AI Engineer, Intel
Samet Akcay is a Principal AI Engineer at Intel who leads ML R&D efforts across Open Edge Platform libraries, including Intel Geti, Datumaro, Anomalib, Training Extensions, and Inference libraries. His research specializes self-supervised learning and multi-modal object detection... Read More →
Wednesday April 8, 2026 14:45 - 14:55 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

15:25 CEST

Beyond the Theory: What Actually Breaks When You Scale Your Disaggregated Pytorch Models - Ekin Karabulut & Ron Kahn, NVIDIA
Wednesday April 8, 2026 15:25 - 15:50 CEST
As inference demand explodes, new techniques to optimize these deployments have emerged. One such technique is disaggregated inference, which splits inference into differently optimized workloads (e.g. prefill and decode) on separate workers. The theory is straightforward–better GPU utilization, inference performance, and tighter control over SLAs.The deployment in production is not.
Scaling happens at multiple connected levels. Adding prefill workers for a traffic spike? Those workers belong to a prefill leader and must scale as a unit. But your prefill-to-decode ratio matters too, scale prefill without matching decode capacity and you've moved the bottleneck.Placement also plays a role: place prefill and decode far apart in your network topology and KV-cache transfers will kill your latency.Standard autoscaling treats these as independent components.They're not.
In this talk, we'll share what we've learned running disaggregated vLLM and SGLang deployments on K8s: what broke,what worked, and how we're improving performance. We'll evaluate approaches from standard deployments to specialized APIs like LWS and Grove, discuss how these integrate with frameworks like llm-d and Dynamo.
Speakers
avatar for Ekin Karabulut

Ekin Karabulut

AI/ML Developer Advocate, NVIDIA
Ekin is a Developer Advocate at NVIDIA, following the acquisition of Run:ai. Prior to that, she specialized in the privacy implications of federated learning systems with DNNs in distributed environments as a data scientist. Currently, she is exploring the efficient usage of large... Read More →
avatar for Ron Kahn

Ron Kahn

Senior Software Engineer, NVIDIA
Ron Kahn is a Senior Software Engineer in the NVIDIA Run:ai platform team. Ron works on the design and implementation of workload management systems that abstract Kubernetes complexity for AI practitioners. When not simplifying AI training jobs, Ron can be found cooking something... Read More →
Wednesday April 8, 2026 15:25 - 15:50 CEST
Central Room
  Inference & Production
  • Audience Level Any
  • Slides Attached Yes

15:55 CEST

Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Visible in Practice - Sahana Venkatesh, Wayve
Wednesday April 8, 2026 15:55 - 16:05 CEST
PyTorch teams often log rich training metrics, yet still discover training regressions late after significant developer time and GPU budget have already been spent. In this talk, I’ll share a practical pattern we used to turn PyTorch training metrics into an operational guardrail for large-model training.

The approach combines scheduled short and long training runs, standardized performance and stability metrics (throughput, memory, loss, divergence), and simple statistical baselines to automatically surface regressions via alerts without hard gates or complex infrastructure.

I’ll focus on why logging alone is insufficient, how we chose what to monitor, and what tradeoffs we encountered (false positives, alert fatigue, baseline drift). The goal is not a tool demo, but a reusable pattern other PyTorch teams can adapt to catch training regressions earlier and make retraining more predictable.
Speakers
avatar for Sahana Venkatesh

Sahana Venkatesh

Software engineer, Wayve
Wednesday April 8, 2026 15:55 - 16:05 CEST
Central Room
  Training Systems

16:10 CEST

Lightning Talk: Ball Tracking and Detection in Soccer Videos - Comparison of VLMs and Traditional Pipelines - Maciej Szymkowski, Future Processing
Wednesday April 8, 2026 16:10 - 16:20 CEST
Nowadays, Vision-Language Models (VLMs) have plenty of different applications. However, it must be pointed out that we cannot be totally sure that they are the most accurate and precise solution for all potential problems. We must compare their possibilities with some other pipelines. In this presentation, we would like to compare on-premise models – Qwen 3 and InternVL-3.5, and cloud-based solutions – Gemini 3, GPT-5 with traditional pipeline based on YOLOv11 and image processing techniques. The battlefield will be ball detection and tracking in soccer matches recordings (from different angles and in diversified light, e.g., sunny, night, and weather conditions, e.g., snowy, rainy day) downloaded from SoccerNet database. In this case, we used both broadcast videos and action and replay images. All of them were marked manually to prepare ground truth database. The models must recognize not only the ball but also track it through the whole sequence of images. To give equal chances we fine-tuned YOLOv11 and provided additional knowledge to VLMs in the form of RAG pipeline. Comparison was made with traditional Machine Learning metrics like accuracy, precision, and recall.
Speakers
avatar for Maciej Szymkowski

Maciej Szymkowski

AI Researcher and Senior Machine Learning Engineer, Future Processing
Maciej Szymkowski, PhD, is a Senior ML Engineer at Future Processing. Formerly Head of AI at Łukasiewicz PIT, his academic background spans BUT, WUT, and AGH. With 45+ publications, he specializes in Computer Vision (med/transport/sport), VLMs, and LLMs. His industry experience includes... Read More →
Wednesday April 8, 2026 16:10 - 16:20 CEST
Central Room
  Applications & Case Studies

16:25 CEST

De-mystifying PyTorch for ASICs: When (and Why) To Move Your Development To AI Accelerators - Alpha Romer Coma, Kollab Philippines
Wednesday April 8, 2026 16:25 - 16:50 CEST
GPU availability and cost are squeezing ML teams, making ASICs like Google TPUs and AWS Trainium attractive alternatives. But does the software stack hold up? This session moves beyond the datasheets to provide a practical, code-first reality check on migrating PyTorch workloads to ASICs.

We will de-mystify the underlying compiler stacks, comparing PyTorch/XLA (TPU) and TorchNeuron (Trainium), and analyze the 'Compiler Tax' that often surprises developers. Through side-by-side code diffs and real-world benchmarks on fine-tuning Llama 4, Gemma 3, Qwen 3, and training CNNs and ViTs, we will answer:

1. The Code: How much rewriting is actually required?
2. The Performance: Which model architectures thrive on ASICs, and which ones fail due to dynamic shapes?
3. The Debugging: What happens when you hit an OOM or a compilation hang?

Attendees will leave with a clear 'Migration Decision Matrix' to determine if their specific workload is ready for the ASIC leap.
Speakers
avatar for Alpha Romer Coma

Alpha Romer Coma

Associate Engineer, Cloud Development, Kollab Philippines
Alpha is an Associate Cloud Engineer in Kollab and a CS undergraduate at FEU Tech, Philippines. He specializes in multimodality with text, videos, and audio, and works on Accelerated Computing with Google TPUs and AWS Tranium.

For 5 months, he pushed Google Cloud TPUs v4s to their limit to train vision-language models for use cases like internet brain rot recognition and detection of cognitively overloading content called sludge videos with 92% accuracy... Read More →
Wednesday April 8, 2026 16:25 - 16:50 CEST
Central Room
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.