Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Company: Beginner clear filter
arrow_back View All Dates
Wednesday, April 8
 

10:05 CEST

Birds of A Feather: Disaggregated Tokenization: Building Toward Tokens-In-Tokens-Out LLM Inference - Maroon Ayoub, IBM Research; Hang Yin & Xi Ning Wang, Alibaba Cloud; Nili Guy, IBM; Hyunkyun Moon, Moreh
Wednesday April 8, 2026 10:05 - 10:35 CEST
LLMs are token-in, token-out - but our serving stacks aren't. Tokenization and preprocessing are still locked inside the inference engine, blocking the cache-aware routing and encode/prefill/decode (E/P/D) disaggregation that production deployments demand. To route smart, you need tokens before you reach the backend - and with multi-modal inputs requiring heavy encode-stage preprocessing, this is an architectural imperative, not just an optimization.

In llm-d, we learned this the hard way: three tokenization approaches, three gaps. We're now converging on disaggregated tokenization via vLLM's Renderer API as a gRPC sidecar, and collaborating with the Gateway API Inference Extension community to define the tokens-in-tokens-out interface. For multi-modal workloads, disaggregating preprocessing unlocks independent scaling of encode, prefill, and decode - each with different compute profiles.

Join us to discuss: How should we standardize tokenization and multi-modal preprocessing outside the engine? How does this shape E/P/D disaggregation? What are your pain points? We'll frame the problem from scheduling, vLLM, and gateway perspectives - then open the floor.
Speakers
avatar for Xi Ning Wang

Xi Ning Wang

Senior Technical Expert, Alibaba Cloud
Wang Xining, senior technical expert of Alibaba Cloud, focusing on MaaS/LLM, Kubernetes, service mesh and other advanced cloud native technical strategies. Previously worked in the IBM as tech architect focusing on SOA/Cloud and served as the chairman of the Patent Technology Review... Read More →
avatar for Hang Yin

Hang Yin

Senior R&D Engineer, Alibaba Cloud
Hang Yin, senior engineer of Alibaba Cloud, focusing on Kubernetes, service mesh, Gateway API Inference Extension and other cloud native fields. Currently served in the Alibaba Cloud Container Service for Kubernetes (ACK) team, responsible for the developing of ACK Gateway with Inference... Read More →
avatar for Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research
Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.
avatar for Nili Guy

Nili Guy

IBM Research, IBM
Nili is a Research Manager and Senior Technical Staff Member at IBM Research, co-creator of llm-d, and an expert in distributed inference and Kubernetes-native AI systems. She has led key open-source and productized inference initiatives across IBM’s AI platforms.
avatar for hyunkyun moon

hyunkyun moon

MLOps Engineer, Moreh
Hyunkyun Moon is an ML Platform Engineer at Moreh, focusing on building high-performance LLM inference platforms with llm-d. He is an active contributor to open-source projects, including llm-d and vLLM. With a strong background in large-scale Kubernetes-native infrastructure, he... Read More →
Wednesday April 8, 2026 10:05 - 10:35 CEST
Open Platform

10:35 CEST

How To Write C++ Extensions in 2026 - Jane Xu, Meta & Mikayla Gawarecki, Meta
Wednesday April 8, 2026 10:35 - 11:00 CEST
Are you writing a C++ custom op extension to PyTorch? It's 2026 and are you still shipping M x N wheels for M CPython versions and N libtorch versions? Did you know you can just ship 1 wheel that works across multiple CPythons and libtorches? If you're curious how, attend this talk to get the deets on py_limited_api, APIs like torch::stable::Tensor & TORCH_TARGET_VERSION, and generally the latest and greatest ways for keeping your code and your release matrix simple. Get your custom kernel enrolling in new features with benefits proven out in FA3, xformers, torchao, torchaudio, and more in progress! We'll also share some of our vision towards smoother and faster custom ops extensions.
Speakers
avatar for Jane Xu

Jane Xu

PyTorch SWE, Meta
Hi, I'm Jane! Please don't hesitate to come talk to me about your favorite optimizer, fitting models in GPU memory, how to free C++ extensions from libtorch version, and anything that interests you.
avatar for Mikayla Gawarecki

Mikayla Gawarecki

Software Engineer, Meta
Software Engineer on PyTorch
Wednesday April 8, 2026 10:35 - 11:00 CEST
Founders Cafe
  Frameworks & Compilers

11:20 CEST

Lightning Talk: Building AI That Ops Teams Actually Trust - Robert King, Chronosphere / Palo Alto Networks
Wednesday April 8, 2026 11:20 - 11:30 CEST
You've built an AI that identifies root causes of incidents faster than any human could... but there's one problem, no one trusts it.

Ops teams are skeptical by nature. They've been burned by noisy alerts, black-box tools, and "intelligent" systems that weren't.
This talk covers what we learned building AI for incident response across enterprise environments: why technically correct recommendations get ignored, and how to design for skepticism from day one.

I'll share specific patterns that moved the needle:

- Validating agent responses before they reach users, catching hallucinations, weak reasoning, and overconfident outputs
- Explainability that fits the operator's mental model, not the data scientist's
- Feedback loops that improve the AI and build user trust simultaneously
- Rollout strategies that let teams build confidence gradually

Whether you're using LLMs, agents, or traditional ML for operational tasks, the trust problem is the same. Ship something wrong during an incident and you've lost your users for months.

You'll leave with a practical framework for validating AI outputs and building the kind of trust that gets recommendations acted on.
Speakers
avatar for Robert King

Robert King

Senior Sales Engineer, Chronosphere
Robert is Lead Enterprise Solutions Engineer at Chronosphere and an OpenTelemetry contributor. He recently presented on AI Observability with OpenTelemetry at Cloud Native London https://www.youtube.com/live/qF4wz-pha1w?si=PFzjNcGkbD4pFKnA&t=625 and has spoken at AWS Summit, and other... Read More →
Wednesday April 8, 2026 11:20 - 11:30 CEST
Junior Stage
  Inference & Production
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Audience Level
  • Slides Attached
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -