Name: Portable High‑Performance LLM Serving: A Triton Backend for VLLM - Burkhard Ringlein, IBM Research & Jan van Lunteren, IBM
Start: 2026-04-08T11:35:00+0200
End: 2026-04-08T12:00:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Portable High‑Performance LLM Serving: A Triton Backend for VLLM - Burkhard Ringlein, IBM Research & Jan van Lunteren, IBM

Wednesday April 8, 2026 11:35 - 12:00 CEST

Master Stage

Today, vLLM is the de-facto industry standard for serving Large Language Models and is widely adopted in production.

However, for most of the past, vLLM’s state-of-the-art performance was largely dependent on hand-written CUDA or HIP kernels. These kernels have typically been carefully optimized for a specific GPU platform and may pose a serious obstacle to the portability of vLLM across different hardware.

Leveraging Triton, we introduced a “Triton attention backend” to vLLM that produces highly competitive performance across GPU platforms with a single code base, without involving hand-written CUDA or HIP kernels. The Triton attention backend became the default for AMD GPUs and is used in scenarios where other attention backends have missing features. Additionally, this backend automatically selects appropriate specialized kernels based on model type or request length.

In this talk, we will present our recent advances that consistently deliver high performance on both NVIDIA and AMD GPUs with a single Triton-only code-base. We will present the engineering and science behind this Triton-only backend, including system aspects, kernel improvements, and launch grid optimizations.

Speakers

Jan van Lunteren

Senior Research Scientist, IBM Research

Jan van Lunteren is a Senior Research Scientist at IBM Research Zurich holding MSc and PhD degrees in Electrical Engineering. His research has covered a broad range of topics, including high‑speed networking, near‑memory computing, and high‑performance machine‑learning inference... Read More →

Burkhard Ringlein

Research Staff Member, IBM Research

Dr. Burkhard Ringlein is a Research Staff Member in the AI Platform team of IBM Research, based in Zurich. He is an accomplished AI systems researcher and designs, builds, debugs, and optimizes practical systems for low-latency, high-throughput machine learning applications. Currently... Read More →

2026 04 08 vanLunteren Ringlein PyTorch Paris pub pdf

Wednesday April 8, 2026 11:35 - 12:00 CEST
Master Stage

Inference & Production

Audience Level Intermediate

PyTorch Conference Europe 2026

Jan van Lunteren

Burkhard Ringlein

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event