The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.
This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Sign up or log in to add sessions to your schedule and sync them to your phone or calendar.
vLLM has emerged as a reference inference stack in the PyTorch ecosystem for high-throughput large language model serving. CPUs continue to play an important role in LLM inference, supporting cost-sensitive deployments, hybrid CPU/GPU serving, and batch or off-peak workloads on general-purpose infrastructure.
In this talk, we examine CPU-based LLM inference through the lens of PyTorch internals, using vLLM as a case study. We describe how vLLM interacts with PyTorch’s operator stack, including tensor layout management, backend dispatch, and threading behaviour, and highlight common sources of overhead such as repeated weight repacking and poor threading behaviour.
We present runtime and kernel-level optimizations that reduce overhead including CPU paged-attention kernel tuning with vectorized softmax, specialized Q–K and P–V GEMM kernels aligned with vLLM’s scheduler, an ISA-aware BF16 attention, pre-packed weight layouts for quantized matmul, SIMD vectorization using PyTorch’s at::vec::Vectorized primitives, and NUMA-aware scheduling for scalable parallel inference.
Finally, we conclude with lessons learned from building and upstreaming a high-performance CPU inference engine.
Crefeda Rodrigues is a Staff Software Engineer at Arm, focusing on performance and scalability driven machine learning software optimization for Arm server CPUs. She previously worked on large-scale climate and weather model optimization as a postdoctoral researcher at the University... Read More →
Fadi is a Senior Machine Learning Engineer at Arm, working on optimizing PyTorch and vLLM for Arm Infrastructure cores. Prior to that, Fadi obtained a BSc in Artificial Intelligence from the University of Manchester.