Name: Optimizing CPU LLM Inference in PyTorch: Lessons From VLLM - Crefeda Rodrigues, Arm Limited & Fadi Arafeh, Arm
Start: 2026-04-08T13:30:00+0200
End: 2026-04-08T13:55:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Optimizing CPU LLM Inference in PyTorch: Lessons From VLLM - Crefeda Rodrigues, Arm Limited & Fadi Arafeh, Arm

Wednesday April 8, 2026 13:30 - 13:55 CEST

Founders Cafe

vLLM has emerged as a reference inference stack in the PyTorch ecosystem for high-throughput large language model serving. CPUs continue to play an important role in LLM inference, supporting cost-sensitive deployments, hybrid CPU/GPU serving, and batch or off-peak workloads on general-purpose infrastructure.

In this talk, we examine CPU-based LLM inference through the lens of PyTorch internals, using vLLM as a case study. We describe how vLLM interacts with PyTorch’s operator stack, including tensor layout management, backend dispatch, and threading behaviour, and highlight common sources of overhead such as repeated weight repacking and poor threading behaviour.

We present runtime and kernel-level optimizations that reduce overhead including CPU paged-attention kernel tuning with vectorized softmax, specialized Q–K and P–V GEMM kernels aligned with vLLM’s scheduler, an ISA-aware BF16 attention, pre-packed weight layouts for quantized matmul, SIMD vectorization using PyTorch’s at::vec::Vectorized primitives, and NUMA-aware scheduling for scalable parallel inference.

Finally, we conclude with lessons learned from building and upstreaming a high-performance CPU inference engine.

Speakers

Crefeda Rodrigues

Staff Software Engineer, Arm

Crefeda Rodrigues is a Staff Software Engineer at Arm, focusing on performance and scalability driven machine learning software optimization for Arm server CPUs. She previously worked on large-scale climate and weather model optimization as a postdoctoral researcher at the University... Read More →

Fadi Arafeh

Senior Machine Learning Engineer, Arm

Fadi is a Senior Machine Learning Engineer at Arm, working on optimizing PyTorch and vLLM for Arm Infrastructure cores. Prior to that, Fadi obtained a BSc in Artificial Intelligence from the University of Manchester.

Optimizing LLM Inference on CPU Crefeda Fadi pdf

Wednesday April 8, 2026 13:30 - 13:55 CEST
Founders Cafe

Inference & Production

Audience Level Intermediate
Slides Attached Yes

PyTorch Conference Europe 2026

Crefeda Rodrigues

Fadi Arafeh

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event