Name: Optimizing Large MoE Inference on NVIDIA Blackwell: NVFP4, ADP, and DualPipe Strategies - Julien Demouth, NVIDIA
Start: 2026-04-08T11:35:00+0200
End: 2026-04-08T12:00:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Optimizing Large MoE Inference on NVIDIA Blackwell: NVFP4, ADP, and DualPipe Strategies - Julien Demouth, NVIDIA

Wednesday April 8, 2026 11:35 - 12:00 CEST

Central Room

Deploying massive Mixture-of-Experts (MoE) architectures like DeepSeek-V3/R1 requires a co-designed approach leveraging NVIDIA Blackwell’s fifth-generation Tensor Cores. This session details the transition to NVFP4 precision for MoE weights to significantly reduce memory load, coupled with FP4/FP8 KV caching to minimize attention layer footprint and enable higher concurrency.
We will analyze the architectural shift to Expert Parallelism (EP) for expert layers to maximize FLOPS, and Attention Data Parallelism (ADP) for attention heads—avoiding redundant KV replication and converting Multi-Head Latent Attention (MLA) into Multi-Query Attention (MQA) via weight absorption. The talk will demonstrate advanced execution strategies, including DualPipe algorithms to overlap dispatch/combine communication with computation, and the integration of DeepGEMM and FlashInfer kernels. Finally, we will cover runtime optimizations using Programmatic Dependent Launch (PDL) and CUDA Graphs to minimize host latency, alongside Multi-Token Prediction (MTP) for accelerated speculative decoding.

Speakers

Julien Demouth

Senior Distinguished Engineer - Eng. Lead for AI Labs & Models, NVIDIA

Wednesday April 8, 2026 11:35 - 12:00 CEST
Central Room

Inference & Production

Audience Level Advanced

PyTorch Conference Europe 2026

Julien Demouth

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event