Name: Lightning Talk: Step-Aligned Telemetry for Distributed PyTorch Training (Time & Memory Attribution Across Ranks) - Abhinav Srivastav, TraceOpt
Start: 2026-04-08T10:50:00+0200
End: 2026-04-08T11:00:00+0200

7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

Lightning Talk: Step-Aligned Telemetry for Distributed PyTorch Training (Time & Memory Attribution Across Ranks) - Abhinav Srivastav, TraceOpt

Wednesday April 8, 2026 10:50 - 11:00 CEST

Central Room

Distributed PyTorch training often looks healthy in system dashboards; GPU utilization is high, memory is stable and yet throughput degrades, steps jitter, or GPUs go idle intermittently. The core issue is misalignment: most
telemetry is sampled by time, while training progresses by "steps", and distributed behavior is dominated by the slowest rank rather than averages.

In this talk I will breaks down common failure modes in DDP training that standard metrics miss (rank stragglers, dataloader stalls, step-time variance, and memory spikes/creep). We will show how step-aligned, rank-aware aggregation changes debugging: per-step worst-rank vs median-rank views, gating to completed steps across ranks, and how to tie time and memory back to training semantics without relying on heavyweight profilers.

Speakers

Abhinav Srivastav

ML Scientist, TraceOpt

ML researcher with a PhD in Computer Science. Industry experience at IBM Research, Huawei Research, and Zalando.Currently building TraceML: an open source tool that shows you the step-level breakdown of your PyTorch training run while it's still running.I am partially interested in... Read More →

Pytorch conf pdf

Wednesday April 8, 2026 10:50 - 11:00 CEST
Central Room

Training Systems

Audience Level Advanced
Slides Attached Yes

PyTorch Conference Europe 2026

Abhinav Srivastav

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event