Loading…
7-8 April, 2025
Paris, France
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for PyTorch Conference Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in CEST (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."
Wednesday April 8, 2026 15:55 - 16:20 CEST


Large-scale distributed AI training is highly sensitive to infrastructure failures, where even a single node disruption can halt progress and waste substantial compute. This talk presents Nebius’s approach to fault-tolerant training, combining reliability metrics such as goodput, MTBF, and MTTR with automated infrastructure practices including health checks, workload isolation, node replacement, state recovery, and observability. Drawing on production cluster results, the presentation shows how these techniques reduce interruptions, accelerate recovery, and improve the stability and efficiency of long-running AI workloads.
Speakers
CK

Cyril Kondratenko

AI/ML Specialist Solutions Architect, Nebius
MD

Maurits de Groot

AI/ML Specialist Solutions Architect, Nebius
Wednesday April 8, 2026 15:55 - 16:20 CEST
Junior Stage

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link