Technical Talks

Nikita Vemuri
Nikita Vemuri
Software Engineer | Anyscale

From Scaling to Observability: Solving Key Challenges for Distributed ML with Ray

video
Missing value detected...
Video will be populated after the conference

ABOUT THE TALK
  • ML OPs & Platforms

As machine learning workloads grow increasingly complex, distributed training across thousands of nodes presents significant challenges. This talk explores how the Ray library ecosystem tackles critical issues in multi-node ML training, focusing on development, orchestration, and comprehensive observability. Attendees will learn about innovative solutions for tracking system data, managing potential failure points, and implementing robust observability workflows that persist critical information.

Nikita Vemuri

Software Engineer

Nikita Vemuri

Anyscale

Nikita Vemuri is a software engineer at Anyscale, where she focuses on developing observability features across Ray and the Anyscale platform to help developers debug and monitor their large scale AI workloads. She joined Anyscale as one of the early engineers and has contributed to multiple initiatives across the platform stack over the last 4 years. As a UC Berkeley grad, she earned both her bachelor's and master's in Electrical Engineering and Computer Science, and conducted research at Berkeley’s RISE Lab.