Technical Talks

Ethan Rosenthal
Ethan Rosenthal
Member of Technical Staff | Runway

Building a Data Foundation for Multimodal Foundation Models

video
Missing value detected...
Video will be populated after the conference

ABOUT THE TALK
  • Foundation Models

While it is often easier to simply throw more data at a problem, scale is not all you need when building multimodal foundation models. Data quality continues to be just as important as data quantity, and supporting “data-centric AI” requires lowering the barrier to data curation as much as possible. However, multimodal data curation presents unique requirements compared to conventional machine learning or business intelligence data management systems. The data is heterogeneous, ranging from scalars to embedding arrays to entire compressed videos. While the dataset sizes in terms of number of rows are not quite Big Data™, the number of bytes is massive with high columnar variance. Given the storage size, it’s infeasible to construct and copy new training datasets for each model training job; training jobs must query the core datasets without copying them. Finally, large scale distributed training jobs require fast random access which bumps up against limitations of typical solutions like partitioned parquet files. In this talk, I will discuss how we built a petabyte-scale, multimodal feature lakehouse. This lakehouse supports analytical querying as well as serving features for large scale distributed training jobs, such as those that were used for training Runway’s recent foundation models like Gen-3 Alpha.

Ethan Rosenthal

Member of Technical Staff

Ethan Rosenthal

Runway

Ethan Rosenthal is a Member of Technical Staff at Runway, an applied AI research company focused on multimedia content creation, where he builds engineering systems to accelerate the work of research scientists. His career spans diverse roles across AI, machine learning, and data science - from training language models at Square to developing recommendation systems at seed-stage ecommerce startups. Before working in tech, Ethan was an actual scientist and got his PhD in experimental physics from Columbia University.