ML Feature Engineering is a two-part problem:
1. Generating Realtime Features for Serving
2. Generating Historically Realtime Features for training
We will briefly introduce how we address the former problem. The primary focus of this talk is about generating Historically Realtime Features for training.
One way to think of generating Historically-Real-Time Features is to:
1.) Travel back in time to a particular state of the world, as represented by data in production systems
2.) Snapshot it, and
3.) Compute aggregations over the snapshot.
This is a useful visualization to understand the problem, but it is an intractable approach - especially in terms of compute and storage. We will introduce the algorithm in Zipline that makes backfilling features, with Historically Realtime values, feasible. We will borrow a few concepts from Abstract Algebra / Category Theory, but everything will be introduced from first principles.
Nikhil Simha is a Software Engineer on the Machine Learning infrastructure team at Airbnb. He is currently working on Bighead (DSAA '19), an end-to-end machine learning platform. Prior to Airbnb, he built a scheduler (Turbine, ICDE '20) and a stream processing framework (RealTime Data @ FB, SIGMOD '16) at Facebook. He is interested in the intersection of compilers, machine learning and realtime data processing systems. Nikhil got his Bachelors degree in Computer Science from Indian Institute of Technology, Bombay. While not working, he likes to boulder or play capoeira.