It is well known that data quality and quantity are crucial for building Machine Learning models, especially when dealing with Deep Learning and Neural Networks.
But besides the data required to build the model itself, there is another often overlooked type of data required to build a production grade Machine Learning Platform: metadata.
Modern Machine Learning platforms contain a number of different components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature stores, and many more. Most of these components have associated metadata including versioned datasets, versioned Jupyter Notebooks, training parameters, test/training accuracy of a trained model, versioned features, and statistics from model serving.
For the dataops team managing such production platforms, it is critical to have a common view across all this metadata, as we have to ask questions such as: Which Jupyter Notebook has been used to build Model XYZ currently running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?
In this talk, we look at existing implementations, in particular MLMD as part of the TensorFlow ecosystem. Further, we propose a first draft of a (MLMD compatible) universal Metadata API. We demo the first implementation of this API using ArangoDB.
Jörg Schad is Head of Engineering and Machine Learning at ArangoDB. In a previous life, he has worked on or built machine learning pipelines in healthcare, distributed systems at Mesosphere, and in-memory databases. He received his Ph.D. for research around distributed databases and data analytics. He’s a frequent speaker at meetups, international conferences, and lecture halls.