Machine learning workflows are not linear, where experimentation is an iterative & repetitive to and fro process between different components. What this often involves is experimentation with different data labeling techniques, data cleaning, preprocessing and feature selection methods during model training, just to arrive at an accurate model.
Quality ML at scale is only possible when we can reproduce a specific iteration of the ML experiment–and this is where data is key. This means: capturing the version of training data, ML code and model artifacts at each iteration is mandatory. However, to efficiently version ML experiments without duplicating code, data and models, data versioning tools are critical. Open source tools like lakeFS make it possible to version all components of ML experiments without the need to keep multiple copies, and as an added benefit, save you storage costs as well.
In this talk, you will learn how to use a data versioning engine to intuitively and easily version your ML experiments and reproduce any specific iteration of the experiment.
This talk will demo through a live code example:
Vino is a developer advocate at lakeFS, an open-source platform that delivers git-like experience to object store based data lakes. she started as a software engineer at NetApp, and worked on data management applications for NetApp data centers when on-prem data centers were still a cool thing. She then hopped onto cloud and big data world and landed at the data teams of Nike and Apple. There she worked mainly on batch processing workloads as a data engineer, built custom NLP models as an ML engineer and even touched upon MLOps a bit for model deployments. Vino enjoys sharing her learnings and industry best practices through blogs, video tutorials and tech talks. An avid public speaker and an ardent toastmaster, she has presented in different data conferences and meetups. When she is not wrestling with data, you can find her doing yoga or strolling the golden gate park and ocean beach.