Pandas has become one of the de-facto libraries for data manipulation of tabular data in the Python ecosystem. In recent years, several projects have emerged, such as Dask, Modin, and Koalas, whose goal is to reproduce the Pandas API in order to ease the learning curve for scaling data processing logic. Coupled with ML orchestration tools like Flyte, machine learning practitioners can benefit from reproducibility and data lineage tracking while using the data processing tools they are familiar with.
However, as powerful as dataframes are, they can often be difficult to reason about in terms of their data types and statistical properties as data is reshaped from its raw form into one that’s ready for modeling. In this session, data science and machine learning practitioners will learn how to combine Flyte’s (LF AI & Data incubating project) rich type system and flexible DAG composition syntax with Pandera’s intuitive schema-declaration API so they can spend less time worrying about the correctness of their dataframes and more time obtaining insights and training models. This talk will first introduce Pandera (OSS project), a package that provides an expressive data validation API, and then dive into a practical case study to illustrate the benefits of integrating Pandera with Flyte.
Niels is a machine learning engineer and core maintainer of Flyte, an open source ML orchestration tool and author and maintainer of Pandera, a data testing tool for dataframes.
He has a Masters in Public Health with a specialization in sociomedical science and public health informatics, and prior to that a background in developmental biology and immunology.
His research interests include reinforcement learning, AutoML, creative machine learning, and fairness, accountability, and transparency in automated systems. He enjoys developing open source tools to make data science and machine learning practitioners more productive.