A data lake is primarily two things: an object store and the objects being stored. Even with the most basic setup, data lakes are capable of supporting BI, Machine Learning, and operational analytics use cases. This flexibility speaks to the strength of object stores, particularly their flexibility in integrating with a diverse set of data processing engines.
As data lakes exploded in adoption, a number of improvements were made to the first architectures. The first and most obvious improvement was to file formats, which led to the development of analytics-optimized formats like parquet, and eventually Modern Table Formats like Delta Lake.An even newer improvement has been the emergence of Data Source Control tools like lakeFS that bring new levels of manageability across an entire lake! In this talk, we’ll cover how to incorporate these technologies into your data lake lake, and how they simplify workflows critical to ML experimentation, deployment of datasets, and more!
Paul is a developer advocate for the lakeFS project, after several years on the analytics team at Equinox Fitness. His goal is to democratize big data analytics through explaining data architectures that are both user-friendly and cost-effective. He's spoken at various conferences and meetups, including the Postgres Conference NYC and AWS re:Invent. When not working you can find him drinking tea and playing golf.