In this talk, we discuss what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. We will introduce Amundsen which is an Open Source Data Discovery Platform From Lyft.
Amundsen is built on 3 key pillars:
1. Augmented Data Graph Amundsen uses a graph database under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What's unique to Amundsen is that we treat people as a first class data asset – in other words, there's a graph node for each person in the organization that connects to other nodes (like tables, and dashboards).
2. Intuitive User Experience Amundsen runs PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.
3. Centralized Metadata Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress.
We will deep dive into Amundsen's architecture and discuss how it achieves the 3 discussed design pillars. We will close with future roadmap of the project, what problems remain unsolved and how we can work together to solve them.
Jin Hyuk Chang is a software engineer at Lyft data platform team working on various data products. Jin is a main contributor to Apache Gobblin, and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Service focused on Big data and Service oriented architecture.
Tao Feng is a software engineer at Lyft data platform team working on various data products. Tao is a committer and PMC on Apache Airflow. Previously, Tao worked at LinkedIn and Oracle on data infrastructure, tooling and performance.