As workflows increase in complexity, companies have come to depend on Airflow to manage inter-DAG dependencies. Airflow has quickly become an important component of the Modern Data Stack powering analytical reports, business metrics, and dashboards.
But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption were delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures? How can we use data lineage to achieve the data observability we need to answer these questions?
In this talk, OpenLineage will be introduced, an open standard for collecting lineage metadata for jobs under execution, and how it works with Airflow. The presentation will walk through a practical example using Marquez, the reference implementation of OpenLineage. It will be explained how OpenLineage can help data teams maintain inter-DAG dependencies within their Airflow instance, capture metadata on historical DAG runs, and minimize data quality issues.
Julien Le Dem is a Principal Engineer at Datadog, serves as an officer of the ASF and is a member of the LFAI&Data Technical Advisory Council. He co-created the Parquet, Arrow and OpenLineage open source projects and is involved in several others. His career leadership began in Data Platforms at Yahoo! - where he received his Hadoop initiation - then continued at Twitter, Dremio and WeWork. He then co-founded Datakin (acquired by Astronomer) to solve Data Observability. His French accent makes his talks particularly attractive.
Willy Lulciuc is the Founding Engineer of Datakin. He makes datasets discoverable and meaningful with metadata. He co-created Marquez and is now involved in the OpenLineage initiative. Previously, he worked on the Project Marquez team at WeWork. When he’s not reviewing code and creating indirections, he can be found experimenting with analog synthesizers.