Dealing with diverse and constantly evolving data sources is a challenge. Now let’s add security and compliance requirements, as well as different levels of sensitivity in your data model. Finally mix in a pipeline that has to remain easy to iterate on, facilitate backfill of historical data, and have backups. It can quickly become a nightmare.
Here is how a small team of data engineers tackled these while maintaining a modular, scalable, and observable Internal Analytics platform, using a tiered “Transform” step in a pipeline where Luigi is the maestro and Spark the hard worker.
Jean-Mathieu is a Data Engineer at Datadog. As the first member of the Internal Analytics team, he designed and built data pipelines from scratch using Luigi, Spark, and Parquet files stored in S3. As the team grew, he worked on making the pipelines highly fault tolerant, reusable, and observable. He also deployed a scalable self-serve analytics platform throughout the organization for easy, accurate, and appropriate access to all internal data.