December 2024 Top Ten (by Dagster Labs)

Written by Data Council | 19/12/24 22:10

Hey Data Council-ers!

I'm Pedram Navid, Chief Dashboard Officer at Dagster Labs, the modern data orchestrator for data engineers building data platforms. I'm excited to share some recent articles I've had my eye on these past few weeks.

📣 Reminder: Data Council 2025 early bird tickets are live, but they won't stay that way for long. Last year's conference sold out in record time, so don't just mark your calendar -- secure your spot today. The future of data & AI is calling!

01 / DUCK DB
Runtime-Extensible SQL Parsers Using PEG
This post by the folks at DuckDB discusses using modern Parsing Expression Grammar (PEG) parsers instead of the ancient YACC-style parsers used in most database systems today. They demonstrate how PEG parsers can be extended at runtime to add new SQL syntax or even support entirely new query languages, much like dplyr, and can provide better error messages.

02 / HEX
How We Renovated Our Data Warehouse Without Interruption
This article describes how the data team at Hex used a blue-green deployment approach to renovate their data warehouse without interrupting production. It highlights the importance of setting achievable goals, working iteratively, and implementing a "least privileges" access strategy to improve data discoverability and usability for stakeholders.

03 / DATA DUEL
Optimizing SQL Queries for Speed with dbt
Christopher Arnold, Software Engineer at Yelp, shares a technical deep dive into how his team uses DBT with Redshift Spectrum to read data from their data lake into Redshift. The approach eliminates the forking of data flows, reduces runtime, resolves data quality issues and improves developer productivity.

04 / MODAL
Fold Proteins with Chai-1
Who hasn’t wanted to fold a protein? This example by Modal is a detailed tutorial on using the Chai-1 protein structure prediction model on Modal's serverless infrastructure. It covers setting up dependencies, managing model weights, and running the inference in a scalable, cloud-based environment. If you're a bioinformatics enthusiast looking to harness the power of machine learning for protein folding, this is a must-read.

05 / SDF
Testing is Not Enough: Transforming Data Quality with Write, Audit, Publish Using SDF Build
SDF's new "sdf build" command brings the software engineering dream of "write, audit, publish" to data pipelines. Say goodbye to the days of broken dashboards and angry stakeholders - this Rust-powered tool ensures only the freshest, highest-quality data hits production.

🔊 New Zero Prime Podcast Episode 26: Balancing Open Source and Business with Spencer Kimball, Co-Founder & CEO of Cockroach Labs. Listen here!

06 / THE AIRBNB TECH BLOG
From Data to Insights: Segmenting Airbnb’s Supply
Airbnb has used a combination of availability rate, streakiness, and seasonality features to segment their host supply using machine learning. By applying this scalable segmentation model, they were able to gain deeper insights into host behavior patterns and tailor their products and operations accordingly.

07 / CANVA
How to Improve Search Without Looking at Queries
Canva's engineering team developed a synthetic dataset and evaluation pipeline to improve their private design search without accessing real user data, which is critical for preserving user privacy. They used large language models to generate realistic but entirely synthetic queries, relevant designs, and non-relevant designs, allowing them to rapidly test search improvements offline before deploying to production.

08 / DAGSTER LABS
The Rise of the Data Platform Engineer
The blog post discusses the rise of the Data Platform Engineer role, which involves building platforms, frameworks, and services to enable data consumers to build pipelines without relying on dedicated Data Engineers. As data tooling has improved, the focus is shifting from building custom ETL pipelines to creating scalable data platforms that empower other data roles.

09 / MENLO VENTURES
The State of Generative AI
The post discusses the state of generative AI adoption in the enterprise in 2024. It highlights key trends in enterprise generative AI spending, application use cases, and the evolution of the modern AI stack, including the rise of retrieval-augmented generation (RAG) and agentic architectures.

10 / AWS BLOG
How Amazon S3 Tables Use Compaction to Improve Query Performance by up to 3 Times
Everyone’s been talking about Amazon’s release of S3 Tables, so we had to share this post from their blog on how Amazon is using S3 Tables to improve query performance for storage-intensive workloads.

Thanks to Dagster Labs for curating this month’s top ten!
Team Data Council

View full post