Technical Talks

View All

The Future of Column-Oriented Data Processing with Arrow and Parquet

Julien Le Dem Julien Le Dem | Principal Engineer | Datadog

In pursuit of speed and efficiency, big data processing is continuing its logical evolution toward columnar execution. Julien Le Dem offers a glimpse into the future of column-oriented data processing with Arrow and Parquet.

A number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis, Drill and many others. Modern CPUs will achieve higher throughput using SIMD instructions and vectorization on Apache Arrow’s columnar in-memory representation. Similarly Apache Parquet will provide storage and I/O optimized columnar data access using statistics and appropriate encodings. For interoperability, row-based encodings (CSV, Thrift, Avro) combined with general-purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. Julien explains why the Arrow and Parquet Apache projects define standard columnar representations that allow interoperability without the usual cost of serialization.

This solid foundation for a shared columnar representation across the big data ecosystem promises great things for the future. Julien discusses the future of columnar data processing and the hardware trends it can take advantage of. Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) will allow using them together seamlessly and efficiently without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead; when remote, scatter-gather I/O sends the memory representation directly to the socket, avoiding serialization costs; and soon RDMA will allow exposing data remotely.

Julien Le Dem
Julien Le Dem
Principal Engineer  | Datadog

Julien Le Dem is a Principal Engineer at Datadog, serves as an officer of the ASF and is a member of the LFAI&Data Technical Advisory Council. He co-created the Parquet, Arrow and OpenLineage open source projects and is involved in several others. His career leadership began in Data Platforms at Yahoo! - where he received his Hadoop initiation - then continued at Twitter, Dremio and WeWork. He then co-founded Datakin (acquired by Astronomer) to solve Data Observability. His French accent makes his talks particularly attractive.

FEATURED MEETINGS