The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Matei is an Assistant Professor of Computer Science at Stanford and Chief Technologist at Databricks. During his PhD, he started the Apache Spark computing engine and developed other widely used open source software including Apache Mesos, Alluxio and job schedulers for Apache Hadoop. Today, he also works on MLflow (https://mlflow.org), an open source machine learning platform from Databricks.