Cloud has been dramatically changing the landscape of data engineering as well as the behavior of data engineers. Specifically, data storage is migrating from the colocated model (e.g., HDFS) to a more cost-effective, more scalable but often fully disaggregated and remote data lake model (e.g. AWS S3). This has also created a strong need for data orchestration in the cloud like what Kubernetes does for container-based workloads, so that data can be presented in the right layout at the right location for data-consuming applications on the cloud.
Originally developed from UC Berkeley AMPLab as research project "Tachyon", Alluxio (www.alluxio.io) implements the world’s first open-source data orchestration system in the cloud. Alluxio creates a unified access layer for data-driven applications in big data and ML, enabling Spark, Presto, TensorFlow and so on to transparently access different external storage systems while actively leveraging in-memory cache to accelerate data access.
In this talk, the speaker will present:
- New trends and challenges in the data ecosystem in the cloud era;
- Effective data engineering in the cloud world with data orchestration;
- Production use cases of using popular stacks like Presto/Alluxio/S3.
Bin Fan is the founding engineer of Alluxio, Inc. and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems and algorithms.