It has become increasingly popular to build modern open-source data lakes for big data analytics and AI workloads. The architecture of data platforms has been evolving heavily over the past few years, with many open-source communities participating and collaborating in this movement. Many focus on better and more cloud-native approaches to serve metadata for structural data, but challenges remain in retrieving data more efficiently and providing sufficient bandwidth. For example, the scalability and cost-efficiency of cloud-native storage services are driving many organizations to embrace hybrid or multi-cloud architectures. It is important to present data in the lake efficiently to the computation with I/O bandwidth shared fairly across lake users. On the application side, the I/O workload is also quickly evolving in its patterns. For example, recent machine learning jobs tend to retrieve hundreds of millions of relatively small files/objects in training, which increasingly challenge the scalability, cost-efficiency and throughput of metadata serving.
In this talk, Hope will provide her views based on observations of working with many open-source users. She will share the analysis of these industry trends, challenges, and success stories working in the open-source ecosystem.
Hope Wang is a Developer Advocate at Alluxio with a decade of experience in Data, AI, and Cloud. She contributes to open-source projects such as Alluxio, Trino, and PrestoDB, and is an AWS Certified Solutions Architect – Professional. Her prior roles include positions in venture capital and as a Data Architect. Academically, she holds a BS in Computer Science, a BA in Economics, an MEng in Software Engineering from Peking University, and an MBA from the University of Southern California. Outside of work, she is an independent musician creating easy-listening songs.