With more than 1,300 stars on GitHub, Apache Hudi is a great open source solution for companies with large analytical datasets to quickly ingest data onto HDFS or cloud storage.
Hudi (pronounced "Hoodie") was originally developed at Uber in 2016 to fulfill its need to "build a transactional data lake that would facilitate quick, reliable data updates at scale." It reached its goal in a matter of months: "by the end of 2017, all raw data tables at Uber leveraged the Hudi format, running one of the largest transactional data lakes on the planet," co-author Nishith Agarwal recalls.
While Uber's scale and nature made the need for an efficient data lake particularly pressing, many large companies were starting to find themselves in a similar position. Realizing that this problem was widespread, Uber first open-sourced Hudi in 2017, and in 2019 fully donated it to the Apache Foundation, where it is now a top-level project.
Beyond its use cases at Uber, Apache Hudi is used in production by other major companies such as Alibaba Cloud, Udemy, and Tencent. But it is only the beginning, Vinoth Chandar explained: "We are only getting started with our deeply technical roadmap. We certainly look forward to a lot more contributions and collaborations from the community to get there. Everyone's invited!"
Update: Here's a recent episode of DC_THURS with co-authors Nishith Agarwal & Vinoth Chandar: