Its origin story is similar to the one of data catalogs like Airbnb’s Data Portal or
Lyft’s Amundsen: data discovery at hyperscale is a problem that calls for specific tools. However, DataHub also
takes things one step further than its predecessor at LinkedIn, WhereHows: in addition to boosting the productivity of data users, it also has its eye on AI/ML and “power[ing] new use cases while preserving fairness, privacy, and transparency.” With this goal in mind, it required an architecture that was able to scale with the metadata, and which is currently the following:
Note that the open source version of DataHub is separate and slightly different from the one LinkedIn maintains in-house. Differences include stream processing: “Although our internal version uses a managed stream processing infrastructure, we chose to use embedded (standalone) stream processing for the open source version because it avoids creating yet another infrastructure dependency,” DataHub contributors Kerem Sahin, Mars Lan, and Shirshanka Das explained
in a blog post worth reading.