Data Council Blog
|
by
Data Council
Open Source Highlight: DataHub
DataHub is a generalized metadata search & discovery tool. Originally created at LinkedIn, it was open sourced in February of this year, and has been adopted by other companies such as Expedia and Typeform, with the ambition to help connect employees to data that matters to them.
Its origin story is similar to the one of data catalogs like Airbnb’s Data Portal or Lyft’s Amundsen: data discovery at hyperscale is a problem that calls for specific tools. However, DataHub also takes things one step further than its predecessor at LinkedIn, WhereHows: in addition to boosting the productivity of data users, it also has its eye on AI/ML and “power[ing] new use cases while preserving fairness, privacy, and transparency.” With this goal in mind, it required an architecture that was able to scale with the metadata, and which is currently the following:

Note that the open source version of DataHub is separate and slightly different from the one LinkedIn maintains in-house. Differences include stream processing: “Although our internal version uses a managed stream processing infrastructure, we chose to use embedded (standalone) stream processing for the open source version because it avoids creating yet another infrastructure dependency,” DataHub contributors Kerem Sahin, Mars Lan, and Shirshanka Das explained in a blog post worth reading.
To learn more, you can check out this recent episode of the Data Engineering Podcast or join DataHub’s Slack, which features channels for discussions around search, graph, UI, k8s, and more.
Subscribe to Email Updates
Receive relevant content, news and event updates from our community directly into your inbox. Enter your email (we promise no spam!) and tell us below which region(s) you are interested in:
Fresh Posts
Categories
- Analytics (15)
- Apache Arrow (3)
- Artificial Intelligence (7)
- Audio Research (1)
- big data (7)
- BigQuery (2)
- Careers (2)
- Data Discovery (2)
- data engineer salary (1)
- Data Engineering (46)
- Data Infrastructure (2)
- Data Lakes (1)
- Data Pipelines (6)
- Data Science (33)
- Data Strategy (14)
- Data Visualization (6)
- Data Warehouse (10)
- Data Warehousing (2)
- Databases (4)
- datacoral (1)
- disaster management (1)
- Event Updates (12)
- functional programming (1)
- Learning (1)
- Machine Learning (18)
- memsql (1)
- nosql (1)
- Open Source (21)
- ops (1)
- postgresql (1)
- Redshift (1)
- sharding (1)
- Snowflake (1)
- Speaker Spotlight (5)
- SQL (2)
- Startups (12)