Here's our January 2021 roundup of links from across the web that could be relevant to you:
Dropbox shared insights into Alki, the petabyte-scale metadata store it designed for infrequently accessed metadata (“cold data”). The post details how one-size-fits-all database Edgestore was reaching capacity limits, and why audit logs were a good candidate to be moved elsewhere than on costly SSDs. After considering off-the-shelf options, the team settled on building its own solution on top of AWS services: Alki; with DynamoDB as the hot store, and S3 as the cold store. Like HBase or Cassandra, Alki is based on log-structured merge-trees (LSM trees), but is better suited to handle hot-then-cold audit logs, as well as future use cases at Dropbox.
Snowflake Co-Founder & President of Products Benoit Dageville gave an interesting online keynote at CIDR 2021, this year’s Conference on Innovative Data Systems Research. After a quick overview of the company’s impressive journey, from being founded in 2012 and first launching in 2015 to IPOing last year, his talk deep-dived into Snowflake Data Cloud, Snowflake’s integrated platform - including its architecture, and lessons learned along the way. Note that you will soon be able to find more CIDR 2021 videos on the CIDR DB YouTube channel - we recommend you check it out.
Hashpath founder Seth Rosen shared reflections the dbt Coalesce conference inspired him last December, which could be summed up this way: “The analytics engineering movement will take back control over data quality and insights.” This is made possible by several trends helping analysts take the matter into their on hands, such as the rise of dbt, and the fact that knowing SQL now goes a really long way. It also includes the emergence of tools such as dbt tests, Great Expectations, and Looker data validations, which Rosen predicts will continue to grow in popularity and necessity.
If you’re interested in data quality, check out our past DC_THURS episodes with Great Expectations co-authors Abe Gong (June 2020) and James Campbell (January 2021).
Inspired by a recent thread on Hacker News, developer David Xiang wrote a solid summary of the two schools of thought around using Kafka as a database - their arguments, counter-arguments, and best representatives. The post was praised on Twitter by author Martin Kleppmann, whose arguments lean towards ‘yes’, but it really does justice to both sides. It is well-worth reading no matter where you initially stand: you might discover trade-offs in either direction that you hadn’t considered for your specific use cases. After all, “there is never a one-size-fits-all solution in software development.”
Partner at Kleiner Perkins Bucky Moore shared his thinking on “the future of computing and data infrastructure” (and posted a summary thread on Twitter). As usual with predictions, not everyone will agree and the future might prove him wrong, but his rationale is compelling either way. His main points are the following: we will see more products and services built directly on top of the warehouse; the cloud will go serverless (as in costing zero when not used); protecting endpoints should evolve to protecting data; “more business processes will be written as code, and treated as such”; and the likely winners once the ML sector gets consolidated will be the most data-intensive players.
Monte Carlo CEO Barr Moses and VP of Engineering at AppZen Debashis Saha argued that “we need to rethink our approach to metadata management and data governance.” In a joint blog post, they made the case for “next-generation catalogs [that] will have the capabilities to learn, understand, and infer the data, enabling users to leverage its insights in a self-service manner.” They also made suggestions on how we can get there, and explained why it matters: “To achieve truly discoverable data, it’s important that your data is not just “cataloged,” but also accurate, clean, and fully observable for ingestion to consumption.”
Paul Hsiao, Grace Isford and Tobias Macey joined forces to gather advice targeted at any data entrepreneur hoping to land his or her first 20 data customers. After talking to more than 50 data leaders and practitioners, they came up with thoughts and recommendations on value propositions solving the main pain points for data leaders, as well as go-to-market tips. For example: “Don’t ever say [your product or service] is plug and play. Customers won’t believe you. You lose credibility fast. Be upfront about the integration required. Good customers expect and understand it. It makes your tool more sticky as well.” Check out the post for the full list.
Pete Warden works on TensorFlow Lite at Google, and was recently a guest on the Software Engineering Daily podcast for a conversation on ML applications and the frameworks that support them. As host Jeff Meyerson noted in his introduction, “TensorFlow Lite is an open source deep learning framework for on-device inference […] designed to improve the viability of machine learning applications on phones, sensors and other IoT devices”; making his guest particularly familiar with resource constraint environments, points of friction, and ML on the edge.
Alejandro Saucedo is the Chief Scientist of the Institute for Ethical AI & Machine Learning, and he published a great guide on Towards Data Science. As its subtitle announces, it is “a practical deep dive on production monitoring architectures for machine learning at scale using real-time metrics, outlier detectors, drift detectors, metrics servers and explainers.” Featuring graphs and code snippets, it covers technical aspects while keeping key principles on the radar: compliance, governance, and scalability. All examples are based on open-source frameworks, and you can explore further on your own with the linked Jupyter notebook.
Data Council’s local community co-organizer in Berlin Daniel Molnar was a recent guest on the Data Engineering Podcast. His conversation with Tobias Macey is interesting to follow for his perspective on being a data janitor, data engineering careers, his recommendation to focus on the basics rather than the newest shiniest tools, and his advice for cutting through the noise. This also reflects the focus of his latest endeavor, the Pipeline Data Engineering Academy, a coding bootcamp focused on data engineering that promises to teach “data craftsmanship beyond the AI-hype.”
Have you found or created anything that you’d like to recommend to the data community? Feel free to let us know: community@datacouncil.ai