Data Council Blog

Data Council Blog

NLP Heroes, Pinot, Data Testing, and More: Top 10 Links From Across the Web

Here's our November 2020 roundup of good reads and podcast episodes that might be relevant for your career in data:

1. Heroes of NLP: Quoc Le (Deeplearning.ai)

NLP researcher Quoc Le was recently Andrew Ng’s guest as part of the ‘Heroes of NLP’ video series. Their discussion covered Le’s impressive journey, from growing up in Vietnam and developing his first basic chatbot in high school to becoming Google Brain’s first intern, and everything that followed. This includes the ‘Google Cat’ experiment, the Meena chatbot project, and work on Seq2Seq models. Check out the conversation here, and consider subscribing to the series to hear from other guests such as Chris Manning, Kathleen McKeown, and Oren Etzioni.

Open Source Highlight: DataHub

DataHub is a generalized metadata search & discovery tool. Originally created at LinkedIn, it was open sourced in February of this year , and has been adopted by other companies such as Expedia and Typeform, with the ambition to help connect employees to data that matters to them.
 

State of AI, Data Quality, and More: Top 10 Links From Across the Web

Here's our October 2020 roundup of good reads and podcast episodes that might be relevant to you as a data professional:

1. Multiplayer Editing: a Pragmatic Approach (Hex)

Data collaboration startup Hex published a great long read on its approach to live collaboration . Written by software engineer Mac Lockard, it takes a look at the respective pros and cons of Operational Transforms and Conflict-free Replicated Data Types (CRDTs), before explaining the solution that Hex adopted. Inspired by Figma's hybrid approach, it can also be described as "Atomic Operations (AO), as all edits to application state are broken down to their smallest atomic parts." "If the application you are building can rely on last-writer-wins semantics, Atomic Operations might provide a more pragmatic approach," the post concludes. This is a highly recommended read if you are pondering about a similar decision.  
| |

Open Source Highlight: n8n

Created by Berlin-based developer Jan Oberhauser in 2019, n8n presents itself as “a free and open workflow automation tool”. Think of it as a locally hosted Zapier on steroids.

Hot Data Tools pt. 2, End-to-End Data Scientists, and More: Top 10 Links From Across the Web

Here's our September 2020 roundup of good reads and podcast episodes that might be relevant to you as a data professional:

1. What Data Tools Don't Do (Data Council)

Our founder Pete Soderling co-authored a follow-on piece to his previous post with Great Expectations' core contributor Abe Gong and Partner at Amplify Partners Sarah Catanzaro, for which they had interviewed the makers of some of the hottest data tools. The focus is still the same: rather than what their data tools can do, we hear about what they don't do, as a way to better understand how they fit together. From ApertureData to Xplenty, this new installment covers 21 new tools, and you can read it here.

Large Datasets, Are Dashboards Dead, and More: Top 10 Links From Across the Web

Here's our August 2020 roundup of good reads and great podcast episodes for anyone working with data:

1. Processing Large Datasets with Python

AI engineer and author J.T. Wolohan was recently a guest of the Heroku’s Code[ish] podcast to discuss the contents of his book, “Mastering Large Datasets with Python.” Listen to the episode here or read the transcript for some practical advice on using Python to deal with massive datasets, especially in the context of machine learning.

Open Source Highlight: Apache Hudi

With more than 1,300 stars on GitHub, Apache Hudi is a great open source solution for companies with large analytical datasets to quickly ingest data onto HDFS or cloud storage.

Apache Airflow, Beyond Spreadsheets, and More: Top 10 Links From Across the Web

Here's our July 2020 roundup of relevant links for data professionals, from blog posts to podcast episodes:

1. The State of Airflow

Software Engineering Daily recently invited Apache Airflow's creator Maxime Beauchemin and Astronomer engineers Vikram Koka and Ash Berlin-Taylor to discuss the state of Airflow. Listen to the podcast episode or read the transcript to hear their comments on Airflow's use cases, its purpose, the open source ecosystem, and more.

| |

Open Source Highlight: Apache Iceberg

Apache Iceberg is an open table format for very large analytic datasets. You can use it with Presto or Spark to add tables that use a high-performance format that vows to work just like a SQL table.

AGI, Dask, Feature Stores, and More: Top 10 Links From Across the Web

Here's our June 2020 roundup of relevant links for data professionals, from blog posts to podcast episodes:

1. Self-Supervised Learning vs. AGI

"AGI does not exist — there is no such thing as general intelligence. We can talk about rat-level intelligence, cat-level intelligence, dog-level intelligence, or human-level intelligence, but not artificial general intelligence," Yann LeCun declared during an online session of the International Conference on Learning Representation (ICLR) 2020, which VentureBeat wrote about. Together with fellow Turing Award winner Yoshua Bengio, he advocated for pursuing humanlike AI through "self-supervised learning."