Here's our July 2020 roundup of relevant links for data professionals, from blog posts to podcast episodes:
Software Engineering Daily recently invited Apache Airflow's creator Maxime Beauchemin and Astronomer engineers Vikram Koka and Ash Berlin-Taylor to discuss the state of Airflow. Listen to the podcast episode or read the transcript to hear their comments on Airflow's use cases, its purpose, the open source ecosystem, and more.
The Data Science Podcast recently featured an interview of Data Scientist at IBM Cesar Mierzejek, in which he shared 3 skills he didn't know he needed to be a successful data scientist. No full spoilers here, but the 18-minute conversation covers communication, prioritization, and learning, with an eye on leadership.
Jason Brownlee's Machine Learning Mastery is known for its tutorials, and June 22's one is particularly relevant for anyone working on data preparation for ML models. Its focus is how to prevent data leakage, reminding readers that data preparation must be prepared on the training set only, and making sure that by the end of the tutorial, they will know how to avoid data leakage for train-test splits and k-fold cross-validation in Python.
Data Science Manager at Shopify Marc-Olivier Arsenault published a post on the company's engineering blog in which he explains the founding principles that guide its data warehousing and analysis. From data consistency and open access to communication guidelines and required readings, this will be an interesting guide for teams operating consistently at a similar scale.
"Long live code": this is the title of the well-written essay that Hex.tech's CEO Barry McCardel published on Medium. Its key point is that "no code" tools may end up limiting their users. "Truly empowering users doesn’t mean getting rid of code, but embracing it," he argues.
Anyscale is the company founded by the creators of open source project Ray, and its co-founder Ion Stoica recently co-authored a blog post with Ben Lorica on "five key features for a machine learning platform." Going through a list of elements that ML platforms should possess, such as ecosystem integration and easy scaling, it makes the case for Ray as the foundation of future ML platforms.
Full-stack data platform Holistics published a free 'Analytics Setup Guidebook' with relevant insights on building scalable analytics stacks. "This book is written for people who need a map to the world of data analytics," the introduction points out. Click here to access the ungated version of the guide (you can also check out the Hacker News discussion one of the authors participated in.)
Fiber is a Python-based distributed computing library that Uber built and open-sourced. You can find out more about it on the Uber Engineering blog, with a post that tells the story behind Fiber and the key features that make it useful for modern computer clusters. The repository is also available on GitHub.
Jeff Sternberg from Google Cloud wrote a Forbes column titled "Beyond Spreadsheets", which was also the topic of a talk he recently gave at our virtual NY meetup (video available here). As he explained in a Twitter thread, spreadsheets are incredibly useful, but they have limits, which notebooks can bypass.
Towards Data Science is featuring a multi-part series on Efficient PyTorch, and part 1 focuses on an important aspect: eliminating bottlenecks, namely I/O and CPU ones. Its author Eugene Khvedchenya is a computer vision & machine learning engineer who previously authored pytorch-toolbelt, so some previous knowledge of PyTorch is required, but it is still accessible if you are early in your journey with data (thanks to our KL meetup organizers for the recommendation!)
Have you created or enjoyed a post or podcast episode that you'd like to recommend to the data community? Make sure to let us know: community@datacouncil.ai