Here's our June 2020 roundup of relevant links for data professionals, from blog posts to podcast episodes:
"AGI does not exist — there is no such thing as general intelligence. We can talk about rat-level intelligence, cat-level intelligence, dog-level intelligence, or human-level intelligence, but not artificial general intelligence," Yann LeCun declared during an online session of the International Conference on Learning Representation (ICLR) 2020, which VentureBeat wrote about. Together with fellow Turing Award winner Yoshua Bengio, he advocated for pursuing humanlike AI through "self-supervised learning." The online discussion that followed featured interesting insights, including this 'Explain Like I'm 5' definition shared by Professor Thomas G. Dietterich: "Self-supervised learning is a method for attacking unsupervised learning problems by using the mechanisms of supervised learning."
A recent episode of the TWIML AI Podcast with Sam Charrington featured a timely panel discussion on "ways that Data Scientists and ML/AI practitioners can continue to advance their careers despite current challenges." In this online session, panelists Hilary Mason, Caroline Chavier, Ana Maria Echeverri, and Jacqueline Nolis shared their perspectives and tips from both sides of the table on identifying one's skills, considering job offers that might not have "data scientist" in their title and interviewing online successfully.
Are you familiar with the term "Analytics Engineer"? We have been hearing about it more and more often since it emerged, especially in the context of the dbt ecosystem, so we took a deep look at it. Our post covers a fair bit of ground on this emerging role, from its (recent) history to its future as a career, so we hope you will enjoy it.
Data teams are too often flooded with a constant flow of ad-hoc requests to answer business questions, without any general roadmap, Sebastian Perez Saaibi points out in an interesting blog post. To get out of this constant emergency mode, he suggests having data teams instead focus on building 'data products' that will answer these business questions on a self-service basis.
In this episode of The Data Science Podcast, Will Roberts discusses the concept of Federated Learning with Nathalie Baracaldo, a research staff member at IBM's Almaden Research Center. Beyond the theory, she gives actual examples of scenarios in which federated learning could be particularly useful. For instance, it could help competing banks join forces to fight fraud, which typically only represents a small subset of each of their datasets; and more generally, provide access to 'big data' level of findings in privacy-constrained environments.
Software Engineering Daily recently welcomed Matthew Rocklin to discuss Dask and other options to scale up Python. The conversation covers distributed computing, the Python ecosystem, the differences between Dask and Spark, as well as Rocklin's new Dask-centered company, Coiled Computing. As usual, a transcript is also available [PDF].
The WiDS Podcast recently featured a conversation between host Professor Margot Gerritsen and guest Andrea Gagliano that will be of interest to people in our field who are also interested in the arts and creativity. Currently the Head of Data Science, AI, and Machine Learning at Getty Images, Gagliano is for instance leading a machine learning project to detect pictures that look too much like stock images, rather than naturally posed shots.
Feature stores are a great tool for organizations to make their data science/ML processes more efficient. This is likely why there are now a dozen of them, from pioneer Michelangelo to open-source Hopsworks, and a very useful resource to learn more about them: Featurestore.org. The site includes a comparison table with the characteristics of each store, and also has its own newsletter. Happy exploring!
AWS released a whitepaper called "Machine Learning Lens", which is meant to be used in conjunction with the AWS Well-Architected Framework to review ML workloads against AWS best practices. It covers different scenarios as well as 5 pillars: operational excellence, security, reliability, performance efficiency, and cost optimization; with a set of questions to consider in each section.
Here's a tutorial on how to add a Google Sheet as a database in Apache Superset. As the author points out, "running queries against Google Sheets is significantly slower than against a database such as PostgreSQL or MySQL so you should only use this approach for smaller datasets" - but it can still be a useful option to test Superset and start exploring. The post was published on the blog of Preset, which is currently building a hosted cloud solution for Superset.
Have you created or enjoyed a post or podcast episode that you'd like to recommend to the data community? Make sure to let us know: community@datacouncil.ai