A pervasive but often overlooked problem in predictive modeling on real-life data is the problem of data or label leakage, which occurs when information not available at prediction time leaks into the training set used for model training. This results in models that are perfect on paper but useless in practice.
This talk will first describe different cases of label leakage and how they can influence your predictive performance. Next, I will talk about how we tackle label leakage at Salesforce at scale in a multi-tenant environment, contrasting it with traditional approaches and elaborating on why they do not suffice in the Enterprise setting. We will go through several novel algorithms and methods to catch label leakage in an automated manner, and learn more about the machine learning process necessary to facilitate such methods.
Till Bergmann is a Senior Data Scientist at Salesforce Einstein, building platforms to make it easier to integrate machine learning into Salesforce products, with a focus on automating many of the laborious steps in the machine learning pipeline. Before Salesforce, he obtained a PhD in Cognitive Science at the University of California, Merced, where he studied collaboration patterns of academics using NLP techniques.