World-class LLMs are trained on trillions of tokens, and data quality is critical in determining the ultimate performance of these systems.
At Cohere, we train LLMs and retrieval (RAG) systems from scratch using datasets created through complex ingestion, preprocessing, and distillation pipelines. This talk will cover what we know about how data drives LLM performance, as well as the data platform we use to manage hundreds of datasets, from ultra-niche finetuning datasets to petabyte-scale web data, and automate the measurement and enforcement of data quality at scale.
Here’s what practitioners will walk away from in this talk:
Jonathan is a six-year veteran of data-in-tech, and leads the team responsible for integrating large, high quality datasets into Cohere's foundational language models. He is passionate about advancing the science of data for LLMs and hopes to share his learnings with the broader community. Previously he led a data platform team at mobile commerce startup Super, and worked on data teams at Shopify and Instacart.