Slack is a communication and collaboration platform for teams. Our millions of users spend more than 10 hours connected to the service on a typical working day.
Logs & events are the critical components of business applications. The logs help us to understand how a system works, debug, analyze performance, and improve operational efficiency. They are the foundation of building distributed systems.
Logging infrastructure is a critical component for Slack; our logging pipeline drives customer billing, usage pattern, and performance analysis of the business-critical systems.
This talk walks through the first-generation logging infrastructure and some of the problems we have encountered. We will then cover the second-generation logging infrastructure design, how we added reliability as a built-in feature of the systems, the overall design principles, and some of the best practices for designing scalable logging infrastructure.
Ananth Packkildurai works as a software engineer at Slack managing observability infrastructures such as Murron, Kafka, Elasticsearch and Prometheus. He is passionate about all things related to ethical data management and building distributed systems.
Jackson Argo works as a Software engineer at Slack and is the author of the Murron platform. He works alongside Ananth Packkildurai to manage Slack's visibility infrastructure. He is passionate about making music.