Pushing Kafka to the Limit at Heroku

Written by Pete Soderling | 06/04/17 18:45

How Everyone's Favorite PaaS Operates Kafka at Scale

Scale presents unique challenges for engineers, particularly those at companies who have the largest number of users throwing off the most data exhaust, resulting in the fattest data pipelines with the gnarliest problems. For example, Heroku, arguably the most popular platform as a service (PaaS), who last year decided to offer Apache Kafka to their customers as a hosted service, quickly realized they would need to support a large number of distinct users, each with varying use cases. This put them on a challenging path to attempt to minimize the operational headaches that come inherently with running this kind of infrastructure at scale.

Because of the scope of Heroku's challenges, Software Engineer of Data Infrastructure, Jeff Chao, experienced a multitude of fascinating failure scenarios with Kafka that most of us would likely never see in our typical implementations. For example, Jeff discovered that there are a variety of situations where brokers can enter into cascading failure and eventually render clusters completely unavailable. He learned that the key to preventing these failures is to ensure that when one broker fails remaining brokers are resilient enough to be able to take on additional partitions from the downed broker. Although this might seem somewhat obvious in theory, Jeff discovered that in practice there are many details that can easily be overlooked which he will cover in depth in his talk.

Meet Jeff Chao of Heroku

Jeff Chao is a software engineer at Heroku where he is a member of the Department of Data. Prior to Heroku, he worked on streaming data processing systems where he built event processing engines, data ingestion pipelines, and pubsub services. Jeff's current work involves delivering and maintaining Apache Kafka as a service.

As someone with vast experience deploying Kafka as a service supporting thousands of clients, Chao will share the lessons he's learned in building this data service offering at Heroku. Additionally, he'll explain in detail how he identified the specific problems he encountered, while addressing the interesting behaviors and "gotchas" he experienced as Heroku scaled their data infrastructure to ensure a robust and trustworthy service for its growing user base.

To up your data pipeline game, and learn how Jeff and the data team pushed the limits of Kafka at Heroku, check out the full talk at DataEngConf SF '17.

View full post