The idea of distributed tracing is to stitch together the execution path traversed by a request: operations are timed, and the execution context is propagated as different services perform work to handle the request. As storing all trace data is prohibitively expensive, it is necessary to select certain traces to be retained and discard others. Anomalous traces are invaluable to debugging and optimization workflows, but traces do not announce up front whether they will take an abnormally long time to complete, or whether an operation 35 links away will result in an error. A tail-based approach, in which the decision whether to retain a trace is deferred until the trace is complete, at which point its characteristics determine the likelihood that it is retained, is therefore required.
This talk will describe the product and engineering requirements for a robust and scalable tail-based distributed tracing system, and the statistical techniques that arise in meeting these requirements. For example, we will discuss how to prefer abnormally long and/or erroneous traces while maintaining the ability to calculate accurate summary statistics.
We will introduce the necessary concepts from distributed tracing. Some comfort with statistical arguments would be helpful.
Joe Ross holds a PhD in mathematics from Columbia University and was a researcher and instructor in pure mathematics, most recently at the University of Southern California. He has given more than 20 talks about his research at conferences and universities throughout the world (Germany, Japan, Turkey, USA). He has also been the primary lecturer for many undergraduate and graduate math courses, and has given countless informal seminars. He has 9 publications in peer-reviewed mathematics journals. Joe has worked as a data scientist at machine learning/analytics startups for over five years; in his current role, he focuses on a variety of time series (anomaly detection, forecasting, correlation) and sampling problems that arise in monitoring.