When using monitoring tools to identify latency issues, metrics and logging have their place, but they can be inadequate for offering true visibility across services, particularly with the rise of distributed application architectures. Distributed tracing, however, overcomes those challenges.
The Role of Logging and Metrics
Traditionally, logging and metrics have played critical roles in identifying and tackling latency issues. Through the creation of a record of events in an operating system or other software, logging offers an account of what happened when - including in the instance of latency, a record of response times. Metrics will also record response times, but as part of a larger set of data combined from measuring events. Metrics help put response times into context; for example, demonstrating whether response times were fast or slow in comparison to others. Metrics can also be useful to identify patterns and/or alerts.
The Benefits of Distributed Tracing
Logging and metrics, however, can run into challenges when a system reaches a certain size. As functions increase in complexity and companies increasingly move over to distributed architectures, metrics and logging can fail to provide sufficient visibility across services. Distributed tracing is instead increasingly being used for monitoring complex, microservice-based architectures as it records events with causal ordering. As a result, it allows you to ask the question, why is this slow? Distributed tracing can therefore enable you to determine causality in a way that logging and metrics alone cannot, as it can reveal the processes behind the speed of each response.
As an application grows to 10+ processes, begins to see increased concurrency or non-trivial interactions between mobile/web clients and servers, tracing allows for visibility into those processes. As an example, think of your homepage. Whenever a user visits it, the web server will make two HTTP calls; each of those calls will branch out to call the database. Debugging this process is fairly straightforward if there are latency issues, as you can assign each request a unique ID and send it downstream via HTTP headers for analysis. If, however, your website experiences a spike in popularity and your application is now spread across multiple machines and services, logs become less useful, providing less and less visibility, the larger the number of machines and services.
Use Cases: Uber & The Economist
Uber Engineers has been a pioneer in distributed tracing as a result of its software architecture complexity growing with the size of its worldwide business. As of early 2017, Uber had over two thousand microservices due to an increased amount of business features, both user-facing apps such as UberEATS and internal functions such as maps processing and data mining.The company also migrated away from large monolithic applications to a distributed microservices architecture and found accordingly that they needed a different monitoring system that would provide visibility into the system and the complex interactions happening between services. Its engineers have written about the shift in detail, explaining the different phases they moved through as they found their way to their current model.
The Economist has also detailed its transition from a monolithic system to a distributed microservice-based architecture, and the realization that “our logging and monitoring approach wasn’t able to keep up”. They found, for instance, that logs from different apps had no schema or differing schema, and HTTP requests were not easy to trace through distributed applications and services. They increasingly lacked transparency into their own systems and the company’s engineers began to lose confidence in the site’s ability to perform at scale. The company developed a new strategy, leaning towards standardized logging, or distributed tracing, in addition to continuing to deploy logs and metrics where necessary.
Keep Track of Transactions
Distributed tracing allows you to keep track of transactions. A tracer propagates a context that comes into a service, which is then propagated to other processes and attached to transaction data sent to a tracing backend. This not only allows you to monitor your system across process boundaries in which APM agents can’t be installed (an increasingly important tool as the industry shifts to microservices), but the context enables the transactions to be stitched together at a later time. When trying to improve latency across machines and services, this type of visibility is crucial. It allows you to ask the question, what happened and how long did it take?
It’s also why Section is making an active shift from a logging to a tracing focus. This will allow us to provide more meaningful insights to our platform users, and offer the most up-to-date and valuable monitoring tools for today’s increasingly complicated computing landscape.