DevOps Checklist for Distributed Tracing

The blog below is a guest blog post written by Epsagon, one of our ChefConf Online sponsors.

What is Distributed Tracing anyway? In the olden days, to debug problems, developers would typically log into the server running the software and inspect the logs and maybe some real-time metrics to investigate the issue. Nowadays, the distributed nature of modern architectures make this impossible (although it wasn’t the best way, to begin with). Indeed, when your software runs in parallel on multiple instances or containers, and when deployments are automated and happen without human intervention, you need to devise new methods of investigation

Today, we need to centralize logs and metrics to allow developers and testers to investigate issues. Distributed tracing goes one step further in the sense that it helps you understand the logical flow of data among many services, making it an ideal method for analyzing problems and performance issues in microservice-based architectures.

Choosing a Solution

Distributed Tracing involves instrumenting the code (either manually or automatically), identifying all microservice calls related to a single request, and sending trace data to a central location. Additionally, it is made up of a chain of various components: code instrumentation, collection of trace data, and, finally, analysis and visualization–all of which we’ll discuss below.  

But first, you need to pick a solution. Ideally, all stakeholders (especially developers and DevOps engineers) should consider their options and needs carefully and come to an agreement as to the best solution for everyone involved.

In a nutshell, virtually all distributed tracing solutions use an agent that runs on each instance and sends trace data to a global collector. There are a number of options out there to choose from, although there aren’t that many open-source projects in this field–Zipkin and Jaeger are probably the two most well-known. 

Epsagon provides automated data correlation, payloads, and end-to-end observability within microservice environments allowing Dev and Ops teams to troubleshoot less and fix issues faster. With a lightweight agent SDK, Epsagon provides automated instrumentation and tracing without gaps in coverage, giving you full visibility for containers, VMs, serverless, and more with no training, manual coding, tagging, or maintenance required.

Unless your requirements are very basic, distributed tracing should be augmented with logs and metrics in order to provide a more complete picture. If the chosen solution is able to link traces with server logs and metric data related to the request being traced, this will usually be extremely helpful to whoever is investigating a given issue.

Code Instrumentation

Code instrumentation is an essential part of distributed tracing. Although developers implement it, the whole team needs to define the strategy beforehand. Make sure to include the DevOps engineers in these discussions, as developers and DevOps engineers need to work together for this endeavor to be successful.

It’s better to stick to widely used industry standards, such as OpenTelemetry, so you can replace elements of the chain in the future if required. Avoiding vendor lock-in is usually an important parameter to keep in mind, although this should always be weighed against the ease and speed of implementation offered by the vendor. It’s also important to keep in mind that you don’t necessarily get what you pay for: some commercial solutions can be quite expensive and yet difficult to get up and running and not easy to use at all.

Code instrumentation can be done either manually, automatically, or via a combination of both. In fact, a combination of both gives you the best of both worlds: the automated instrumentation giving you a baseline that can be augmented with manual traces whenever the need arises. 

Each incoming request will be assigned a unique identifier, which will be passed to all calls related to that request. All the traces will then be centralized into a global location, which is described in the next section.

Code instrumentation is probably the most tedious and time-consuming step, but you should not neglect it. The usefulness of what you get in the end will mostly be based on the quality of the work done and the effort put in by the team during this stage of the project.

Collection of Trace Data

The trace data is usually sent to a local agent (i.e., running on the same machine) first in order to avoid too much overhead in the application. This may perform some filtering and discard unwanted traces. For example, you may want to get the traces related only to a certain time of request that you want to debug, or maybe you’re only concerned with the performance of the system and need to collect data on just 1% of the requests, randomly selected. 

As a DevOps, you want to ensure that the configuration of the local agent is automated and synchronized among all of your instances. The agent will then efficiently package trace data and send it in bulk to a global collector, which is a centralized software collecting all such traces.

It’s usually a good idea to use the same product end to end to minimize interoperability problems. That being said, for the storage/visualization side of things, many products offer a range of choices and good interoperability. Beyond these choices, your job as a DevOps Engineer is to ensure the agent is always installed and configured properly. This should ideally be done in an automated manner, either via a provisioning mechanism when using Infrastructure-as-Code tools such as CloudFormation or Terraform or via configuration management when using tools such as Ansible, Chef, or Puppet.

Note that if you use a serverless architecture (such as Lambda functions on AWS or Cloud Functions on GCP), you can’t have an agent running on the same host your code is running on, so an agentless solution would be required.

Storage, Analysis & Visualization

Some products, such as Jaeger, provide end-to-end solutions. Other products allow feeding the data into alternative backends, such as an ELK stack.

In an ideal world, trace data should be combined with metrics and logs to provide some context and additional useful information. Such specific cross-analysis and presentation can only be offered by specialist software. Generic solutions such as Elasticsearch will struggle to present a coherent platform for detailed and efficient analysis.

In any case, the DevOps team working on this will have to ensure that the backend is highly available and able to scale with the increased workload. An alternative to scaling would be to have, like Jaeger, a feedback mechanism to reduce the trace samples (i.e., send less trace data in case of increased workload). In addition, details such as data retention and lifecycle will need to be devised as well, in collaboration with the development team.

Conclusion

It should be acknowledged from the start that implementing a robust distributed tracing solution requires quite a lot of work, especially when you try to combine tracing data with logs and metrics. Importantly, keep in mind that implementing distributed tracing that is useful and pertinent to your workload will require tight coordination between your developers and DevOps engineers. That’s when a solution like Epsagon should come to mind – enabling you to focus on your core business.

Tags:

Ran Ribenzaft, CTO & Co-founder of Epsagon