Optimising Visibility: The Convergence of Traces and Logs

Introduction

Everyone that has ever written any code will have written some permutation of “Hello World”. The feeling of unstoppable power as you can now successfully, write, compile, and execute your own code. At least this is how I felt some 25 years ago. This is before the weight of pointer arithmetic, boolean algebra, and concurrent systems come in and make you feel like a kid lost in a busy car park.

I was lucky enough to learn a few programming and scripting languages over the last 25 years for different purposes but printing to the console has always been something that I have done to debug basic code. I’m not saying this is a good practice, but it is convenient for single threaded scripts that will be used once, and discarded until the next “once off”.

If you want a primer on OpenTelmetry, I recommend taking a look at my review on the book Practical OpenTelemetry.

Programmers are usually familiar with the paradigms of event logs, syslog, stdout, stderr which allow you to create logs of things that are happening within a piece of code running in the background. The challenge we have is that logs in these forms often lack context and represent a point in time leaving us looking at logs in isolation. Cycling through the logs to create a timeline of events perhaps with a pen and paper and wondering how this seemingly impossible error could have occurred. Below is a simple multi-threaded application that dumps a combination of a logging library and print statements to the console to illustrate the pain of looking analysing log files.

Even if I send this to Syslog and look at it in a tool like SigNoz, Grafana Loki, Elastic, or Splunk it is still difficult to analyse. Having to look at each individual line using filters still doesn’t allow me to understand the context easily (see below).

Tracing

Tracing is the process of following a work unit from beginning to end be it packet traces, APM traces, eBPF traces are all examples of following a workload with context allowing us to understand the sequence of events and the details surrounding it.

OpenTelemetry has integration with more than 10 of the most popular popular languages and frameworks (C++, Java, PHP, Python, .NET etc) allowing us to automatically or manually instrument our code. While the OTEL profilers and SDKs may not be as mature as some APM vendors capabilities it does allow instrument more applications and unify our MELT (Metrics, Events, Logs, and Traces) and switch Observability visualisation UIs easily without needing to re-instrument our applications and services.

OpenTelemetry uses the term Spans to represent units of work inside of a trace and you can see from the below screenshot the duration of the workload and its sub-components with 38.95 seconds total:

30.45 Seconds Updating the database.
8 seconds sleeping.
40 ms Reading data from the database.
10ms of time in the parent function (modify words)

Zooming into the bottom right hand corner we can see that not just the duration of the request is there, but also the query and service name which would be multiple log lines in a traditional setting.

Selecting the events tab we can also see any “Events” that have occurred such as logs, exceptions, or anything that might be of interest to someone looking at a unit of work.

What is a Unit of Work?

A Unit of Work is something that has a defined start and end. In the world of web applications this might be an individual GET/POST/PUT/PATCH/DELETE request with the URI acting as the entry-point, and the 200 OK response being the exit-point. In the world of web applications this is fairly easy to understand, however in the world of FAT clients and long lived applications that don’t have a UI this is significantly harder. We can see this in the types of applications that have supported automatic instrumentation (Flask, Spring boot etc). Some languages have more supported frameworks and applications than others.

What Does This Mean?

What I can see happening is an increasing use of tracing technology (notably OpenTelemetry) to enrich the power of logging to create SPANs with context. I doubt it makes sense to retrofit existing applications, but it does make sense for new and actively developed applications to take on a variant of because of the power and flexibility of this approach and the details it provides.

The combination of automatic and manual instrumentation can provide as much (or more) visibility as proprietary APM solutions. This is because authors of applications can add additional context to the existing Spans. We are now seeing APM vendors leaders taking OpenTelemetry as a data source to provide even more context to their data.

I doubt we will ever see logs disappear completely as they are incredibly easy and useful to implement, however less useful as applications become larger and more complex.

Into the code

https://github.com/lfinchHob/scripts/tree/main/OTEL/python/threading

In the example I’ve replicated a fat client application that I ran into about 2 years ago that was experiencing significant delays whenever a particular unit of work was triggered. In this case the Unit of Work was triggered by a notification to contact centre workers that a new job had appeared, this notification resulted in every client simultaneously trying to update the database. This is less than ideal as the busier it got, the slower it became, and adding more contact centre workers had diminishing returns.

In the case of automatic instrumentation no traces were created because the code was not using a supported framework such as Flask. Manual instrumentation was extremely successful immediately adding value over the logging that I had used in the initial example.

Automatic Instrumentation

Automatic instrumentation is quite easy to use by executing the code via the OpenTelemetry-instrument command.

opentelemetry-instrument --traces_exporter console,otlp     --metrics_exporter console --service_name thread.py --exporter_otlp_endpoint 192.168.20.53:4317 python3 thread.py

Manual Instrumentation

Manual instrumentation requires adding to the code. I chose the annotations path on functions, but other approaches can be used also.

resource = Resource(attributes={
    SERVICE_NAME: "Thread Manual"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://192.168.20.53:4318/v1/traces"))
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

If you made it this far, thanks for reading. Check out the Book Recommendations and subscribe to the Observability Round Up Newsletter.

Comments

2 responses to “Optimising Visibility: The Convergence of Traces and Logs”

Observability Round Up 30th October 2023 – Observability

October 30, 2023

[…] In the #OTEL / #OpenTelemetry world I had a chance to review tracing and log convergence which I see as the next level of maturity for applications wanting to provide context and information to operators of complex software. Logs give you a point in time reference of something that has happened, but they don’t give you context without the operator or a tool attempting to clumsily infer a sequence of events from the individual datapoints. Check out my article on this topic for more details and some sample code for the techies amongst us. […]

Digital Experience in an Opaque World – MilCIS 2023 – Observability

November 13, 2023

[…] Link to Article […]