Unveiling the Power of OpenTelemetry: An Exploration with SigNoz for Superior Observability

In the second of this series exploring OpenTelemetry, we take a look at SigNoz as an OpenTelemetry Observability solution. Having worked for an Observability vendor that had an APM offering, I found the interface intuitive and easy to work through. In this article I’m going to explore the Traces capability within SigNoz and plan to follow up exploring some of the Logging and Metrics capabilities that are included with SigNoz.

Disclaimer: I’m not affiliated with SigNoz in any way and do not make any commissions for this article.

If you are just getting started with OpenTelemetry, I recommend picking up a copy of Practical OpenTelemetry as your starting point as it explains the philosophy and practical architectures with examples that make it easy start working with SigNoz and other OpenTelmetry visualisation platforms.

Installation

I used the standalone Docker Compose from the SigNoz Git repository which had me up and running as quickly as my fixed wireless LTE service would allow (maybe 10 minutes). I chose Docker Compose which deployed all the required containers for me without me needing to think to hard about it. The deployment include 10 containers including the frontend application, Zookeeper, ClickHouse, and supporting services for queries and logging. I have this running on 2 cores with 4GB of RAM, but I would ramp this up if I was running it in production.

The fastest path to deployment would be signing up for the cloud offering which includes a 30 day free trail. Alternatively there are Swarm installation guides, Helm installation guides for Kubernetes deployments.

Tracing

From a tracing perspective, OpenTelemetry uses a concept called Spans which represents a unit of work being performed. If you come from an APM (Application Performance Management) background this would be an individual trace with an entry-point and exit-point. This unit of work could include sub pieces of work such as a web request, sql query, API call, or even the duration method called in the stack. In the below example I have a trace from a web request made to an application server on the /customer endpoint.

In this example I’ve been able to isolate that the first 300ms contained a SQL request that took up 258ms. I can even see the query text and hostname details. this waterfall diagram also allows me to see where my code is spending the most time when being executed.

This is extremely valuable as I can look at data retrospectively and from a user centred context. Too many times in my career I’ve seen tickets that come through to an engineer who closes the ticket with something to the effect of “Works for me” or “Closed: unable to reproduce”. Having useful information in the form of a trace means that the engineer has more information to work with and empirical evidence that the subjective “it’s slow” can be measured, quantified, and acted upon.

With this trace information we are also able to create service maps that connect the different components sending trace data together. Out of the box SigNoz is self instrumented which gives you the ability to see the platform in action without needing to instrument anything yourself. Being technologists I like to get my hands dirty, I downloaded the sample Distributed Tracing Java Sample which included a docker container and some Spring Java services with a NodeJs front end. I did have to make some small modifications to the code to get it to work in my environment (I’ll create a pull request when I get some time).

The sample application allowed me to walk through the process of ordering a Mac mini in a faux online store.

The subsequent service map included my whole environment which I could filter using the filter bar at the top of the page.

Exceptions

Without fail I see people looking at Exceptions as the most important artefact when troubleshooting. It is true they provide detailed insights (see below screenshot), but we also need to think about what happens over before and after an exception to understand the context. SigNoz has a great Exception display enabling us to see exactly that (the exception details and the context).

What Else Can I trace?

Trace is a bit of a nebulous term, but it can include virtually anything from databases, packets, method invocation, and logging. Logging is a topic I’ll be taking a closer look in another article, as I see the convergence of Events, Logging, and Tracing an important step in the direction of actionable observability. Traces include more information than a printf statement as they include context and attributes. People read logs and events sequentially which can be confusing and difficult to isolate problems, Traces can include the context and visualisation as a Span and sub-span makes it a lot easier to work with.

I integrated my Node Red deployment using a plugin to enable me to create Spans in SigNoz. This is great for understanding where time is being spent in Node Red workflows.

If you made it this far, thanks for reading. I love writing about technology and if you liked this article, consider buying a book from my book recommendations.

Unveiling the Power of OpenTelemetry: An Exploration with SigNoz for Superior Observability

Installation

Tracing

Exceptions

What Else Can I trace?

Comments

One response to “Unveiling the Power of OpenTelemetry: An Exploration with SigNoz for Superior Observability”

Leave a Reply Cancel reply