APM - Observe Ability

Practical OpenTelemetry Review

Leigh Finch — Tue, 24 Oct 2023 01:54:11 +0000

OpenTelemetry is something I’ve been watching for a while now and reviewing Practical OpenTelemetry was the perfect excuse to dive deeper. In my lab I’ve been running a project called SigNoz for which I’m writing up a companion article to this one to show some practical examples.

When I first looked into OpenTelemetry about 4 years ago, I was primarily focused on the Tracing aspects such as how can OpenTelemetry replace vendor specific code profilers. I concluded that vendor specific profilers won’t necessarily be replaced by OpenTelemetry, but can be augmented by it. Most existing APM Observability vendors are looking at integrating with OpenTelemetry to reach languages and systems they previously hadn’t invested in such those focused on Java and DotNet now having the ability to include data from PHP, Python, and NodeJS (to name a few).

OpenTelemetry provides a standard and software to instrument many different things using Traces, Logs, and Metrics (converging Events in the MELT framework across the other telemetry types). The idea of using a standard for instrumenting applications and systems means that you can switch backends relatively easily. Therefore less time can be spent on the standard of instrumentation, and more time spent on the visualisation and analysis of the instrumentation of systems.

Practical OpenTelemetry is an easy read designed for developers, DevOps, SRE, and Observability practitioners to get familiar with where OpenTelemetry fits in their ecosystem. The practical aspect includes reference architectures with Prometheus, Elastic, and Kafka, as well as examples using Java whist are simple enough that anyone with any programming experience should be able to port to their language of choice.

To achieve that it starts with an analysis of why OpenTelemetry is needed focusing on the strengths like openness, standards, Observability (MTTRes, MTTI, MTTK MTTF). This is my favourite style of introduction because it explains the need before throwing solutions at the problem.

The second part of the book breaks down what OpenTelemetry is (and isn’t) focusing on telemetry types and how they are ingested (including Traces, Metrics, and Logs) the OTLP protocol, the role of the collector, schemas and conventions. This is where the concept of Spans becomes important. A Span is a unit of work that is done. That unit of work may be composed of multiple sub Spans with each span having a context and attributes. A Span could be a Trace with a defined entry point such a webpage, or something manually defined and coded within the application itself. See the below image of SigNoz representing a series of Spans.

The book then goes on to describe Tracing instrumentation styles such as auto instrumentation vs manual as well as local vs distributed Tracing. Distributed Tracing is the ability to correlate Spans across systems. Distributes tracing is incredibly important as it allows you to create real-time service maps and see how an individual transaction performed over multiple systems.

The chapters on Metrics and Logs and how they are implemented in OpenTelemetry. I liked how the author was able to discuss the convergence of logs and tracing and why it won’t happen overnight.

Blanco finished the book with adoption and institutionalisation covering the challenges of Brownfield environments vs Greenfield. This is especially valuable as relevancy and compatibility with other systems with overlapping capabilities is a constant battle for enterprises.

Practical OpenTelemetry on Amazon.

The post Practical OpenTelemetry Review first appeared on Observe Ability.

The CrUX Of Web Performance

Leigh Finch — Tue, 17 Oct 2023 05:36:01 +0000

The world of web performance is tougher than most internal applications because you are dealing with third parties that you don’t. The Internet, browsers (with a thousand tabs open), endpoints (with agents running), and poor connectivity in the form of cellular/wireless/consumer grade connections. This means that we need to be able to understand and manage what is within our control.

This article is tangental to SEO (Search Engine Optimisation), however focused on User Experience, rather than SEO aspects.

I like to think about performance using the point of consumption and the point of distribution when looking at a network like a WAN or Internet serving applications/services/web pages to consumers over some latency. We usually control is the point of distribution of an application (which could be a load balancer, Application Delivery Controller, Reverse Proxy such as Apache, NGINX, or Varnish).

The point of consumption (and everything in between) in an Internet delivered application is something we may have little control over. If we are lucky it might be a corporate managed device where tools like End User Experience Management (EUEM) agents might be installed.

If we go back to the point of distribution for the application, we need to ensure that it is operating as optimally as possible in terms of server response time and transport performance (e.g. TCP and QUIC). But we also need to consider the rendering at the point of consumption which is where Core Web Vitals and CrUX come into play.

Core Web Vitals is an initiative that looks at the performance of a web page from the perspective of the user, to improve User Experience. CWV is made up of the following meaningful metrics:

Largest Contentful Paint

Largest contentful paint looks at the time it takes to draw the largest content on the page which could be an image, video, or other visible object in the browser. LCP is favoured over earlier metrics such as the page load time.

Cumulative Layout Shift

Cumulative layout shift looks at how often the layout of a page shifts during its lifetime. The more things move around, the poorer the user experience, the lower the score. Think about pages that pops up an advertisement covering the content as you scroll through an article, or as the page loads, buttons and links move around as the page renders.

First Input Delay

First input delay is how long it takes for the user to start interacting with the page and have the browser event handlers register. This could be how long it takes to be able to type a field in a form, or click a button.

Interaction to Next Paint

Interaction to next paint is a pending (at the time of writing October 2023) metric for Core Web Vitals. INP is how long it takes from an interaction such as a click to render the next object such as a drop down box, login form, or even the next page. This can be considered how responsive a page is to interaction. Everyone I know has had a poor experience with a web application that is unresponsive to clicks, resulting in them clicking again and again adding to the queue of interactions that are to be processed.

CrUX (Chrome User eXperience) is the service for Chrome that reports these metrics back to go Google for the purposes of benchmarking and baselining web sites. Arguably it could be used to influence a web sites ranking in search results, but this may just be because Google likes to rank websites with better user experience higher. There are caveats in CrUX reporting:

The website has to be popular enough.
CrUX data is only collected by Chrome.
The user has to have opted in to metric collection.
Other Geographical and privacy reasons.

So what do you do if there isn’t enough data coming in, or you want more detailed analysis? Javascript snippets like those provided by Blue Triangle and APM vendors can provide detailed user experience from any browser that interacts with your page. OpenTelemetry can also be used to collect metrics using the JavaScript package.

Some Interesting Tools

Here are some tools that can help you with your End User Experience Management:

https://developers.google.com/search/ Console for monitoring web page performance over time.
https://pagespeed.web.dev/analysis/ Pagespeed tests website performance, SEO, accessibility with recommendations.
Google LightHouse. Embedded in the Chrome developer tools provides performance, SEO, and CWV details (see below).

Thanks for reading this quick introduction to Web User Performance. If you made it this far please consider buying a book in the Book Recommendations Page.

The post The CrUX Of Web Performance first appeared on Observe Ability.

Performance Diagnostics Part 5- Optimising Worker Threads

Leigh Finch — Fri, 29 Sep 2023 07:32:32 +0000

Background

A few (ok many) years ago I was working with a customer who was launching a new application and was expecting a high load on their launch, which for whatever reason was at 3pm on a Friday (1). As 3pm hit, and the load balancers started directing traffic to the now production system, there was a panic as synthetic tests started to time out. This was not good because the users were unable to interact with the web application, or if they did it was dead slow (minutes to load a page).

Using observability tooling I was quickly able to see that they had run out of worker threads and that requests were now being queued beyond the timeout of the synthetic agent. The fix was a simple, increase the number of worker threads to enable requests to be handled in parallel rather than waiting for a thread to become free.

The increase from 25 to 100 threads immediately increased responsiveness of the application back to within the SLA that the application team had promised to the business.

So Why did I recommend increasing the number of threads from 25 to 100?

If you’ve ever managed a webserver and seen the max connections or worker threads settings, you might be tempted to think that bigger is better. But there are a number of factors that need to be considered before blindly increasing the number of threads.

When things start to become “slow” as an Observability and Digital Performance expert, I need to consider the type of workload, the utilisation of resources (such as CPU, memory, and Storage IO), and errors/events that might be occurring. I will then leverage APM traces to understand where time is being spent in the code or even the application server.

In this case all threads were being consumed however not all CPU cores were being consumed. This led me to start looking at traces, and what I saw was that the actual application response time was quick. This means that when the request actually got to application code, it was executed very quickly. The time was being spent in the application server (Tomcat in this case) which was queueing requests, but unable to have the thread pool execute them quickly.

So when the code is executing quickly but is held in a queue waiting. so if everything is being executed quickly but requests are timing out, it means that we need a way to increase the number of requests being executed simultaneously, with the side effect of increasing the time it takes for each request taking slightly longer to execute. If we have an equal number of workers to cpu cores a single thread can have effectively uncontended access to a CPU core, however if we increase the number of threads beyond the number of cores, we have to rely on the operating system scheduler to schedule access to the required CPU core.

Additionally, as we increase the number of worker threads, we also increase the likely of issues relating to concurrency (locks, race conditions), as the increased number of threads will also take longer to execute their workload.

Using NGINX as an example, it recommends setting the number of works to the number of cores or auto if in doubt(2). I’m going to use a benchmarking tool called Apache Benchmark against a webserver that has two cores and two workers to calculate the first 1000 prime numbers.

Test 1– 1 Concurrent Request

In this test we have two worker threads and one concurrent request. We see that the mean response time is 620ms. Not bad for ten requests with the total time to process ten requests at 6.197 seconds.

root@client:~# ab -n 10 -c 1 http://192.168.20.17/index.php
Time taken for tests:   6.197 seconds
Requests per second:    1.61 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   606  620  37.7    608     727
Waiting:      605  620  37.7    608     727
Total:        606  620  37.7    608     727

Test 2 – 2 Concurrent Requests

In this test we have two worker threads and two concurrent requests. We see that the mean response time is 624ms. Pretty comparable to the previous test however the the total test time was reduced to 3.7 seconds.

root@client:~# ab -n 10 -c 2http://192.168.20.17/index.php
Time taken for tests:   3.748 seconds
Requests per second:    2.67 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:   607  624  12.4    624     652
Waiting:      607  624  12.3    624     652
Total:        607  624  12.5    625     652

Test 3 — 4 Concurrent Requests

In this test we only have one worker thread and four concurrent requests. We see that the mean response time increase to 1162ms. This is roughly doubling the request duration, however the total time taken to serve the ten requests was almost the same as test two at 3.8 seconds.

Doubling the number of concurrent requests to 8 shows that the response time increase is roughly linear.

root@client:~# ab -n 10 -c 4 http://192.168.20.17/index.php
Time taken for tests:   3.821 seconds
Requests per second:    2.62 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   691 1162 319.7   1205    1748
Waiting:      691 1160 317.0   1205    1737
Total:        691 1162 319.6   1205    1748

Test 4 – 4 Concurrent Requests And 4 workers

This test is to oversubscribe the number of worker threads to CPU cores by double, relying on the operating system scheduler to load balance the requests.

The performance was comparable (slightly worse by ~100ms) to test three relying on the OS scheduler to load balance across the two cores.

root@client:~# ab -n 10 -c 4 http://192.168.20.17/index.php
Time taken for tests:   3.978 seconds
Requests per second:    2.51 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:   621 1205 304.1   1280    1483
Waiting:      621 1205 304.2   1280    1483
Total:        621 1205 304.1   1280    1483

Conclusion

Overall the best performance was two workers with two concurrent requests lining up with the general advice of equal number of workers to cores, however this workload fully utilises (prime number generation) the CPU core while it runs. Other workloads will use require less CPU time whilst waiting on dependancies (e.g. DB calls), and this will mean that over-subscribing worker threads will improve results. So Like everything in IT the correct value is: “it depends” and “bigger is not necessarily better“.

If you made it this far, thanks for reading. Check out the book section for interesting books on Observability.

This is the best way to ruin a weekend for your hard working staff. Read only Fridays make for happy engineers.
https://nginx.org/en/docs/ngx_core_module.html#worker_processes

The post Performance Diagnostics Part 5- Optimising Worker Threads first appeared on Observe Ability.