Congestion - Observe Ability

Performance Diagnostics Part 5- Optimising Worker Threads

Leigh Finch — Fri, 29 Sep 2023 07:32:32 +0000

Background

A few (ok many) years ago I was working with a customer who was launching a new application and was expecting a high load on their launch, which for whatever reason was at 3pm on a Friday (1). As 3pm hit, and the load balancers started directing traffic to the now production system, there was a panic as synthetic tests started to time out. This was not good because the users were unable to interact with the web application, or if they did it was dead slow (minutes to load a page).

Using observability tooling I was quickly able to see that they had run out of worker threads and that requests were now being queued beyond the timeout of the synthetic agent. The fix was a simple, increase the number of worker threads to enable requests to be handled in parallel rather than waiting for a thread to become free.

The increase from 25 to 100 threads immediately increased responsiveness of the application back to within the SLA that the application team had promised to the business.

So Why did I recommend increasing the number of threads from 25 to 100?

If you’ve ever managed a webserver and seen the max connections or worker threads settings, you might be tempted to think that bigger is better. But there are a number of factors that need to be considered before blindly increasing the number of threads.

When things start to become “slow” as an Observability and Digital Performance expert, I need to consider the type of workload, the utilisation of resources (such as CPU, memory, and Storage IO), and errors/events that might be occurring. I will then leverage APM traces to understand where time is being spent in the code or even the application server.

In this case all threads were being consumed however not all CPU cores were being consumed. This led me to start looking at traces, and what I saw was that the actual application response time was quick. This means that when the request actually got to application code, it was executed very quickly. The time was being spent in the application server (Tomcat in this case) which was queueing requests, but unable to have the thread pool execute them quickly.

So when the code is executing quickly but is held in a queue waiting. so if everything is being executed quickly but requests are timing out, it means that we need a way to increase the number of requests being executed simultaneously, with the side effect of increasing the time it takes for each request taking slightly longer to execute. If we have an equal number of workers to cpu cores a single thread can have effectively uncontended access to a CPU core, however if we increase the number of threads beyond the number of cores, we have to rely on the operating system scheduler to schedule access to the required CPU core.

Additionally, as we increase the number of worker threads, we also increase the likely of issues relating to concurrency (locks, race conditions), as the increased number of threads will also take longer to execute their workload.

Using NGINX as an example, it recommends setting the number of works to the number of cores or auto if in doubt(2). I’m going to use a benchmarking tool called Apache Benchmark against a webserver that has two cores and two workers to calculate the first 1000 prime numbers.

Test 1– 1 Concurrent Request

In this test we have two worker threads and one concurrent request. We see that the mean response time is 620ms. Not bad for ten requests with the total time to process ten requests at 6.197 seconds.

root@client:~# ab -n 10 -c 1 http://192.168.20.17/index.php
Time taken for tests:   6.197 seconds
Requests per second:    1.61 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   606  620  37.7    608     727
Waiting:      605  620  37.7    608     727
Total:        606  620  37.7    608     727

Test 2 – 2 Concurrent Requests

In this test we have two worker threads and two concurrent requests. We see that the mean response time is 624ms. Pretty comparable to the previous test however the the total test time was reduced to 3.7 seconds.

root@client:~# ab -n 10 -c 2http://192.168.20.17/index.php
Time taken for tests:   3.748 seconds
Requests per second:    2.67 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:   607  624  12.4    624     652
Waiting:      607  624  12.3    624     652
Total:        607  624  12.5    625     652

Test 3 — 4 Concurrent Requests

In this test we only have one worker thread and four concurrent requests. We see that the mean response time increase to 1162ms. This is roughly doubling the request duration, however the total time taken to serve the ten requests was almost the same as test two at 3.8 seconds.

Doubling the number of concurrent requests to 8 shows that the response time increase is roughly linear.

root@client:~# ab -n 10 -c 4 http://192.168.20.17/index.php
Time taken for tests:   3.821 seconds
Requests per second:    2.62 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   691 1162 319.7   1205    1748
Waiting:      691 1160 317.0   1205    1737
Total:        691 1162 319.6   1205    1748

Test 4 – 4 Concurrent Requests And 4 workers

This test is to oversubscribe the number of worker threads to CPU cores by double, relying on the operating system scheduler to load balance the requests.

The performance was comparable (slightly worse by ~100ms) to test three relying on the OS scheduler to load balance across the two cores.

root@client:~# ab -n 10 -c 4 http://192.168.20.17/index.php
Time taken for tests:   3.978 seconds
Requests per second:    2.51 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:   621 1205 304.1   1280    1483
Waiting:      621 1205 304.2   1280    1483
Total:        621 1205 304.1   1280    1483

Conclusion

Overall the best performance was two workers with two concurrent requests lining up with the general advice of equal number of workers to cores, however this workload fully utilises (prime number generation) the CPU core while it runs. Other workloads will use require less CPU time whilst waiting on dependancies (e.g. DB calls), and this will mean that over-subscribing worker threads will improve results. So Like everything in IT the correct value is: “it depends” and “bigger is not necessarily better“.

If you made it this far, thanks for reading. Check out the book section for interesting books on Observability.

This is the best way to ruin a weekend for your hard working staff. Read only Fridays make for happy engineers.
https://nginx.org/en/docs/ngx_core_module.html#worker_processes

The post Performance Diagnostics Part 5- Optimising Worker Threads first appeared on Observe Ability.

Initial Congestion Windows in Linux

Leigh Finch — Tue, 12 Sep 2023 03:45:04 +0000

As part of my research I’ve spent a lot of time looking the performance of TCP variants and options. One of the most common questions I get asked is about the congestion window and how it reacts to change in the environment.

The congestion window (CWND) is used to control how many segments (layer 4 protocol data unit) can be outstanding (unacknowledged) at any point in time. For most TCP connections we want to be able to use as much bandwidth as we can without overwhelming the network. In most situations the CWND constantly changing as we move through the phases of the connection lifecycle.

The first phase of a TCP connection is the 3 way handshake where a connection is established between two endpoints (client and server). When the connection is established both endpoints individually set buffers for the sending (CWND) and receive (RWND) of data. These buffers are usually set conservatively for efficiency and security purposes.

The next phase is slow start, where the CWND is set as an integer between 1 and 10 (maximum currently allowed per RFC6928). Slow start (despite its name) happens exponentially using a method similar to ABC (Appropriate Byte Counting) where the CWND is increased by the number of bytes acknowledged in an Ack segment. As bandwidth has increased it makes sense, it makes sense to increase that initial CWND to 10. Fortunately newer Linux kernels do just this.

As the connection matures and exits slow start (depending on the flavour of TCP could be a combination of loss and latency), TCP moves into congestion avoidance where the CWND is only increased approximately every round trip. Loss, Latency, and ECN (Explicit Congestion Notification) may result in a return to slow start.

Unlike the receive window (which we can see in packets) we can’t directly see the congestion window in a packet capture. We can however infer the congestion window based on the number of outstanding bytes any point in time.

The above IO Graph in Wireshark looks at a 10 second trace that shows shows the number of bytes in flight which roughly equates to the congestion window over the lifecycle of the connection.

I can also zoom into the beginning of the trace and count the number of segments at the very beginning of the connection and I can see that I can send a maximum of 14480 bytes, or 10 segments at 1448 bytes (+12 bytes of TCP options to add up to the MSS of 1460).

1. Look at the CWND of a listening sockets using ‘ss — nli | grep cwnd | uniq’

2. Write a simple application that inspects the socket created.

This application takes 2 parameters (destination host and port) and it creates a TCP connection, not transfer any data, and print out the socket parameters. The tcpi_snd_cwnd in this example shows the expected value: 10.

import socket
import string
import struct
import argparse

class tcp_client_tcp_info(object):
    """Simple tcp client."""

    def __init__(self, host, port):
        try:
            client_socket = socket.socket()
            client_socket.connect((host, port))
            fmt = "B"*8+"I"*24
            fmt_keys =  ['tcpi_state', 'tcpi_ca_state', 'tcpi_retransmits', 'tcpi_probes', 'tcpi_backoff', 'tcpi_options', 'tcpi_snd_wscale', 'tcpi_rcv_wscale', 'tcpi_rto', 'tcpi_ato', 'tcpi_snd_mss', 'tcpi_rcv_mss', 'tcpi_unacked', 'tcpi_sacked', 'tcpi_lost', 'tcpi_retrans', 'tcpi_fackets', 'tcpi_last_data_sent', 'tcpi_last_ack_sent', 'tcpi_last_data_recv', 'tcpi_last_ack_recv', 'tcpi_pmtu', 'tcpi_rcv_ssthresh', 'tcpi_rtt', 'tcpi_rttvar', 'tcpi_snd_ssthresh', 'tcpi_snd_cwnd', 'tcpi_advmss', 'tcpi_reordering', 'tcpi_rcv_rtt', 'tcpi_rcv_space', 'tcpi_total_retrans']
            tcp_info = dict(zip(fmt_keys, struct.unpack(fmt, client_socket.getsockopt(socket.IPPROTO_TCP, socket.TCP_INFO, 104))))
            for k, v in tcp_info.items():
                print(k + " " + str(v))
        except socket.error as e:
            print(e)
            exit(1)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("-H", "--hostname", type=str,
                    help="Host to connect to", required=True)
    parser.add_argument("-p", "--port", type=int,
                    help="Port to connect to", required=True)
    args = parser.parse_args()
    tcp_client_tcp_info(args.hostname, args.port)

Lastly we can check out where this value is set tcp.h and the commit that set it to 10.

If you got this far, thanks for reading. Pick up a copy of W. Richard Stevens TCP/IP Illustrated Volume 1 to learn more about TCP and the protocols that built the internet.

The post Initial Congestion Windows in Linux first appeared on Observe Ability.