Linux - Observe Ability https://leighfinch.net Observability Sat, 07 Oct 2023 04:34:21 +0000 en-AU hourly 1 https://wordpress.org/?v=6.5 223568926 Learning eBPF Review https://leighfinch.net/2023/09/27/learning-ebpf-review/ https://leighfinch.net/2023/09/27/learning-ebpf-review/#comments Wed, 27 Sep 2023 07:09:30 +0000 https://leighfinch.net/?p=137 What makes Learning eBPF different to BPF Performance tools (which I wrote about recently) is that it moves beyond theObservability and performance lens towards Security and modification behaviour inside the Linux kernel. The author Liz Rice is the Chief Open Source Officer at Isovalent and recently presented at the eBPF Virtual Summit in September of […]

The post Learning eBPF Review first appeared on Observe Ability.

]]>

What makes Learning eBPF different to BPF Performance tools (which I wrote about recently) is that it moves beyond theObservability and performance lens towards Security and modification behaviour inside the Linux kernel. The author Liz Rice is the Chief Open Source Officer at Isovalent and recently presented at the eBPF Virtual Summit in September of 2023. She has a lot of material available online and I’ll provide some resources towards the bottom of the article.

This book introduces eBPF in a consumable way discussing its history and how it became a vehicle to inspect and create new kernel capabilities without needing to have either created a kernel module (tied to a specific build or API) or having the code agreed upon by the community and adopted by distributions. Additionally, we understand how eBPF code is checked for safety prior to running, reducing the risk of a kernel crash in production.

As a reader I enjoyed the use of C and Python to illustrate practical examples of events being triggered (such as a packet arriving on an interface) and data being read into a program in user space. 

The hardest thing to get your head around is the different components that pull eBPF together. The author makes this easy with examples of which code is run in user space, and which code is first compiled to byte code and then JIT or assembled into machine code for execution.

The eBPF for networking described newer features such as XDP (eXpress Data Path) shows how we can create routers, firewalls, and load balancers (especially in a Kubernetes context) bypassing typical kernel behaviour. Examples are discussed including how CloudFlare and Facebook have used this capability in production.

The examples and working code are provided and you can download them in the resources below. If you’re interested in the next generation of Observability and Kernel modifications, please get yourself a copy of this book.


Resources

The post Learning eBPF Review first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/27/learning-ebpf-review/feed/ 4 137
BPF Performance Tools Review https://leighfinch.net/2023/09/21/bpf-performance-tools-review/ https://leighfinch.net/2023/09/21/bpf-performance-tools-review/#comments Thu, 21 Sep 2023 06:48:29 +0000 https://leighfinch.net/?p=122 BPF Performance Tools the kind of book an observability specialist picks up and thinks this will make a good reference book for my library, and then reads the whole thing cover to cover. Brendan Gregg formerly of NetFlix has contributed significantly to the world of observability and uses his experience in troubleshooting and tracing some […]

The post BPF Performance Tools Review first appeared on Observe Ability.

]]>
BPF Performance Tools the kind of book an observability specialist picks up and thinks this will make a good reference book for my library, and then reads the whole thing cover to cover.

Brendan Gregg formerly of NetFlix has contributed significantly to the world of observability and uses his experience in troubleshooting and tracing some of the most interesting problems any of us are likely to come across.

So what is BPF? Those of us in the Unix, Linux, and BSD world will likely say Berkley packet filters, and to be fair, this was the case. BPF was originally created to allow users to create filters for TCPDump to monitor selected network traffic and either send this to a PCAP, or display on the screen. This was useful when troubleshooting what was happening on the “wire” as opposed to what people think is going over the wire. I’ve used this to troubleshoot everything from port security, voice over IP issues, to performance analysis. The phrase “PCAP, or it didn’t happen” exists for a reason.

BPF has moved away from being just an acronym to the name of a feature sometimes referred to as eBPF (extended BPF) which now allows us to virtually trace anything that happens inside the Linux kernel. This could be performance related, security related, or even modifying the behaviour of the kernel altogether. Load balancers and firewalls have been created in BPF. I’ve even started building a congestion control algorithm leveraging BPF. The possibilities here are endless, you can now write kernel safe code to be run in the kernel with information being fed up to user-land through maps. 

This book however focuses on the performance aspects of BPF using Tracing. The difference between tracing and logs is the ability to trace events in real time without relying on pre-existing logs that occur without context. I could for example trace every socket accept event from every application and process on my machine, or trace server response times, or the amount of time spent in a particular state.

What I particularly liked was how the author broke down performance into specific domains including disk io, network io, applications and showed us real examples of BPF in action via BCC (BPF Compiler Collection) and also using the BPF APIs. There were one liners on practically every aspect of Linux performance I would want to query.

Perhaps most importantly the author compared and contrasted traditional tools we would use with the BCC approach in one line! This has completely changed the way I plan to approach performance troubleshooting in Linux.

If you ever thought to yourself, ‘why is that so slow?’ this book is for you. Grab a copy here!

If you made it this far, thanks for reading.

The post BPF Performance Tools Review first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/21/bpf-performance-tools-review/feed/ 3 122
Performance Diagnostics Part 4 -HTTPS Performance https://leighfinch.net/2023/09/19/performance-diagnostics-part-4-https-performance/ https://leighfinch.net/2023/09/19/performance-diagnostics-part-4-https-performance/#respond Tue, 19 Sep 2023 05:51:48 +0000 https://leighfinch.net/?p=105 Unlike HTTPS, analysing HTTP traffic with tools like Wireshark is pretty easy because everything is in clear text. Wireshark will even give you the request performance (49ms highlighted below). I can also see that the request was sent in packet 4 (after the three way handshake), and the response came in packet 6. The delta […]

The post Performance Diagnostics Part 4 -HTTPS Performance first appeared on Observe Ability.

]]>
Unlike HTTPS, analysing HTTP traffic with tools like Wireshark is pretty easy because everything is in clear text. Wireshark will even give you the request performance (49ms highlighted below). I can also see that the request was sent in packet 4 (after the three way handshake), and the response came in packet 6. The delta between packet 4 and packet 6 is your server response time.

But what about packet 5? Packet 5 is the acknowledgement of data at the operating system level, rather than at the application layer. Normally if the request is takes more than 50ms (your OS may vary), we will see what’s called a delayed acknowledgement, which the application data may piggyback on. However, this naked acknowledgement (no application payload) came back 3ms later. The reason for this is that the request was less than a full segment size (see the MSS in the SYN packets), which meant that the OS has attached the PSH flag, which the receiver must acknowledge straight away.

So what happens when we wrap this up in HTTPS? We can use the same logic as the measuring the request and response cycles we did with HTTP, it just means that we cant see the actual payloads. In most case we can expect that a payload will be sent to the server, and the delta between that payload and the return payload is our server response(1).

The second interesting thing is that we will now be at the mercy of SSL/TLS setup which involves additional round trips for the connection to establish. The below screenshot demonstrates a simple HTTPS request with connection setup, TLS establishment, HTTP, and session taredown.

If we brake this down, it’s actually quite a simple request and response cycle(2).

Events

  1. The first 3 packets are the normal TCP three-way handshake. 
  2. Packet 4 nstead of the HTTP request, we have the ‘Client Hello’. The Client Hello is a backwards compatible offer from the client to the server to negotiate specific TLS parameters including; pre-existing TLS Session IDs, available cipher suites (eg. TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256).
  3. Packet 5 is an operating system level TCP acknowledgement, indicating that the Client Hello has been received.
  4. A ‘Server Hello’ is received in packet 6 with the servers selection of cipher suite and other important TLS parameters including the server TLS certificate.
  5. Packet 10 tells the the server that all subsequent communications will be encrypted using the agreed parameters.
  6. Packet 12 tells the client that all subsequent parameters will be encrypted using the agreed parameters.
  7. Packet 14 is the HTTP request from the client (Encrypted using TLS).
  8. Packet 15 (8ms later) is the beginning of the HTTP response, followed by packet packet 17. (16 is an Acknowledgement at the TCP level).
  9. Packets 19 and 21 are the encrypted alert (type 21) which is session tear down. Even though it says alert, this is normal for TLS teardown and does not indicate a problem.
  10. Packets 20 and 22 are normal TCP teardown.
  11. Packets 23 and 24 are resets (RST) towards the server. Resets are now commonly used to tear down types of TLS communications(3). 

From this we can see that even though the actual server response time for the request was only 8ms, it actually took 236ms to get to the beginning of the server response to the application due to TCP and TLS overhead. 

If this was a high latency (eg. satellite) this would have taken even longer (back of the envelope for StarLink would take roughly 500ms, with geostationary satellite taking 2.2 seconds).

If you got this far, thanks for reading! If you want to learn more about this type of protocol analysis, pick up a copy of Wireshark Network Analysis.

  1. The exception to this are async communications like WebSockets. This a subscribe type model where you will see the a payload go to the server, and you will see sporadic responses back to the client from the server, or a response every 60 (30, 20,10) seconds.
  2. This session only has one HTTP object fetched. Typically you would see a persistent connection which reuses the same TCP/TLS connection to make further requests reducing overheads.
  3. The reason for RST being used in this way is to do with the behaviour of the TCP stack specified all the way back in RFC793. If a reset is sent, the socket is closed immediately, as opposed to waiting for the 2 times MSL (Maximum Segment Lifetime) which is typically for minutes of TIME_WAIT / CLOSE_WAIT. check out RFC1337 for some interesting commentary.

The post Performance Diagnostics Part 4 -HTTPS Performance first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/19/performance-diagnostics-part-4-https-performance/feed/ 0 105
Performance Diagnostics Part 2 — Revenge of the OSI Model https://leighfinch.net/2023/09/13/performance-diagnostics-part-2-revenge-of-the-osi-model/ https://leighfinch.net/2023/09/13/performance-diagnostics-part-2-revenge-of-the-osi-model/#respond Wed, 13 Sep 2023 00:12:25 +0000 https://leighfinch.net/?p=53 Continuing on from the previous article where I discussed an amalgamation of performance diagnostics with fat client applications. I thought it was a good time to go back to computer science 101 where we were introduced to the OSI model and the TCP/IP model. Both are models that some architectures and platforms more or less […]

The post Performance Diagnostics Part 2 — Revenge of the OSI Model first appeared on Observe Ability.

]]>
Continuing on from the previous article where I discussed an amalgamation of performance diagnostics with fat client applications. I thought it was a good time to go back to computer science 101 where we were introduced to the OSI model and the TCP/IP model. Both are models that some architectures and platforms more or less implement, But the former (OSI model) allows for us to break down problems into something easy to consume and definitively identify where performance problems lay.

I’ve found throughout my travels that most technologists understanding of the model is murky at best, and I strongly feel that this is a key part of identifying problems and the ownership of problems within IT Ops, engineering, SRE, and DevOps. This is further obfuscated by the evolution of compute and network architectures where we have multiple layers of virtualisation, and shared infrastructure.

I had a situation where I was called in to investigate a ‘customer network problem’ as called out by a third party vendor that hosted a Citrix farm in cloud provider that was used to host a fat client application connecting directly to a database also in the same cloud network. I asked for an example of the type of errors they were seeing, to my surprise they sent me a screenshot of the fat client (inside the Citrix session) displaying a database deadlock error… So not only was this not a customer network problem, it was also at the wrong layer of the OSI stack?!?

A word of caution on mean-time-to-innocence thinking, In this case the ability to resolve the issue was out of the control of the customer that brought me in, at least we were able to help focus the investigation efforts. Organisations that use terms like mean-time-to-innocence as opposed to mean-time-to-resolution often have larger problems than technology to work through. As I’ve said in other articles, people generally want to fix problems if it lays within their domain, however there is often enough plausible deniability and lack of observability data due to retention, tooling, and processes that sticking their heads in the sand and blaming another team seems to be the default.

ApplicationApplication code is run here.
PresentationData encoding, encryption, compression.
SessionManagement tear down start up of a session. RPC as well as session cookies.
TransportTransport protocol manages connections between the endpoints for the reliable or unreliable transmission of data.
NetworkManages the addressing or where to find another host.
Data-LinkManages the connection between two physically or logically connected hosts. Ethernet.
PhysicalPhysical cable or medium used to connect two or more hosts.
OSI Functions

For the purposes of this discussion, we’ll keep it simple. There are cases where the lines are blurred, after all this is a model, not a mandate. Technologies like DPDK (Data Plane Development Kit), raw sockets, and eBPF allow you to bypass some or most of the stack. This area is a lot of fun if you ever want to build your own router, switch, firewall, or write you’re own implementation of a layers 2+ of a protocol.

LayerTypical Component
ApplicationApplication
PresentationApplication, Application Server, standard libraries
SessionOperating System, Application Server, standard libraries
TransportKernel
NetworkKernel
Data-LinkKernel
PhysicalMedium. Eg cable
OSI Responsibility

A typical application will usually consider anything below layer 5 as a network problem or timeout, which can lead to interesting discussions with network teams who usually don’t manage the operating systems at either end. Remember that routers don’t retransmit packets (segments), hosts do. This could be a reaction to someting happening within the network infrastructure, but is also part of the normal TCP process probing for its fair share of bandwidth as seen in the below IO graph.

Just as the application stack (layers 5, 6, and 7) has no control over routing (layer 3), it also has no control over retransmissions which is solely the domain of the operating system kernel. This also applies to TCP connection establishment (AKA 3 way handshake), the application cannot slow the establishment down, but it can chose whether to read data from the socket.

An application can set some options that influence behaviours such as delayed ack, Nagle, Cork, and Congestion Control, using setsockopt() but that is topic for another article. Check out the man page for TCP for more info.


In this section we’re going to use the OSI model to go through problem isolation and how to monitor the each layer.

Layer 1 — Physical

The kinds of problems we typically see with the physical layer are to do with medium problems. This could be as simple as duplex, poor signal strength, collisions. Errors like this are usually easily observed on the network infrastructure as CRC (Cyclic Redundancy Errors), runts (inclomplete frames), carrier drops, and network switch buffer exhaustion. Streaming telemetry and SNMP provide excellent insights into these types of performance errors.

The data link is responsible for synchronising, modulation, and multiplexing communications between physical nodes. Carrier duplex (CSMA/CD) and enforcement of protocol specification (eg. ethernet). An example of this may be that a frame checksum has failed (due to an error at layer 1) and therefore must be discarded due to MAC (Media Access Control). LLC (Logical link control) allows for flow control and extensions such as VLAN tags. Popular layer 2 standards include:

  1. ATM
  2. Ethernet
  3. PPP

Streaming telemetry and SNMP provide excellent insights into these types of performance errors.

Layer 3 — Network

The network layer is used to identify hosts and address hosts. In the TCP/IP world this IP or internet protocol (versions 4 & 6). Hosts use this to uniquely identify a target. unlike a MAC address, IP addresses (outside RFC1918) should be globally (or at least relatively) unique. There are a few exceptions to this including multicast, private, anycast, site local, snd link local.

Errors at the Network layer include reachability (Destination Host Unreachable), IP TTL expiry (a field that is decremented by every router in a transit path).

Layer 4 is also typically where QoS of service is set. The DSCP (Differentiated Services Code Point) enables devices to prioritise traffic based on the class that has been configured. A router (or even host) operating at layer 3 can then change priority or queueing strategy based on this and other parameters. Again a story for another day.

While streaming telemetry and SNMP provide some insight into layer 3. NetFlow/IPFix provide a better insight how traffic traverses an environment. Traceroute, ping, and other ad-hoc tools can provide some level of insight, however this is subject to network policies.

Layer 4 — Transport

The transport layer is responsible for the reliable (or unreliable) delivery of data. This is where the traditional Network Engineers lose visibility of the ‘network’. Each TCP session is bidirectional between a host pair (which could even be the same host talking to itself), and has a randomised (hopefully) sequence and acknowledgement number used to keep track of the conversation and identify what data has, and has not been received.

Routers, Switches, and to a lesser extent firewalls do not participate (or interfere) in a TCP conversation between two hosts. As always there are exceptions to this to set values such as the MSS (maximum segment size) for a path to avoid packet fragmentation. But to re-iterate a router does not retransmit a packet, the sender host does.

This is also where communications efficiency comes into play. The congestion window on the sender is used to metre data communications across the network, and the receive window advertised by the receiver tells the sender how much space the receiver has to read data.

NetFlow and IPFix provide some insight here, but to really understand how protocols like TCP are performing we need actual packet data. This could be as simple as tcpdump running continously or using a commercial solution like Riverbed AppResponse for Application Aware Network Performance Management. Endpoint and server agents can also play a role in providing metrics, events, and traces. eBPF can also come into play here.

Layer 5 — Session

This is where things start to become a little more vague and using the model as a model rather than a rule becomes important. The session layer is responsible for the management of individual session using the same transport. This applicable to a lot of the older protocols including RPC (Remote Procedure Calls). It could be argued that this include HTTP session management (session cookies).

At this point in the stack we are now working with Application Servers and standard libraries, and small percentage of applications that have chosen to implement this from scratch.

This is where application traces, logs, and events come into play. Endpoint and server agents can be used to aggregate and feed this data into EUEM and APM analysis servers.

Layer 6 — Presentation

The presentation layer manages compression, (some) encryption, and encoding. This allows the application to work with clear text (eg. ASCII and Unicode) without having to implement much within the application code. This is typically managed by the Application Server and standard libraries. An example of this is opening an encrypted socket vs a cleartext socket, The process of reading and writing to the socket is largely the same with the developer not worrying about the encryption process.

Metrics, TLS Handshakes, Events, logs, traces provide detailed insight into this layer and associated performance problems.

Layer 7 — Application

This is where most developers spend their time and where application logic is performed. Application performance problems at this layer can be for a whole range of sources including but not limited to:

  1. Head of line blocking.
  2. Number of Server turns (request / response cycles) which is amplified by latency.
  3. Poor memory management.
  4. Inefficient algorithms.
  5. Number of worker threads.
  6. Dependencies such as databases and web services.
  7. Locks.
  8. Lack of connection pooling and pipelining.
  9. Authentication.

This is solely the domain of APM and EUEM agents to collect metrics, events, logs, and traces (MELT).

If you got this far, thanks for reading. Check out some of the book recommendations.

The post Performance Diagnostics Part 2 — Revenge of the OSI Model first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/13/performance-diagnostics-part-2-revenge-of-the-osi-model/feed/ 0 53
Performance Diagnostics Part 1 https://leighfinch.net/2023/09/12/performance-diagnostics-part-1/ https://leighfinch.net/2023/09/12/performance-diagnostics-part-1/#respond Tue, 12 Sep 2023 08:12:31 +0000 https://leighfinch.net/?p=40 Over the last 20 years I’ve been sent in by customers to investigate some of the most intriguing application performance problems that have had customers investing in infrastructure, time in war rooms, connectivity to try and resolve a problem that is eluding the technical team, or the technical team is unable to quantify what will […]

The post Performance Diagnostics Part 1 first appeared on Observe Ability.

]]>
Over the last 20 years I’ve been sent in by customers to investigate some of the most intriguing application performance problems that have had customers investing in infrastructure, time in war rooms, connectivity to try and resolve a problem that is eluding the technical team, or the technical team is unable to quantify what will fix it. Once a change such as adding bandwidth, compute, storage has been implemented, they are seldom reversed when the root cause of an issue identified.

While performance diagnostics is a fancy term for troubleshooting (something that technical staff do every day), I love the rush of energy that comes from the investigation and identifying the solution. Performance problems are harder to investigate than an outage because the application is somewhat working, the degradation occurs intermittently, and most customers I work with have limited observability tools in place, rather relying on the depth of the IT support queue as a measure of the health of an application.

I’m going to start with one of the simplest category of problems, the ‘slow fat client application’. Many applications have moved to using HTTP(s) for delivery and presentation of applications, there are still (far too) many applications that require a fat client executable application that if you are lucky connects to a application server, directly to a database, or if you are really unlucky a database file on a file share (oplocks… and I’m sorry…).

In this amalgam of stories (to preserve anonymity), I’m going to talk about a fat client application talking directly to a database that has been a long running issue with process, political, and technology problems.

Background:

  1. The customer has bought a COTS application from an external vendor to reduce IT costs developing the application in-house.
  2. The application is accessed by about 20 users simultaneously.
  3. The quality of and details of the support tickets has degraded over time as users have become frustrated with a lack of resolution and have started to adjust to this experience as normal.
  4. The vendor of the application have said that the application works fine for them and have even tested the dataset on their infrastructure.
  5. The blame game has reached fever pitch, and the executive wants answers.

This makes the first step of a performance diagnostics the hardest, which is to quantify the magnitude of the problem and when it occurs. Users will often complain that the issue happens ‘all the time’ and ‘everyone experiences it’. Along with users I’ll usually interview and discuss the problem with various teams to understand their take on the issue. Often these tech teams have been banging their heads against the wall for some time and are sick of talking about it and investigating the vague problem reports. I’ve found the best way to get buy in from reluctant tech teams is to be transparent and provide them access to any telemetry I might collect as part of the investigation.

Once I understand the nature of the problem(s), I start at the point of consumption (the workstation). If the customer has EUEM tools in play I can look at defining transaction (click to render) and understand three aspects of each transaction and the timestamps:

  1. Client activity including process utilisation, overall environmentals, delay between requests, network connectivity (VPN, wired, wireless).
  2. Network activity including number of request and response cycles, bandwidth utilisation, efficiency of the network communications.
  3. Delays on the backend including time to first byte. (I’ll write up how to quantify this in a seperate article).

This gives me an understanding of the baseline performance of the application. It’s also important to remember that at this point ‘normal’ does not equal good.

Normal != Good

A PCAP or NPM solution will also give me most of the network and backend delay information if the customer doesn’t have an EUEM solution such as Riverbed Aternity.

In most circumstances I now have enough information to establish to determine the performance limits of the application, whether bandwidth will solve the problem, or if the problem can be fixed without architectural changes to the application.

In this simple example of the fat client talking to a database, there only so many answers that can be given. I’m going to use the login transaction taking 12.5 seconds as my baseline. In this case the vendor and database had blamed the network. The network team had checked connectivity and found it to be operating fine with performance tests achieving 1Gbps, far more than the application needed. Looking into the EUEM statistics I found a significant amount of time being spent on the network, a PCAP told me the same thing.

Network can mean a lot of different things, but we need to remember that the network starts and ends with the two endpoints. Routers don’t retransmit traffic, endpoints do. In this case what happened is that the workstation would make a few database queries to complete a login including authentication, and downloading 12,000 rows of customer data. The fat client was performing an unbuffered fetch of data from the database, which in itself is not bad if that level of data integrity is needed (not so in this case).

So what does an unbuffered fetch mean? For every row in the database result, the client will make sequential requests for data, one, after, the, other. At < 1ms of latency between the workstation and the database server, and each server turn (request response cycle) taking 1ms, we saw 12,000 rows taking (you guessed it) 12 seconds. The vendor was not receptive to the idea that it was a problem with the application, and decided to test our finding by changing the login process to do a buffered fetch of the 12,000 rows. This took the login time to about 0.5 seconds. A sigh of relief from the customers tech lead was short lived though, the vendor was not willing to modify the application and the solution was thrown out as not fit for purpose.

It’s not just how much bandwidth the network has, or how fast the server is, but how applications use the available resources. How many request / response cycles do we need to use to go from click to render?

So why did the vendor not see the problem? They were testing the fat client application on the database server negating any limitations of the speed of light or network protocol overhead. 0 times 12,000 is still 0.

If you got this far, thanks for reading and follow me for more.

The post Performance Diagnostics Part 1 first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/12/performance-diagnostics-part-1/feed/ 0 40
Initial Congestion Windows in Linux https://leighfinch.net/2023/09/12/initial-congestion-windows-in-linux/ https://leighfinch.net/2023/09/12/initial-congestion-windows-in-linux/#respond Tue, 12 Sep 2023 03:45:04 +0000 https://leighfinch.net/?p=27 As part of my research I’ve spent a lot of time looking the performance of TCP variants and options. One of the most common questions I get asked is about the congestion window and how it reacts to change in the environment. The congestion window (CWND) is used to control how many segments (layer 4 […]

The post Initial Congestion Windows in Linux first appeared on Observe Ability.

]]>
As part of my research I’ve spent a lot of time looking the performance of TCP variants and options. One of the most common questions I get asked is about the congestion window and how it reacts to change in the environment.

The congestion window (CWND) is used to control how many segments (layer 4 protocol data unit) can be outstanding (unacknowledged) at any point in time. For most TCP connections we want to be able to use as much bandwidth as we can without overwhelming the network. In most situations the CWND constantly changing as we move through the phases of the connection lifecycle.

The first phase of a TCP connection is the 3 way handshake where a connection is established between two endpoints (client and server). When the connection is established both endpoints individually set buffers for the sending (CWND) and receive (RWND) of data. These buffers are usually set conservatively for efficiency and security purposes.

The next phase is slow start, where the CWND is set as an integer between 1 and 10 (maximum currently allowed per RFC6928). Slow start (despite its name) happens exponentially using a method similar to ABC (Appropriate Byte Counting) where the CWND is increased by the number of bytes acknowledged in an Ack segment. As bandwidth has increased it makes sense, it makes sense to increase that initial CWND to 10. Fortunately newer Linux kernels do just this.

As the connection matures and exits slow start (depending on the flavour of TCP could be a combination of loss and latency), TCP moves into congestion avoidance where the CWND is only increased approximately every round trip. Loss, Latency, and ECN (Explicit Congestion Notification) may result in a return to slow start.

Unlike the receive window (which we can see in packets) we can’t directly see the congestion window in a packet capture. We can however infer the congestion window based on the number of outstanding bytes any point in time.

The above IO Graph in Wireshark looks at a 10 second trace that shows shows the number of bytes in flight which roughly equates to the congestion window over the lifecycle of the connection.

I can also zoom into the beginning of the trace and count the number of segments at the very beginning of the connection and I can see that I can send a maximum of 14480 bytes, or 10 segments at 1448 bytes (+12 bytes of TCP options to add up to the MSS of 1460).

1. Look at the CWND of a listening sockets using ‘ss — nli | grep cwnd | uniq’

    2. Write a simple application that inspects the socket created.

    This application takes 2 parameters (destination host and port) and it creates a TCP connection, not transfer any data, and print out the socket parameters. The tcpi_snd_cwnd in this example shows the expected value: 10.

    import socket
    import string
    import struct
    import argparse
    
    class tcp_client_tcp_info(object):
        """Simple tcp client."""
    
        def __init__(self, host, port):
            try:
                client_socket = socket.socket()
                client_socket.connect((host, port))
                fmt = "B"*8+"I"*24
                fmt_keys =  ['tcpi_state', 'tcpi_ca_state', 'tcpi_retransmits', 'tcpi_probes', 'tcpi_backoff', 'tcpi_options', 'tcpi_snd_wscale', 'tcpi_rcv_wscale', 'tcpi_rto', 'tcpi_ato', 'tcpi_snd_mss', 'tcpi_rcv_mss', 'tcpi_unacked', 'tcpi_sacked', 'tcpi_lost', 'tcpi_retrans', 'tcpi_fackets', 'tcpi_last_data_sent', 'tcpi_last_ack_sent', 'tcpi_last_data_recv', 'tcpi_last_ack_recv', 'tcpi_pmtu', 'tcpi_rcv_ssthresh', 'tcpi_rtt', 'tcpi_rttvar', 'tcpi_snd_ssthresh', 'tcpi_snd_cwnd', 'tcpi_advmss', 'tcpi_reordering', 'tcpi_rcv_rtt', 'tcpi_rcv_space', 'tcpi_total_retrans']
                tcp_info = dict(zip(fmt_keys, struct.unpack(fmt, client_socket.getsockopt(socket.IPPROTO_TCP, socket.TCP_INFO, 104))))
                for k, v in tcp_info.items():
                    print(k + " " + str(v))
            except socket.error as e:
                print(e)
                exit(1)
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument("-H", "--hostname", type=str,
                        help="Host to connect to", required=True)
        parser.add_argument("-p", "--port", type=int,
                        help="Port to connect to", required=True)
        args = parser.parse_args()
        tcp_client_tcp_info(args.hostname, args.port)

    Lastly we can check out where this value is set tcp.h and the commit that set it to 10.

    If you got this far, thanks for reading. Pick up a copy of W. Richard Stevens TCP/IP Illustrated Volume 1 to learn more about TCP and the protocols that built the internet.

    The post Initial Congestion Windows in Linux first appeared on Observe Ability.

    ]]>
    https://leighfinch.net/2023/09/12/initial-congestion-windows-in-linux/feed/ 0 27