eBPF Tracepoints: Gaining Access to the TCP State Machine

My current research focus at UTS is around the inner workings of TCP Congestion Control, which as you might guess requires some detailed insights into the Linux TCP State Machine. While there have been significant improvements in the TCP Congestion Control Modules used by Linux (Reno -> NewReno -> BIC -> CUBIC¹), the state machine which controls when the functions within the modules has not changed as much and provides a central point to focus on when understanding the state of a specific TCP flow.

In the past we’ve had to write Kernel Probes which come with the complexity of coding against specific architectures and creating breakpoints. In contrast eBPF can connect to a specific Tracepoint which can contain debugging information that will remain static across versions of the Kernel with the safety of the eBPF verifier to help (relatively) avoid accidentally crashing your system.

In a previous article, I wrote about using XDP to collect information at the interface level to collect NetFlow style statistics. The issue with using XDP to collect accounting data is that it is actually in the traffic path acting as a gatekeeper. Tracepoints are better suited for reading and accounting purposes with minimal overhead as it doesn’t impact the flow of traffic (apart from the usually negligible CPU overhead).

Code Walkthrough

All code can be downloaded from https://github.com/lfinchHob/ebpf/tree/main/ebpf-tcp-example

***This is not production tested and not recommended outside of research purposes***

The tcp_probe Tracepoint was added back Linux 4.16 which migrated the previous Kernel Probe code. In order to find this tracepoint I used the popular bpftrace tool which allowed me to search for and understand the data structure I was going to receive when attaching to the the tcp_probe tracepoint. Understanding what each Tracepoint exposes require some knowledge of the Kernel architecture, however, there isn’t much point in tracing something you don’t understand.

As per previous articles, I will be writing eBPF in C and the user land component in Python. This is to help clarify the difference between the components, but also to help integrate with my other applications that aren’t seen in this article. for more details on how I’ve laid out my code check out the previous article on this topic.

Searching for the Tracepoint

bpftrace -l allows us to search for tracepoints of interest using text and wildcards. In this example I knew I wanted to look for a a TCP tracepoint, but I wasn’t sure exactly which one I wanted so decided to list them all.

leigh@ebpf:~/ebpf/ebpf-tcp-example$ sudo bpftrace -l “tracepoint:tcp*”
tracepoint:tcp:tcp_bad_csum
tracepoint:tcp:tcp_cong_state_set
tracepoint:tcp:tcp_destroy_sock
tracepoint:tcp:tcp_probe
tracepoint:tcp:tcp_rcv_space_adjust
tracepoint:tcp:tcp_receive_reset
tracepoint:tcp:tcp_retransmit_skb
tracepoint:tcp:tcp_retransmit_synack
tracepoint:tcp:tcp_send_reset
leigh@ebpf:~/ebpf/ebpf-tcp-example$

Describing the Tracepoint

bpftrace -lv shows us the data structure that will be returned by this tracepoint which will use in our eBPF code (written in C) to expose to our user land code (Python).

leigh@ebpf:~/ebpf/ebpf-tcp-example$ sudo bpftrace -lv "tracepoint:tcp:tcp_probe"
tracepoint:tcp:tcp_probe
__u8 saddr[28]
__u8 daddr[28]
__u16 sport
__u16 dport
__u16 family
__u32 mark
__u16 data_len
__u32 snd_nxt
__u32 snd_una
__u32 snd_cwnd
__u32 ssthresh
__u32 snd_wnd
__u32 srtt
__u32 rcv_wnd
__u64 sock_cookie
leigh@ebpf:~/ebpf/ebpf-tcp-example$

eBPF Code

Again we are writing our eBPF code in C and using Python to orchestrate the compiling, attaching, and detaching of eBPF code. The eBPF code is split up into three sections which I have separated out for simplictiy.

The data structure which will be copied to user land for the Python code to consume and display. This struct maps against the data structure that was identified by sudo bpftrace -lv “tracepoint:tcp:tcp_probe”.

struct tcp_probe_t {
u64 ts;
u8 saddr[28];
u8 daddr[28];
u16 sport;
u16 dport;
u16 family;
u32 mark;
u16 data_len;
u32 snd_nxt;
u32 snd_una;
u32 snd_cwnd;
u32 ssthresh;
u32 snd_wnd;
u32 srtt;
u32 rcv_wnd;
u64 sock_cookie;
};

This section defines a BPF Map (in this case a hash table) which defines which data will be made available to user land code. In this example the hash is called tcp_probe_h, the index/key is an unsigned 32 bit integer, and the data/value is tcp_probe_t as defined above.

BPF_HASH(tcp_probe_h, u32, struct tcp_probe_t);

This next section defines the tracepoint which we will attach to and the code to be executed. The TRACEPOINT_PROBE macro makes it easy to attach our probe. In this example we are using a simple copy of the data structure provided by the Tracepoint into our own data structure to be made available to user land. We don’t need to copy everything one for one, however this example provides a complete copy to show how different types of data can be copied. It is here where we can also perform high speed filtering on variables such as the destination port.

TRACEPOINT_PROBE(tcp, tcp_probe) {
struct tcp_probe_t val = {0};
u32 key = bpf_get_prandom_u32();
val.ts = bpf_ktime_get_ns();
__builtin_memcpy(val.saddr, args->saddr, 28);
__builtin_memcpy(val.daddr, args->daddr, 28);
val.sport = args->sport;
val.dport = args->dport;
val.family = args->family;
val.mark = args->mark;
val.data_len = args->data_len;
val.snd_nxt = args->snd_nxt;
val.snd_una = args->snd_una;
val.snd_cwnd = args->snd_cwnd;
val.ssthresh = args->ssthresh;
val.snd_wnd = args->snd_wnd;
val.srtt = args->srtt;
val.rcv_wnd = args->rcv_wnd;
val.sock_cookie = args->sock_cookie;

tcp_probe_h.update(&key, &val);
return 0;

}

Python Code

The Python code is broken up into two main components The monitoring code (start_monitoring) and a simple decoder for turning the IP address into something human readable (decode_in6).

Attach our code by initialising b as type BPF with the src_file argument.
Create a loop that sleeps for 2 seconds.
Retrieve entry from the BPF HashMap.
Loop through and decode.
Display entry.
Remove entry from HashMap.

def start_monitoring():
    b = BPF(src_file="tcp_c.c")
    try:
        while True:
            sleep(2)
            for k, v in b["tcp_probe_h"].items():
                src_ip = decode_in6(v.family, v.saddr)
                dst_ip = decode_in6(v.family, v.daddr)
                print("ts: {6}, src: {0}, dst: {1}, sport: {2}, dport: {3}, snd_cwnd: {4}, rcv_wnd: {8}, srtt: {5}, ssthresh: {7}".format(src_ip, dst_ip, v.sport, v.dport, v.snd_cwnd, v.srtt, v.ts, v.ssthresh, v.rcv_wnd))
                del b["tcp_probe_h"][k]

    except KeyboardInterrupt: #7
        print("Exiting")

Decoding the IP address is a little more complex than simply dumping out an int as the data may be an IPv6 or IPv4 address and we need to treat them differently. So how do we tell if we have an IPv4 or IPv6 address? the tcp_probe struct contains an address-family field which enables us to determine by comparing it to an integer. IPv4 maps to integer 2 and IPv6 maps to integer 10.

Because the data structure (saddr and daddr) are u8 arrays with a length of 28 we need to decode the binary in that structure differently depending on which address family in use. Fortunately from a structure perspective the beginning of the IPv4 and IPv6 are the same we can unpack the data using the the IPv6 structure and then select the field we wish to convert to a human readable string. For more reading on this topic check out the Linux header code for IPv4 and IPv6.²

def decode_in6(af ,addr6):
                af, tport, flow, a1, a2, a3, a4, scope = struct.unpack('<HHILLLLI', bytearray(addr6))
                ip = "0.0.0.0"
                if (af == 10):
                    ip = str(ipaddress.IPv6Address(struct.pack('<LLLL', a1, a2, a3, a4)))
                elif (af == 2):
                    ip = str(ipaddress.IPv4Address(struct.pack('<I', flow)))
                return ip

Summary

So in this example we created a eBPF program with kernel-land and user-land components to display every TCP connections state whenever the state machine is triggered. I chose this as an example as I was struggling to find examples of this being done well, that said scraping log files and trace buffers is safer in production.

If you want to learn more about eBPF pick up a copy of Learning eBPF as it provides excellent examples and a complete understanding of eBPF.

Notes

There are other great congestion control algorithms including BBR (Bottleneck, Bandwidth, and Round Trip Time) and other hybrid algorithms. The chosen list was based on the Linux default algorithm progression.
Decoding this took way longer than I had anticipated as I had to work it out from the source code.