Observe Ability

eBPF Tracepoints: Gaining Access to the TCP State Machine

Leigh Finch — Mon, 05 Feb 2024 01:48:36 +0000

My current research focus at UTS is around the inner workings of TCP Congestion Control, which as you might guess requires some detailed insights into the Linux TCP State Machine. While there have been significant improvements in the TCP Congestion Control Modules used by Linux (Reno -> NewReno -> BIC -> CUBIC¹), the state machine which controls when the functions within the modules has not changed as much and provides a central point to focus on when understanding the state of a specific TCP flow.

In the past we’ve had to write Kernel Probes which come with the complexity of coding against specific architectures and creating breakpoints. In contrast eBPF can connect to a specific Tracepoint which can contain debugging information that will remain static across versions of the Kernel with the safety of the eBPF verifier to help (relatively) avoid accidentally crashing your system.

In a previous article, I wrote about using XDP to collect information at the interface level to collect NetFlow style statistics. The issue with using XDP to collect accounting data is that it is actually in the traffic path acting as a gatekeeper. Tracepoints are better suited for reading and accounting purposes with minimal overhead as it doesn’t impact the flow of traffic (apart from the usually negligible CPU overhead).

Code Walkthrough

All code can be downloaded from https://github.com/lfinchHob/ebpf/tree/main/ebpf-tcp-example

***This is not production tested and not recommended outside of research purposes***

The tcp_probe Tracepoint was added back Linux 4.16 which migrated the previous Kernel Probe code. In order to find this tracepoint I used the popular bpftrace tool which allowed me to search for and understand the data structure I was going to receive when attaching to the the tcp_probe tracepoint. Understanding what each Tracepoint exposes require some knowledge of the Kernel architecture, however, there isn’t much point in tracing something you don’t understand.

As per previous articles, I will be writing eBPF in C and the user land component in Python. This is to help clarify the difference between the components, but also to help integrate with my other applications that aren’t seen in this article. for more details on how I’ve laid out my code check out the previous article on this topic.

Searching for the Tracepoint

bpftrace -l allows us to search for tracepoints of interest using text and wildcards. In this example I knew I wanted to look for a a TCP tracepoint, but I wasn’t sure exactly which one I wanted so decided to list them all.

leigh@ebpf:~/ebpf/ebpf-tcp-example$ sudo bpftrace -l “tracepoint:tcp*”
tracepoint:tcp:tcp_bad_csum
tracepoint:tcp:tcp_cong_state_set
tracepoint:tcp:tcp_destroy_sock
tracepoint:tcp:tcp_probe
tracepoint:tcp:tcp_rcv_space_adjust
tracepoint:tcp:tcp_receive_reset
tracepoint:tcp:tcp_retransmit_skb
tracepoint:tcp:tcp_retransmit_synack
tracepoint:tcp:tcp_send_reset
leigh@ebpf:~/ebpf/ebpf-tcp-example$

Describing the Tracepoint

bpftrace -lv shows us the data structure that will be returned by this tracepoint which will use in our eBPF code (written in C) to expose to our user land code (Python).

leigh@ebpf:~/ebpf/ebpf-tcp-example$ sudo bpftrace -lv "tracepoint:tcp:tcp_probe"
tracepoint:tcp:tcp_probe
__u8 saddr[28]
__u8 daddr[28]
__u16 sport
__u16 dport
__u16 family
__u32 mark
__u16 data_len
__u32 snd_nxt
__u32 snd_una
__u32 snd_cwnd
__u32 ssthresh
__u32 snd_wnd
__u32 srtt
__u32 rcv_wnd
__u64 sock_cookie
leigh@ebpf:~/ebpf/ebpf-tcp-example$

eBPF Code

Again we are writing our eBPF code in C and using Python to orchestrate the compiling, attaching, and detaching of eBPF code. The eBPF code is split up into three sections which I have separated out for simplictiy.

The data structure which will be copied to user land for the Python code to consume and display. This struct maps against the data structure that was identified by sudo bpftrace -lv “tracepoint:tcp:tcp_probe”.

struct tcp_probe_t {
u64 ts;
u8 saddr[28];
u8 daddr[28];
u16 sport;
u16 dport;
u16 family;
u32 mark;
u16 data_len;
u32 snd_nxt;
u32 snd_una;
u32 snd_cwnd;
u32 ssthresh;
u32 snd_wnd;
u32 srtt;
u32 rcv_wnd;
u64 sock_cookie;
};

This section defines a BPF Map (in this case a hash table) which defines which data will be made available to user land code. In this example the hash is called tcp_probe_h, the index/key is an unsigned 32 bit integer, and the data/value is tcp_probe_t as defined above.

BPF_HASH(tcp_probe_h, u32, struct tcp_probe_t);

This next section defines the tracepoint which we will attach to and the code to be executed. The TRACEPOINT_PROBE macro makes it easy to attach our probe. In this example we are using a simple copy of the data structure provided by the Tracepoint into our own data structure to be made available to user land. We don’t need to copy everything one for one, however this example provides a complete copy to show how different types of data can be copied. It is here where we can also perform high speed filtering on variables such as the destination port.

TRACEPOINT_PROBE(tcp, tcp_probe) {
struct tcp_probe_t val = {0};
u32 key = bpf_get_prandom_u32();
val.ts = bpf_ktime_get_ns();
__builtin_memcpy(val.saddr, args->saddr, 28);
__builtin_memcpy(val.daddr, args->daddr, 28);
val.sport = args->sport;
val.dport = args->dport;
val.family = args->family;
val.mark = args->mark;
val.data_len = args->data_len;
val.snd_nxt = args->snd_nxt;
val.snd_una = args->snd_una;
val.snd_cwnd = args->snd_cwnd;
val.ssthresh = args->ssthresh;
val.snd_wnd = args->snd_wnd;
val.srtt = args->srtt;
val.rcv_wnd = args->rcv_wnd;
val.sock_cookie = args->sock_cookie;

tcp_probe_h.update(&key, &val);
return 0;

}

Python Code

The Python code is broken up into two main components The monitoring code (start_monitoring) and a simple decoder for turning the IP address into something human readable (decode_in6).

Attach our code by initialising b as type BPF with the src_file argument.
Create a loop that sleeps for 2 seconds.
Retrieve entry from the BPF HashMap.
Loop through and decode.
Display entry.
Remove entry from HashMap.

def start_monitoring():
    b = BPF(src_file="tcp_c.c")
    try:
        while True:
            sleep(2)
            for k, v in b["tcp_probe_h"].items():
                src_ip = decode_in6(v.family, v.saddr)
                dst_ip = decode_in6(v.family, v.daddr)
                print("ts: {6}, src: {0}, dst: {1}, sport: {2}, dport: {3}, snd_cwnd: {4}, rcv_wnd: {8}, srtt: {5}, ssthresh: {7}".format(src_ip, dst_ip, v.sport, v.dport, v.snd_cwnd, v.srtt, v.ts, v.ssthresh, v.rcv_wnd))
                del b["tcp_probe_h"][k]

    except KeyboardInterrupt: #7
        print("Exiting")

Decoding the IP address is a little more complex than simply dumping out an int as the data may be an IPv6 or IPv4 address and we need to treat them differently. So how do we tell if we have an IPv4 or IPv6 address? the tcp_probe struct contains an address-family field which enables us to determine by comparing it to an integer. IPv4 maps to integer 2 and IPv6 maps to integer 10.

Because the data structure (saddr and daddr) are u8 arrays with a length of 28 we need to decode the binary in that structure differently depending on which address family in use. Fortunately from a structure perspective the beginning of the IPv4 and IPv6 are the same we can unpack the data using the the IPv6 structure and then select the field we wish to convert to a human readable string. For more reading on this topic check out the Linux header code for IPv4 and IPv6.²

def decode_in6(af ,addr6):
                af, tport, flow, a1, a2, a3, a4, scope = struct.unpack('





Summary



So in this example we created a eBPF program with kernel-land and user-land components to display every TCP connections state whenever the state machine is triggered. I chose this as an example as I was struggling to find examples of this being done well, that said scraping log files and trace buffers is safer in production. 



If you want to learn more about eBPF pick up a copy of Learning eBPF as it provides excellent examples and a complete understanding of eBPF.



Notes 




There are other great congestion control algorithms including BBR (Bottleneck, Bandwidth, and Round Trip Time) and other hybrid algorithms. The chosen list was based on the Linux default algorithm progression.



Decoding this took way longer than I had anticipated as I had to work it out from the source code.
The post eBPF Tracepoints: Gaining Access to the TCP State Machine first appeared on Observe Ability.



XDP and eBPF for Network Observability with Python
Leigh Finch — Sun, 14 Jan 2024 10:07:45 +0000
I’ve been playing with XDP and eBPF in my lab to see if it might be possible to create NetFlow/IPFIX style flow logs for network observability purposes. Of course this is possible, but is this something that is achievable in a few hours for average Joe?







In my previous article I discussed the what eBPF and XDP are and how they can be used for observability, security, and kernel behavioural changes. This article takes this further with some working code. I chose Python for my examples because to make it clear which parts of the code are user-land code, and which parts are eBPF (kernel-land code). 



The example in this article is an expansion on Liz Rice’s eBPF beginners guide and book Learning eBPF (chapter 8, page 147). In this example we are going to extend the packet counter which kept a record of how many ICMP, TCP, and UDP packets have arrived on a hard-coded interface.



Enhancements:




Seperate Python and C into seperate files for clarity.



List interfaces available for selection (as opposed to hardcode eth0). 



Detach the XDP/eBPF from interface on ctrl-c (Keyboard interrupt).



Display a list of source IP addresses that have sent IP traffic.



Create an additional BPF_HASH to store source IP address packet counter.



Create a function to export the source IP address from the packet header.



Convert an unsigned integer to a dotted decimal address.




This code is for learning purposes and there is no guarantee the code will not break your system, although that is one of the premises of eBPF. I’ve used Debian Bookworm in VM (KVM) with only a minimal set of packages installed. You will need at least the following additional packages for development purposes:




python3-bpfcc



linux-headers-6.1.0-17-amd64 



bpftools



xdp-tools



vim



sudo




Sudo isn’t necessarily needed, but because XDP code needs to be run as a super-user or with the correct capabilities (CAP_SYS_ADMIN and CAP_BPF etc). Setting capabilities is beyond the scope of this article, so we will run our code using sudo.



We are going to have three parts to our code split into three different files  (packet.py, packet.c, and packet.h) to make the purpose of each piece of code clear. When packet.py is run, it will automatically compile packet.c/packet.h to eBPF byte-code and ensure that passes validation (safety check) prior to attaching the code to the selected interface.



Example Run



leigh@ebpf:~/ebpf/ebpf-xdp-example$ sudo python3 packet.py
[sudo] password for leigh:
Select your interface:
lo
ens18
Interface name: ens18
Protocol 1: counter 2,Protocol 17: counter 2,
Sources 192.168.20.97: counter 1,Sources 192.168.20.12: counter 2,Sources 192.168.20.1: counter 1,
Protocol 1: counter 4,Protocol 6: counter 2,Protocol 17: counter 4,
Sources 192.168.20.97: counter 3,Sources 192.168.20.12: counter 6,Sources 192.168.20.1: counter 1,
Protocol 1: counter 6,Protocol 6: counter 4,Protocol 17: counter 5,
Sources 192.168.20.97: counter 3,Sources 192.168.20.12: counter 10,Sources 192.168.20.1: counter 1,Sources 192.168.20.139: counter 1,
Protocol 1: counter 8,Protocol 6: counter 5,Protocol 17: counter 6,
Sources 192.168.20.23: counter 1,Sources 192.168.20.97: counter 3,Sources 192.168.20.12: counter 13,Sources 192.168.20.1: counter 1,Sources 192.168.20.139: counter 1,
Protocol 1: counter 10,Protocol 6: counter 6,Protocol 17: counter 6,
Sources 192.168.20.23: counter 1,Sources 192.168.20.97: counter 3,Sources 192.168.20.12: counter 16,Sources 192.168.20.1: counter 1,Sources 192.168.20.139: counter 1,
Protocol 1: counter 12,Protocol 6: counter 7,Protocol 17: counter 9,
Sources 192.168.20.23: counter 1,Sources 192.168.20.97: counter 3,Sources 192.168.20.37: counter 2,Sources 192.168.20.12: counter 20,Sources 192.168.20.1: counter 1,Sources 192.168.20.139: counter 1,
^CDetaching
Exiting
leigh@ebpf:~/ebpf/ebpf-xdp-example$



packet.py



This file is the orchestrator and user-land component of our code. The get_interfaces code asks the user which interface to use. The start_monitoring code attaches (and detaches) the code to the selected interface and subsequently checks the eBPF maps for the latest data from the kernel-land code.



Source



#!/usr/bin/python
from bcc import BPF
import socket
import struct
from time import sleep

def get_interfaces():
    print("Select your interface:")
    # Return a list of network interface information
    interfaces = socket.if_nameindex()
    for iface in interfaces:
        print(iface[1])
    val = input("Interface name: ")
    for iface in interfaces:
        if val == iface[1]:
            return val
    else:
        print("invalid interface name")
        exit()


def start_monitoring(interface):
    b = BPF(src_file="packet.c")
    b.attach_xdp(dev=interface, fn=b.load_func("packet_counter", BPF.XDP))
    try:
        while True:
            sleep(2)
            s = ""
            for k, v in b["packets"].items():
                s += "Protocol {}: counter {},".format(k.value, v.value)
            print(s)
            source = ""
            for k, v in b["sources"].items():
                source += "Sources {}: counter {},".format(socket.inet_ntoa(struct.pack('




packet.c



This file contains the XDP function that will be compiled to eBPF byte code and run in kernel space attached to an interface. For every IPv4 packet that comes into this interface, the code will check that the protocol is IPv4 and record the next protocol (TCP, UDP, ICMP) byte (Wireshark display filter ip.proto).







Two BPF hashes are defined which are the eBPF maps we can use to send kernel-land date to the user-land python code.



Source



#include "packet.h"

BPF_HASH(packets);
BPF_HASH(sources);

int packet_counter(struct xdp_md *ctx) {
    u64 counter = 0;
    u64 source_counter = 0;
    u64 key = 0;
    u64 source = 0;
    u64 *p;
    u64 *s;

    key = lookup_protocol(ctx);
    if (key != 0) {
        p = packets.lookup(&key);
        if (p != 0) {
            counter = *p;
        }
        counter++;
        packets.update(&key, &counter);

        source = lookup_source(ctx);
        s = sources.lookup(&source);
        if (s != 0) {
            source_counter = *s;
        }
        source_counter++;
        sources.update(&source, &source_counter);

    }

    return XDP_PASS;
}



packet.h



Packet.h is the C header file which contains two inline functions to assist with the decoding of the IP headers. This file is far from efficient, however demonstrates how to dissect data from a xdc_md struct.



Source



#include 

#define IP_ADDRESS(x) (unsigned int)(172 + (17 << 8) + (0 << 16) + (x << 24))

// Returns the protocol byte for an IP packet, 0 for anything else
static __always_inline u64 lookup_protocol(struct xdp_md *ctx)
{
    u64 protocol = 0;

    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    if (data + sizeof(struct ethhdr) > data_end)
        return 0;

    // Check that it's an IP packet
    if (bpf_ntohs(eth->h_proto) == ETH_P_IP)
    {
        // Return the protocol of this packet
        // 1 = ICMP
        // 6 = TCP
        // 17 = UDP
        struct iphdr *iph = data + sizeof(struct ethhdr);
        if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) <= data_end)
            protocol = iph->protocol;
    }
    return protocol;
}

static __always_inline u64 lookup_source(struct xdp_md *ctx)
{
    u64 source = 1;

    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    if (data + sizeof(struct ethhdr) > data_end)
        return 0;

    // Check that it's an IP packet
    if (bpf_ntohs(eth->h_proto) == ETH_P_IP)
    {
        struct iphdr *iph = data + sizeof(struct ethhdr);
        if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) <= data_end)
            source = iph->saddr;
    }
    return source;
}



Manually detaching XDP code



Code can be manually detached using xdp-loader.



leigh@ebpf:~/ebpf/ebpf-xdp-example$ sudo xdp-loader status
CURRENT XDP PROGRAM STATUS:

Interface        Prio  Program name      Mode     ID   Tag               Chain actions
--------------------------------------------------------------------------------------
lo                     
ens18                  packet_counter    native   81   10540f65ac5626d6

leigh@ebpf:~/ebpf/ebpf-xdp-example$ sudo xdp-loader unload ens18 -a



Summary



In the next XDP article we will move closer to collecting a 5-tuple (src-ip, dest-ip, src-port, dest-port, protocol) which could be used to export to a NetFlow/IPFIX collector. If you made it this far, thanks for reading.
The post XDP and eBPF for Network Observability with Python first appeared on Observe Ability.


Alert Fatigue: Why Too Many Alerts Can be Disastrous!
Leigh Finch — Thu, 28 Dec 2023 03:33:26 +0000
Alert fatigue is a problem I’ve encountered so many times in IT Operations, especially as monitoring sprawl increases the number of tools we use to gain additional insights into our SLOs (Service Level Objectives). Those on the front line who who receive the alerts will begin to drown in the information and overlook the important issues when they arise.



Having been in the hot seat myself, I’ve fallen victim to alert fatigue where an import notification (by SMS at the time) has shown up on my old school Nokia in between 25 other notifications. The problem was that we had 25 satellite sites connected via a branch office. We were used to satellite being unreliable due to rain fade, solar events, and carrier maintenance. Therefore, a deluge of alerts typically indicated a satellite failure, however, in this case the branch office uplink supporting the satellite sites had failed. All staff (80) in the office were offline and called the service desk to get the issue resolved. The damage in this case was minimal, but we’ve all seen much worse where revenue and reputation were impacted.







Most observability tools now have some level of event management, prioritisation, and filtering built in. Even then the ITSM (IT Service Management) usually has an events module which is meant to correlate and reduce the number of alerts/incidents being raised. That said I’ve seldom seen an environment where this has been implemented effectively. Usually we will see a straight passthrough from an observability tool alert, to an ITSM incident. For many of the customers I’ve worked with, having more than 10,000 incidents a day is not uncommon.



What Can we do About it?



Alert Management Solution (AIOps)



There are a number of software solutions for managing alerts, incidents, problems from vendors including PagerDuty, BigPanda, Splunk, Atlassian, ServiceNow. Each has their own spin on AIOps and its relation to alert management and certainly worth digging into (if you haven’t already).



In the Open Source world there are projects like Keep, and Prometheus Alert Manager. Using tools like FluentBit as part of your observability Pipeline can also help.



One of the other challenges with relying on an AIOps platform to reduce alert fatigue is that the systems themselves become very good at becoming fatigued themselves.



Seperate Reporting From Service Level Indicators



When we build a new monitoring or observability solution we are often after quick wins to reduce time to value and perceived return on investment. We don’t want all of this information to become alerts on an ongoing basis. The key here is to reduce our load on our Alert Management solution by ensuring that were are actively monitoring and alerting on our Service Level Indicators as defined by Service Delivery, Developers. SRE (Site Reliability Engineering), ITOM (IT Operations Management), and Performance Consultants are well versed in defining what these SLIs are and can help prioritise them.



Examples:




Server Response Time



Time to First Byte



Largest Contentful Paint



DB Query Time




Prioritising SLIs doesn’t mean we shouldn’t collect and alert on other metrics, but “if everything is important, nothing is.” By prioritising what alerts are escalated and how, we can reduce alert fatigue so that when an issue does occur, it gets the right attention.



Platform Ops Observability



Building an Observability capability into your Platform Ops offering can help standardise the observability solution used across the organisation. In addition to the Observability solution being standardised, the practices and workflows can be standardised on a maturity scale depending on the criticality of the service and impact to the customers digital experience.



What Next?



Managing alerts is hard when done well, and the larger the organisation, the more complicated it becomes. AI Ops promises to help reduce alert fatigue, but we need to improve the quality of the data going in to improve its effectiveness.



I’m excited to see what happens in this are in the next few years. If you have a great Open Source solution to improve operations and reduce alert fatigue, I’d be excited to see your reply in the comments below.
The post Alert Fatigue: Why Too Many Alerts Can be Disastrous! first appeared on Observe Ability.


Implementing Enterprise Observability for Success Review
Leigh Finch — Wed, 13 Dec 2023 03:16:48 +0000
Implementing Enterprise Observability for Success by Manisha Agrawal and Karun Krishnannair takes a novel approach to implementing observability in the enterprise. Unlike other books that focus on the how to get data into a specific system, this book looks at it from an architectural and IT management perspective and is largely vendor agnostic (which I loved).













I found this book interesting following an Open Door session I ran a few weeks ago with friend and colleagues on LinkedIn where we discussed a number of ideas and concepts. One of the key challenges that we discussed was the challenges of gaining organisational support for Observability when there are so many competing factors. This book answers some of those questions with a systematic approach broken into three easy to read parts:




Part 1 – Understanding Observability in the Real World



Part 2 – Planning and Implementation



Part 3 – Case Studies




Part 1  – Understanding Observability in the Real World



This book provides an introduction to Observability discussing the need for Observability and where it fits in the enterprise. It’s easy to say that we care about observability because it empowers a focus on our customers digital experience, this book breaks this down in a consumable to multiple audiences.



The architecture that the authors discusses starts at the infrastructure layer which is a critical part of building an observability architecture. The book then moves on the application layer and APM, followed by Digital Experience Management, and lastly the organisational layer which is where the business data and insights lay. 



This approach is contrary to so many other books I’ve read which focus a specific technology or component of observability instead of the end-to-end and business outcomes. One of the reasons I like this book so much is that it gives us the why behind observability which can help us build business cases that get attention.



Part 2 – Planning and Implementation



With Part 2 we go through the planning and implementation of an observability solution. Because this book is focused on an architectural and IT director audience we focus more on the authors maturity model, gaining stakeholder buy in, and RASCI models. The authors go into quite a bit of detail on the maturity models and how different services will be at different levels depending on the criticality of the service.



Team structures and building an observability and customer focused culture is discussed, and the authors recommend starting small and defining success using measurable outcomes. 



The observability team discussed in this book could be part of a Platform Operations or SRE team. The book recommends a dedicated to team to ensure standards across the organisation, and providing observability to service specific teams helping them become power users of the toolsets to increase adoption.



Part 3 – Case Studies



The last part goes through 4 case studies identifying the customers challenges and how observability can be implemented and adopted to solve those problems.



Summary



I really enjoyed that this book really spoke to the why? of enterprise observability and enabled Architects and IT directors to build an observability business case. Being technology agnostic and focused on adoption is something I’ve done for many years and it’s great to have a book like this to point to when discussing this with my own customers.



If you made it this far, why not pick up a copy of this book our one of the others in my book recommendations!
The post Implementing Enterprise Observability for Success Review first appeared on Observe Ability.


Termshark: Command Line Wireshark for the Win!
Leigh Finch — Wed, 06 Dec 2023 00:22:24 +0000
I was recently working on a headless server trying to troubleshoot an issue with Linux Bridging and IPTables and needed to understand where my packets were getting dropped. Traditionally in this situation I would run a tcpdump (with aggressive filters) and either watch the output or take a PCAP and scp the file to my workstation for analysis. In this case I decided to try a cool tool I’d played with  in the Cilium Labs called Termshark.



Termshark is a command line tool written in Go leveraging the Gowid libraries for visualisation and Tshark for either live captures or PCAP analysis all within your terminal. In this example I took a short capture from my physical interface and loaded into termshark to show a different yet familiar view:







You can tab between the three main panels (packet list, dissectors, Hex view) with the dissector view expanding to view additional details in a hierarchical display.







Display Filters



Like VIM/Less/More we can use forward slash / to access the filters bar and type the Wireshark display filters.







You can do the equivalent of a right click on a dissector field using by pressing Enter on the desired field which will display a context menu.







Packet Search



To find strings or byte patters in the packets you can use ctrl-f and use the arrow keys to select the type of search you want to run.











Analysis



Press ESC (the escape key) to access the Analysis and Misc menus at the top.











Quit



Press ctrl-c to quit or use the escape key to access the Misc menu and select quit.







You can find the user guide here!



If you want to amplify your Wireshark skills check out on of my favourite network analysis books: Wireshark Network Analysis Second Edition by Laura Chappell or check out other books in the Book Recommendations.
The post Termshark: Command Line Wireshark for the Win! first appeared on Observe Ability.


Mastering Python Networking Review
Leigh Finch — Tue, 28 Nov 2023 02:51:31 +0000
I came across Mastering Python Networking by Eric Chou about a month ago on Twitter and immediately purchased it. I was excited to see book on programming targeted at people with a networking background as being able to automate becomes critical to scaling networks and reducing toil.







To say I’m a fan of this book is an understatement! I’d been expecting topics like programming with pexpect and using common APIs, but what we got was far more detailed than I could have hoped for with deep insights into the background of TCP right through building custom APIs, observability, and automating cloud networking.









Chou ramps the topics up gradually building on each chapter so that the learning curve for each topic is gentle enough that even someone with no Python experience could be writing basic scripts to automate network configuration changes within the first couple of chapters.



Using Ansible was covered in detail to automate configuration baselining, provisioning, and changes using a scalable methodology has been well received not only by me, but also raved about on Twitter. If you’re not using Ansible (or similar) you will eventually come across it this book gives you ready to run playbooks that will accelerate your adoption.



With my background in Observability I was pleasantly surprised that 2.5  chapters had been dedicated to the topic from multiple perspectives:




Telemetry configuration pushes



Receiving and decoding the telemetry



Extending existing tools like NTOP and Cacti




Graphing and visualisation are an important part of making data consumable to multiple audiences and an introduction and practical examples of the popular MatPlotLib and PyGraphViz were on point.



Packet decoding and crafting libraries like Scapy are introduced and again the practical examples make it easy to digest relatively complex concepts like writing a network scanning tool relatively easy. I’ve used Scapy in the past to build custom protocol implementations, and I wish I’d had this book then.



Today most of my research centres around using a Python framework called Mininet, and while Mininet is not covered, I would recommend this book to anyone looking to learn modelling and simulations using Python.



The future of networking is not Network Engineers logging into individual devices and running commands. This book is a primer for the network engineering community looking to scale, and conversely for programmers looking to understand how to automate networking tasks.



Topics:




Review of TCP/IP Protocol Suite and Python



Low-Level Network Device Interactions



APIs and Intent-Driven Networking



The Python Automation Framework – Ansible



Docker Containers for Network Engineers



Network Security with Python



Network Monitoring with Python – Part 1



Network Monitoring with Python – Part 2



Building Network Web Services with Python



Introduction to AsyncIO



AWS Cloud Networking



Azure Cloud Networking




If you enjoyed this article, pick up a copy of this book to support us.
The post Mastering Python Networking Review first appeared on Observe Ability.


SRE: Five Ways to Build a Blameless Culture
Leigh Finch — Fri, 24 Nov 2023 08:16:08 +0000
One of the main pillars of SRE (Site Reliability Engineering) is to introduce a blameless culture, however, building this takes more than just words. You can’t build a blameless culture by talking about culture, because culture is the result of changes in processes and structures within an organisation. 



Here are my top 5 ways to build a blameless culture: 




Career Safety



Root cause analysis is not blame



Observability



Customer centricity



Shared story




What does a defensive culture look like?







As a consultant I’m often brought into resolve problems with applications, networks, and services with complex infrastructure and even more complex cultural problems. These problems have usually existed for more than a few months to the point that leadership have gone out to market to seek outside help.



When organisations get to this point they typically have rigid structures in place with ongoing war rooms, overly cautious change control processes, long running incidents that seem to have no root cause. 



Five ways to build a blameless culture



1: Career safety



People who fear for their job are less likely to be transparent about problems that they might be encountering. I’ve seen organisations that are so risk averse an bound by processes that they will leave a problem in place rather than resolve an issue due to the fear of raising a change request that may expose a problem within their domain of operations.



Work with your team to help them understand that finding problems is a good thing at that their role is secure (or even bolstered) by finding existing and potential issues.



2: Root cause analysis is not blame



At the end of an incident, root cause analysis should be performed to understand where in the technology stack the problem occurred. Finding a problem in a specific area such as a router configuration is a learning exercise, not to blame a person for not doing their “job”. 



Root causes analysis can be tough and may uncover some uncomfortable truths, but it it is necessary to improve the experience of your users. Therefore, finding a problem should be rewarded, so as to empower people to actively seek out problems to improve the experience of their users.



Hindsight is always 20:20 and life is about continuous learning and development. We should encourage our best and brightest to divulge times where they have made a mistake to help others feel comfortable discussing their own experiences.



3: Observability



Monitoring has come a long way in the last 30 years we now have the ability to observe practically every part of a technology stack with metrics, logs, and traces. Without a proper observability solution we are left with blind spots that leave room for doubt. If an engineer is unable to say with certainty that the problems is or is not in their domain, it becomes easy to blame transient issues that may or may not have existed.



No engineer wants to have a problem in their domain and observability tools allows them to inspect their own domain first and understand why a problem occurred. Arm your engineers with the right observability tools and you will see engineers start proactively seeking out problems rather than fretting that a problem they don’t know about may exist.



4: Customer centricity



Customer centricity deserves a whole article on its own, but in this context I focus on positioning yourself in the shoes of the customer experiencing the service. In a previous life I made a point of travelling out to the field to be with end-users in remote parts of Australia over satellite communications to share the users experience as they experienced it daily. 



A small change to a service may seem insignificant when tested, but can become  a challenge when latency is introduced making the application or service unusable for a portion of your customer base.



A second issue around customer centricity is understanding that a customer may not be able to adequately describe the problem they are experiencing, which can lead to tickets or incidents being ignored as “not reproducible” or the end user being labeled as a noisy complainer. 



Dig deeper into what the end user is experiencing by using probing questions such as those you might hear when visiting your GP (General Practitioner):




Tell me more?



When did this start?



What have you tried?



Did that work?



What is the impact?




5: Shared story



Shared vision and story is something that allows us all to move in the same direction. In a defensive culture we see issues or bugs being dismissed as the fault of the way the customers use the software or service. If we create a shared story and understanding of the way end-user interact with our services it’s easier for everyone from end-users through to developers to understand where and why problems are experienced.



We can create story maps using wireframes or bullet point workflows that show how users interact and navigate through our applications and services. An excellent book on this topic is User Story Mapping by Jeff Patton.



6*: Culture is a trailing indicator



Remember that we can’t wake up one day and say we have a blameless culture, this is something that takes time and is a trailing indicator of the behaviours and structures talked about in this article.



What other techniques have you used to help develop a blameless culture?



* Bonus
The post SRE: Five Ways to Build a Blameless Culture first appeared on Observe Ability.


Performance Diagnostics Part 6: 5 SRE Practices to Minimise Toil
Leigh Finch — Mon, 20 Nov 2023 02:32:37 +0000
One of the core tenets of SRE is to minimise toil to increase resiliency and improve digital experience. SRE (Site Reliability Engineering) is a practice that Google created in the 2000s to improve the performance of the “Site”, with the site being the Google Search.



Google being made up of some incredibly smart people defied the traditional model of Development and Operations. SRE pivots from the traditional ops model by making the SRE team responsible for the performance of a site or service with the team including developers and infrastructure team members. This reduces friction and improves the customer experience.



My Top 5 Practices:




Automation



Service Level Objectives



Observability



Incident Management



Blameless Culture




What’s Wrong With The Old Model?



In the traditional ops model developers would typically build an application or service and the operations team would deploy and maintain the application and supporting infrastructure.







The friction is more than just a lack of collaboration, over time we all make mistakes and this leads to an us vs them mentality. The development team will see every issue as an ops problem, and the ops team will see every issue as a bug leading to a build up of technical debt and unresolved issues.



The famous IT Crowd quote “Have you tried turning it off and on again?” might be cliche, but it’s often the first thing operations will do to try and restore service.   This mentality means that the operations team are less focused on the why it failed and more on trying to bandaid the service. 



Conversely developers seeing an exception that says something to the effect of “network timeout” dismissing the service disruption as an ops problem without considering the issue could be a result of the application design.







How does SRE Minimise Toil?



The SRE team includes people with development and operations experience meaning that developers are closer to the user and performance of the service. Being closer to the user empowers the team to improve the service performance and therefore digital experience.



Toil is the repetitive work we do with low levels of gain. A good example in the operations space would be manually clearing up disk space, while useful, it could easily be automated freeing up staff to perform higher value tasks. 



Automation



Scripting, APIs, and infrastructure as code have significantly reduced the time engineers spend on repetitive tasks. I don’t know many engineers that enjoy manually building a server and deploying a service. CI/CD pipelines have significantly improved the quality of software by building and testing software freeing up staff to focus on higher value tasks.



Service Level Objectives



Rather than focusing on Service Level Agreements, SRE is made up of multiple elements made up of SLI, SLO, and SLA.




SLI: Service level Indicator is key metrics that influence the performance of a service. CPU utilisation is a good example of a single metric that influences the performance of a service.



SLO: Service Level Objective is the expected performance of the service. This is the value we know that we should not exceed in BAU operations. For example Server Response Time less than 100ms could be the SLO which when exceeded triggers an investigation.



SLA: Service Level Agreement is the never breach performance of a service.   This is the agreement made with an external (or within) team. Breaching this value has consequences that could include service credits.




Observability



Using observability tooling and alerts to understand and map out services so that   key metrics are quantified and recorded. The old saying goes “you can’t improve what you cannot measure”. Subjective and qualitative measures can also be useful but shouldn’t be the primary measure. I recently worked with a company the measure comments on social media as a indicator of service.



ITIM: IT Infrastructure Management looks at the infrastructure performance. Servers, network devices, load balancers etc. This is the traditional monitoring approach that uses SNMP, Synthetics, WMI, APIs, and Streaming telemetry to interrogate devices. This is also where logs and events would be collected and aggregated.



NPM: Network Performance Management looks at the efficiency and characteristics of the protocols traversing a network. This is done using passive techniques such as NetFLOW and packet capture.



EUEM: End User Experience Management looks at the performance of an application or service at the point of consumption. Typically measuring endpoint metrics such as CPU, Memory, Disk IO, and click to paint metrics. Check out my article on EUEM here!



APM: Application Performance Management looks at the performance of an application or service by instrumenting the application, container, or server. This could include using techniques like eBPF, OTEL, profilers and typically produce traces. Check out my review on learning eBPF. Check out my OTEL article with SigNoz.



If we have an Observability instrumentation plan in place, we can identify what factors influence performance and why. This means that no-one sits in a war room hoping the problem is not theirs.



Incident Management



Incident Management and root cause analysis is crucial to minimising toil. I built my business around root cause analysis and diagnostics to teach technologists how to troubleshoot and manage incidents effectively. To rely on a service means that we need to methodically identify the cause of a fault and rectify the underlying problem rather than applying a bandaid. 



This is where the SRE mindset adds value to DevOps. SRE is compatible with DevOps and both work effectively together.



Blameless Culture



Blameless culture allows for the creation of shared vision. If we move beyond empire building and focus on the service being as reliable and performant as possible. Everyone wins, and no-one hides and hopes that the problem is not in their domain.



Wrap Up



These are not the only practices we need to do to reduce toil and improve digital experience. If you made it this far, thanks for reading and why not purchase the Site Reliability Engineering book which describes Google’s approach to SRE. The Site Reliability Workbook provides practical examples of how to implement SRE.
The post Performance Diagnostics Part 6: 5 SRE Practices to Minimise Toil first appeared on Observe Ability.


Digital Experience in an Opaque World – MilCIS 2023
Leigh Finch — Mon, 13 Nov 2023 10:43:01 +0000
Thanks for attending the Observe Ability talk on Digital Experience in an Opaque World at MilCIS 2023. Observability drives digital experience as we cannot improve what we cannot measure.



digital-experience-in-an-opaque-world.pdfDownload



Fix our Computers!



Link to Twitter Letter



Link To Defense Business Board Report







Convergence of Logs and Traces



Link to Article



Site Reliability Engineering











Next Generation Observability



NETFLIX primer on Streaming Telemetry



Learning eBPF review







BPF Performance Tools Review







Practical OpenTelemetry Review



The post Digital Experience in an Opaque World – MilCIS 2023 first appeared on Observe Ability.


Unlocking Cybersecurity Excellence: A Guide to the Australian Cyber Essential Eight
Leigh Finch — Mon, 06 Nov 2023 06:53:30 +0000
The Australian Cyber Essential Eight is a general framework targeted at Australian enterprise and government and is largely considered one of the better frameworks for hardening, and limiting the damage from the most common kind of attacks:




Inside Threat Actor



Ransomeware



Advanced Persistent Threat (APT)




Most Australian government entities must implement the Essential Eight strategy, developed by the Australian Cyber Security Centre and Australian Signals Directorate (Australia’s equivalent to the American NSA and the UK GCHQ) to help reduce the impact of a cyber incident.



The strategy has been implemented as a four stage maturity model starting from level zero through to three.







What are the Essential Eight?




Application Control



Patch Applications



Configure Microsoft Office macro settings



User Application Hardening



Restrict Administrative Privileges



Patch Operating Systems



Multi-Factor Authentication



Regular Backups




Hopefully most of these are common sense for any organisation with information or systems they need to protect. Additionally, this list of strategies shouldn’t be the only thing an organisation does to protect their data.



Application Control



Application control is the process of controlling which applications and application types are allowed to run on a workstation or server and protecting audit trails and logs from unauthorised deletion. This could include applications a user downloads from the Internet or even an application that the user has built themselves. For most users, they use the same set of applications regularly, and seldom need to run something different.



This is control largely painful for knowledge workers who use uncommon applications to perform their duties such downloading a terminal emulator to configure a firewall or other network appliance. Having an approved software catalog or processes in place to allow ensure these workers are able to run the software they need should help reduce the pain.



Patch Applications



Patch applications includes the discovery and patching of applications installed on workstation and servers. Even in the strictest environments I’ve seen many different versions of the same application running across an estate. This can include various builds of Microsoft Office, Browsers, and other notorious applications that constantly require updates.



This strategy also recommends using vulnerability scanning applications like to search for vulnerable applications. 



Configure Microsoft Office macro settings



Microsoft Office Macros have been discouraged by everyone including Microsoft who in February of 2022 announced they would disable Macros by default. Macros are used to automate repetitive tasks within the Office suite of applications, however there has been a push away from them as they can be used to create malicious applications that infect your machines. Microsoft recommend using other tools like the Power BI suite instead.



If your organisation does need macros make sure that they are sandboxed and scanned by anti-virus software. Additionally, block execution from Macros that have been downloaded by from the Internet and only enable Macros for those who need it.



User Application Hardening



User application hardening is the process of following hardening guides provided by ASD, vendors, or STIGs (Secure Technical Implementation Guide). This strategy ensures that the applications are not given free reign on your machine and that they have been configured in a secure manner.



Restrict Administrative Privileges



Gone are the days where everyone had administrator access (which is a good thing). Controlling who has administrative privileges and to what has become the norm for most organisations. This strategy recommends the use of targeted administrative accounts that have limited access without being able to access services such as email and internet.



The use of Jump servers, SCIM (Secure Cross-domain Identity Management), and access management portals make it easier to log and record events performed with administrative privileges.



Patch Operating Systems



Similar to the Patching applications strategy the Essential Eight recommends the patching of operating systems in the same way using vulnerability scanners to identify and catalog assets.



Multi-Factor Authentication



Multi-factor is using more than one type of authentication to gain access to a system. This can include the use of OTP (One Time Pass), and Biometrics, Authenticator applications.  This strategy mostly targets Internet accessible services, but also sensitive data repositories and administrative access.



Regular Backups



Backups are targeting important data, application configuration, and settings so that they can be recovered in the event of a disaster or ransomware event. This strategy also includes requirements to limit access to backups to administrative users. 



I would add that backups be disconnected from production systems so that an ATP with administrative privileges cannot delete or otherwise corrupt the backup repository.



How does the Essential Eight compare to the ISM?



The ISM (Information Security Manual) is an ASD document that includes thousands of controls with specific requirements for different classifications of data. While this mostly applies to government organisations and private organisations that interact with government data, the Essential Eight calls out specific ISM controls that can be used to build out Essential Eight implementations.



Is it enough to only employ the Essential Eight strategies?



In a word, No. The ACSC also calls out many other strategies that can be used in conjunction with the Essential Eight to improve the Cyber posture of an organisation. The Essential Eight are a minimal guideline that organisations (both government and non-government) can deploy to reduce risk.



https://www.cyber.gov.au/resources-business-and-government/essential-cyber-security/strategies-mitigate-cyber-security-incidents/strategies-mitigate-cyber-security-incidents



What Next?



There are many other frameworks that can be used to improve an organisations security posture including (and definitely not limited to):




MITRE D3FEND Framework



Zero Trust Architectures (NIST 800-207)



NIST SP800



Centre for Information Security Benchamarks




Observability is key to Cyber success. You cannot secure what you cannot see.
The post Unlocking Cybersecurity Excellence: A Guide to the Australian Cyber Essential Eight first appeared on Observe Ability.