Observability - Observe Ability https://leighfinch.net Observability Tue, 28 Nov 2023 03:17:38 +0000 en-AU hourly 1 https://wordpress.org/?v=6.5 223568926 Mastering Python Networking Review https://leighfinch.net/2023/11/28/mastering-python-networking-review/ https://leighfinch.net/2023/11/28/mastering-python-networking-review/#respond Tue, 28 Nov 2023 02:51:31 +0000 https://leighfinch.net/?p=295 I came across Mastering Python Networking by Eric Chou about a month ago on Twitter and immediately purchased it. I was excited to see book on programming targeted at people with a networking background as being able to automate becomes critical to scaling networks and reducing toil. To say I’m a fan of this book […]

The post Mastering Python Networking Review first appeared on Observe Ability.

]]>
I came across Mastering Python Networking by Eric Chou about a month ago on Twitter and immediately purchased it. I was excited to see book on programming targeted at people with a networking background as being able to automate becomes critical to scaling networks and reducing toil.

To say I’m a fan of this book is an understatement! I’d been expecting topics like programming with pexpect and using common APIs, but what we got was far more detailed than I could have hoped for with deep insights into the background of TCP right through building custom APIs, observability, and automating cloud networking.

Chou ramps the topics up gradually building on each chapter so that the learning curve for each topic is gentle enough that even someone with no Python experience could be writing basic scripts to automate network configuration changes within the first couple of chapters.

Using Ansible was covered in detail to automate configuration baselining, provisioning, and changes using a scalable methodology has been well received not only by me, but also raved about on Twitter. If you’re not using Ansible (or similar) you will eventually come across it this book gives you ready to run playbooks that will accelerate your adoption.

With my background in Observability I was pleasantly surprised that 2.5 chapters had been dedicated to the topic from multiple perspectives:

  1. Telemetry configuration pushes
  2. Receiving and decoding the telemetry
  3. Extending existing tools like NTOP and Cacti

Graphing and visualisation are an important part of making data consumable to multiple audiences and an introduction and practical examples of the popular MatPlotLib and PyGraphViz were on point.

Packet decoding and crafting libraries like Scapy are introduced and again the practical examples make it easy to digest relatively complex concepts like writing a network scanning tool relatively easy. I’ve used Scapy in the past to build custom protocol implementations, and I wish I’d had this book then.

Today most of my research centres around using a Python framework called Mininet, and while Mininet is not covered, I would recommend this book to anyone looking to learn modelling and simulations using Python.

The future of networking is not Network Engineers logging into individual devices and running commands. This book is a primer for the network engineering community looking to scale, and conversely for programmers looking to understand how to automate networking tasks.

Topics:

  1. Review of TCP/IP Protocol Suite and Python
  2. Low-Level Network Device Interactions
  3. APIs and Intent-Driven Networking
  4. The Python Automation Framework – Ansible
  5. Docker Containers for Network Engineers
  6. Network Security with Python
  7. Network Monitoring with Python – Part 1
  8. Network Monitoring with Python – Part 2
  9. Building Network Web Services with Python
  10. Introduction to AsyncIO
  11. AWS Cloud Networking
  12. Azure Cloud Networking

If you enjoyed this article, pick up a copy of this book to support us.

The post Mastering Python Networking Review first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/11/28/mastering-python-networking-review/feed/ 0 295
Practical OpenTelemetry Review https://leighfinch.net/2023/10/24/practical-opentelemetry-review/ https://leighfinch.net/2023/10/24/practical-opentelemetry-review/#comments Tue, 24 Oct 2023 01:54:11 +0000 https://leighfinch.net/?p=199 OpenTelemetry is something I’ve been watching for a while now and reviewing Practical OpenTelemetry was the perfect excuse to dive deeper. In my lab I’ve been running a project called SigNoz for which I’m writing up a companion article to this one to show some practical examples. When I first looked into OpenTelemetry about 4 […]

The post Practical OpenTelemetry Review first appeared on Observe Ability.

]]>
OpenTelemetry is something I’ve been watching for a while now and reviewing Practical OpenTelemetry was the perfect excuse to dive deeper. In my lab I’ve been running a project called SigNoz for which I’m writing up a companion article to this one to show some practical examples.

When I first looked into OpenTelemetry about 4 years ago, I was primarily focused on the Tracing aspects such as how can OpenTelemetry replace vendor specific code profilers. I concluded that vendor specific profilers won’t necessarily be replaced by OpenTelemetry, but can be augmented by it. Most existing APM Observability vendors are looking at integrating with OpenTelemetry to reach languages and systems they previously hadn’t invested in such those focused on Java and DotNet now having the ability to include data from PHP, Python, and NodeJS (to name a few).

OpenTelemetry provides a standard and software to instrument many different things using Traces, Logs, and Metrics (converging Events in the MELT framework across the other telemetry types). The idea of using a standard for instrumenting applications and systems means that you can switch backends relatively easily. Therefore less time can be spent on the standard of instrumentation, and more time spent on the visualisation and analysis of the instrumentation of systems.

Practical OpenTelemetry is an easy read designed for developers, DevOps, SRE, and Observability practitioners to get familiar with where OpenTelemetry fits in their ecosystem. The practical aspect includes reference architectures with Prometheus, Elastic, and Kafka, as well as examples using Java whist are simple enough that anyone with any programming experience should be able to port to their language of choice.

To achieve that it starts with an analysis of why OpenTelemetry is needed focusing on the strengths like openness, standards, Observability (MTTRes, MTTI, MTTK MTTF). This is my favourite style of introduction because it explains the need before throwing solutions at the problem.

The second part of the book breaks down what OpenTelemetry is (and isn’t) focusing on telemetry types and how they are ingested (including Traces, Metrics, and Logs) the OTLP protocol, the role of the collector, schemas and conventions. This is where the concept of Spans becomes important. A Span is a unit of work that is done. That unit of work may be composed of multiple sub Spans with each span having a context and attributes. A Span could be a Trace with a defined entry point such a webpage, or something manually defined and coded within the application itself. See the below image of SigNoz representing a series of Spans.

The book then goes on to describe Tracing instrumentation styles such as auto instrumentation vs manual as well as local vs distributed Tracing. Distributed Tracing is the ability to correlate Spans across systems. Distributes tracing is incredibly important as it allows you to create real-time service maps and see how an individual transaction performed over multiple systems.

The chapters on Metrics and Logs and how they are implemented in OpenTelemetry. I liked how the author was able to discuss the convergence of logs and tracing and why it won’t happen overnight.

Blanco finished the book with adoption and institutionalisation covering the challenges of Brownfield environments vs Greenfield. This is especially valuable as relevancy and compatibility with other systems with overlapping capabilities is a constant battle for enterprises.

Practical OpenTelemetry on Amazon.

The post Practical OpenTelemetry Review first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/10/24/practical-opentelemetry-review/feed/ 4 199
The CrUX Of Web Performance https://leighfinch.net/2023/10/17/the-crux-of-web-performance/ https://leighfinch.net/2023/10/17/the-crux-of-web-performance/#respond Tue, 17 Oct 2023 05:36:01 +0000 https://leighfinch.net/?p=179 The world of web performance is tougher than most internal applications because you are dealing with third parties that you don’t. The Internet, browsers (with a thousand tabs open), endpoints (with agents running), and poor connectivity in the form of cellular/wireless/consumer grade connections. This means that we need to be able to understand and manage […]

The post The CrUX Of Web Performance first appeared on Observe Ability.

]]>
The world of web performance is tougher than most internal applications because you are dealing with third parties that you don’t. The Internet, browsers (with a thousand tabs open), endpoints (with agents running), and poor connectivity in the form of cellular/wireless/consumer grade connections. This means that we need to be able to understand and manage what is within our control.

This article is tangental to SEO (Search Engine Optimisation), however focused on User Experience, rather than SEO aspects.

I like to think about performance using the point of consumption and the point of distribution when looking at a network like a WAN or Internet serving applications/services/web pages to consumers over some latency. We usually control is the point of distribution of an application (which could be a load balancer, Application Delivery Controller, Reverse Proxy such as Apache, NGINX, or Varnish).

The point of consumption (and everything in between) in an Internet delivered application is something we may have little control over. If we are lucky it might be a corporate managed device where tools like End User Experience Management (EUEM) agents might be installed.

If we go back to the point of distribution for the application, we need to ensure that it is operating as optimally as possible in terms of server response time and transport performance (e.g. TCP and QUIC). But we also need to consider the rendering at the point of consumption which is where Core Web Vitals and CrUX come into play.

Core Web Vitals is an initiative that looks at the performance of a web page from the perspective of the user, to improve User Experience. CWV is made up of the following meaningful metrics:

Largest Contentful Paint

Largest contentful paint looks at the time it takes to draw the largest content on the page which could be an image, video, or other visible object in the browser. LCP is favoured over earlier metrics such as the page load time.

Cumulative Layout Shift

Cumulative layout shift looks at how often the layout of a page shifts during its lifetime. The more things move around, the poorer the user experience, the lower the score. Think about pages that pops up an advertisement covering the content as you scroll through an article, or as the page loads, buttons and links move around as the page renders.

First Input Delay

First input delay is how long it takes for the user to start interacting with the page and have the browser event handlers register. This could be how long it takes to be able to type a field in a form, or click a button.

Interaction to Next Paint

Interaction to next paint is a pending (at the time of writing October 2023) metric for Core Web Vitals. INP is how long it takes from an interaction such as a click to render the next object such as a drop down box, login form, or even the next page. This can be considered how responsive a page is to interaction. Everyone I know has had a poor experience with a web application that is unresponsive to clicks, resulting in them clicking again and again adding to the queue of interactions that are to be processed.

CrUX (Chrome User eXperience) is the service for Chrome that reports these metrics back to go Google for the purposes of benchmarking and baselining web sites. Arguably it could be used to influence a web sites ranking in search results, but this may just be because Google likes to rank websites with better user experience higher. There are caveats in CrUX reporting:

  1. The website has to be popular enough.
  2. CrUX data is only collected by Chrome.
  3. The user has to have opted in to metric collection.
  4. Other Geographical and privacy reasons.

So what do you do if there isn’t enough data coming in, or you want more detailed analysis? Javascript snippets like those provided by Blue Triangle and APM vendors can provide detailed user experience from any browser that interacts with your page. OpenTelemetry can also be used to collect metrics using the JavaScript package.

Some Interesting Tools

Here are some tools that can help you with your End User Experience Management:

  1. https://developers.google.com/search/ Console for monitoring web page performance over time.
  2. https://pagespeed.web.dev/analysis/ Pagespeed tests website performance, SEO, accessibility with recommendations.
  3. Google LightHouse. Embedded in the Chrome developer tools provides performance, SEO, and CWV details (see below).

Thanks for reading this quick introduction to Web User Performance. If you made it this far please consider buying a book in the Book Recommendations Page.

The post The CrUX Of Web Performance first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/10/17/the-crux-of-web-performance/feed/ 0 179
XDP: Your eBPF Packet Processing Introduction! https://leighfinch.net/2023/10/09/xdp-your-ebpf-packet-processing-introduction/ https://leighfinch.net/2023/10/09/xdp-your-ebpf-packet-processing-introduction/#comments Mon, 09 Oct 2023 04:55:34 +0000 https://leighfinch.net/?p=164 I want to let you in on why I think XDP (eXpress Data Path) is so awesome and will change the game when it comes to security, routing, and application delivery.

The post XDP: Your eBPF Packet Processing Introduction! first appeared on Observe Ability.

]]>

I want to let you in on why I think XDP (eXpress Data Path) and eBPF are so awesome and will change the game when it comes to security, routing, and application delivery.

Around ten years ago a new technology called DPDK (Data Plane Development Kit) was created by Intel to enabling people like you and I to create network applications (firewalls, switches, routers, load balancers etc) in user land bypassing the hosts Kernel altogether. The benefit of this is that you are not bound by the hosts general purpose network stack. This is very cool because it allows the user to write complex packet workflows in an optimised way.

Programmability of the Linux Kernel has been a goal of eBPF (extended Berkley Packet Filter) removing the need to create tightly coupled kernel modules. Trying get a change accepted by the Linux Kernel Team and then adopted by Linux distributions. Thus eBPF can be used to create code that runs in the Kernel to observe or modifying modify behaviour in real time.

XDP provides a way to network applications to operate safely within the Kernel prior to being processed by the hosts networking stack. In the case of Cilium (an open source eBPF Kubernetes network, security, and observability platform), we can create a load-balancer that bypasses the need for kube-proxy and cloud load balancers.

Jan Engelhardt - Own work, Origin SVG PNG

Jan Engelhardt Origin SVG PNG

How is XDP Different to DPDK?

DPDK and XDP have some overlap in function around used for high-performance packet processing in network applications, however their mode of operation and capabilities are quite different.

Operation

XDP is made up of both Kernel land and user land components. Using Clang we can compile our C code targeting BPF format. We then load the code into the kernel using ip set link commands. This code can then modify the packet contents and perform 4 actions.

  1. XDP_ABORTED – Error condition and drop the packet.
  2. XDP_DROP – Drop the packet.
  3. XDP_PASS – Allow the packet to continue to the kernel.
  4. XDP_TX – Transmit the packet out the interface it was received.
  5. XDP_REDIRECT – Transmit the packet out another interface or to a user land application leveraging AF_XDP.

In contrast DPDK requires that the NIC (Network Interface Card) supports DPDK, and will punt the traffic straight to a user land application. This means that the receiving application know how to process the packet rather than simple drop, pass, transmit style actions.

Use Cases

XDP is tightly linked to the Operating System kernel (both Linux and Windows) and is generally used in packet load balancing, observability, routing, and security applications (DDoS scrubbing, IPS, Firewall). Additionally, you can pass data to user land for the purposes of observing the data.

DPDK is generally used for NFV (Network Function Virtualisation) purposes such as creating a networking application such as WAN optimiser. The ability to have the entire application running in user land means incredible flexibility.

Further Reading

In the next post on XDP I will create a fully functioning XDP kernel and user land application to observe traffic prior to the kernel processing the packet.

  1. Check out my review of Learning eBPF which includes a chapter on XDP.
  2. Check out my review of BPF Performance Tools.
  3. https://www.kernel.org/doc/html/latest/networking/af_xdp.html
  4. Cilium XDP Documentation

The post XDP: Your eBPF Packet Processing Introduction! first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/10/09/xdp-your-ebpf-packet-processing-introduction/feed/ 2 164
SaaS to on-prem and back again! Interview With a Professional Services Director https://leighfinch.net/2023/10/07/saas-to-on-prem-and-back-again-interview-with-a-professional-services-director/ https://leighfinch.net/2023/10/07/saas-to-on-prem-and-back-again-interview-with-a-professional-services-director/#respond Sat, 07 Oct 2023 05:04:51 +0000 https://leighfinch.net/?p=157 Earlier this week I meet with a good friend of mine (who we’ll call Martin) who works as a Director of Professional Services. Martin works at a global enterprise software company. We’d been discussing a recent article I wrote on Cloud Migrations in Reverse and we discussed why he had seen some of his customers […]

The post SaaS to on-prem and back again! Interview With a Professional Services Director first appeared on Observe Ability.

]]>
Earlier this week I meet with a good friend of mine (who we’ll call Martin) who works as a Director of Professional Services. Martin works at a global enterprise software company. We’d been discussing a recent article I wrote on Cloud Migrations in Reverse and we discussed why he had seen some of his customers migrate from their SaaS offering back to their on premises products.

For the last 10 or so years traditional enterprise software companies have been working with their customers towards their SaaS offerings. The biggeest reason for software vendors to do this is that it is easier for the vendor to support their customers, when they also manage the environment. The strategy for Martin has been to retrain his professional services consultants to become (Technical) Customer Success Managers. The nature SaaS means that customers can churn at any point before the customer lifetime value exceeds the cost to acquire. CSMs work with customers achieve and maintain value from the as a service offering is in play.

Martin explained to me that his companies SaaS solution is a win-win for customers and vendors. He also sees a trend with certain types of customers reversing their decision, or indefinitely delaying go-live for three main reasons:

  1. Poor planning
  2. Changes in user experience
  3. Data sovereignty and legal challenges

Poor Planning

I think we could consider all three of these as poor planning, but what Martin meant by this was that the customer assumes that migrating is an easy task and neglects to consult across teams including LOB (Line Of Business) heads and bring in professional help to perform the migration. The outcome of this can mean that data is not migrated correctly, and data inconsistencies.

Changes in User Experience

Changes as a result of where application is hosted or insufficient user training for newer releases resulting in users being having a poor user experience. I’ve talked a lot about how latency can really impact application performance in Performance Diagnostics Part 1.

Data Sovereignty and Legal Challenges

Martin outlined that some of his customers moved to their SaaS offering and found that they ran into problems with Personally identifiable information (PII) being stored in localities outside the locations specified by legislation. Not every SaaS provider will have a tenancy or can gaurantee that data may stay in a specific location. Sometimes the law, and peoples interpretations of the law can lag behind technology.

Martin did say that almost all of his customers are looking to eventually end in SaaS as an end state. His Professional Services team are busy advising and working with their customers to achieve that.

Thanks for reading. If you enjoyed this article you can buy me a coffee or buy yourself a book.

The post SaaS to on-prem and back again! Interview With a Professional Services Director first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/10/07/saas-to-on-prem-and-back-again-interview-with-a-professional-services-director/feed/ 0 157
Performance Diagnostics Part 5- Optimising Worker Threads https://leighfinch.net/2023/09/29/performance-diagnostics-part-5-optimising-worker-threads/ https://leighfinch.net/2023/09/29/performance-diagnostics-part-5-optimising-worker-threads/#comments Fri, 29 Sep 2023 07:32:32 +0000 https://leighfinch.net/?p=142 Background A few (ok many) years ago I was working with a customer who was launching a new application and was expecting a high load on their launch, which for whatever reason was at 3pm on a Friday (1). As 3pm hit, and the load balancers started directing traffic to the now production system, there […]

The post Performance Diagnostics Part 5- Optimising Worker Threads first appeared on Observe Ability.

]]>
Background

A few (ok many) years ago I was working with a customer who was launching a new application and was expecting a high load on their launch, which for whatever reason was at 3pm on a Friday (1). As 3pm hit, and the load balancers started directing traffic to the now production system, there was a panic as synthetic tests started to time out. This was not good because the users were unable to interact with the web application, or if they did it was dead slow (minutes to load a page).

Using observability tooling I was quickly able to see that they had run out of worker threads and that requests were now being queued beyond the timeout of the synthetic agent. The fix was a simple, increase the number of worker threads to enable requests to be handled in parallel rather than waiting for a thread to become free.

The increase from 25 to 100 threads immediately increased responsiveness of the application back to within the SLA that the application team had promised to the business.

So Why did I recommend increasing the number of threads from 25 to 100?

If you’ve ever managed a webserver and seen the max connections or worker threads settings, you might be tempted to think that bigger is better. But there are a number of factors that need to be considered before blindly increasing the number of threads.

When things start to become “slow” as an Observability and Digital Performance expert, I need to consider the type of workload, the utilisation of resources (such as CPU, memory, and Storage IO), and errors/events that might be occurring. I will then leverage APM traces to understand where time is being spent in the code or even the application server.

In this case all threads were being consumed however not all CPU cores were being consumed. This led me to start looking at traces, and what I saw was that the actual application response time was quick. This means that when the request actually got to application code, it was executed very quickly. The time was being spent in the application server (Tomcat in this case) which was queueing requests, but unable to have the thread pool execute them quickly.

Queue of statues illustrating a queue for worker threads

So when the code is executing quickly but is held in a queue waiting. so if everything is being executed quickly but requests are timing out, it means that we need a way to increase the number of requests being executed simultaneously, with the side effect of increasing the time it takes for each request taking slightly longer to execute. If we have an equal number of workers to cpu cores a single thread can have effectively uncontended access to a CPU core, however if we increase the number of threads beyond the number of cores, we have to rely on the operating system scheduler to schedule access to the required CPU core.

Additionally, as we increase the number of worker threads, we also increase the likely of issues relating to concurrency (locks, race conditions), as the increased number of threads will also take longer to execute their workload.


Using NGINX as an example, it recommends setting the number of works to the number of cores or auto if in doubt(2). I’m going to use a benchmarking tool called Apache Benchmark against a webserver that has two cores and two workers to calculate the first 1000 prime numbers.

Test 1– 1 Concurrent Request

In this test we have two worker threads and one concurrent request. We see that the mean response time is 620ms. Not bad for ten requests with the total time to process ten requests at 6.197 seconds.

root@client:~# ab -n 10 -c 1 http://192.168.20.17/index.php
Time taken for tests:   6.197 seconds
Requests per second:    1.61 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   606  620  37.7    608     727
Waiting:      605  620  37.7    608     727
Total:        606  620  37.7    608     727

Test 2 – 2 Concurrent Requests

In this test we have two worker threads and two concurrent requests. We see that the mean response time is 624ms. Pretty comparable to the previous test however the the total test time was reduced to 3.7 seconds.

root@client:~# ab -n 10 -c 2http://192.168.20.17/index.php
Time taken for tests:   3.748 seconds
Requests per second:    2.67 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:   607  624  12.4    624     652
Waiting:      607  624  12.3    624     652
Total:        607  624  12.5    625     652

Test 3 — 4 Concurrent Requests

In this test we only have one worker thread and four concurrent requests. We see that the mean response time increase to 1162ms. This is roughly doubling the request duration, however the total time taken to serve the ten requests was almost the same as test two at 3.8 seconds.

Doubling the number of concurrent requests to 8 shows that the response time increase is roughly linear.

root@client:~# ab -n 10 -c 4 http://192.168.20.17/index.php
Time taken for tests:   3.821 seconds
Requests per second:    2.62 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   691 1162 319.7   1205    1748
Waiting:      691 1160 317.0   1205    1737
Total:        691 1162 319.6   1205    1748

Test 4 – 4 Concurrent Requests And 4 workers

This test is to oversubscribe the number of worker threads to CPU cores by double, relying on the operating system scheduler to load balance the requests. 

The performance was comparable (slightly worse by ~100ms) to test three relying on the OS scheduler to load balance across the two cores.

root@client:~# ab -n 10 -c 4 http://192.168.20.17/index.php
Time taken for tests:   3.978 seconds
Requests per second:    2.51 [#/sec] (mean)
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:   621 1205 304.1   1280    1483
Waiting:      621 1205 304.2   1280    1483
Total:        621 1205 304.1   1280    1483

Conclusion

Overall the best performance was two workers with two concurrent requests lining up with the general advice of equal number of workers to cores, however this workload fully utilises (prime number generation) the CPU core while it runs. Other workloads will use require less CPU time whilst waiting on dependancies (e.g. DB calls), and this will mean that over-subscribing worker threads will improve results. So Like everything in IT the correct value is: “it depends” and “bigger is not necessarily better“.

If you made it this far, thanks for reading. Check out the book section for interesting books on Observability.

  1. This is the best way to ruin a weekend for your hard working staff. Read only Fridays make for happy engineers.
  2. https://nginx.org/en/docs/ngx_core_module.html#worker_processes

The post Performance Diagnostics Part 5- Optimising Worker Threads first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/29/performance-diagnostics-part-5-optimising-worker-threads/feed/ 1 142
Learning eBPF Review https://leighfinch.net/2023/09/27/learning-ebpf-review/ https://leighfinch.net/2023/09/27/learning-ebpf-review/#comments Wed, 27 Sep 2023 07:09:30 +0000 https://leighfinch.net/?p=137 What makes Learning eBPF different to BPF Performance tools (which I wrote about recently) is that it moves beyond theObservability and performance lens towards Security and modification behaviour inside the Linux kernel. The author Liz Rice is the Chief Open Source Officer at Isovalent and recently presented at the eBPF Virtual Summit in September of […]

The post Learning eBPF Review first appeared on Observe Ability.

]]>

What makes Learning eBPF different to BPF Performance tools (which I wrote about recently) is that it moves beyond theObservability and performance lens towards Security and modification behaviour inside the Linux kernel. The author Liz Rice is the Chief Open Source Officer at Isovalent and recently presented at the eBPF Virtual Summit in September of 2023. She has a lot of material available online and I’ll provide some resources towards the bottom of the article.

This book introduces eBPF in a consumable way discussing its history and how it became a vehicle to inspect and create new kernel capabilities without needing to have either created a kernel module (tied to a specific build or API) or having the code agreed upon by the community and adopted by distributions. Additionally, we understand how eBPF code is checked for safety prior to running, reducing the risk of a kernel crash in production.

As a reader I enjoyed the use of C and Python to illustrate practical examples of events being triggered (such as a packet arriving on an interface) and data being read into a program in user space. 

The hardest thing to get your head around is the different components that pull eBPF together. The author makes this easy with examples of which code is run in user space, and which code is first compiled to byte code and then JIT or assembled into machine code for execution.

The eBPF for networking described newer features such as XDP (eXpress Data Path) shows how we can create routers, firewalls, and load balancers (especially in a Kubernetes context) bypassing typical kernel behaviour. Examples are discussed including how CloudFlare and Facebook have used this capability in production.

The examples and working code are provided and you can download them in the resources below. If you’re interested in the next generation of Observability and Kernel modifications, please get yourself a copy of this book.


Resources

The post Learning eBPF Review first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/27/learning-ebpf-review/feed/ 4 137
Performance Diagnostics Part 3 — Latency beyond Ping https://leighfinch.net/2023/09/24/performance-diagnostics-part-3-latency-beyond-ping/ https://leighfinch.net/2023/09/24/performance-diagnostics-part-3-latency-beyond-ping/#respond Sun, 24 Sep 2023 06:14:40 +0000 https://leighfinch.net/?p=129 Network teams often use ICMP as a mechanism to determine the latency (propagation delay etc) and reachability between two endpoints using the trusty Ping utility. Ping appeared in late 1983 created Mike Muuss while working US Ballistics Research Laboratory. Additionally, what was interesting about 1983 is that it was the year the that the US military converged on […]

The post Performance Diagnostics Part 3 — Latency beyond Ping first appeared on Observe Ability.

]]>
I write on my personal time. Feel free to buy me a coffee or buy a copy of TCP/IP Illustrated Volume 1 to learn more about the protocols that run the internet. Check out Wireshark Network Analysis for an awesome hands-on guide to wireshark.

Network teams often use ICMP as a mechanism to determine the latency (propagation delay etc) and reachability between two endpoints using the trusty Ping utility. Ping appeared in late 1983 created Mike Muuss while working US Ballistics Research Laboratory. Additionally, what was interesting about 1983 is that it was the year the that the US military converged on IP (and TCP) mandating for any system connected to the ARPANET making Ping one of the oldest IP applications still in use today.

The naming for PING (Packet InterNet Groper) is a backronym for the sonar process used by submarines and other water-craft (as well as in nature). Which makes sense when you are trying to measure latency between nodes.

Ping uses ICMP (Internet Control Management Protocol echo(8)/echo-reply(7)) to communicate between nodes and is even mandated in the historic RFC1122 Requirements for Internet Hosts — Communication Layers(released in 1989) for internet connected hosts. This RFC is well worth a read to understand what was happening with IP and TCP in the aftermath of the congestion collapse events of the mid 1980s.

The problem with using Ping and ICMP as a measure of latency is that it is frequently blocked or placed in scavenger queues which distorts the latency detected (adding to it or making it appear jittery) anddoes not reflect the actual latency experienced by applications. The lack ICMP prioritisation makes sense, we want the actual user traffic coming through and processed at endpoints with a higher priority than our monitoring traffic. Secondly, Ping is usually run in intervals (eg. every 5 minutes) which means the that we wont be able to spot events between polling intervals.

This may have been good enough when we used IP networks for non-realtime applications (email and web browsing etc) where changes in latency and drops are not as important, but in the 2000s we started using IP SLA to inject synthetic traffic between to devices that support IP SLA and report on metrics like jitter and latency for the class of service or QoS markings desired. This was a good step further as now we understand how real traffic would perform while the IP SLA runs. This is (usually) run in intervals which means that still have gaps in our visibility. The good reason for using IP SLA (and other synthetics) is that traffic is being generated even when there is none being generated by users. A lot of vendors take this approach with their observability stacks, but it still leaves a gap between intervals and doesn’t necessarily reflect a users experience.

We can also monitor latency passively using captured TCP packets between nodes. NPM platforms like Alluvio AppResponse do this at a large scale, but we can also do this using Wireshark or TCPDump for ad-hoc reporting. The best bit is that we can now see the latency between any nodes that we can collect traffic between which has two big bennefits:

  1. We have every connection.
  2. It is all passive.

Using Wireshark we will look at how this is possible. I’ve talked about how TCP operates below the application layer and that an application has only a limited ability to influence socket behaviour. The OS kernel handles traffic acknowledgement, which has a very high priority in the Operating System scheduler. We can essentially ignore the TCP stack delay as negligible (unless it is behaving erratically which is a sign that the endpoint is over-subscribed).

The two TCP behaviours we will use to understand the latency of the connection are the 3-way handshake, and the TCP time to ACK two full segments.

Method 1 – The 3-way handshake

The 3-way handshake (also called connection setup time) is the process used to establish a reliable connection between two nodes. It involves the TCP client sending a specially marked segment called a SYN, The server responding with another specially marked segment called a SYN-ACK, followed by the client sending an ACK (with or without data). The delta between the SYN and the ACK collected anywhere between the nodes will give us the latency between the two nodes.

In this case we have a latency of 105ms between the SYN and the ACK. I’ve set the propagation delay of the backbone of this network to 100ms, which after we add small overhead on socket creation the server, we are very much right on the latency. I deliberately chose a capture that was was not on the client, or the server to show that this can be applied anywhere on the path.

We can also see this value of 105ms in each subsequent packet stored in the tcp.analysis.initial_rtt variable.

Method 2 — Time to ACK

We know that from RFC1122 we should see an ACK (for at least) every 2 full sized segments without delay, or after a packed marked with PSH is set. This behaviour is not impacted by the applications ability to process the data, and is solely the responsibility of the TCP stack in play. This method is best used close to the client (otherwise same additional math is required).

We can even graph this in Wireshark using the Round Trip Time option in the TCP Stream Graphs menu. You will also note some spikes in acknowledgements at and over 200ms, this is a topic willbe discussed in another article.

I like to add it as a coloumn in the packet list as below when troubleshooting TCP performance.

Using TCP to monitor latency has significant advantages over synthetics

If you made it this far, thanks for reading. Feel free to buy me a coffee or buy a copy of TCP/IP Illustrated Volume 1 to learn more about the protocols that run the internet. Check out Wireshark Network Analysis for an awesome hands-on guide to wireshark.

The post Performance Diagnostics Part 3 — Latency beyond Ping first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/24/performance-diagnostics-part-3-latency-beyond-ping/feed/ 0 129
Why Are We Seeing Cloud Migrations in Reverse? https://leighfinch.net/2023/09/23/why-are-we-seeing-cloud-migrations-in-reverse/ https://leighfinch.net/2023/09/23/why-are-we-seeing-cloud-migrations-in-reverse/#comments Sat, 23 Sep 2023 05:46:43 +0000 https://leighfinch.net/?p=125 I’ve always loved using excess or old computers and network infrastructure to lab things up or run a PoC for an application or service. I still have a Dell R710 and HP ML10v2 I use to run services like Home Assistant and various observability tools for testing.  So what does this have to do with […]

The post Why Are We Seeing Cloud Migrations in Reverse? first appeared on Observe Ability.

]]>
I’ve always loved using excess or old computers and network infrastructure to lab things up or run a PoC for an application or service. I still have a Dell R710 and HP ML10v2 I use to run services like Home Assistant and various observability tools for testing. 

So what does this have to do with Cloud migrations? I can’t afford to run my labs in with IaaS for long periods of time. But this is the easy answer to a phenomenon that seems to be happening more and more. Let’s take a look.


When cloud services started becoming a thing in the mid-late 2000s there were a few options available, some of those were storage, compute, and some xAAS offerings like unified communications. The promise of the cloud was redundancy, scalability, reduced reliance on employing specialists, and OpEx spend as opposed to CapEx spend. Why would a legal firm need a large IT team and own equipment, when you can outsource it to the experts.

This sounds pretty attractive to technology leaders because they no longer need to own assets, their related service contracts, renewals, and also reduce the number of staff required to manage their technology investment. If the company had a launch or other event where scale was needed, they could simply scale their services temporarily. 

I’d personally leveraged hosted dedicated servers in the US and Australia primarily for reliability and high bandwidth for my personal projects, as well as recommending this to my customers at the time as a way to avoid the risk of localised power and connectivity issues for those hosting out of makeshift datacentres or worse, basements.


As technology evolved through the 2010s, we saw cloud native technology starting to appear which had the promise of autoscaling, and automating the build of applications and microservices. Many organisations jumped on this and migrated some of their applications to become cloud native. Others did not, they simply took their on-premise architecture and moved it into cloud service providers like AWS, GCP, and Azure.

Without application transformation, the cost of running these legacy architectures can become extraordinarily expensive over the long term. Investing developer resources into transforming a ‘good enough’ application on a legacy architecture doesn’t make sense, and neither does the cloud costs of running it in the cloud.

Some organisations simply took their legacy applications and placed it into a container, which then has the overhead of running the service container. containerisation has benefits of abstracting the service from the underlying platform, but there has to be some re-think of how the application works.

Put simply, if you are building something new, build it cloud native. If it is legacy and you have no intent on transforming the application to run on a cloud native architecture, you may not realise the expected savings long term.


When organisation go through the process of troubleshooting a complex application performance issue they will often go for quick wins:

  1. Increase bandwidth
  2. Add CPU cores
  3. Add Memory
  4. War room?

These changes are never reversed and these costs add up quickly. Nothing is more permanent than a temporary fix.

Additionally it can be cheaper to prototype solutions internally on excess and legacy hardware, which if tested in the cloud has ongoing costs that would not otherwise be seen. Hands up if you ever left something running in the cloud by accident?

I think we will continue to see hybrid environments. If I was starting a business today, I would certainly still run a hybrid environment where prototyping could be done on-prem, with the majority of anything built would be cloud native.

If you got this far, thanks for reading. Feel free to buy a book from here!

The post Why Are We Seeing Cloud Migrations in Reverse? first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/23/why-are-we-seeing-cloud-migrations-in-reverse/feed/ 1 125
BPF Performance Tools Review https://leighfinch.net/2023/09/21/bpf-performance-tools-review/ https://leighfinch.net/2023/09/21/bpf-performance-tools-review/#comments Thu, 21 Sep 2023 06:48:29 +0000 https://leighfinch.net/?p=122 BPF Performance Tools the kind of book an observability specialist picks up and thinks this will make a good reference book for my library, and then reads the whole thing cover to cover. Brendan Gregg formerly of NetFlix has contributed significantly to the world of observability and uses his experience in troubleshooting and tracing some […]

The post BPF Performance Tools Review first appeared on Observe Ability.

]]>
BPF Performance Tools the kind of book an observability specialist picks up and thinks this will make a good reference book for my library, and then reads the whole thing cover to cover.

Brendan Gregg formerly of NetFlix has contributed significantly to the world of observability and uses his experience in troubleshooting and tracing some of the most interesting problems any of us are likely to come across.

So what is BPF? Those of us in the Unix, Linux, and BSD world will likely say Berkley packet filters, and to be fair, this was the case. BPF was originally created to allow users to create filters for TCPDump to monitor selected network traffic and either send this to a PCAP, or display on the screen. This was useful when troubleshooting what was happening on the “wire” as opposed to what people think is going over the wire. I’ve used this to troubleshoot everything from port security, voice over IP issues, to performance analysis. The phrase “PCAP, or it didn’t happen” exists for a reason.

BPF has moved away from being just an acronym to the name of a feature sometimes referred to as eBPF (extended BPF) which now allows us to virtually trace anything that happens inside the Linux kernel. This could be performance related, security related, or even modifying the behaviour of the kernel altogether. Load balancers and firewalls have been created in BPF. I’ve even started building a congestion control algorithm leveraging BPF. The possibilities here are endless, you can now write kernel safe code to be run in the kernel with information being fed up to user-land through maps. 

This book however focuses on the performance aspects of BPF using Tracing. The difference between tracing and logs is the ability to trace events in real time without relying on pre-existing logs that occur without context. I could for example trace every socket accept event from every application and process on my machine, or trace server response times, or the amount of time spent in a particular state.

What I particularly liked was how the author broke down performance into specific domains including disk io, network io, applications and showed us real examples of BPF in action via BCC (BPF Compiler Collection) and also using the BPF APIs. There were one liners on practically every aspect of Linux performance I would want to query.

Perhaps most importantly the author compared and contrasted traditional tools we would use with the BCC approach in one line! This has completely changed the way I plan to approach performance troubleshooting in Linux.

If you ever thought to yourself, ‘why is that so slow?’ this book is for you. Grab a copy here!

If you made it this far, thanks for reading.

The post BPF Performance Tools Review first appeared on Observe Ability.

]]>
https://leighfinch.net/2023/09/21/bpf-performance-tools-review/feed/ 3 122