Performance Diagnostics Part 2 — Revenge of the OSI Model

Leigh Finch — Wed, 13 Sep 2023 00:12:25 +0000

Continuing on from the previous article where I discussed an amalgamation of performance diagnostics with fat client applications. I thought it was a good time to go back to computer science 101 where we were introduced to the OSI model and the TCP/IP model. Both are models that some architectures and platforms more or less implement, But the former (OSI model) allows for us to break down problems into something easy to consume and definitively identify where performance problems lay.

Layer 1 — Physical
Layer 2 — Data Link
Layer 3 — Network
Layer 4 — Transport
Layer 5 — Session
Layer 6 — Presentation
Layer 7 — Application

I’ve found throughout my travels that most technologists understanding of the model is murky at best, and I strongly feel that this is a key part of identifying problems and the ownership of problems within IT Ops, engineering, SRE, and DevOps. This is further obfuscated by the evolution of compute and network architectures where we have multiple layers of virtualisation, and shared infrastructure.

I had a situation where I was called in to investigate a ‘customer network problem’ as called out by a third party vendor that hosted a Citrix farm in cloud provider that was used to host a fat client application connecting directly to a database also in the same cloud network. I asked for an example of the type of errors they were seeing, to my surprise they sent me a screenshot of the fat client (inside the Citrix session) displaying a database deadlock error… So not only was this not a customer network problem, it was also at the wrong layer of the OSI stack?!?

A word of caution on mean-time-to-innocence thinking, In this case the ability to resolve the issue was out of the control of the customer that brought me in, at least we were able to help focus the investigation efforts. Organisations that use terms like mean-time-to-innocence as opposed to mean-time-to-resolution often have larger problems than technology to work through. As I’ve said in other articles, people generally want to fix problems if it lays within their domain, however there is often enough plausible deniability and lack of observability data due to retention, tooling, and processes that sticking their heads in the sand and blaming another team seems to be the default.

Application	Application code is run here.
Presentation	Data encoding, encryption, compression.
Session	Management tear down start up of a session. RPC as well as session cookies.
Transport	Transport protocol manages connections between the endpoints for the reliable or unreliable transmission of data.
Network	Manages the addressing or where to find another host.
Data-Link	Manages the connection between two physically or logically connected hosts. Ethernet.
Physical	Physical cable or medium used to connect two or more hosts.

OSI Functions

For the purposes of this discussion, we’ll keep it simple. There are cases where the lines are blurred, after all this is a model, not a mandate. Technologies like DPDK (Data Plane Development Kit), raw sockets, and eBPF allow you to bypass some or most of the stack. This area is a lot of fun if you ever want to build your own router, switch, firewall, or write you’re own implementation of a layers 2+ of a protocol.

Layer	Typical Component
Application	Application
Presentation	Application, Application Server, standard libraries
Session	Operating System, Application Server, standard libraries
Transport	Kernel
Network	Kernel
Data-Link	Kernel
Physical	Medium. Eg cable

OSI Responsibility

A typical application will usually consider anything below layer 5 as a network problem or timeout, which can lead to interesting discussions with network teams who usually don’t manage the operating systems at either end. Remember that routers don’t retransmit packets (segments), hosts do. This could be a reaction to someting happening within the network infrastructure, but is also part of the normal TCP process probing for its fair share of bandwidth as seen in the below IO graph.

Just as the application stack (layers 5, 6, and 7) has no control over routing (layer 3), it also has no control over retransmissions which is solely the domain of the operating system kernel. This also applies to TCP connection establishment (AKA 3 way handshake), the application cannot slow the establishment down, but it can chose whether to read data from the socket.

An application can set some options that influence behaviours such as delayed ack, Nagle, Cork, and Congestion Control, using setsockopt() but that is topic for another article. Check out the man page for TCP for more info.

In this section we’re going to use the OSI model to go through problem isolation and how to monitor the each layer.

Layer 1 — Physical

The kinds of problems we typically see with the physical layer are to do with medium problems. This could be as simple as duplex, poor signal strength, collisions. Errors like this are usually easily observed on the network infrastructure as CRC (Cyclic Redundancy Errors), runts (inclomplete frames), carrier drops, and network switch buffer exhaustion. Streaming telemetry and SNMP provide excellent insights into these types of performance errors.

Layer 2 — Data Link

The data link is responsible for synchronising, modulation, and multiplexing communications between physical nodes. Carrier duplex (CSMA/CD) and enforcement of protocol specification (eg. ethernet). An example of this may be that a frame checksum has failed (due to an error at layer 1) and therefore must be discarded due to MAC (Media Access Control). LLC (Logical link control) allows for flow control and extensions such as VLAN tags. Popular layer 2 standards include:

ATM
Ethernet
PPP

Streaming telemetry and SNMP provide excellent insights into these types of performance errors.

Layer 3 — Network

The network layer is used to identify hosts and address hosts. In the TCP/IP world this IP or internet protocol (versions 4 & 6). Hosts use this to uniquely identify a target. unlike a MAC address, IP addresses (outside RFC1918) should be globally (or at least relatively) unique. There are a few exceptions to this including multicast, private, anycast, site local, snd link local.

Errors at the Network layer include reachability (Destination Host Unreachable), IP TTL expiry (a field that is decremented by every router in a transit path).

Layer 4 is also typically where QoS of service is set. The DSCP (Differentiated Services Code Point) enables devices to prioritise traffic based on the class that has been configured. A router (or even host) operating at layer 3 can then change priority or queueing strategy based on this and other parameters. Again a story for another day.

While streaming telemetry and SNMP provide some insight into layer 3. NetFlow/IPFix provide a better insight how traffic traverses an environment. Traceroute, ping, and other ad-hoc tools can provide some level of insight, however this is subject to network policies.

Layer 4 — Transport

The transport layer is responsible for the reliable (or unreliable) delivery of data. This is where the traditional Network Engineers lose visibility of the ‘network’. Each TCP session is bidirectional between a host pair (which could even be the same host talking to itself), and has a randomised (hopefully) sequence and acknowledgement number used to keep track of the conversation and identify what data has, and has not been received.

Routers, Switches, and to a lesser extent firewalls do not participate (or interfere) in a TCP conversation between two hosts. As always there are exceptions to this to set values such as the MSS (maximum segment size) for a path to avoid packet fragmentation. But to re-iterate a router does not retransmit a packet, the sender host does.

This is also where communications efficiency comes into play. The congestion window on the sender is used to metre data communications across the network, and the receive window advertised by the receiver tells the sender how much space the receiver has to read data.

NetFlow and IPFix provide some insight here, but to really understand how protocols like TCP are performing we need actual packet data. This could be as simple as tcpdump running continously or using a commercial solution like Riverbed AppResponse for Application Aware Network Performance Management. Endpoint and server agents can also play a role in providing metrics, events, and traces. eBPF can also come into play here.

Layer 5 — Session

This is where things start to become a little more vague and using the model as a model rather than a rule becomes important. The session layer is responsible for the management of individual session using the same transport. This applicable to a lot of the older protocols including RPC (Remote Procedure Calls). It could be argued that this include HTTP session management (session cookies).

At this point in the stack we are now working with Application Servers and standard libraries, and small percentage of applications that have chosen to implement this from scratch.

This is where application traces, logs, and events come into play. Endpoint and server agents can be used to aggregate and feed this data into EUEM and APM analysis servers.

Layer 6 — Presentation

The presentation layer manages compression, (some) encryption, and encoding. This allows the application to work with clear text (eg. ASCII and Unicode) without having to implement much within the application code. This is typically managed by the Application Server and standard libraries. An example of this is opening an encrypted socket vs a cleartext socket, The process of reading and writing to the socket is largely the same with the developer not worrying about the encryption process.

Metrics, TLS Handshakes, Events, logs, traces provide detailed insight into this layer and associated performance problems.

Layer 7 — Application

This is where most developers spend their time and where application logic is performed. Application performance problems at this layer can be for a whole range of sources including but not limited to:

Head of line blocking.
Number of Server turns (request / response cycles) which is amplified by latency.
Poor memory management.
Inefficient algorithms.
Number of worker threads.
Dependencies such as databases and web services.
Locks.
Lack of connection pooling and pipelining.
Authentication.

This is solely the domain of APM and EUEM agents to collect metrics, events, logs, and traces (MELT).

If you got this far, thanks for reading. Check out some of the book recommendations.

The post Performance Diagnostics Part 2 — Revenge of the OSI Model first appeared on Observe Ability.

OSI Model - Observe Ability