Performance Diagnostics Part 1

Over the last 20 years I’ve been sent in by customers to investigate some of the most intriguing application performance problems that have had customers investing in infrastructure, time in war rooms, connectivity to try and resolve a problem that is eluding the technical team, or the technical team is unable to quantify what will fix it. Once a change such as adding bandwidth, compute, storage has been implemented, they are seldom reversed when the root cause of an issue identified.

While performance diagnostics is a fancy term for troubleshooting (something that technical staff do every day), I love the rush of energy that comes from the investigation and identifying the solution. Performance problems are harder to investigate than an outage because the application is somewhat working, the degradation occurs intermittently, and most customers I work with have limited observability tools in place, rather relying on the depth of the IT support queue as a measure of the health of an application.

I’m going to start with one of the simplest category of problems, the ‘slow fat client application’. Many applications have moved to using HTTP(s) for delivery and presentation of applications, there are still (far too) many applications that require a fat client executable application that if you are lucky connects to a application server, directly to a database, or if you are really unlucky a database file on a file share (oplocks… and I’m sorry…).

In this amalgam of stories (to preserve anonymity), I’m going to talk about a fat client application talking directly to a database that has been a long running issue with process, political, and technology problems.

Background:

The customer has bought a COTS application from an external vendor to reduce IT costs developing the application in-house.
The application is accessed by about 20 users simultaneously.
The quality of and details of the support tickets has degraded over time as users have become frustrated with a lack of resolution and have started to adjust to this experience as normal.
The vendor of the application have said that the application works fine for them and have even tested the dataset on their infrastructure.
The blame game has reached fever pitch, and the executive wants answers.

This makes the first step of a performance diagnostics the hardest, which is to quantify the magnitude of the problem and when it occurs. Users will often complain that the issue happens ‘all the time’ and ‘everyone experiences it’. Along with users I’ll usually interview and discuss the problem with various teams to understand their take on the issue. Often these tech teams have been banging their heads against the wall for some time and are sick of talking about it and investigating the vague problem reports. I’ve found the best way to get buy in from reluctant tech teams is to be transparent and provide them access to any telemetry I might collect as part of the investigation.

Once I understand the nature of the problem(s), I start at the point of consumption (the workstation). If the customer has EUEM tools in play I can look at defining transaction (click to render) and understand three aspects of each transaction and the timestamps:

Client activity including process utilisation, overall environmentals, delay between requests, network connectivity (VPN, wired, wireless).
Network activity including number of request and response cycles, bandwidth utilisation, efficiency of the network communications.
Delays on the backend including time to first byte. (I’ll write up how to quantify this in a seperate article).

This gives me an understanding of the baseline performance of the application. It’s also important to remember that at this point ‘normal’ does not equal good.

Normal != Good

A PCAP or NPM solution will also give me most of the network and backend delay information if the customer doesn’t have an EUEM solution such as Riverbed Aternity.

In most circumstances I now have enough information to establish to determine the performance limits of the application, whether bandwidth will solve the problem, or if the problem can be fixed without architectural changes to the application.

In this simple example of the fat client talking to a database, there only so many answers that can be given. I’m going to use the login transaction taking 12.5 seconds as my baseline. In this case the vendor and database had blamed the network. The network team had checked connectivity and found it to be operating fine with performance tests achieving 1Gbps, far more than the application needed. Looking into the EUEM statistics I found a significant amount of time being spent on the network, a PCAP told me the same thing.

Network can mean a lot of different things, but we need to remember that the network starts and ends with the two endpoints. Routers don’t retransmit traffic, endpoints do. In this case what happened is that the workstation would make a few database queries to complete a login including authentication, and downloading 12,000 rows of customer data. The fat client was performing an unbuffered fetch of data from the database, which in itself is not bad if that level of data integrity is needed (not so in this case).

So what does an unbuffered fetch mean? For every row in the database result, the client will make sequential requests for data, one, after, the, other. At < 1ms of latency between the workstation and the database server, and each server turn (request response cycle) taking 1ms, we saw 12,000 rows taking (you guessed it) 12 seconds. The vendor was not receptive to the idea that it was a problem with the application, and decided to test our finding by changing the login process to do a buffered fetch of the 12,000 rows. This took the login time to about 0.5 seconds. A sigh of relief from the customers tech lead was short lived though, the vendor was not willing to modify the application and the solution was thrown out as not fit for purpose.

It’s not just how much bandwidth the network has, or how fast the server is, but how applications use the available resources. How many request / response cycles do we need to use to go from click to render?

So why did the vendor not see the problem? They were testing the fat client application on the database server negating any limitations of the speed of light or network protocol overhead. 0 times 12,000 is still 0.

If you got this far, thanks for reading and follow me for more.

Performance Diagnostics Part 1

Comments

Leave a Reply Cancel reply