Performance Diagnostics Part 6: 5 SRE Practices to Minimise Toil

One of the core tenets of SRE is to minimise toil to increase resiliency and improve digital experience. SRE (Site Reliability Engineering) is a practice that Google created in the 2000s to improve the performance of the “Site”, with the site being the Google Search.

Google being made up of some incredibly smart people defied the traditional model of Development and Operations. SRE pivots from the traditional ops model by making the SRE team responsible for the performance of a site or service with the team including developers and infrastructure team members. This reduces friction and improves the customer experience.

My Top 5 Practices:

Automation
Service Level Objectives
Observability
Incident Management
Blameless Culture

What’s Wrong With The Old Model?

In the traditional ops model developers would typically build an application or service and the operations team would deploy and maintain the application and supporting infrastructure.

The friction is more than just a lack of collaboration, over time we all make mistakes and this leads to an us vs them mentality. The development team will see every issue as an ops problem, and the ops team will see every issue as a bug leading to a build up of technical debt and unresolved issues.

The famous IT Crowd quote “Have you tried turning it off and on again?” might be cliche, but it’s often the first thing operations will do to try and restore service. This mentality means that the operations team are less focused on the why it failed and more on trying to bandaid the service.

Conversely developers seeing an exception that says something to the effect of “network timeout” dismissing the service disruption as an ops problem without considering the issue could be a result of the application design.

How does SRE Minimise Toil?

The SRE team includes people with development and operations experience meaning that developers are closer to the user and performance of the service. Being closer to the user empowers the team to improve the service performance and therefore digital experience.

Toil is the repetitive work we do with low levels of gain. A good example in the operations space would be manually clearing up disk space, while useful, it could easily be automated freeing up staff to perform higher value tasks.

Automation

Scripting, APIs, and infrastructure as code have significantly reduced the time engineers spend on repetitive tasks. I don’t know many engineers that enjoy manually building a server and deploying a service. CI/CD pipelines have significantly improved the quality of software by building and testing software freeing up staff to focus on higher value tasks.

Service Level Objectives

Rather than focusing on Service Level Agreements, SRE is made up of multiple elements made up of SLI, SLO, and SLA.

SLI: Service level Indicator is key metrics that influence the performance of a service. CPU utilisation is a good example of a single metric that influences the performance of a service.
SLO: Service Level Objective is the expected performance of the service. This is the value we know that we should not exceed in BAU operations. For example Server Response Time less than 100ms could be the SLO which when exceeded triggers an investigation.
SLA: Service Level Agreement is the never breach performance of a service. This is the agreement made with an external (or within) team. Breaching this value has consequences that could include service credits.

Observability

Using observability tooling and alerts to understand and map out services so that key metrics are quantified and recorded. The old saying goes “you can’t improve what you cannot measure”. Subjective and qualitative measures can also be useful but shouldn’t be the primary measure. I recently worked with a company the measure comments on social media as a indicator of service.

ITIM: IT Infrastructure Management looks at the infrastructure performance. Servers, network devices, load balancers etc. This is the traditional monitoring approach that uses SNMP, Synthetics, WMI, APIs, and Streaming telemetry to interrogate devices. This is also where logs and events would be collected and aggregated.

NPM: Network Performance Management looks at the efficiency and characteristics of the protocols traversing a network. This is done using passive techniques such as NetFLOW and packet capture.

EUEM: End User Experience Management looks at the performance of an application or service at the point of consumption. Typically measuring endpoint metrics such as CPU, Memory, Disk IO, and click to paint metrics. Check out my article on EUEM here!

APM: Application Performance Management looks at the performance of an application or service by instrumenting the application, container, or server. This could include using techniques like eBPF, OTEL, profilers and typically produce traces. Check out my review on learning eBPF. Check out my OTEL article with SigNoz.

If we have an Observability instrumentation plan in place, we can identify what factors influence performance and why. This means that no-one sits in a war room hoping the problem is not theirs.

Incident Management

Incident Management and root cause analysis is crucial to minimising toil. I built my business around root cause analysis and diagnostics to teach technologists how to troubleshoot and manage incidents effectively. To rely on a service means that we need to methodically identify the cause of a fault and rectify the underlying problem rather than applying a bandaid.

This is where the SRE mindset adds value to DevOps. SRE is compatible with DevOps and both work effectively together.

Blameless Culture

Blameless culture allows for the creation of shared vision. If we move beyond empire building and focus on the service being as reliable and performant as possible. Everyone wins, and no-one hides and hopes that the problem is not in their domain.

Wrap Up

These are not the only practices we need to do to reduce toil and improve digital experience. If you made it this far, thanks for reading and why not purchase the Site Reliability Engineering book which describes Google’s approach to SRE. The Site Reliability Workbook provides practical examples of how to implement SRE.

Comments

One response to “Performance Diagnostics Part 6: 5 SRE Practices to Minimise Toil”

Alert Fatigue: Why Too Many Alerts Can be Disastrous! – Observe Ability

December 28, 2023

[…] monitoring and alerting on our Service Level Indicators as defined by Service Delivery, Developers. SRE (Site Reliability Engineering), ITOM (IT Operations Management), and Performance Consultants are […]