SRE: Five Ways to Build a Blameless Culture

One of the main pillars of SRE (Site Reliability Engineering) is to introduce a blameless culture, however, building this takes more than just words. You can’t build a blameless culture by talking about culture, because culture is the result of changes in processes and structures within an organisation.

Here are my top 5 ways to build a blameless culture:

Career Safety
Root cause analysis is not blame
Observability
Customer centricity
Shared story

What does a defensive culture look like?

As a consultant I’m often brought into resolve problems with applications, networks, and services with complex infrastructure and even more complex cultural problems. These problems have usually existed for more than a few months to the point that leadership have gone out to market to seek outside help.

When organisations get to this point they typically have rigid structures in place with ongoing war rooms, overly cautious change control processes, long running incidents that seem to have no root cause.

Five ways to build a blameless culture

1: Career safety

People who fear for their job are less likely to be transparent about problems that they might be encountering. I’ve seen organisations that are so risk averse an bound by processes that they will leave a problem in place rather than resolve an issue due to the fear of raising a change request that may expose a problem within their domain of operations.

Work with your team to help them understand that finding problems is a good thing at that their role is secure (or even bolstered) by finding existing and potential issues.

2: Root cause analysis is not blame

At the end of an incident, root cause analysis should be performed to understand where in the technology stack the problem occurred. Finding a problem in a specific area such as a router configuration is a learning exercise, not to blame a person for not doing their “job”.

Root causes analysis can be tough and may uncover some uncomfortable truths, but it it is necessary to improve the experience of your users. Therefore, finding a problem should be rewarded, so as to empower people to actively seek out problems to improve the experience of their users.

Hindsight is always 20:20 and life is about continuous learning and development. We should encourage our best and brightest to divulge times where they have made a mistake to help others feel comfortable discussing their own experiences.

3: Observability

Monitoring has come a long way in the last 30 years we now have the ability to observe practically every part of a technology stack with metrics, logs, and traces. Without a proper observability solution we are left with blind spots that leave room for doubt. If an engineer is unable to say with certainty that the problems is or is not in their domain, it becomes easy to blame transient issues that may or may not have existed.

No engineer wants to have a problem in their domain and observability tools allows them to inspect their own domain first and understand why a problem occurred. Arm your engineers with the right observability tools and you will see engineers start proactively seeking out problems rather than fretting that a problem they don’t know about may exist.

4: Customer centricity

Customer centricity deserves a whole article on its own, but in this context I focus on positioning yourself in the shoes of the customer experiencing the service. In a previous life I made a point of travelling out to the field to be with end-users in remote parts of Australia over satellite communications to share the users experience as they experienced it daily.

A small change to a service may seem insignificant when tested, but can become a challenge when latency is introduced making the application or service unusable for a portion of your customer base.

A second issue around customer centricity is understanding that a customer may not be able to adequately describe the problem they are experiencing, which can lead to tickets or incidents being ignored as “not reproducible” or the end user being labeled as a noisy complainer.

Dig deeper into what the end user is experiencing by using probing questions such as those you might hear when visiting your GP (General Practitioner):

Tell me more?
When did this start?
What have you tried?
Did that work?
What is the impact?

5: Shared story

Shared vision and story is something that allows us all to move in the same direction. In a defensive culture we see issues or bugs being dismissed as the fault of the way the customers use the software or service. If we create a shared story and understanding of the way end-user interact with our services it’s easier for everyone from end-users through to developers to understand where and why problems are experienced.

We can create story maps using wireframes or bullet point workflows that show how users interact and navigate through our applications and services. An excellent book on this topic is User Story Mapping by Jeff Patton.

6*: Culture is a trailing indicator

Remember that we can’t wake up one day and say we have a blameless culture, this is something that takes time and is a trailing indicator of the behaviours and structures talked about in this article.

What other techniques have you used to help develop a blameless culture?

* Bonus