Alert Fatigue: Why Too Many Alerts Can be Disastrous!

Alert fatigue is a problem I’ve encountered so many times in IT Operations, especially as monitoring sprawl increases the number of tools we use to gain additional insights into our SLOs (Service Level Objectives). Those on the front line who who receive the alerts will begin to drown in the information and overlook the important issues when they arise.

Having been in the hot seat myself, I’ve fallen victim to alert fatigue where an import notification (by SMS at the time) has shown up on my old school Nokia in between 25 other notifications. The problem was that we had 25 satellite sites connected via a branch office. We were used to satellite being unreliable due to rain fade, solar events, and carrier maintenance. Therefore, a deluge of alerts typically indicated a satellite failure, however, in this case the branch office uplink supporting the satellite sites had failed. All staff (80) in the office were offline and called the service desk to get the issue resolved. The damage in this case was minimal, but we’ve all seen much worse where revenue and reputation were impacted.

Most observability tools now have some level of event management, prioritisation, and filtering built in. Even then the ITSM (IT Service Management) usually has an events module which is meant to correlate and reduce the number of alerts/incidents being raised. That said I’ve seldom seen an environment where this has been implemented effectively. Usually we will see a straight passthrough from an observability tool alert, to an ITSM incident. For many of the customers I’ve worked with, having more than 10,000 incidents a day is not uncommon.

What Can we do About it?

Alert Management Solution (AIOps)

There are a number of software solutions for managing alerts, incidents, problems from vendors including PagerDuty, BigPanda, Splunk, Atlassian, ServiceNow. Each has their own spin on AIOps and its relation to alert management and certainly worth digging into (if you haven’t already).

In the Open Source world there are projects like Keep, and Prometheus Alert Manager. Using tools like FluentBit as part of your observability Pipeline can also help.

One of the other challenges with relying on an AIOps platform to reduce alert fatigue is that the systems themselves become very good at becoming fatigued themselves.

Seperate Reporting From Service Level Indicators

When we build a new monitoring or observability solution we are often after quick wins to reduce time to value and perceived return on investment. We don’t want all of this information to become alerts on an ongoing basis. The key here is to reduce our load on our Alert Management solution by ensuring that were are actively monitoring and alerting on our Service Level Indicators as defined by Service Delivery, Developers. SRE (Site Reliability Engineering), ITOM (IT Operations Management), and Performance Consultants are well versed in defining what these SLIs are and can help prioritise them.

Examples:

Server Response Time
Time to First Byte
Largest Contentful Paint
DB Query Time

Prioritising SLIs doesn’t mean we shouldn’t collect and alert on other metrics, but “if everything is important, nothing is.” By prioritising what alerts are escalated and how, we can reduce alert fatigue so that when an issue does occur, it gets the right attention.

Platform Ops Observability

Building an Observability capability into your Platform Ops offering can help standardise the observability solution used across the organisation. In addition to the Observability solution being standardised, the practices and workflows can be standardised on a maturity scale depending on the criticality of the service and impact to the customers digital experience.

What Next?

Managing alerts is hard when done well, and the larger the organisation, the more complicated it becomes. AI Ops promises to help reduce alert fatigue, but we need to improve the quality of the data going in to improve its effectiveness.

I’m excited to see what happens in this are in the next few years. If you have a great Open Source solution to improve operations and reduce alert fatigue, I’d be excited to see your reply in the comments below.

Comments

Medium Hairstyles

May 7, 2024

Your comment is awaiting moderation.
I’m not sure exactly why but this web site is loading extremely slow for me. Is anyone else having this problem or is it a problem on my end? I’ll check back later on and see if the problem still exists.