An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
In the world of site reliability engineering (SRE), engineers lean on alerting and problem detection to help meet customer reliability expectations and avoid churn.
There are two main approaches to problem detection: deterministic and probabilistic. Each category has its own advantages and limitations. In this post, we'll breakdown the options and share insights from 150 engineering teams to help you and your team design a robust monitoring strategy.
Deterministic detection involves predefined rules and patterns that trigger alerts when specific conditions are met. This method relies on clear, fixed criteria derived from known failure modes and operational events.
Some of these rules are simply, triggered when a single event occurs. Others are complex, relying on one or more conditions or events.
For example:
Probabilistic detection, on the other hand, uses statistical methods, including machine learning, to identify anomalies based on historical data and patterns.
Some of these models are designed to detect anomalies and others are focused on predicting future failures.
For example:
To understand the State of Problem Detection, Detect met with over 150 engineering teams.
As discussed, deterministic rules can provide straightforward, actionable alerts for known issues, while probabilistic methods promise deeper insights and early warnings for less obvious problems. Since Gartner coined the AIops category in 2017, and with the modern buzz around Generative AI, we've seen many vendors emphasize probabilistic methods; but the results have not followed.
Here's what else we found:
1. Most engineering teams are relying on threshold-based rules for both alerts that 'wake people up' and low-priority alerts. However, these alerts are often superficial, require teams to wade through traces, logs, and metrics to figure out what is causing the issue.
2. Most engineering teams have tried but are largely ignoring anomaly-based alerts because of the noise, with one organization showing us 26,000 alerts.
3. Complex rules are often the most reliable alerts, but teams face the following adoption challenges:
Detect's mission is to help engineering teams improve their problem detection and troubleshooting capabilities. We encourage you and your team to assess your posture in this area and develop a problem detection roadmap leveraging the free resources on this site.