An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
For Site Reliability Engineers (SREs) and Software Engineers, identifying and addressing system issues is a daily activity. While we constantly monitor and troubleshoot system performance and reliability issues, the formalization of problem detection as a discipline is not as widespread as one might expect.
Let’s dive a bit deeper into the art and science of problem detection.
Problem detection is the structured process of identifying issues that potentially require action.
There are many different ways to approach problem detection, each having its own pros and cons. A common way we find out something is broken involves getting alerts when a service level objective (SLO) is violated. For example, when more than 1% of requests exceed a target latency of 400ms, an alert may fire. Or, we may notice an unusual increase in errors on a Grafana dashboard and open a ticket. Worst case, a customer reports the problem.
In all of these cases, the high-level problem is actually a symptom of something else. So, you dive in and try to find out what’s causing it. You start troubleshooting. To the extent you start looking for or ruling out specific causes, you are reactively problem detecting. In this way, problem detection can be recursive.
In the scenarios outlined above, problem detection starts with a high-level symptom and works its way down the stack or across dependencies to find underlying problems. Ideally, a problem detection system could notify you of the underlying problem before high latency, error spikes, or customer complaints occur. A preemptive approach may involve applying data exploration, heuristics (aka rules), statistical methods (aka anomaly detection) to hunt for problems in advance.
In both directions, problem detection focuses on issue identification to help software remain reliable, available, and high-performing.
While we all find and identify problems as part of our routine, embedding it more formally into organizational practice is not something most organizations have achieved. Processes and tools to identify reliability issues are ad hoc, unpredictable, reactive and disjointed.
It’s time to change this.
A mature problem detection program is product-agnostic and focuses the development, deployment, and tuning of detections to identify and contextualize problems.
Our peers in cybersecurity have figured out the advantages of formalizing detection practices. They use frameworks like the Mitre Att&ck matrix to map out their detection capabilities and build a deliberate roadmap. The results in cybersecurity have been:
Unfortunately reliability has lagged behind in this area in part due to a misconception that each software failure is unique. Problem detection is about bringing this rigor to reliability.
Looking around we couldn’t find a place where engineers convene to improve their problem detection skills and programs. So, we gathered some of the smartest minds in this space and started it.
Detect (Detect.sh) is a global community of engineers focused on improving problem detection and resolution in modern applications.
We exchange best practices and get together for focused events.
Join us.
The Detect.sh Community