What is Problem Detection?

Community



September 3, 2024

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Popular articles

What is Problem Detection?

For Site Reliability Engineers (SREs) and Software Engineers, identifying and addressing system issues is a daily activity. While we constantly monitor and troubleshoot system performance and reliability issues, the formalization of problem detection as a discipline is not as widespread as one might expect.

Let’s dive a bit deeper into the art and science of problem detection.

The Essence of Problem Detection

Problem detection is the structured process of identifying issues that potentially require action.

There are many different ways to approach problem detection, each having its own pros and cons. A common way we find out something is broken involves getting alerts when a service level objective (SLO) is violated. For example, when more than 1% of requests exceed a target latency of 400ms, an alert may fire. Or, we may notice an unusual increase in errors on a Grafana dashboard and open a ticket. Worst case, a customer reports the problem.

In all of these cases, the high-level problem is actually a symptom of something else. So, you dive in and try to find out what’s causing it. You start troubleshooting. To the extent you start looking for or ruling out specific causes, you are reactively problem detecting. In this way, problem detection can be recursive.

In the scenarios outlined above, problem detection starts with a high-level symptom and works its way down the stack or across dependencies to find underlying problems. Ideally, a problem detection system could notify you of the underlying problem before high latency, error spikes, or customer complaints occur. A preemptive approach may involve applying data exploration, heuristics (aka rules), statistical methods (aka anomaly detection) to hunt for problems in advance.

In both directions, problem detection focuses on issue identification to help software remain reliable, available, and high-performing.

The State of Problem Detection In 2024

While we all find and identify problems as part of our routine, embedding it more formally into organizational practice is not something most organizations have achieved. Processes and tools to identify reliability issues are ad hoc, unpredictable, reactive and disjointed.

It’s time to change this.

A mature problem detection program is product-agnostic and focuses the development, deployment, and tuning of detections to identify and contextualize problems.

Lessons from Other Industries

Our peers in cybersecurity have figured out the advantages of formalizing detection practices. They use frameworks like the Mitre Att&ck matrix to map out their detection capabilities and build a deliberate roadmap. The results in cybersecurity have been:

More efficient use of data and expertise against most urgent gaps
Standard ways to map and evaluate vendors
The ability to exchange detections across organizations

Unfortunately reliability has lagged behind in this area in part due to a misconception that each software failure is unique. Problem detection is about bringing this rigor to reliability.

Why Detect.sh ?

Looking around we couldn’t find a place where engineers convene to improve their problem detection skills and programs. So, we gathered some of the smartest minds in this space and started it.

Detect (Detect.sh) is a global community of engineers focused on improving problem detection and resolution in modern applications.

We exchange best practices and get together for focused events.

Join us.

The Detect.sh Community

What is Problem Detection?

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

What is Problem Detection?

The Essence of Problem Detection

The State of Problem Detection In 2024

Lessons from Other Industries

Why Detect.sh ?

Related articles

Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions

Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate