Putting a Meaningful Dent in Your Error Backlog (By Dan Slimmon)

Community



August 23, 2024

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

The magic of excluded behaviors

Suppose you have a list of errors that your system throws in production. Sorting this list by frequency-of-error and eyeballing it, you see that it contains about:

40 kinds of network timeouts
30 different JSON parse errors
20 Nil pointer exceptions, spread across the codebase
12 Postgres deadlocks
… many more errors that are harder to lump into categories.

I would look at this list and say, “Well, deadlocks are never expected or desired, and they’re often contributing factors in larger problems… so let’s exclude deadlocks.” (Someone else, with different constraints and knowledge, might justifiably pick a different behavior to exclude.) Anyway, we pick a behavior, then we exclude it.

Here’s how you exclude a behavior:

List all the individual errors in the class to be excluded.
Burn down that list by fixing each underlying bug.
Create a (non-paging) monitor to catch regressions.

When you exclude a behavior, you get immediate incremental value. Where before there was a system that would sometimes deadlock in production, now there’s a system that is known never to deadlock in production.

This guarantee is immensely valuable. By eliminating deadlocks from the system, you block off a whole range of ways that surprising failure modes could creep into your system. This yields a direct increase in reliability.

Excluding a behavior also makes your system easier to troubleshoot! Suppose you’re hunting down a bug that manifests as sudden server process crashes in production. You might wonder if an out-of-memory condition could be to blame for this behavior. And so you might spend half a day scrolling through logs, trying to correlate OOM events with your crashes. Whereas, if you’ve excluded out-of-memory errors, then you can hop right over that whole entire rabbit hole. Haven’t been notified about any OOMs? Then there haven’t been any OOMs.

Here are some classes of behavior that you might choose to exclude:

deadlocks
out-of-memory crashes
network timeouts between load balancer and web server
503 errors
Nil-pointer exceptions
database transactions longer than 30 seconds
Go panics

It shouldn’t be hard to think of more.

Do you really have to eliminate every member of an excluded class? Can’t you make exceptions?

Sure you can make exceptions. Just make sure you document the reasoning for any exception you make.

Because another great thing you get out of excluded behaviors is a list of known vulnerabilities to failure. This list is worth its weight in gold as a tool for knowledge transfer activities, such as onboarding, planning, and architecture design.

After a while, you get kind of addicted to excluding behaviors. Each new exclusion makes your production system that much more boring.

And boring is how we like ’em.

This article originally appeared on Dan Slimmon's Scientific Incident Response blog.

Putting a Meaningful Dent in Your Error Backlog (By Dan Slimmon)

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

What is Problem Detection?

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

The magic of excluded behaviors

Related articles

detect #18: August failures, September safeguards; incidents at Openai, Cloudflare, Pagerduty; Bitnami image deprecation

Detect #17: LLM-induced outage; get more out of OTel with CREs; Google & Cloudflare incidents ...

Detect #16: GCP's 503 storm; AI hates debugging?; how github tackles problems