Deterministic vs. Probabilistic Problem Detection

Community



May 28, 2024

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Deterministic Detections

Deterministic detection involves predefined rules and patterns that trigger alerts when specific conditions are met. This method relies on clear, fixed criteria derived from known failure modes and operational events.

Some of these rules are simply, triggered when a single event occurs. Others are complex, relying on one or more conditions or events.

For example:

‍Simple Rule: Triggering an alert when a program crashes.
‍Complex Rules: Triggering an alert when a messaging queue has a deadlock and a dependent service is producing 500 errors.

Advantages

Simplicity: Easy to understand. The logic is self-explanatory or 'white box'.
Predictability: Provides clear, reliable alerts that consistently fire when well-defined conditions are met.
Speed: Can quickly identify issues that match predefined patterns.

Limitations

Rigidity: Will miss issues that do not fit the predefined criteria.
False Negatives: Overly strict rules can miss unknown failure states.

Probabilistic Detections

Probabilistic detection, on the other hand, uses statistical methods, including machine learning, to identify anomalies based on historical data and patterns.

Some of these models are designed to detect anomalies and others are focused on predicting future failures.

For example:

Anomaly Detection: If the response time of a web service suddenly spikes beyond the normal range predicted by a machine learning model, an alert is generated.
Predictive Analytics: If the trend analysis indicates that the disk space usage will exceed 90% within the next 24 hours, an alert would be generated

Advantages

Flexibility: Can adapt to new and evolving conditions. Designed to detect issues that may not conform to predefined rules
Comprehensive: Provides a broader understanding of system behavior, incorporating a variety of factors.
Sensitive: Capable of identifying subtle anomalies that deterministic methods might miss.

Limitations

False Positives: Defining normal is modern systems is nearly impossible. Change and variability are constant. Most alerts end up being noise.
False Negatives: Models may become too tailored to specific historical data, reducing their general effectiveness.
Training period: Models require weeks or months of training data.

How are enterprises doing today?

To understand the State of Problem Detection, Detect met with over 150 engineering teams.

As discussed, deterministic rules can provide straightforward, actionable alerts for known issues, while probabilistic methods promise deeper insights and early warnings for less obvious problems. Since Gartner coined the AIops category in 2017, and with the modern buzz around Generative AI, we've seen many vendors emphasize probabilistic methods; but the results have not followed.

Here's what else we found:

1. Most engineering teams are relying on threshold-based rules for both alerts that 'wake people up' and low-priority alerts. However, these alerts are often superficial, require teams to wade through traces, logs, and metrics to figure out what is causing the issue.

2. Most engineering teams have tried but are largely ignoring anomaly-based alerts because of the noise, with one organization showing us 26,000 alerts.

3. Complex rules are often the most reliable alerts, but teams face the following adoption challenges:

Limited rules for 3rd party libraries and applications
Limited rules for homegrown applications
No ability to correlate data across disparate data types
Observability cost pressures and technical gaps block access to fine-grained data required for accurate rules

Detect's mission is to help engineering teams improve their problem detection and troubleshooting capabilities. We encourage you and your team to assess your posture in this area and develop a problem detection roadmap leveraging the free resources on this site.

Deterministic vs. Probabilistic Problem Detection

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

What is Problem Detection?

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

Deterministic Detections

Advantages

Limitations

Probabilistic Detections

Advantages

Limitations

How are enterprises doing today?

Related articles

detect #18: August failures, September safeguards; incidents at Openai, Cloudflare, Pagerduty; Bitnami image deprecation

Detect #17: LLM-induced outage; get more out of OTel with CREs; Google & Cloudflare incidents ...

Detect #16: GCP's 503 storm; AI hates debugging?; how github tackles problems