Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions

Community



March 4, 2025

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Podcasts 🎙️

Dan Slimmon (Hashicorp, Etsy, and Clerk) spends his time "looking for trouble". Understanding where applications will break - and fixing them before they do.

He recently joined the Prequel podcast to discuss ways to detect problems before they become incidents. Listen here. 👈

Upcoming Events 🗓️

SREcon Americas 2025: 3/25 - 3/27 in San Francisco - The premier SRE event is coming up. Expect technical deep dives into large-scale distributed systems, incident management, and emerging methodologies. (usenix)

Notable Incidents 🔥

Slack Outage – A cascading failure led to widespread issues loading and connecting to slack. The root cause was traced to db maintenance actions. (slack)
PlayStation Network Down – Users reported authentication failures due to a misconfigured CDN cache that prevented token validation. (the verge)
Cloudflare Outage: Cloudflare experienced an outage cause by a misconfiguration, leading to increased error rates and latency across multiple regions. Detailed post mortem inlcuded. (cloudflare)
OpenStreetMap Post-Mortem – A database corruption event required a full rollback to a recent snapshot. Detailed post mortem included. (osm foundation)
Discord Service Disruption – Message queuing congestion resulted in significant delays, compounded by an inefficient retry mechanism. (discord)

Troubleshoot & Debug 📖

Linux Kernel Debugging – Tracing low-level syscalls and memory corruption issues using perf, ftrace, and BPF. (dasl)
Stripe’s ML-Driven Performance Detection – Using machine learning models trained on historical payment latency data, Stripe detects subtle degradations before they impact customers. (stripe)
Debug a Hanging Go Program – An exploration of Go runtime internals, goroutine scheduling issues, and diagnosing deadlocks in production workloads. (michael stapelberg)
Profiling in Production – Using function call traces and eBPF to pinpoint high CPU utilization in performance-sensitive applications. (yosefk)
Head-scratching EBPF Issus – Diagnosing issues in Kernel eBPF output. (tanel poder)

Fresh Ideas & Perspectives 🤔

You're Missing Your Near Misses – A case for systematically analyzing near misses in addition to incidents to improve system resilience. (surfing complexity)
Meta-Incident Reviews – Reviewing the effectiveness of incident post-mortems. (will gallego)

Architecting for Reliability 📐👷‍♀️

Every Pod Eviction, Explained – How eviction policies interact with resource overcommitment, node pressure thresholds, and pod disruption budgets. (ahmet alp balkan)
Rust Memory Management – A deep dive into Rust's borrow checker, stack vs. heap allocation, and how it enforces memory safety guarantees at compile time. (infoworld)

Tools 🛠️

eBPF for Windows – Bringing low-overhead kernel observability to Windows environments with eBPF, enabling network traffic analysis, observability, and security enforcement. (scorpio software)

Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.

Follow us on X: @detect_sh

Join our mailing list so you'll be the first to know.

The detect.sh Community

Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

Podcasts 🎙️

Upcoming Events 🗓️

Notable Incidents 🔥

Troubleshoot & Debug 📖

Fresh Ideas & Perspectives 🤔

Architecting for Reliability 📐👷‍♀️

Tools 🛠️

Related articles

Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate

Detect #10: OpenAI and Canva outages, Kubernetes failures, debugging Rust and more...