Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions
Community
March 4, 2025
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Dan Slimmon (Hashicorp, Etsy, and Clerk) spends his time "looking for trouble". Understanding where applications will break - and fixing them before they do.
SREcon Americas 2025: 3/25 - 3/27 in San Francisco -The premier SRE event is coming up. Expect technical deep dives into large-scale distributed systems, incident management, and emerging methodologies. (usenix)
Notable Incidents 🔥
Slack Outage – A cascading failure led to widespread issues loading and connecting to slack. The root cause was traced to db maintenance actions. (slack)
PlayStation Network Down – Users reported authentication failures due to a misconfigured CDN cache that prevented token validation. (the verge)
Cloudflare Outage: Cloudflare experienced an outage cause by a misconfiguration, leading to increased error rates and latency across multiple regions.Detailed post mortem inlcuded.(cloudflare)
OpenStreetMap Post-Mortem – A database corruption event required a full rollback to a recent snapshot. Detailed post mortem included. (osm foundation)
Discord Service Disruption – Message queuing congestion resulted in significant delays, compounded by an inefficient retry mechanism. (discord)
Troubleshoot & Debug 📖
Linux Kernel Debugging – Tracing low-level syscalls and memory corruption issues using perf, ftrace, and BPF. (dasl)
Stripe’s ML-Driven Performance Detection – Using machine learning models trained on historical payment latency data, Stripe detects subtle degradations before they impact customers. (stripe)
Debug a Hanging Go Program – An exploration of Go runtime internals, goroutine scheduling issues, and diagnosing deadlocks in production workloads. (michael stapelberg)
Profiling in Production – Using function call traces and eBPF to pinpoint high CPU utilization in performance-sensitive applications. (yosefk)
You're Missing Your Near Misses – A case for systematically analyzing near misses in addition to incidents to improve system resilience. (surfing complexity)
Meta-Incident Reviews – Reviewing the effectiveness of incident post-mortems. (will gallego)
Architecting for Reliability 📐👷♀️
Every Pod Eviction, Explained – How eviction policies interact with resource overcommitment, node pressure thresholds, and pod disruption budgets. (ahmet alp balkan)
Rust Memory Management – A deep dive into Rust's borrow checker, stack vs. heap allocation, and how it enforces memory safety guarantees at compile time. (infoworld)
Tools 🛠️
eBPF for Windows – Bringing low-overhead kernel observability to Windows environments with eBPF, enabling network traffic analysis, observability, and security enforcement. (scorpio software)
Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.