Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate
Community
March 4, 2025
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to this month's edition of detect! We're brought to you by Prequel, the community-driven problem detection platform.
Here's a digest of what happened last month in the world of problem detection.
DeepSeek's R1 model is making waves with its superior cost efficiency, while recent outages at OpenAI raise questions about reliability—one of AI’s most critical features.
ChatGPT Outage Reports (downdetector)
Upcoming Events 🗓️
Kubernetes Problem Detection Workshop (with Kelsey Hightower) - 2/10 in San Francisco A free hands-on session focused on implementing and maturing problem detection methods. Seats limited(lu.ma)
SREcon Americas 2025 - 3/25 in San Francisco Join some of the brightest minds in Site Reliability Engineering. Topics include post-incident reviews, designing large-scale distributed systems, and much more. (usenix)
Notable Incidents 🔥
OpenAI Service Incident - Users encountered extensive interruptions to ChatGPT and related services.(status.openai)
Capital One Outage - Customers reported difficulties accessing deposits, for several days. (nbcnews)
GitHub Service Interruptions - Another major incident.(githubstatus)
Bitbucket Worldwide Outage - Many users found repository access and CI pipelines disrupted. (bleepingcomputer)
Microsoft MFA Outage - Microsoft 365 apps were blocked for users relying on multi-factor authentication. (bleepingcomputer)
Proton Outages - Proton’s mail and VPN services experienced downtime tied to a Kubernetes migration gone awry and the TikTok ban,. (status.proton) (bleepingcomputer)
Troubleshooting Stories 📖
Debugging with DistroTube - A video exploring advanced troubleshooting on Linux systems and Docker containers. (youtube)
The Hunt for Error -22 - When cryptic error codes derail your workflow, methodical sleuthing can solve the puzzle. (tweedegolf)
AI-Driven Speed vs. Human Accuracy - A cautionary tale of AI overpromising in code analysis and underdelivering in real debugging. (nsavage)
Debugging Is a Story - Approaching debugging as narrative might reveal hidden clues and patterns. (buttondown)
Architecting for Reliability 📐👷♀️
Kubernetes Controller Pitfalls - Common mistakes that can turn your controllers into reliability risks. (ahmet)
Exploring the Kubernetes API Server Proxy - Tips for securing and optimizing API requests through the built-in proxy. (raesene)
Scaling User Restriction Logic at LinkedIn - A behind-the-scenes look at how LinkedIn ensures consistent, performant data checks for billions of users (blog.bytebytego)
Dropbox’s Evolving Infrastructure Messaging Model - An inside view of how Dropbox moved from monolithic messaging to an asynchronous platform. (dropbox)
Cache Invalidation for Heavy Requests - Techniques to manage caching effectively when requests or responses are large. (punits)
Maturing Your Reliability Program 🌱
Incident Severity Scales - Dan Slimmon argues that rigid severity scales may be more hindrance than help. (danslimmon)
Evolution of Incident Response at Podia - A practical look at how Podia grew their processes around reliable services. (ideasasylum)
Tools 🛠️
Strobelight - A profiling service developed by Meta Production Engineering, built on open-source foundations for in-depth performance insights across large-scale systems.(engineering.fb)
uScope - Gain granular observability into distributed systems and application performance (calabro)
Perforator - A performance measurement and profiling tool designed for modern cloud-native stacks. (perforator)
Whether you're on call this week, looking to improve system stability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.