Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate

Community



March 4, 2025

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Upcoming Events 🗓️

Kubernetes Problem Detection Workshop (with Kelsey Hightower) - 2/10 in San Francisco
A free hands-on session focused on implementing and maturing problem detection methods. Seats limited (lu.ma)
SREcon Americas 2025 - 3/25 in San Francisco
Join some of the brightest minds in Site Reliability Engineering. Topics include post-incident reviews, designing large-scale distributed systems, and much more. (usenix)

Notable Incidents 🔥

OpenAI Service Incident - Users encountered extensive interruptions to ChatGPT and related services.(status.openai)
Capital One Outage - Customers reported difficulties accessing deposits, for several days. (nbcnews)
GitHub Service Interruptions - Another major incident.(githubstatus)
Bitbucket Worldwide Outage - Many users found repository access and CI pipelines disrupted. (bleepingcomputer)
Microsoft MFA Outage - Microsoft 365 apps were blocked for users relying on multi-factor authentication. (bleepingcomputer)
Proton Outages - Proton’s mail and VPN services experienced downtime tied to a Kubernetes migration gone awry and the TikTok ban,. (status.proton) (bleepingcomputer)

Troubleshooting Stories 📖

Debugging with DistroTube - A video exploring advanced troubleshooting on Linux systems and Docker containers.
(youtube)
The Hunt for Error -22 - When cryptic error codes derail your workflow, methodical sleuthing can solve the puzzle.
(tweedegolf)
AI-Driven Speed vs. Human Accuracy - A cautionary tale of AI overpromising in code analysis and underdelivering in real debugging. (nsavage)
Debugging Is a Story - Approaching debugging as narrative might reveal hidden clues and patterns. (buttondown)

Architecting for Reliability 📐👷‍♀️

Kubernetes Controller Pitfalls - Common mistakes that can turn your controllers into reliability risks. (ahmet)
Exploring the Kubernetes API Server Proxy - Tips for securing and optimizing API requests through the built-in proxy. (raesene)
Scaling User Restriction Logic at LinkedIn - A behind-the-scenes look at how LinkedIn ensures consistent, performant data checks for billions of users (blog.bytebytego)
Dropbox’s Evolving Infrastructure Messaging Model - An inside view of how Dropbox moved from monolithic messaging to an asynchronous platform.
(dropbox)
Cache Invalidation for Heavy Requests - Techniques to manage caching effectively when requests or responses are large.
(punits)

Maturing Your Reliability Program 🌱

Incident Severity Scales - Dan Slimmon argues that rigid severity scales may be more hindrance than help. (danslimmon)
Evolution of Incident Response at Podia - A practical look at how Podia grew their processes around reliable services. (ideasasylum)

Tools 🛠️

Strobelight - A profiling service developed by Meta Production Engineering, built on open-source foundations for in-depth performance insights across large-scale systems.(engineering.fb)
uScope - Gain granular observability into distributed systems and application performance (calabro)
Perforator - A performance measurement and profiling tool designed for modern cloud-native stacks. (perforator)

Whether you're on call this week, looking to improve system stability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.

Follow us on X: @detect_sh

Join our mailing list so you'll be the first to know.

The detect.sh Community

‍

Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

Upcoming Events 🗓️

Notable Incidents 🔥

Troubleshooting Stories 📖

Architecting for Reliability 📐👷‍♀️

Maturing Your Reliability Program 🌱

Tools 🛠️

Related articles

Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions

Detect #10: OpenAI and Canva outages, Kubernetes failures, debugging Rust and more...