Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Community



April 2, 2025

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Events 🗓️

SREcon Americas 2025: 3/25 - 3/27 in San Francisco - What a week. The hallway track was buzzing with stories about complex failures and career longevity. This year’s talks explored everything from real-world failovers that didn’t go as planned, to performance mysteries buried deep in the Linux block layer. If we didn’t bump into each other, let’s fix that next time. If you missed it, definitely check out the slides (when posted) - a few talks are linked below. (usenix)
Kubecon Europe + CloudNativeCon 2025: 4/1 - 4/4 in London - This flagship conference, organized by CNCF, includes deep dives into multi-cluster networking, eBPF-based observability, and securing the software supply chain. SIGs and working groups are showcasing progress on topics like Gateway API adoption, scheduling extensions, and WASM integration with Kubernetes. (cncf)

Notable Incidents 🔥

Cloudflare: March 21 Incident - A change to mitigate abuse of the WARP client triggered elevated CPU and memory usage, leading to degraded performance across multiple edge locations. (cloudflare blog)
Grafana Cloud: Partial Outage – Grafana Cloud experienced a two-and-a-half-hour partial outage after a TLS policy update unintentionally caused Kubernetes load balancers to be destroyed. (grafana)
OpenAI Errors - Multiple availability and latency issues affected OpenAI UI and APIs. In one case, traffic was unintentionally directed to a compute node with an OS-level fault. This triggered automated rate-limiting, resulting in an elevated number of HTTP 429 responses. At its peak, the system saw a global error rate of approximately 20%. (march 25) (march 28)
X outages - The platform experienced multiple outages affecting users globally. The most significant disruption occurred on March 10, when users were unable to access the platform for several hours. Musk attributed this outage to a "massive cyberattack," (tom's hardware)

Troubleshoot & Debug 📖

The Hardest Bug I Ever Debugged – A debugging deep dive through subtle race conditions and misunderstood stack traces. (client server dev)
Simulating CPU Bottlenecks with LLVM-MCA - A hands-on intro to LLVM-MCA, a performance modeling tool for CPU-bound code paths. (johnny’s lab)
Monitoring Chain-of-Thought Models - Internal debugging of reasoning failures in chain-of-thought models with synthetic signals and live probes. (openai)
SREcon: Tackling Slow Queries: A Practical Approach to Prevention and Correction – Kurni Famili and Brad Feehan from Shopify discuss a two-pronged strategy to address slow database queries. (srecon)
SREcon: The Search for Speed – Scott Laird shares how he tackled poor performance and high costs in a managed OpenSearch service by applying fundamentals: monitoring, modeling, and experimentation. (srecon)
SREcon: Case Study: A Thundering Herd in the Wild – Nicolas Arroyo shares how Bloomberg uncovered a rare variant of the thundering herd problem hidden behind standard libraries and absolute timers. (srecon)

Fresh Ideas & Perspectives 🤔

SREcon: Measuring Availability the Player Focused Way: How Riot Games Changed Its Availability Culture – This talk delves into how they redefined their approach to measuring availability by aligning it with player experience. The session outlines the methodologies adopted to shift the organizational culture towards a more player-centric availability model.(srecon)
Teaching a New Way To Prevent Outages - Techniques for teaching failure analysis and systems thinking using STPA and narrative techniques. (google)
Learning from Failure in Homelabs - A compelling case for writing postmortems even for small personal failures. Great storytelling. (bash)

Architecting for Reliability 📐👷‍♀️

SREcon: Lies Programmers Believe about Memory – Chris Down, a kernel engineer at Meta, delves into common misconceptions about Linux kernel memory management. (srecon)
SREcon: Maturing Your Data Architecture in a Week: How Bluesky Survived – Jaz Volpert from Bluesky PBC recounts the rapid scaling challenges faced during a sudden user surge in November 2024. (srecon)
Shopify’s Journey to Planet-Scale Observability – Shopify walks through building their own observability stack to support tens of thousands of workloads. Includes lessons on scale, developer ergonomics, and performance trade-offs. (horovits)
Title Launch Observability at Netflix Scale – Netflix built a custom observability layer to track how titles go live across hundreds of devices and regions. The system validates that launches happen as expected, with the right metadata and user availability. (netflix)
Premature Optimization – A refreshingly nuanced take on when early optimization is worthwhile. The post walks through real-world cases where delaying optimization actually made systems harder to maintain.(ewerlöf)

Tools 🛠️

OpenTelemetry Demo v2 - A major revamp of the OpenTelemetry demo stack. Now includes load testing, richer trace scenarios, and automatic dashboards.(otel blog)

Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.

Follow us on X: @detect_sh

Did someone forward you this email? Join our mailing list so you'll be the first to know.

Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

What is Problem Detection?

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

Events 🗓️

Notable Incidents 🔥

Troubleshoot & Debug 📖

Fresh Ideas & Perspectives 🤔

Architecting for Reliability 📐👷‍♀️

Tools 🛠️

Related articles

detect #18: August failures, September safeguards; incidents at Openai, Cloudflare, Pagerduty; Bitnami image deprecation

Detect #17: LLM-induced outage; get more out of OTel with CREs; Google & Cloudflare incidents ...

Detect #16: GCP's 503 storm; AI hates debugging?; how github tackles problems