Detect #6: 🚨 Kafka data loss; 🛠️ Real debugging stories; 🌩️ All things ebpf; 🧯 Incidents at OpenAI, Anthropic, Hubspot, Google...

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

🚨 Kafka Data Loss; 🛠️ Real debugging stories & tips; 🌩️ All things ebpf; 🧯 Incidents at OpenAI, Anthropic, Hubspot, Google...

Welcome to the latest edition of the only newsletter focused on the art and science of problem detection. Detect is brought to you by Prequel (prequel.dev), the team bringing detection engineering to reliability.

‍

Hidden Bug of the Month 🐜 🚨 (Presented by Prequel)

A critical bug in Apache Kafka (3.8.0, 3.7.1) has been identified that can result in data loss. A failure while offloading to remote storage can cause segments to be dropped before the retention limit is met. The Kafka community is planning to address this issue in version 3.9.0. If this use case is relevant to your deployment, you are strongly advised to upgrade when available. (first reported by Guillaume Mallet)

See how Prequel helps teams detect a wide range of failures, powered by global reliability intelligence.

‍

Community Articles 📝

Putting a Dent in Your Error Backlog (Dan Slimmon): Explore strategies to tackle the persistent issue of error backlog management. This article offers practical tips to make a meaningful impact. (detect.sh)

Community Events 🗓️

🎉 Detect Webinar (Tomorrow 9/4): Join us for our huge September webinar featuring Lorenzo Fontana, Co-author of Linux Observability with BPF (O'Reilly). Hear about Lorenzo's experience with ebpf as the former maintainer of CNCF's Falco, and IOVisor's kubectl-trace. We'll discuss his lessons learned applying ebpf to observability and security and his broader experience building and troubleshooting applications. Register here. 👈🏼
SREcon EMEA 2024 (29–31 October): Join fellow SREs in Dublin for SREcon24 EMEA, where industry leaders will discuss the latest in reliability engineering and incident management. (usenix.org)

And now, here is a digest of what happened last month in the world of problem detection.

Real Problem Detection & Troubleshooting Stories 📖

Sharpen your technical skills with these deep dives:

The Memory Debugging Journey: A deep dive into the nuances of memory debugging, with practical insights and lessons learned. (memfault.com)
Performance and Data Access Patterns: Discover why 90% of your performance issues might be tied to data access patterns. (swizec.com)
Recursion Gone Wild: A wild debugging story where recursion led to unexpected behavior and how it was resolved. (andreabergia.com)
Debugging CPython Live: An adventurous tale of live debugging in CPython and the challenges encountered. (disconnect3d.pl)
Rust Compiler Segfault on illumos: Explore the intricacies of debugging a Rust compiler segfault in the illumos operating system. (sunshowers.io)
The Good, the Bad, and the Retry: Learn from an incident where retry logic led to unexpected consequences. (medium.com)
Optimizing Global Latency: A detailed journey through optimizing TCP configurations to reduce global message transit latency. (ably.com)
BPF Adventures: An exploration of BPF and its use cases in modern debugging. (rachelbythebay.com)
The One-Hour-Per-Year Bug: A unique debugging story of a bug that only o appeared once a year. (tomeraberba.ch)

Maturing Your Reliability Practices ↗️

Tips and tricks for stepping up your game:

Compliance Period: Balancing Risk and Innovation: Learn how to navigate the tension between compliance and rapid innovation in a reliability-focused environment. (alexewerlof.com)
Don't Repeat Incidents: A guide to preventing repeat incidents through better post-mortem practices. (rubick.com)
Adapting to Surprises: How software-reliant businesses can adapt to unexpected changes and improve resilience. (infoq.com)
The Hidden Costs of Chasing Five Nines: Explore the often-overlooked costs associated with achieving extreme levels of availability. (thenewstack.io)
Debug Smarter: Tips and strategies for more effective debugging, reducing incident resolution times, and improving overall system reliability. (rugu.dev)
Traces vs. Metrics: Discover why traces might be more effective than metrics for debugging complex systems. (jaywhy13.hashnode.dev)

Notable Incidents 🔥

Stay informed about the latest incidents and learn from their root cause analysis (RCA) - where available:

OpenAI Downtime: Multiple incidents this month at OpenAI, leading to significant downtime. (status.openai.com), (status.openai.com)
Google Services Disruption: A major outage affected several Google services. Read the post-incident analysis and recovery actions taken. (google.com), (downdetector.com)
HubSpot Outages: HubSpot experienced significant outages, impacting customers' ability to use key services. (status.hubspot.com), (status.hubspot.com)
GitHub’s Recent Outage: GitHub faced a significant outage affecting millions of developers worldwide. (githubstatus.com), (theverge.com)
Anthropic Incident: Anthropic's services were disrupted, causing downtime and impacting users. (status.anthropic.com)
Cloudflare Issues: A series of incidents at Cloudflare caused disruptions for many users. (cloudflarestatus.com), (cloudflarestatus.com)
Fastmail Outage: Fastmail experienced an outage, affecting email services for its users. (fastmailstatus.com)
Reddit Downtime: Reddit faced downtime due to an incident, affecting users' access to the platform. (redditstatus.com)
SendGrid Incident: An incident at SendGrid impacted the delivery of emails for its customers. (status.sendgrid.com)
Heroku Outage: Heroku published the root cause analysis of a recent outage that affected its services. (status.heroku.com)
Honeycomb Incident: Honeycomb's services were disrupted due to an incident, with details on the cause and resolution available. (status.honeycomb.io)
CrowdStrike Root Cause Analysis: In-depth (external) analysis of the root cause behind CrowdStrike’s recent incident. (crowdstrike.com)

Architecturing for Reliability 📐👷‍♀️

Explore how teams are architecting their applications to reach new heights:

Dealing with Rejection in Distributed Systems: A deep dive into handling rejections gracefully in distributed systems. (warpstream.com)
When Publicity Gets in the Way of Scalability: A case study on the challenges faced when balancing publicity with the need for scalable infrastructure. (medium.com)
Scaling for Large Customers: Learn how Slack re-architected its infrastructure to handle its largest customers. (slack.engineering)
From Four to Five 9s of Uptime: A case study on migrating to Kubernetes to achieve higher uptime. (workos.com)
Uber’s Migration to a Hybrid Cloud: Insights into Uber’s journey from a traditional data center to a hybrid cloud environment, highlighting the challenges and successes. (infoq.com)
Rust Safe Garbage Collection: Exploring a novel approach to garbage collection in Rust, balancing safety and performance. (kyju.org)
Overcommit in C++: Balancing Memory and Performance: An exploration of memory management techniques in C++ and how overcommitment can impact system performance. (quuxplusone.github.io)

Whether you're on call, looking to optimize performance, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.

Follow our brand new account on X (fka twitter): @detect_sh

Did someone forward you this email? Join our mailing list so you'll be the first to know.

The detect.sh Community

‍

Detect #6: 🚨 Kafka data loss; 🛠️ Real debugging stories; 🌩️ All things ebpf; 🧯 Incidents at OpenAI, Anthropic, Hubspot, Google...

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

What is Problem Detection?

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

🚨 Kafka Data Loss; 🛠️ Real debugging stories & tips; 🌩️ All things ebpf; 🧯 Incidents at OpenAI, Anthropic, Hubspot, Google...

Related articles

detect #18: August failures, September safeguards; incidents at Openai, Cloudflare, Pagerduty; Bitnami image deprecation

Detect #17: LLM-induced outage; get more out of OTel with CREs; Google & Cloudflare incidents ...

Detect #16: GCP's 503 storm; AI hates debugging?; how github tackles problems