Detect Newsletter #5: 🚨 Hidden Bug of the Month; 🧯Root Cause of Crowdstrike's $5B outage; 🌩️ Incidents at Cloudflare and Github; 🛠️ Memory profiling, kafka monitoring and more

Community



September 3, 2024

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Popular articles

Welcome to the latest edition of the only newsletter focused on the art and science of problem detection. Detect is brought to you by Prequel (prequel.dev), the team bringing detection engineering to reliability.

Hidden Bug of the Month 🐜 🚨 (Presented by Prequel)

Kafka users be on the lookout for unexpected resource leakage due to connector startup failures. This can lead to unreleased resources, causing memory leaks and system instability. Reported by @ashoke-cube. See how Prequel helps teams detect a wide range of failures, powered by global reliability intelligence.

Upcoming Events 🗓️

🎉 Join us for our huge September webinar featuring Lorenzo Fontana, Co-author of Linux Observability with BPF (O'Reilly). Hear about Lorenzo's experience with ebpf as the former maintainer of CNCF's Falco, and IOVisor's kubectl-trace. We'll discuss his lessons learned applying ebpf to observability and security and his broader experience building and troubleshooting applications. Register here. 👈🏼

And now, here is a digest of what happened last month in the world of problem detection.

Crowdstrike Blue Screen of Death (BSOD) — (Crowdstrike/Microsoft Blue screen of Death)

Notable Incidents 🔥

Stay informed about the latest incidents and learn from their root cause analysis (RCA) - where available:

CrowdStrike - Of course you've been tracking the massive global IT outage caused by a Crowdstrike upgrade. Impact is estimated at $5.4 billion in damages. CrowdStrike published their root cause analysis and Microsoft performed their own analysis. Brendan Gregg shares his perspective in No More Blue Fridays.
Cloudflare - Cloudflare experienced a significant incident involving their 1.1.1.1 public DNS resolution service. Read the detailed incident report on the Cloudflare Blog.
GitHub Service Disruption - GitHub experienced a service disruption that impacted up to 50% of Action workflow jobs. They provided a thorough post-mortem on their GitHub Status Page.
Hulu Outage - Hulu faced a major outage impacting streaming services due to a backend failure. Learn more about the incident on The Verge. An RCA was not made available.
Substack and iCloud Private Relay Outage - Substack reported a notable drop in newsletter open rates, linking the issue to an iCloud Private Relay outage. This incident highlights the complex dependencies in modern digital services. Read more on 9to5Mac.

Troubleshooting Blogs 📝

Sharpen your troubleshooting skills with these blogs:

Go Memory Profiling at CloudQuery - Engineers at CloudQuery share their journey in optimizing memory usage in Golang applications, offering practical tips for developers facing similar challenges. Dive into their story on the CloudQuery Blog.
Kafka Metrics - Warpstream's latest post underscores the importance of measuring time rather than just counting messages in Kafka systems for better performance insights. This is an important read for Kafka users. Learn more on the Warpstream Blog.
Kubernetes Profiling with pprof - A quick exploration of profiling Kubernetes to identify and resolve performance bottlenecks. Check it out on Raesene's Blog.
Incident Analysis for Value Extraction - This blog explores how incident analysis can unlock value in system reliability and performance improvements. Read the full article on Surfing Complexity.

Architecture 📐👷‍♀️

Explore how teams are architecting for reliability:

Java 21 Virtual Threads - Netflix engineers discuss the impact of Java 21 virtual threads on application performance and thread management. This technical overview is critical for Java developers. Read more on the Netflix Tech Blog.
Unblocking ML Engineers through performance - Meta engineers reveal insights from their AI lab, focusing on the tools and techniques that unblock innovation. This behind-the-scenes look is available on the Meta Engineering Blog.
Re-architecting Slack - Slack's engineering team shares their approach to re-architecting their platform to better serve large customers, ensuring scalability and performance. Learn more on the Slack Engineering Blog.
Kubernetes 1.31 Updates - The upcoming changes in Kubernetes 1.31 offer a sneak peek at enhancements and new features. Read the full update on the Kubernetes Blog.
Performance Analysis of Redis and Memcached - A comprehensive comparison of Redis and Memcached, analyzing their performance and scalability in various use cases. In-depth analysis on DZone.

Tools 🛠️

Boost your toolkit with these tools:

Profiling with Ctrl+C - Yosef presents an unconventional method of using Ctrl+C for profiling applications, providing fresh insights into performance tuning. Explore this approach on Yosef's Blog.

Whether you're on call, looking to optimize performance, or simply keeping up with the latest tricks, we’re happy to be a part of your day.

Follow our brand new account on X (fka twitter): @detect_sh

Did someone forward you this email? Join our mailing list so you'll be the first to know.

The detect.sh Community

Detect Newsletter #5: 🚨 Hidden Bug of the Month; 🧯Root Cause of Crowdstrike's $5B outage; 🌩️ Incidents at Cloudflare and Github; 🛠️ Memory profiling, kafka monitoring and more

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

Related articles

Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions

Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate