May 2024: New debugging tools, our first live community event, behind the scenes at cloudflare, and more 🛠️🚨
Community
July 3, 2024
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to the latest edition of the only newsletter focused on the art and science of problem detection and troubleshooting. In this issue: our first live community event, Cloudflare's approach to autonomous diagnostics, notable incidents at Braze and Honeycomb,new Kubernetes & NodeJS debugging tools ....and more.
This month’s newsletter is brought to you by the team at Prequel (prequel.dev). The company bringing detection engineering to reliability. Join their early access program to see how they help find and fix elusive issues.
And now, here is a digest of what happened this April in the world of problem detection and troubleshooting:
Upcoming Events 🗓️
Detect.sh will be hosting our first community meetup on June 4th at 11 ET. Featuring Xata CTO Tudor Golubenco. Tudor previously founded PacketBeat, acquired by Elastic. Come join our community of SREs and software engineers for a closed door discussion. Register here.
Blogs 📝
Goroutine Leaks: A deep dive into diagnosing memory leaks in Go applications caused by goroutine mismanagement, offering insights on tools and methodologies to detect issues earlier. (brainbaking)
Minimizing On-call Burnout Through Alert Observability: Get an inside look at how Cloudflare approaches alert management at scale. (cloudflare)
Garden Path Incidents: An exploration of the cognitive biases that affect incident response and how vocalizing hypotheses can prevent missteps. (danslimmon)
Autonomous Hardware Diagnostics at Cloudflare: Issues managing hardware failure at scale? Take a look at how one hyperscaler tackles hardware diagnostics and recovery. (cloudflare)
Upgrading Kubernetes: An engineering team's journey through a rapid upgrade of Kubernetes from version 1.11 to 1.18, highlighting challenges and approach. (wetransfer)
Notable Incidents 🔥
Honeycomb Service Interruption: High-level most mortem of a recent incident that impacted Honeycomb's services, including the root causes and the measures taken to prevent future occurrences. Incident details. (honeycomb)
Braze Outage Explained: Braze discusses in detail their April 29 outage, explaining the technical failures and operational responses to restore services. (braze)
Podcasts / Videos 🎙️
"Logs Told Us It Was Kernel – It Wasn't: Can you trust your logs? In this presentation from SRECon 2024, Valery Sigalov discusses investigative techniques he used to pinpoint the true sources of a performance bottleneck. (usenix/youtube)
Tools 🛠️
Universal Profiling Agent: Elastic announces the release of their open-source Universal Profiling Agent, aimed at providing deeper insights into application performance across various environments. (elastic)
KWatch: A Kubernetes tool that helps you detect and diagnose issues by watching cluster events. (github)
Bun Report: Bun introduces a new crash reporter aimed at providing more comprehensive diagnostics for Node.js applications. (bun.sh)
As always, we’re open to your feedback and suggestions. Whether you're troubleshooting an issue, looking to optimize performance, or simply keeping up with the latest tricks, we’re happy to be a part of your day.
Follow our brand new account on X (fka twitter): @detect_sh