May 2024: New debugging tools, our first live community event, behind the scenes at cloudflare, and more 🛠️🚨

Community



July 3, 2024

An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.

‍Join the newsletter:



Thanks for joining our newsletter

Oops! Something went wrong while submitting the form.

Upcoming Events 🗓️

Detect.sh will be hosting our first community meetup on June 4th at 11 ET. Featuring Xata CTO Tudor Golubenco. Tudor previously founded PacketBeat, acquired by Elastic. Come join our community of SREs and software engineers for a closed door discussion. Register here.

‍

Detect June Meetup. Featuring Xata CTO Tudor Golubenco

‍

Blogs 📝

Goroutine Leaks: A deep dive into diagnosing memory leaks in Go applications caused by goroutine mismanagement, offering insights on tools and methodologies to detect issues earlier. (brainbaking)
Minimizing On-call Burnout Through Alert Observability: Get an inside look at how Cloudflare approaches alert management at scale. (cloudflare)
Garden Path Incidents: An exploration of the cognitive biases that affect incident response and how vocalizing hypotheses can prevent missteps. (danslimmon)
Autonomous Hardware Diagnostics at Cloudflare: Issues managing hardware failure at scale? Take a look at how one hyperscaler tackles hardware diagnostics and recovery. (cloudflare)
Upgrading Kubernetes: An engineering team's journey through a rapid upgrade of Kubernetes from version 1.11 to 1.18, highlighting challenges and approach. (wetransfer)

Notable Incidents 🔥

Honeycomb Service Interruption: High-level most mortem of a recent incident that impacted Honeycomb's services, including the root causes and the measures taken to prevent future occurrences. Incident details. (honeycomb)
Braze Outage Explained: Braze discusses in detail their April 29 outage, explaining the technical failures and operational responses to restore services. (braze)

Podcasts / Videos 🎙️

"Logs Told Us It Was Kernel – It Wasn't: Can you trust your logs? In this presentation from SRECon 2024, Valery Sigalov discusses investigative techniques he used to pinpoint the true sources of a performance bottleneck. (usenix/youtube)

Tools 🛠️

Universal Profiling Agent: Elastic announces the release of their open-source Universal Profiling Agent, aimed at providing deeper insights into application performance across various environments. (elastic)
KWatch: A Kubernetes tool that helps you detect and diagnose issues by watching cluster events. (github)
Bun Report: Bun introduces a new crash reporter aimed at providing more comprehensive diagnostics for Node.js applications. (bun.sh)

As always, we’re open to your feedback and suggestions. Whether you're troubleshooting an issue, looking to optimize performance, or simply keeping up with the latest tricks, we’re happy to be a part of your day.

Follow our brand new account on X (fka twitter): @detect_sh

The detect.sh Community

May 2024: New debugging tools, our first live community event, behind the scenes at cloudflare, and more 🛠️🚨

Stop Wasting Everyone's Time. Step Up Your Operational Review Meetings With Problem Detection

Popular articles

How to Assess Your Problem Detection Approach: The Detect Maturity Model (DMM)

10 Problem Detection Pitfalls to Avoid

Upcoming Events 🗓️

Blogs 📝

Notable Incidents 🔥

Podcasts / Videos 🎙️

Tools 🛠️

Related articles

Detect #13: OpenAI Error Rates, Lies Programmers Believe About Memory, SREcon ....

Detect #12: Outages at Slack, Cloudflare, Playstation, Go Profiling Tricks, Understanding Kubernetes Evictions

Detect #11: Capital One and GitHub outages, new profiling tools, and the incident severity debate