Detect #9: Reddit Outage, Debugging in Go, Lessons from Kubernetes Problem Detection, and Zero-Downtime Migrations ...
Community
December 19, 2024
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to this month's edition of detect.sh! As always, we aim to share the most insightful content. We're brought to you by Prequel, the team bringing reliability intelligence to engineering teams.
And now, here's a digest of what happened last month in the world of problem detection.
Enjoy!
Podcasts 🎙️
Unlocking the Power of Modern CPUs to Build Resilient, High-Performance Applications featuring Denis Bakhvalov. The author of Performance Analysis and Tuning on Modern CPUs, shares insights into optimizing applications for modern hardware. Watch the full episode here and grab the 2nd edition of his book here.
Events 🗓️
Kelsey Hightower and Prequel hosted a Kubernetes Problem Detection Workshop at KubeCon 2024. Did you miss it? Prequel is hosting a hands-on application failure workshop next month in San Francisco. sign up here
AWS re:Invent 2024. December 2 - 6, 2024. Detailshere
And now, here's a digest of what happened last month in the world of problem detection.
Notable Incidents 🔥
Reddit experiences a major outage: Users worldwide faced service disruptions as Reddit battled a backend infrastructure issue. (tomsguide)
Microsoft 365 downtime impacts millions: A significant service outage left users unable to access key productivity tools. (x)
Cloudflare logs lost during an incident: Cloudflare reported losing logs during a November 14 incident. (cloudflare)
Analysis of Cloudflare’s public outage report: This detailed analysis explores the technical and communication lessons learned from Cloudflare’s November incident. (surfingcomplexity)
Denmark suffers a nationwide telecoms outage: A software update caused a major telecom failure impacted millions. (telecomstechnews)
Real Problem Detection & Troubleshooting Stories 📖
Mastering Golang debugging in Emacs: A practical guide to supercharging your debugging workflow using Emacs and Golang tools. (dornea)
The one-instruction bug: A deep dive into a subtle, elusive bug that came down to a single instruction (nsrip)
Debugging third-party goroutine leaks: Learn step-by-step how to isolate and resolve memory and concurrency issues caused by third-party libraries in Golang applications. (leonidasv)
Debugging async mysteries in Axum: Discover techniques for solving asynchronous timing challenges in Axum, a popular Rust framework for building web servers. (baarse)
Too many ways to wait for a child process: An exploration of the complexities of managing subprocesses in distributed systems, with practical strategies for timeout handling. (gaultier)
Architecting for Reliability 📐👷♀️
Reliable message ordering in chat systems: Ably’s engineering team shares how they ensured reliable message delivery in distributed chat applications.(ably)
Smart caching for AWS news: How Luc van Donkersgoed built a lightning-fast caching system to optimize performance and minimize costs. (lucvandonkersgoed)
Cloudflare’s zero-downtime DNS migration: Cloudflare engineers detail how they migrated billions of DNS records without disrupting service. (cloudflare)
Designing zero-downtime migrations with data consistency: Mercari shares part one of their journey toward a zero-downtime migration strategy that prioritizes strong data integrity. (engineering.mercari)
The Karpenter effect: Smarter Kubernetes scheduling: Learn how Adevinta improved their Kubernetes operations by leveraging Karpenter for cost-effective resource management. (medium)
Choosing the right instance for cloud workloads: A comprehensive guide to selecting the best instance type for reliability and cost optimization. (reliabilityengineering)
Tools 🛠️
Kubernetes 1.32: Upcoming changes A preview of the updates in Kubernetes 1.32, including probe improvements and other enhancements. Learn about the changes (kubernetes) some text
Whether you're on call, looking to optimize performance, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.