Detect #8: Troubleshooting at Netflix; Incidents at Google and Mailchimp; Debugging Go; Lessons from Early YouTube SRE; Adidas' Platform Engineering Journey...
Community
December 19, 2024
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to this edition of detect.sh! As always, we aim to bring you insightful and practical takes on reliability. Below, you'll find stories, tools, and incidents that illuminate the world of problem detection. We're brought to you by Prequel, the team bringing reliability intelligence to engineering teams.
Enjoy!
In this issue: Troubleshooting at Netflix; Incidents at Google and Mailchimp; Debugging Go; Lessons from Early YouTube SRE; Adidas' Platform Engineering Journey...
Upcoming Events 🗓️
Unlocking the Power of Modern CPUs to Build Resilient, High-Performance Applications (November 19): Join our live detect podcast with Denis Bakhvalov, author of Performance Analysis and Tuning on Modern CPUs. (Register)
Kubecon + CloudNativeCon 2024 (November 12–15 ): Join fellow engineers in Salt Lake City. There are several tracks focused on reliability and prequel will be hosted a private workshop with Kelsey Hightower. (Register)
And now, here's a digest of what happened last month in the world of problem detection.
Real Problem Detection & Troubleshooting Stories 📖
Sharpen your technical skills with these deep dives.
Netflix's Workbench UI Latency Issue: A deep dive into troubleshooting a latent issue within Netflix's Workbench UI. Their investigation gives clear insights into isolating and understanding latency issues in complex systems. (Netflix)
Slack: We're All Just Looking for Connection: Slack shares their investigation into connection issues, providing a detailed breakdown of tools and methodologies used to restore reliability. (Slack)
Repairing Databases on Mobile Devices: An approach to repairing databases on mobile devices that dives into the reliability of applications and self-healing in unreliable environments. (Ashishb)
Notable Incidents 🔥
Stay informed and learn from them.
Cloudflare OVHCloud Outage: On October 30th, OVHCloud faced a major outage impacting several services. Cloudflare shares their perspective and provides key takeaways on handling such disruptions. (Cloudflare)
12-hr Google Cloud Region Outage in Germany: A significant 12-hour outage hit Google Cloud's region in Germany, causing disruptions. (Google) (DataCenterDynamics)
Mailchimp Service Degradation: Mailchimp experienced intermittent availability issues due to backend issues. (Mailchimp)
Fly.io Unable to Reach Postgres Instances: Users struggled to reach their PostgreSQL instances, shedding light on potential underlying network and configuration challenges. (Fly.io)
Architecting for Reliability 📐👷♀️
Explore how other teams are architecting their applications to reach new heights.
Improving Platform Resilience at Cloudflare: How Cloudflare has approached platform resilience, focusing on removing single points of failure. (Cloudflare)
Lessons from Early YouTube SRE: Modern insights from early YouTube SRE experiences shared by a tech lead, emphasizing architectural improvements for modern SRE challenges. (SRE Path)
Avoiding Outage with Kubernetes IP Exhaustion: Adevinta’s post describes how they avoided an outage caused by running out of IPs in EKS, and the measures they took to prevent it in the future. (Adevinta)
Tools 🛠️
Stay sharp with the latest tech.
Scalene: This is a Python profiler that aims to make your debugging experience smoother by offering better insights into CPU, memory, and GPU usage. (GitHub)
Debugging Go Core Dumps with Delve: Michael Stapelberg introduces a new way to work with Go core dumps and shares the practicalities of debugging using Delve. (Michael Stapelberg)
Debugging C++ in Positron: A walkthrough on setting up a development environment and debugging C++ applications in Positron, focusing on efficient workflows. (Tyler)
DNS Trace: Monitor DNS Queries: This Reddit thread covers dnstrace—a tool to monitor DNS queries by host processes and users, potentially useful for gaining insights into DNS reliability. (Reddit)
Caveman Debugging with Live Templates: Embrace simplicity in debugging. This post shows how live templates can speed up debugging by generating standard code snippets. (TheApache64)some text
Whether you're on call, looking to optimize performance, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.
Follow our brand new account on X (fka twitter): @detect_sh
Join our mailing list so you'll be the first to know.