April 2024: The First Newsletter Dedicated to Problem Detection & Troubleshooting 🛠️
Community
May 28, 2024
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to the first edition of the only newsletter focused on the art and science of problem detection in modern software applications. We created this newsletter to make it easier for engineers like us to stay informed. Each month we'll be covering a curated selection of news, insights, incidents, and tools. Excited to have you with us on this journey.
In this issue: major hacking plot uncovered by performance analysis, SREcon 2024 americas recap, notable incidents at notion and cloudflare, a linux tool round up....and more.
This month’s newsletter is brought to you by the team at Prequel (prequel.dev). The company bringing detection engineering to reliability. Join their early access program to see how they help find and fix elusive issues
And now, here is a digest of what happened this March in the world of problem detection:
News 📰
Last week, Andres Freund, a postgres engineer at Microsoft, discovered one of the biggest and widespread security risks of the last decade. While investigating a 500ms lag in his ssh connections, he uncovered a backdoor in the popular xz compression library. Andres’ tools of choice were GBD and Valgrind. If the backdoor had remained unnoticed, millions of linux users would have been impacted. Way to go Andres. This is a great example of why we need to remain curious. (lwn.netOpenwall.comthenewstack)
Engineers at Allegro dive into how they optimized Kafka performance by addressing tail latency issues. Through the use of Kafka protocol sniffing and eBPF technology for dynamic tracing, the team identified and mitigated performance bottlenecks associated with file system writes and lock contention. (blog post)
The EnterpriseDB team describes a detailed process for identifying and troubleshooting memory leaks within Postgres C code. The author shares their journey, initially trying traditional tools like Valgrind and gcc/clang sanitizers without success, and then discovering the utility of the memleak program from the bcc tools collection, which helped identify the source of the leak. The post provides insights on challenges and strategies in managing memory in complex systems like Postgres. (EnterpriseDB)
Notable Incidents 🔥
Cloudflare identified a billing system issue lasting over a week that is still affecting customers' ability to manage payment methods, billing addresses, and subscription plans. This also includes challenges in paying invoices and incorrect unpaid invoice warnings. The incident, reported on March 21st remains unresolved as of the publishing of this newsletter. (Cloudflare Status)
Notion experienced three significant outages recently. On March 25th, users encountered errors editing, saving, and signing in. Another incident on March 22nd, lasted almost 2 hrs, and involved issues with saving pages and edits. For 8 hours between March 20th and March 21st, an incident involving Notion Calendar caused users to experience reduced performance and errors. “Guess I'll take a half-day…” one user noted. Unfortunately, external post mortems were not published for these incidents. (Notion Status) (Notion Status) (Notion Status)
Tailscale suffered an outage on March 7th lasting approximately 90 minutes, that was caused by an expired TLS certificate. The issue affected access to Tailscale's documentation, install scripts, and marketing material for most users. This comment from the community sums it up: “I've said it before and I'll say it again: expiring certs are the new DNS for outages.” (Tailscale)
Recent Events 🗓️
SREcon 2024 Americas: In March, a number of us had the opportunity to attend this year's conference in San Francisco, along with 6000 other SREs. Niall Murphy co-author of the defining site reliability engineering (SRE) book and self-proclaimed "instigator" kicked off with a 20 year look back on SRE. Other keynotes covering the future of observability (2.0), the relationship between security and observability, and engineering burnout. From a problem detection perspective, there were several directly relevant breakout sessions:
“Autopsy of a Cascading Outage from a MySQL Crashing Bug” by Jean-François Gagné, Aiven, and Swetha Narayanaswamy, HubSpot
"Logs Told Us It Was Kernel – It Wasn't" by Valery Sigalov, Bloomberg
“What Is Incident Severity, but a Lie Agreed Upon?” By Em Ruppe, Jeli.io/PagerDuty
“Cross-System Interaction Failures: Don't Fail through the Cracks” by Tianyin Xu and Xudong Sun, University of Illinois Urbana–Champaign
These talks emphasized the complexity of detecting and troubleshooting application failure in modern applications. (Full SREcon24 program guide)
Tools 🛠️
Linux Crisis Tools: Brendan Gregg, one of our favorite engineers and authors, covers a suite of tools designed for crisis management and performance analysis in Linux environments. In a 2 for 1, he also provides his take on the ebpf documentary and settles the debate about what ebpf stands for - if anything. TLDR: It’s no longer an acronym. (brendangregg.combrendangregg.com)
MemLeak.py: A memory leak analysis tool mention in the blog post by EnterpriseDB. (github.com)
As always, we’re open to your feedback and suggestions. Whether you're troubleshooting an issue, looking to optimize performance, or simply keeping up with the latest tricks, we’re happy to be a part of your day.
Follow our brand new account on X (fka twitter): @detect_sh
Join our mailing list so you'll be the first to know.