June 2024: 🔥 Virtual Meetup with Xata 🚨 Incidents at Microsoft, Apple, Github; 🛠️ Learn How to Survive Massive Traffic Surges...and more
Community
September 3, 2024
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to the latest edition of the only newsletter focused on the art and science of problem detection and troubleshooting.In this issue: 🔥 Virtual Meetup Tomorrow (6/3) 🚨 Incidents at Microsoft, Apple, Github; 🛠️ Learn How to Survive Massive Traffic Surges ....and more.
This month’s newsletter is brought to you by the team at Prequel (prequel.dev). The company bringing detection engineering to reliability. Join their early access program to see how they help teams overcome alert overload and manual troubleshooting.
And now, here is a digest of what happened last month in the world of problem detection and troubleshooting.
The hacker news effect in action:
Upcoming Events 🗓️ 🔥
The "hacker news effect" is real. How do you survive being on the front page of HN repeatedly?Come find out. Detect.sh will be hosting our first community meetup on June 4th at 11 ET. Featuring Xata CTO Tudor Golubenco. Tudor previously founded PacketBeat, acquired by Elastic. Come join our community of SREs and software engineers for a closed door discussion. Register here.
Community Articles 🧠
Don't miss the latest content written by our community:
Deterministic vs. Probabilistic Problem Detection: A comparison of deterministic and probabilistic approaches to problem detection. (detect.sh)
Improve Your Operational Review Meetings: Tips for enhancing the effectiveness of your operational review meetings. (detect.sh)
Blogs 📝
Here are some noteworthy blog posts you should check out:
The Error Term Isn't Pareto Distributed: An analysis of error distribution in complex systems. (surfingcomplexity.blog)
Benchmarking Go Error Handling: "Sentinel errors and errors.Is() slow your code down by 500%". (dolthub)
Engineering for Slow Internet: Insights into engineering practices for slow internet environments. (brr.fyi)
Magic: The Gathering and Incident Response: Lessons applied to incident response. (hross.substack)
Psychological Safety in Incident Response: What can you do if you feel like your team isn’t as psychologically safe as it needs to be to respond to incidents effectively? (pagerduty)
3 PostgreSQL Mistakes That Will Cause Outages: Common mistakes in PostgreSQL that can lead to outages. (stepchange.work)
Memory Leak Issues: An exploration of memory leak issues and how to address them. (stevenharman.net)
Notable Incidents 🔥
It was a very busy month of incidents. Here are some of the ones that caught our attention:
Microsoft Bing Outage: Microsoft Bing experienced a significant outage affecting its search and AI services, including Copilot and integrations with DuckDuckGo and ChatGPT. (theverge)
DNS Glitch Affecting Internet Stability: A DNS glitch that threatened internet stability was resolved, though the root cause remains unclear. (arstechnica)
Google Cloud Outage: Google Cloud accidentally deleted a customer account, resulting in two weeks of downtime. (arstechnica) + Official response from Google: 👉 (cloud.google.com).
Apple iMessage Outage: Apple iMessage suffered a disruption, impacting users' ability to send and receive messages. (axios)
Heroku: Heroku faced an incident that affected several customer applications. (status.heroku.com)
GitHub Incident: GitHub experienced an outage affecting code repositories and developer workflows. (githubstatus)
Hugging Face Incident: Hugging Face reported an incident that disrupted their hub services. (huggingface)
jsDelivr Outage: jsDelivr experienced an outage affecting the delivery of web assets. (jsdelivr)
Tools 🛠️
CNCF Project Kepler: Explore the latest in energy-efficient computing with Kepler, a project under the CNCF that aims to improve energy efficiency in Kubernetes clusters. (cncf.io)
As always, we’re open to your feedback and suggestions. Whether you're troubleshooting an issue, looking to optimize performance, or simply keeping up with the latest tricks, we’re happy to be a part of your day.
Follow our brand new account on X (fka twitter): @detect_sh
Did you find us on the web? Join our mailing list so you'll be the first to know.