An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
The role of Site Reliability Engineer (SRE) has never been more critical as cloud applications increasingly power every aspect of our lives. While the job title may vary across organizations, we work tirelessly to ensure the stability and performance of software systems. However, the methods and models we employ to detect and troubleshoot issues are usually stuck in the heads of our most experienced team members and performed on an ad hoc basis.
Whenever I get an Xray or MRI, I’m always blown away by what the specialist can decipher from the monochrome, grainy image. Unfortunately, problem detection within our world is not that different. In 2024, we remain humans staring at dashboards to find problems.
It's time for problem detection, a core function we perform daily, to evolve to meet the needs of the modern world. Every engineering team I meet with agrees with this point. However, these same engineers often find themselves stuck, lacking a clear path to assess, benchmark, and enhance their problem detection capabilities.
Enter the Detect Maturity Model (DMM), a framework designed to guide organizations in evolving their problem detection processes.
If you’re not familiar with product detection, or how it is specifically defined, I highly suggest you take a look at the What is Problem Detection blog before you continue reading this one.
Drawing inspiration from the Capability Maturity Model Integration (CMMI) and other frameworks, the DMM was developed by the detect.sh community. It is structured to provide a roadmap for organizations seeking to advance their problem detection capabilities. CMMI, renowned for its structured levels of process maturity, offers a solid foundation upon which the DMM builds, but we've tailored its approach to the unique demands of problem detection within software applications.
The DMM categorizes problem detection capabilities into five distinct levels, each representing a step forward in maturity. (We start counting at zero of course.)
0. Initial (Ad Hoc): At this stage, problem detection is reactive, and processes are unstructured and chaotic. Organizations operate in firefighting mode, spinning up problem detection activities solely to support troubleshooting of customer-impacting issues.
Evidence: Customer complaints or service level objective (SLO) violations are the primary methods of knowing something is wrong. There is no material proactive problem detection.
1. Managed: The second level introduces basic processes for proactive problem detection. Though still largely reactive, engineers begin to run a set of detection approaches in advance of incidents.
Evidence: Your most experienced engineers use tribal knowledge to proactively look for a set of problems by leveraging 1 or more tools and homegrown scripts. There is no visibility into how or when this is performed.
2. Defined (Standardized): By this stage, organizations have developed standardized problem detection processes. There's a shift towards scaling problem detection, with teams utilizing tuned detection tools and alerts to preemptively identify issues.
Evidence: You have a wide set of documented problem detection processes that are executed on a routine basis to proactively uncover failure.
3. Quantitatively Managed (Measured): The focus here is on metrics and measurement. The output of the detection process itself is measured to manage the health of the function.
Evidence: The majority of detections are automated and run on a periodic basis with appropriate metrics being captured. Stakeholders receive automated reports relevant to their roles.
4. Optimizing (Continuous Improvement): At the highest level of the DMM, organizations continuously refine and improve their problem detection methods.
Evidence: The engineering organization has a problem detection roadmap and is deliberately covering the most relevant use cases overtime.
The Detect.sh problem detection community believes the journey towards maturity is not one to be undertaken alone. The collective wisdom of a community plays a pivotal role in fostering growth and innovation. A community of problem detection engineers can share best practices, tools, and experiences, offering support and inspiration to both newcomers and veterans in the field. By participating in such a community, organizations can accelerate their progress through the DMM levels, overcoming common pitfalls and learning from the success stories of their peers.
Communities like Detect.sh serve as a catalyst for advancement, encouraging experimentation and the sharing of novel solutions. As SREs engage with one another, discussing challenges and brainstorming solutions, the collective knowledge base expands. This collaboration not only propels individual organizations forward but also pushes the entire field of problem detection towards new heights.
The introduction of the Problem Detection Maturity Model marks a significant milestone in the evolution of site reliability engineering. Until now, teams have not had a model to assess their problem detection maturity.
By providing a structured framework for assessing and enhancing problem detection capabilities, DMM offers organizations a defined path to achieving greater reliability and efficiency. However, the journey through the levels of DMM is not just about adopting new processes or technologies; it's about embracing a culture of continuous improvement.
As we look to the future, the role of Detect.sh in advancing the field of problem detection is important. No other community exists with this mission. Through shared knowledge, experiences, and a commitment to innovation, we are unlocking the full potential of SREs and their organizations. As we embark on this journey together, the DMM supports our community vision of a future where problem detection is not just an activity, but a clearly defined program within site reliability engineering.