Measuring Service Failure, or why to not use CFR and MTTR

Change Failure Rate (CFR) and Mean Time to Recovery (MTTR) are two engineering metrics popularised in Accelerate (2018) and subsequent DevOps Research and Assessment (DORA) State of DevOps Reports. Since publication, all DORA metrics have been assessed, adopted, and validated by technology companies across the world.

Experts in the industry have found fault with CFR and MTTR. Specifically, while these are powerful frames to think about service reliability, the data presented in the form of the metric, and any associated change with it is statistically meaningless. Štěpán Davidovič wrote the Google Cloud report Incident Metrics in SRE. He analyses the MTTR metric against Google’s own incident data and concludes that MTTR “does not provide any useful insights into the trends in your incident response practices”. Furthermore:

Improvements in incident management processes or tooling changes cannot have their success or failure evaluated on MTTx. The variance makes it difficult to distinguish any such improvement, and the metric might worsen despite the promised improvement materialising

The Void Report (2022), a respected publication in the SRE space, further validates Davidovič’s findings, and provides the commentary:

Duration and Severity Aren’t Related: We found that duration and severity are not correlated—companies can have long or short incidents that are very minor, existentially critical, and nearly every combination in between

Change Failure Rate has similar issues to MTTR in that all change failures are not considered equal. A change failure that takes down a critical API is not comparable to a change failure that disables a button on the screen.

To measure CFR at scale product delivery teams will need to introduce new automation to detect such failures. Using the example above, implementing automation to detect a broken button (or other significant or trivial failures) might be categorised as toil. Data collection of a failure would need to be automated across every system in the company. Definitions of what constitutes a change, a failure, and the time window in which a failure is detected must be established, and then product teams who own these services must implement the checks.

Software companies run complex systems. Change failures are just a subset of all failures that complex systems experience. In reasoning about the reliability of our systems we should consider the reliability of systems as a whole. ie. Instrument and report metrics for all failures, not just change failures.

Link to this section Measure Service Failure with SLOs

The duration of service failure and recovery time is captured as part of well defined Service Level Objectives and Service Level Indicators (SLOs, SLIs). SLOs provide a rich framework for us to define, reason, and learn from failures and recovery. CFR and MTTR are blunt, if measured “correctly” these metrics only give a glimpse into a subset of the operational state during change failure or recovery. These are insufficient to understand complex systems. Instead we should utilise SLOs to paint a more complete picture of service reliability.

SLOs allow teams to define a failure in a way that’s appropriate to their use case. Service outage is an obvious failure, but a team that runs a public API may extend the definition of failure to include if a percentage of requests take longer than n seconds to respond to a customer. If the API is unacceptably slow for our customers, we need to count it as a failure. SLOs are a DORA Metric and have been evolving since the 2018 DORA Report. The 2021 DORA Report has this to say:

“[SLO metric] represents operational performance and is a measure of modern operational practices. The primary metric for operational performance is reliability, which is the degree to which a team can keep promises and assertions about the software they operate.”

“Historically we have measured availability rather than reliability, but because availability is a specific focus of reliability engineering, we’ve expanded our measure to reliability so that availability, latency, performance, and scalability are more broadly represented. Specifically, we asked respondents to rate their ability to meet or exceed their reliability targets. We found that teams with varying degrees of delivery performance see better outcomes when they also prioritise operational performance.”

Link to this section Fewer, Stronger Signals

DORA frames its metrics as follows: Delivery Lead Time (DLT) and Deployment Frequency (DF) measure the speed and frequency at which teams deliver software, while CFR and MTTR measure quality. In the systems I implement, I measure DLT, DF, and SLOs. These three metrics give insights into the speed of development, frequency of deployment, and level of reliability of systems that we provide to our customers.

These metrics are only as good as the data that goes into generating them. SLOs and SLIs must be comprehensive. They should span customer journeys across common metrics such as availability, latency, and error rate. This requires expertise in the customer experience of your product and should be revisited every few months.

MTTR and CFR are metrics that on the face of it seem valuable but have been proven over the years by mature technology organisations to be difficult and costly to measure. Repeated validations of these metrics by Google Cloud and other organisations show that an increase or decrease in the metric is not statistically significant and offers no insight or benefit. SLOs measure the customer experience and provide rich information into the reliability of our systems.

Link to this section Further Reading

DORA Metrics Reference summarises my reading of the followign resources:


Related Posts