A reference for readers familiar with the DORA metrics. This is a mixture of information from the DORA reports, external authors, and my own experience in applying the DORA metrics across a large technology organisation.
If you’re new to the DORA metrics then first read State of the DORA Metrics 2022.
The State of DevOps Reports found that high performing organisations showed strong results in this set of metrics:
- Delivery Lead Time (DLT) - How long it takes for code, once written, to be deployed to production.
- Deployment Frequency (DF) - How frequently is software deployed? Used as a proxy for batch size.
- Reliability - Does software meet availability, latency, performance, and scalability targets set by the team?
- Mean Time to Restore (MTTR) - After a failure in production how quickly is service restored?
- Change Fail Percentage (CFR) - What percentage of deployments to production fail?
Of these I believe DLT, DF, and Reliability metrics have value, and MTTR and CFR should be discarded.
The promise of the DORA metrics is that they reveal organisational performance. Indeed they may, but from my experience I propose a different frame: DORA metrics reflect the environment in which teams operate, they are indicators of system health.
Delivery Lead Time
Definition: Time taken for code to be committed through to running in production.
- Use alongside Deployment Frequency as an indicator of the speed of software delivery.
- What does the Delivery Lead Time reveal about Software Delivery?: A combination of Engineering Practices and Project/Technology Capabilities.
- What is a ‘good’ Delivery Lead Time? A median of 24 hours, which indicates that on average code written today gets deployed to production by the end of tomorrow.
- How can teams use Delivery Dead Time to improve Software Delivery?
Definition: How often a system is deployed to production.
- Introduction to the Deployment Frequency metric.
- How can teams use Deployment Frequency to Improve Software Delivery? From my experience writing greenfields and modernising legacy systems.
- Time Since Last Deployment: Another way of looking at Deployment Frequency, for large organisations that run thousands of services.
Beware: Deployment Frequency is an activity metric. Our goal isn’t to blindly increase the number of deployments (make the metric go up), but for teams to have the ability to deploy their software at will. Consider two systems which both have a DF of three times per day. The first is an API system developed by two teams which can be deployed at will. The second is a monolithic codebase that fifteen teams work in. Deployments are a bottleneck and need to be booked weeks in advance. The DF metric alone does not tell the full story.
I created a A Simple Model to think about Deployment Frequency:
Deployment Frequency = Activity * Engineering Practices * Deployment Aspects
Accelerate writes that Deployment Frequency was chosen as a proxy for Batch Size.
However, in software, batch size is hard to measure and communicate across contexts as there is no visible inventory. Therefore, we settled on deployment frequency as a proxy for batch size since it is easy to measure and typically has low variability.
Batch Size may have an effect on Deployment Frequency but I don’t think it’s a good proxy. Deployment Frequency is largely determined by the speed and ease of deployment. Incidentally Batch Size is also captured in our DLT metric since we measure since time of first commit within a given deployment. Deployment Frequency is easy to measure and I think it’s a good metric coupled with Delivery Lead Time.
Mean Time To Restore
Definition: How long it generally takes to restore service when a service incident or a defect that impacts users occurs (eg. unplanned outage or service impairment).
Years of use in the industry has revealed that MTTR should not be used as a metric. Use Reliability metrics as an indicator of system quality.
Štěpán Davidovič of Google published the excellent paper Incident Metrics in SRE. In this he explores the metric and demonstrates that due to the high variance in the incidents any statistical average would be misleading. He concludes that MTTR “does not provide any useful insights into the trends in your incident response practices”. From the paper:
The problem is not specific to the metric being an arithmetic mean; I’ve demonstrated the same problem with median and other metrics. It is a consequence of the typically low volume of incidents and high variance of their durations. This distribution has been observed on practical data sets from three anonymous companies, as well as the obfuscated data set from Google.
Further more …
Improvements in incident management processes or tooling changes cannot have their success or failure evaluated on MTTx. The variance makes it difficult to distinguish any such improvement, and the metric might worsen despite the promised improvement materializing.
The 2022 Void Report shows that:
Duration and Severity Aren’t Related: We found that duration and severity are not correlated—companies can have long or short incidents that are very minor, existentially critical, and nearly every combination in between.
- For service recovery where a code change required, Delivery Lead Time can be viewed as a Minimum Time to Restore.
- Implementing a system to automatically measure MTTR is difficult. Incident coordination groups (ie. Slack channels spun up after a fault has been detected) have a clear start and stop time, however these times don’t align with the duration of the actual fault. eg. the fault may exist for two days before it’s detected and an incident is started, and the incident channel may be kept open much longer to monitor the situation.
- Without automation, we would rely on manual entry of such incident data. This cannot be entered consistently across large organisations and would result in an expensive to gather and incomplete dataset.
Change Failure Rate
Definition: The percentage of changes to production or released to users that result in degraded service (eg. lead to service impairment or service outage) and subsequently require remediation (eg. require a hotfix, rollback, fix forward, patch).
I do not recommend using CFR as a metric. Use Reliability metrics as an indicator of system quality.
- I believe CFR has similar issues to MTTR in that there is a huge variation in failures. Some failures may be acceptable while others have large impact.
- Failures include API endpoints returning errors, an unacceptable increase in latency, unusable UX, etc. Some of these cannot practically be detected by automation. Defining failures too narrowly may make CFR measurable, but would reduce the usefulness of the metric.
- CFR excludes non-change related failures, of which there are many in the complex systems we maintain today.
- Some organisations define CFR to mean the percentage of times a given deployment pipeline fails. This may be useful information, but is separate from CFR.
Definition: The degree to which a team can keep promises and assertions about the software they operate.
- I interpret these as Service Level Objectives (SLOs). Teams set Service Level Indicators (SLIs) specific to their service.
- SLOs measure customer experience of the system - the customer doesn’t care whether an outage is change related. Behind SLOs are a number of SLIs. SLIs are quantitative, objective measures of availability, latency, and error rate. They are measured automatically - typically through platforms such as New Relic.
- SLOs and SLIs aren’t perfect, but they are current industry best practice.
- DORA frames ‘good’ as teams meeting their goals, not in absolute number terms (eg. a service meeting its SLO of 99.9% availability is treated the same as a different service meeting its goal of 90% availability). It’s valuable for the organisation to review the absolute numbers to ensure a level of quality is aimed for.
- Google SRE Books are the reference for SLOs, SLIs, and other Reliability topics.
SLOs & SLIs can be designed to encapsulate aspects of MTTR and CFR and serve similar functions:
- They start the same conversations. If some SLIs are degraded a team will investigate them and see what contributed to it. Contributing factors may include slow recovery times, more change failures than usual, or something else entirely.
- SLOs incentivise the outcome we want. By using SLOs as the metric teams optimise for service reliability and incident response. By using CFR as a measure the change failure rate may be reduced at the cost of other failures.
The State of DevOps Report 2018 (published at the end of the same year Accelerate was published) added a fifth metric, Availability. In the State of DevOps Report 2021 this was re-framed as Reliability.
- Interview, transcript Abi Noda interviews Nathan Harvey, DORA lead at Google. 2023-01.
- How to Misuse & Abuse DORA Metrics by Bryan Finster. 2022-07.
- Incident Metrics in SRE - Critically Evaluating MTTR and Friends by Štěpán Davidovič. 2021-03.
- 2022 Void Report - Incident metrics.
- MTTR is a Misleading Metric—Now What? by Courtney Nash
- SRECon Talk - Tales from the VOID: The Scary Truth about Incident Metrics by Courtney Nash
- Why you Shouldn’t Count Production Incidents by Rick Branson.
- Google SRE Books - Read more about SLOs, SLIs, and other Reliability and Incident Management topics.