There’s just been a production outage, a new project has been announced, a security vulnerability has been discovered, a team is nearing the end of a retrospective, a post incident review is being run, a meeting is wrapping up. There’s a call around the room for action items to ensure either: 100% certainty of the good thing succeeding, or 100% certainty that the bad thing will never happen again.
Most software systems are complex sociotechnical systems. Changing the ‘socio’ can often be seen as a more acceptable option than changing the technical.
Instead of fixing the memory leak, system operators will be trained to detect and remediate the leak. Instead of investing in automated tests and safe deployment methods, managers will be required to approve all changes before they go to production.
Fixing the underlying technical issue gives engineers the opportunity to simplify the system. Changing (or more likely attempting to change) the way people operate the system introduces more complexity on top of not fixing the underlying causes.
–
The first image is published by the Defence Acquisition University (DAU). DAU is “a corporate university of the United States Department of Defense offering acquisition, technology, and logistics (AT&L) training to military and Federal civilian staff and Federal contractors” with an annual budget of $220 million USD.
In browsing the DAU website I came across Life Cycle Logistics—Key Tenets of Back-to-Basics in it, a new-to-me phrase "… [we shall do our work] at the speed of relevance".