In the embedded video, we examine the Haddon Matrix, a framework from the injury prevention literature. We then apply it to how a software team prevents and deals with an outage.
The Haddon Matrix
To minimize the risk of bad outcomes, experts in injury-prevention rely on a framework called the Haddon Matrix. This framework provides a way to think about accidents by allowing you to focus on three periods of time: pre-event, event, and post-event.
The Haddon Matrix is also useful for thinking about preventing, handling, and recovering from other types of situations that aren’t necessarily life-threatening. For example, a software team that owns a web service may want to think through the kinds of problems that could take their service off-line and what they’d do about it.
Pre-Event
Think about the pre-event. What are some reasons the site could have gone down and what could we have done to prevent them?
- The team may have a pushed a bad code change.
- The physical hardware their service was running on failed.
- The system became overloaded with unanticipated demand.
We could prevent most bad code changes through:
- code review or pairing
- automated tests
- automated deployment (CI/CD)
- monitoring with automatic rollback
We could prevent physical hardware failures from affecting customers through redundancy. Or, we could have a second host in a different region with an automatic fast failover mechanism.
We could avoid failing under greater load by buying more hardware resources than we need, or better yet, through the use of elastic cloud computing resources. Also, we could also load test our software before we release it.
The Event
Think about the event. What could we do in the moment to minimize the damage? In the case of a site outage due to bad code change, we could make it easy to do a manual rollback. For hardware failure, as in the prevention case, we could have a second system to manually fail-over to in the moment. In the face of demand exceeding capacity, having a way to scale up resources on demand would help.
In all three cases, having monitoring and logging that helps you find the cause of the outage is critical. So is having runlists or standard operating procedures for dealing with these kinds of outages in a calm manner that doesn’t put things in an even worse state. We would also need a proper on call rotation with well-trained operations people.
Post-Event
Think about the post-event. This may be about having a mechanism in place to ensure it doesn’t repeat, such as having an event post-mortem as part of your process and taking action on the lessons that come out of the post-mortem. Thinking about the post-event scenarios may also help focus on the cost to the company of losing customer trust and potentially having to compensate customers for the downtime. An honest assessment of the costs and risks may help with prioritizing some event and pre-event activities like investing in redundant hardware, or monitoring, or a continuous delivery system.
Benefits of The Haddon Matrix
The beauty of the Haddon Matrix is that it helps you think of environmental solutions rather than relying exclusively on carrots and sticks. In this exercise, we created a robust plan against site outages without thinking about the team’s Riders and Elephants. So, we simply tweaked the environment to help make the negative outcome less likely.
I would love to hear what you thought of the video, so feel free to comment below, on The K Guy Twitter, or on The K Guy Facebook fan page.