On-call duty: prepare your team for unknown
In today’s world, the stability of IT infrastructure becomes more and more critical. The companies must respond to failures at any time of the day. For developers, this means on-call duty. Of course, to make it work, the developer must have enough skills to find the root cause and resolve the problem. In this article, I’m going to share my advices about building those skills in our teams. It may be a great help, especially for less experienced developers who often feel stress during their on-calls.
Anatomy of failure
There are two types of failures: the ones that you already know, and the ones that you don’t. It’s quite easy to prepare for the first group. All you need is note down some emergency procedures and let people follow them. It’s the second group that scares less experienced developers. Imagine that your service crashes in a way that you have never seen before. How can you know what to do?
Unknown failures resemble testing in some way. In both cases you face a situation that the system does something it is not supposed to do. Your goal is to find the root cause and fix it. The difference is that during the on-call duty, it is the production system that fails. You work under a greater time pressure. However, thanks to this similarity people with testing skills are particularly good at solving this kind of puzzle. Fortunately, we can also compensate this by training. I take my inspiration here from the aviation industry. The candidates for pilots spend endless hours on simulators. They practise responses in various emergency situations. The training teaches them the response patterns and aircraft behavior in various conditions. If an actual emergency happens, they can use this experience to make right decisions.
Preparing the project for on-call duty
Before you start the team training, make sure that the project itself is prepared for on-call duty. The key thing is knowing the business purpose and business expectations. What’s the outage cost for business? What are the side effects? Who depends on us? Which processes are critical and which are not? Based on this, you can define the severity levels, pick up the right metrics to monitor them, and define the alert thresholds.
Example: you own a project on a critical path, where any outage means financial loss to the company. A part of the functionality is also displaying a small promotional badge in a less exposed place. Imagine two situations:
- the user can’t complete the order due to a failure,
- the user can’t see the badge, but can complete the order.
It’s easy to see that the badge is way less important for the business (especially in the middle of the night, where the traffic is low). The incidents that affect the order processing are way more severe. If they happen, you should fix them quickly. On the other hand, you probably don’t need to fix badges in the middle of the night.
Building skills for on-call duty
The previous step was important for yet another reason. It’s hard to train anyone if you don’t know what to focus on. Do you want your team to fix all sorts of problems with promotional badges, yet fail to resolve the critical outage? Likely not. But once you have it… we can look at the advices to prepare your team for the on-call duty.
Advice 1: knowledge sharing meetings
Unless you work together for longer, different team members may have varying skill levels. I observe that it contributes to the stress among the less experienced developers. The first thing is setting up regular knowledge sharing meetings. You can use them for the following activities:
- explaining incident severity, business expectations (all the stuff presented earlier)
- analyzing past failures,
- discussing improvements in the alerting/monitoring systems,
- learning key technology skills.
By engaging everyone, the developers build confidence and understanding how different pieces of the project work together. Also, discussing and implementing the metric makes us feel to be the part of all this. Technology training is also important. Make sure that everyone understands your database technology, key infrastructure components and concepts related to them.
Advice 2: write post-mortems and discuss failures
Whenever there is an incident during the on-call duty, you can use the knowledge sharing meetings to do the post-mortem analysis. Let the on-duty engineer describe the problem, the steps taken to resolve it, and the root cause. Even if this is a known issue, it’s worth looking at it, because you may discover additional things that you missed before, or refine the severity level (“hmm… maybe we can have a better resolution for it?”).
Post-mortem analysis is a great tool to understand the previously unseen failures. It’s a retrospective, where you look at your actions. You check whether they guided you in the right direction or not. The good idea is to write the post-mortem down using some predefined template. It helps you asking the right questions about the incident, and make sure that you haven’t missed anything important. Eventually, the post-mortem should lead to refining your processes, procedures and planning improvement actions. For less experienced team members, the post-mortem analysis is a great way to see, how the real incidents are solved. They can take the inspiration from there.
Advice 3: give the developers the real failure
The previous two advices were about building the theoretical background. However, the absolutely the best way to elevate the skills is giving the people the actual failures. I mentioned previously that people with testing skills are particularly good at solving incidents, and that’s true. Manual regression testing can be painful, but it shows how the system actually works. It’s like a difference between reading how to drive the locomotive, and actually driving it.
But you can also do yet another thing. In most companies, the teams usually have some DEV or staging environment. Let the experienced developer intentionally break such an environment. You can use the previous post-mortems for inspiration. Then, arrange other developers into small groups and give them the system to find the root cause. Such an exercise teaches them, how to debug the problem, how to combine data from different sources (metrics, logs), make hypotheses and test them. The controlled environment removes the stress normally present during on-call duty. Instead, they can focus on the problem.
Advice 3A: Failure RPG
Sometimes, making a convincing failure on a DEV system is not feasible due to external factors. One example is using shared infrastructure where your training can affect other teams. I had such a situation a couple of times. To deal with it, I invented something I called the “failure RPG”. It takes inspiration from role-playing games. We need a room, a computer with a big screen and free time. One of the developers becomes the “game master”. The game master describes the scenarios, and the players walk through monitoring systems. They show where they would look for. The game master tells them what they see (e.g. “ok, on dashboard X you see a peak in the number of requests”). Then, the players try to figure out the next steps, until they find the root cause and/or resolution.
Of course, in the Failure RPG the developers have to imagine certain things. However, this game preserves many aspects of the training on the real failure. The developers still learn how to use the debugging tools, and the necessary steps and reactions. With the failure RPG, you can also practise the incidents that are impossible to reproduce on a DEV/staging environment.
The on-call duty requires certain skills, like every activity. It’s natural that less experienced developers may be scared of being left alone with an outage. But like we said, it’s possible to prepare for the unknown. The last thing that you need to know is the final lesson again from the airline industry. Very rarely, the pilots make wrong decisions because they do follow their experience. The “unknown” can be absolutely anything. It can be so counter-intuitive that our procedures may be an obstacle or even make things worse. This is something we can’t fully prevent. But we can reduce it to the negilible level.