Postmortem reports: How to get the most from failure for massive growth
Shit happens! Lern from yout mistakes like a ninja
Check my Data Product Management playlist to read more about building successful data products.
Let’s face it: things will go sideways no matter what you do. Whether it’s a bug in your code, a complication from your sources, or an unexpected behaviour of your teammates, you occasionally have to fix new issues. It isn’t enjoyable when something you thought was fixed pops up again. And it’s always during your lunch hour!
Don’t you hate that?
We have to face the challenges of failure and use the lessons we learn to become stronger. We shouldn't be afraid to fail, as it can be the first step in our journey to resilience. We have to use our mistakes to become more experienced and successful.
You’ve been doing that your entire life, haven’t you? What if I tell you you can do it even better?
By implementing a postmortem process in your work – or even in your personal life – you can gain invaluable insights from your mistakes and use them to make positive changes. This process can help you develop a deeper understanding of the situation. You can also use it to inform future decisions and behaviours. Not only can this help you to become a better problem solver, but it can also provide an invaluable opportunity for personal growth.
I’m going to tell you what a postmortem document is and how to write one in three easy steps. I’ll also share some examples of my favourite reports from the years. I have also prepared a postmortem template at the end of this article as a gift to you.
Definition
A postmortem is a document that describes the details of an incident. We usually write it after we resolve an outage. It is intended to provide an analysis of the incident, including the reasons behind any lessons learned and potential action points.
You may then wonder why I wrote a document when you had already fixed a problem. Think about it:
Writing a postmortem demands concentration and focus to consider all necessary details. It offers the opportunity to analyse the incident in more depth and identify any issues that may be present in your processes and tools, enabling you to make your project more robust. By writing a postmortem document, you can ensure you are better prepared for similar events in the future and put in place the necessary changes to prevent them from occurring again.
But wait — there is more!
A postmortem report helps inform stakeholders about a particular incident, offering them a detailed description of the problem as it unfolded. This report provides a clear and concise account of the incident, which helps build trust between you and your stakeholders. That makes them feel confident in your team’s ability to handle similar situations. But be careful because with great trust comes great responsibility. 😉
Now, if you have never heard of the term postmortem, it may sound a bit creepy. At least that was my case before I learned what it does and how it works. Postmortem is a precious tool used by teams to document their successes and failures and identify ways to improve in the future. With this knowledge, postmortems can be seen in a completely different light—the light of shared knowledge and learned lessons.
Can I be completely frank with you? I prefer using the term incident report as postmortem still gives the chills.
Does that make any sense to you? Would you like to learn how to write incident reports? Let’s dive in!
Document structure
Writing a postmortem is straightforward. All you need to do is compile three sections, making sure that each one communicates key elements of the process. Don’t get tricked. The process may be simple, but you should take it seriously and approach it with great care. You want to include all the details correctly and optimise it for the readers.
Now, here’s the step-by-step process:
Summary
You usually start by providing a comprehensive overview of what happened. You should include as much detail as possible, such as affected dashboards, the number of affected records, the duration of the downtime, and any other relevant information. This section should provide a concise summary of the events surrounding the incident so that readers who are not interested in all the technical details can still understand what occurred.
In this section, you should also include a root cause analysis of the initial failure. To identify the incident's root cause, you’d typically use the "Five Whys" technique, which involves asking questions to dig deeper into the problem and analyse the situation from multiple angles. This approach can help you uncover the underlying issue and get to the heart of the matter.
Details
In this section, you provide an overview of the steps taken to diagnose and resolve the downtime. You’d usually use a timeline to structure this section, allowing you to break down the process into distinct stages. This helps to ensure that all the necessary steps are taken to identify and address the issue as quickly as possible. Additionally, employing a timeline aids you in identifying any potential areas of weakness in your troubleshooting process. This allows you to assess your approach and make any necessary changes to help you improve your response in the future.
But wait, let me tell you something significant!
Do I have your attention?
Okay, Now:
More often than not, you might be inclined to point a finger to find the source of any arising mistakes or issues. However, you must not do that! It is vital to foster a blameless culture within the team and focus on the learnings that can be taken away from any given situation rather than attempting to blame one particular individual or group. That way, you’ll allow everyone to learn and grow. It is also important to remember that mistakes can be an invaluable opportunity for growth as long as you address them correctly. Therefore, it is essential to resist the urge to blame and instead focus on the learnings that can be gained in the long run.
Actually, would you like to know a secret?
Usually, no one person is responsible for an incident. Even in rare cases, when just one individual is involved, you should dig a bit deeper and look for the actual reason. Instead of blaming people, you should make your project a space for creative minds to learn in a safe environment. Just a few questions you can ask: Should that person have access to the system that failed? Did anybody review their work? Have you set up any monitoring and notifications?
Takeaways
You can't let your failures define you. You have to let your failures teach you.
― Barack Obama
That certainly holds for all aspects of life, including work. You can do your best to try and prevent failures from happening, yet they still occur, time and time again. It is not the failures that people see in you as a person and professional; instead, how you use those failures to learn and grow genuinely matters. We should look into the past to build a better future. That is why I believe this section is the most important part of the postmortem document.
Sounds good? Let’s see how to put that into action.
In this section, document what went well and what went poorly. Aim for objectivity and focus on the facts. Analyse the incident and extract any lessons learned. This could include insights about your tools, communication issues, etc. Don't jump to solutions yet; consider what you learned from the incident. It is crucial to be honest and objective when reflecting on the downtime. Ask yourself questions such as: What processes could have been improved? Were there any tools or resources that could have been better utilised? Were there any communication barriers that could have been addressed? Consider the entire outage and make sure to evaluate it from all angles. This will help ensure you extract the most important lessons to improve the process for similar incidents.
Now, you gained valuable knowledge and insights from the lessons you learned. It’s time to think about how to improve your project and prevent the same issues from occurring in the future. Write down a list of action points. You can do that in an accessible text format or Jira tickets. Just make sure those are linked in the postmortem document. Take your time. It is essential to consider any potential risks arising from making changes and how to mitigate them. Just make sure you consider all angles and other projects you might have. Discuss timeframes with anyone involved and make sure your actions would make your world a better place.
That’s it—three sections. You can make those as long or short as you want. Of course, you can add some more sections if you wish. There are no strict rules regarding how it works for you, your team, and your stakeholders. Just beware to get all the details right, avoid blaming people, and focus on learning.
Showcase
Okay, I know what you are thinking: That looks great. Can I see some examples? Yes, there are!
As mentioned in the beginning, I’m going to share some of the best incident reports I have read through the years:
Monzo deployed a faulty update, preventing customers from doing some basic operations
Cloudflare broke their DNS service because of wrong beliefs about time
GitLab accidentally removed production data from their primary database and was down for about one working day
These examples are not here to shame the companies for their outages. On the contrary, I’d like to praise them for the effort they made to write those documents and be transparent. Those reports are examples of dedication to providing the best possible service, strengthening relationships, and sharing knowledge. I’d strongly encourage you to learn something from them.
Recap
To summarise:
Incident reports provide a comprehensive overview, a root cause analysis, a timeline of the steps taken to resolve the downtime, and a list of action points to prevent similar issues. They are a fantastic way to learn from mistakes and share knowledge with your audience. It is vital to foster a blameless culture and focus on the lessons that can be learned from the situation rather than attempting to place blame on one particular individual or group.
We didn’t just discuss what a postmortem document is. We also discussed how to structure one and what should be our focus. We also saw some excellent examples from a few well-known companies.
And that’s it!
Well, not really.
The Resource
As I promised, I have a small gift for you. It contains some basic examples of how to use it. I am sure that template would be useful whether you are just starting your journey in structured incident reporting or already have some experience with it.
Don’t take my words for granted. If you are looking for a way to improve your learning, consider using postmortem documents.