Wednesday, April 20, 2016

What Just Happened? Let's Talk Post-Mortems

Today one of my favourite "celebrity" Product Managers, Ken Norton, published a quick and dirty article on Bring the Donuts called Blameless Post Mortems. This spurred a thought process for me on the importance of review.

Review Meeting


There are two important times to look back, and the both have different ways of approaching them.

  1. When there was a clear service failure, it's important to figure out what happened, and how to prevent it from happening again.
  2. When finishing a long running (3-5 month) process or project,  extracting the learnings from that to apply to future projects.
In Ken's Post Mortem document, the former is what is discussed. When something break, it's important to conduct a review, in a blameless fashion, and figure out how to prevent it from happening again. 

Service Failure Post-mortems

Often when there is a service outage, a security hole or some notable problem in a service it's important to review in hopes of preventing it from happening again. With the right process in place, repeat outages can be prevented from happening again, although in some cases, the outage may not be preventable, or not immediately preventable (i.e it may take weeks or months to re-write part of a service to prevent it from happening again). However in many cases, simple process or diligence can increase service reliability. For example, code reviews with senior developers, or security audits with least privilege assigned to operational folks. 

Ken points to a template that Google uses for this purpose. This is essentially the same template that is used at Microsoft as well, so it's definitely a good source. The idea of the template is to just work in a group of experts on the specific issue to fill out the template, track the bugs found through closure (code fixes, or process fixes), and share the knowledge with everyone on the team to ensure it doesn't happen again. Then move on and learn from that experience to the next outage, rinse and repeat.

Shipping Post-mortems

The second type of post mortem usually happens at the end of shipping something substantial, or if it's continuously shipping, they can be done every 3-6 months. It makes sense to run to connect a post mortem to something substantial: i.e. a new multi-sprint feature out the door, a retrospective of trying out a new tool, or a 3-5 month anniversary of a team being together are good examples. Preferably the retrospective is not too far away from the start of the project, that the project starting bumps can still be remembered, and not too close that the team hasn't hit cadence or shipped something.

This critical post mortem helps discover what's working for a team, or where a product manager and/or leadership should focus energy on change.

Retrospective
Sticky Notes are your Friend
This type of post mortem is run differently than a service outage, and likely will take up more time to complete. A good rule of thumb is the moderator, often a product or project manager, should put aside about one and a half to two hours of time, while the rest of the team can join for an hour plus brainstorming. Here is how to run this type of post mortem:

Preparation

Brainstorming takes time, for some it takes a short time, and for others it takes a longer time. For maximum meeting efficiency, it's best not to do this in the post mortem meeting. When the 1-hour long post mortem time is published, the moderator should ensure each person attending comes up with 1-3 things that went well, and 1-3 things that could be improved, and if possible a suggestion or two on how to improve each. Brainstorming should be completed before the meeting starts to make the meeting as efficient as possible.

Execution of the Post Mortem

The execution is where sticky notes are excellent. Most of the meeting will be run by moving sticky notes around, grouping them and positioning them in priority order. In a worst case or a remote case, a whiteboard works as well (i.e. remotely accessible meetings can be run by screen sharing using a digital whiteboard or a shared GDoc/Evernote file. It's probably most efficient to have a typist who is separate from the moderator). 

Here is the suggested timeline of the meeting:
  • 5 mins - Introduction and adding prepared 1-3 pros and cons to separate sticky notes
  • 10 mins - Team sorting of all the sticky notes on a wall/whiteboard and grouping similar items together, prioritizing the team top 1-3 pros and cons
  • 15 mins - Discuss what went well (the pros), and why we should continue doing those top 1-3 items as agreed by the team
  • 30 mins - Discuss the top 1-3 items that could have gone better, what can be done to address them, assign action items to people to track resolution
Let's break each of those timeline items down:

The first 5 minutes of the meeting is just paperwork. Members write down their top 1-3 areas that went well and top 1-3 areas for improvement, each on it's own sticky note, no need to write own how to improve these at this time. Each member of the team now has 6 sticky notes. In some cases, members come with more or less, you can choose to include them or not, it just makes for more paper and longer discussions. What has worked well is to really focus on the team top 1-3 items for this post mortem, then at a future post mortem (3-5 months down the road), tackle the next level. This is a reasonable amount of change for the team to handle in that period of time.

The next 10 minutes of the meeting is about review. Each team member reads out and hands their stickies to the moderator. The moderator is responsible for grouping them together, stack ranking and ultimately by the end of the 10 minutes, the group collectively has the top most agreed upon "team" top 1-3 things that went well, and the "team" top 1-3 things that need improvement. 15 minutes has passed and we're now set up for success for the rest of the meeting. This process ultimately leaves a handful of things to take note of, but aren't considered for active change as outcome of this meeting.

The next 15 minutes are focused on the things that went well, or the successes in the project. While everyone likes to celebrate successes, it's ultimately done to ensure that when we are correcting for what didn't go well, we don't lose site of the things that actually made this project great! They are captured for historical purposes as well. Was the communication tool the best part of the project? Was a process for less than 2 hours for code reviews beneficial to the speed the team ran at? Was having a spec-let a good thing? Let's do the good things again. This section may even be less time, which is great, because the heart of the meeting is next.

The final 30 minutes of time are focused on the top issues that need addressing, this is the meat of the discussion. The team collectively looks at the "team" top 1-3 and actively discusses ways to improve during the next project. Items like "The bug tracking software was a bear to work with!" might result in a project manager assigned to review and recommended new software packages to track bugs. Another example might be, "Our communication software lost track of the decisions we made, forcing us to re-make decisions a month later", might result in a process to document decisions outside of chat tools once they are made. Lots of good ideas will come out from people who didn't even have this as an issue, which is why it's optional for the person suggesting the area for improvement to actually come up with a solution. Coming up with the solution is a team effort.

Lastly, as part of that last 30 minutes, it's likely that most if not all of the key items cannot be finalized in the 1 hour meeting. It's irrational to think you can pick a new bug tracking software in this 30 minutes for example. As such for each change we need to assign a BOL (aka a Butt On the Line) to drive that issue to closure. That person owns the research, delivery and communication of status of that. Depending on the organization, that BOL may have the authority to make a change, or simply just make a recommendation, either way, communication is warranted.

When the execution phase is wrapping up, the moderator will want to capture the notes, either a photo of the sticky notes, or a good note takers notes are fine.

Follow-up

After the post mortem is done, the information needs to be shared, a good moderator will take the notes of the top 1-3 items that went well, and the top 1-3 items that need improvement and write up notes on how the improvements are planned to be implemented and who the BOL is. As the BOLs get resolution, they can even follow-up to the original communication with the changes. The format of the notes should be titled with the project, have the date range that it covered and who attended the meeting. Then jump into what went well, and then what could have gone better, and how the team suggests to improve it, with the BOLs assigned to each item. 

Keeping the notes as short and crisp as possible is important, since the moderator should share the learnings outside of the immediate team in places such as a company Wiki, a broader email to a group of leads or in a specific chat tool. The key is to match how information is spread at the company.

---

This post mortem method really helps the team celebrate wins of working together as team, as well as helps the team focus on improving together, where everyone has a voice. It's also a great indicator for the leadership team on the health of the core of their business and where leadership should spend effort. Effort might include replacing core technologies if they aren't working, or finding key talent to improve knowledge or just ensure their team is staffed appropriately.

As a parting note, while it seems there is a lot of effort that goes into a post mortem, it's important to learn from mistakes and from successes and not repeat them. The hour or two invested here often save hours in the next project, and potentially retain your talent. Frustrated talent looks for less frustrating circumstances.