- Record the impact of the incident, steps taken to mitigate it, learnings from it and follow-up tasks identified.
- Make postmortem a team effort, led by the on-call engineer and compile a timeline of the incident.
- Keep the conversation free of blame.
- Ask how the mistake was allowed to happen, whether automation could have prevented it in the first place and whether wrong assumptions were behind the mistake(s).
- Check whether alerts for the problem came fast enough and whether they could be improved or added to.
- Consider whether the initial assessment of an incident’s impact was accurate and whether the impact could have been worse than it was.
Full post here, 10 mins read