• Record the impact of the incident, steps taken to mitigate it, learnings from it and follow-up tasks identified.
  • Make postmortem a team effort, led by the on-call engineer and compile a timeline of the incident.
  • Keep the conversation free of blame.
  • Ask how the mistake was allowed to happen, whether automation could have prevented it in the first place and whether wrong assumptions were behind the mistake(s).
  • Check whether alerts for the problem came fast enough and whether they could be improved or added to.
  • Consider whether the initial assessment of an incident’s impact was accurate and whether the impact could have been worse than it was.

Full post here, 10 mins read