“Stability concerns amidst high change frequency is a new reality for us to accept and adapt to.” A suggested way improve production stability without sacrificing speed:

During the normal course

  • Design & build for redundancy
  • Build pipelines to release safely & for rollback
  • Do failover testing to validate system’s ability to move operations to back-up systems during any kind of server failure.

During an incident

  • Quickly review changes to isolate potential suspects
  • Rollback. If you can’t rollback, push a new fix.
  • If you can’t do that either, failover to a healthy copy.

After the incident

  • Do thorough postmortems & create list of follow up actions needed
  • Do a post-incident validation testing for your fixes in a mimicked failure scenario

Full post here, 11 mins read