“Stability concerns amidst high change frequency is a new reality for us to accept and adapt to.” A suggested way improve production stability without sacrificing speed:
During the normal course
- Design & build for redundancy
- Build pipelines to release safely & for rollback
- Do failover testing to validate system’s ability to move operations to back-up systems during any kind of server failure.
During an incident
- Quickly review changes to isolate potential suspects
- Rollback. If you can’t rollback, push a new fix.
- If you can’t do that either, failover to a healthy copy.
After the incident
- Do thorough postmortems & create list of follow up actions needed
- Do a post-incident validation testing for your fixes in a mimicked failure scenario
Full post here, 11 mins read