Lesson learned while working on large-scale server software

  • Always have a plan for worst-case scenarios for error conditions, and find a general solution, such as automatically shutting down all operations and return an error code with when to retry or a contact to call.
  • Document your decision making and add idempotence wherever possible.
  • Approach debugging in a scientific way: first, gather data, form the right hypothesis and design an experiment to prove it, and then apply your fix; use tools to dig deep into traces and memory without stopping the system.
  • Impose a strict implementation of Postel’s Law: Be conservative in what you send, be liberal in what you accept.
  • Be wary of a major deployment that seems to go smoothly. Errors are inevitable and the bigger and quieter the error, the more dangerous they are. If you are not sure how to handle an error, let the system crash. It makes it easy to catch and correct errors.
  • Be prepared to restart the entire system from a blank slate under heavy load.
  • Notice technical decisions and components that have global effects, not just global variables.
  • Build channels for persistent communication as new people are onboarded and leave teams. When building systems, do not assume operators will do things correctly and give them the tools to undo mistakes.

Full post here, 10 mins read