How to avoid cascading failures in distributed systems

  • Cascading failures in distributed systems typically involve a feedback loop where an event causes a reduction in capacity, an increase in latency, or a spike in errors which then becomes a vicious cycle due to the responses of other parts of the system. You need to design your system thoughtfully to avoid them.
  • Set a limit on incoming requests for each instance of your service, along with load shedding at the load balancer, so that the client receives a fast failure and retry, or an error message early on.
  • Moderate client requests to limit dangerous retry behaviours: impose an exponentially increasing backoff between retries and add a little jitter, making the number of retries and wait times application-specific. User-facing applications should degrade or fail fast, batch or asynchronous processing can take longer. Also, use a circuit breaker design to track failures and successes so that a sequence of failed calls to an external service trips the breaker.
  • Ensure bad input does not become a query of death, crashing the service: write your program to quit only if the internal state seems incorrect. Use fuzz testing to help detect programs that crash from malformed input.
  • Avoid making failover plans based on proximity where a failure of a data center or zone pushes the load into the next closest resource, which will then likely cause a domino effect since this second one is likely to be as busy. Balance the load geographically instead, pushing the load to data centers with the most available capability.
  • Reduce, limit or delay work that your server does in response to a failure, such as data replication, with a token bucket algorithm and wait a while to see if the system can recover.
  • Reduce startup times from reading or caching a lot of data, to begin with; it makes autoscaling difficult and you may not detect the problem by the time you start up, and recovery will equally take longer if you need to restart.

Full post here, 14 mins read