#distributed system
6 posts

Lessons Learned while Working on Large-Scale Server Software

Working on large scale software comes with its own unique set of challenges. Here’s a set of tips to remember when faced with a mammoth challenge. While the entire article is a longer and better read; here are some distilled points.
Read more

Lessons Learned while Working on Large-Scale Server Software

Working on large scale software comes with its own unique set of challenges. Here’s a set of tips to remember when faced with a mammoth challenge. While the entire article is a longer and better read; here are some distilled points.

  • Plan for the worst. Creating a baseline for the worst thing that can happen is comforting. It also helps us plan for failure (because failure is inevitable).
  • Don’t trust the network. We often take the network for granted. But network latency and flakiness can be a source of immense pain because your production system behaviour doesn’t match your localhost.
  • Crash-first software. Big, loud crashes bring a developers attention to the problem faster, thereby helping them fix the bug. Silent failures fester in systems far longer and crop up at the most inopportune moments.
  • People are the lynchpin of any system. A lot of success of large scale software depends on how people react to failure. This problem is magnified when a senior engineer leaves the team or new members join the team. Default to using tools and  processes to ensure that tribal knowledge is codified and persistent.

Full post here, 11 mins read

How to avoid cascading failures in distributed systems

Cascading failures in distributed systems typically involve a feedback loop where an event causes a reduction in capacity, an increase in latency, or a spike in errors which then becomes a vicious cycle due to the responses of other parts of the system
Read more

How to avoid cascading failures in distributed systems

  • Cascading failures in distributed systems typically involve a feedback loop where an event causes a reduction in capacity, an increase in latency, or a spike in errors which then becomes a vicious cycle due to the responses of other parts of the system. You need to design your system thoughtfully to avoid them.
  • Set a limit on incoming requests for each instance of your service, along with load shedding at the load balancer, so that the client receives a fast failure and retry, or an error message early on.
  • Moderate client requests to limit dangerous retry behaviours: impose an exponentially increasing backoff between retries and add a little jitter, making the number of retries and wait times application-specific. User-facing applications should degrade or fail fast, batch or asynchronous processing can take longer. Also, use a circuit breaker design to track failures and successes so that a sequence of failed calls to an external service trips the breaker.
  • Ensure bad input does not become a query of death, crashing the service: write your program to quit only if the internal state seems incorrect. Use fuzz testing to help detect programs that crash from malformed input.
  • Avoid making failover plans based on proximity where a failure of a data center or zone pushes the load into the next closest resource, which will then likely cause a domino effect since this second one is likely to be as busy. Balance the load geographically instead, pushing the load to data centers with the most available capability.
  • Reduce, limit or delay work that your server does in response to a failure, such as data replication, with a token bucket algorithm and wait a while to see if the system can recover.
  • Reduce startup times from reading or caching a lot of data, to begin with; it makes autoscaling difficult and you may not detect the problem by the time you start up, and recovery will equally take longer if you need to restart.

Full post here, 14 mins read

Distributed systems learnings

Building a new distributed system is easier than migrating the old system over to it. Migrating an old system is more time-consuming and just as challenging as writing one from scratch.
Read more

Distributed systems learnings

  • Building a new distributed system is easier than migrating the old system over to it. Migrating an old system is more time-consuming and just as challenging as writing one from scratch. You tend to underestimate the amount of custom monitoring needed to ensure they both work the same way and a new system is more elegant, but you need to decide whether to accommodate or drop edge cases from the legacy system.
  • To improve reliability, start simple, measure, report and repeat: establish simple service-level objectives (SLOs) and a low bar for reliability (say 99.9%), measure it weekly, fix systemic issues at the root of the failure to hit it, and once confident, move to stricter definitions and targets.
  • Treat idempotency, consistency and durability changes as breaking changes, even if technically not, in terms of communication, rollouts, and API versioning.
  • Give importance to financial and end-user impacts of outages over the systems. Talk to the relevant teams and use appropriate metrics, and use these to put a price tag on preventive measures.
  • To determine who owns a service, check who owns the oncall(the operating of the system). The rest - code ownership, understanding of the system - follow from there. This means that shared oncall between multiple teams is not a healthy practice but a bandage solution.

Full post here, 6 mins read

Scalability problems: Hidden challenges of growing a system

Two main challenges of scaling distributed systems: centralization and synchronization. When scaling up, the system can run into computational limitations, storage limitations, and network limitations.
Read more

Scalability problems: Hidden challenges of growing a system

  • Two main challenges of scaling distributed systems: centralization and synchronization.
  • When one node has too much control, the main source’s capacity/capability limits the entire system in terms of resources it can handle or users it can serve.
  • When scaling up, the system can run into computational limitations, storage limitations, and network limitations.
  • Synchronous communication over a WAN is not only slower, but also less reliable compared to a LAN.
  • Synchronous communication across larger geographies can be an obstacle to scaling.

Full post here, 8 mins read

Three strategies for designing the caching in large-scale distributed system

Always design the distributed systems to be ‘two mistakes high’. Place the web cache container in a side-car arrangement with each instance of your server/web service container.
Read more

Three strategies for designing the caching in large-scale distributed system

  • Always design the distributed systems to be ‘two mistakes high’ - handle failures at two levels so that there is at least one chance to recover instead of the system failing right away on a mistake.
  • Place the web cache container in a side-car arrangement with each instance of your server/web service container. Any modification to the cache container does not affect the decoupled service.
  • Place the cache above the service containers (or app replicas) so that all the containers can access the same cache replicas, and the cache can call the service in case of a miss.
  • The above two approaches work for stateless services. If state is a significant factor for your app and there are many concurrent connections, sharded caching serves better.
  • Use consistent hashing to distribute the load across multiple cache shards that show up as a single cache proxy to the user.

Full post here, 5 mins read