#distributed system
4 posts

Distributed systems learnings

Building a new distributed system is easier than migrating the old system over to it. Migrating an old system is more time-consuming and just as challenging as writing one from scratch.
Read more

Distributed systems learnings

  • Building a new distributed system is easier than migrating the old system over to it. Migrating an old system is more time-consuming and just as challenging as writing one from scratch. You tend to underestimate the amount of custom monitoring needed to ensure they both work the same way and a new system is more elegant, but you need to decide whether to accommodate or drop edge cases from the legacy system.
  • To improve reliability, start simple, measure, report and repeat: establish simple service-level objectives (SLOs) and a low bar for reliability (say 99.9%), measure it weekly, fix systemic issues at the root of the failure to hit it, and once confident, move to stricter definitions and targets.
  • Treat idempotency, consistency and durability changes as breaking changes, even if technically not, in terms of communication, rollouts, and API versioning.
  • Give importance to financial and end-user impacts of outages over the systems. Talk to the relevant teams and use appropriate metrics, and use these to put a price tag on preventive measures.
  • To determine who owns a service, check who owns the oncall(the operating of the system). The rest - code ownership, understanding of the system - follow from there. This means that shared oncall between multiple teams is not a healthy practice but a bandage solution.

Full post here, 6 mins read

Scalability problems: Hidden challenges of growing a system

Two main challenges of scaling distributed systems: centralization and synchronization. When scaling up, the system can run into computational limitations, storage limitations, and network limitations.
Read more

Scalability problems: Hidden challenges of growing a system

  • Two main challenges of scaling distributed systems: centralization and synchronization.
  • When one node has too much control, the main source’s capacity/capability limits the entire system in terms of resources it can handle or users it can serve.
  • When scaling up, the system can run into computational limitations, storage limitations, and network limitations.
  • Synchronous communication over a WAN is not only slower, but also less reliable compared to a LAN.
  • Synchronous communication across larger geographies can be an obstacle to scaling.

Full post here, 8 mins read

Three strategies for designing the caching in large-scale distributed system

Always design the distributed systems to be ‘two mistakes high’. Place the web cache container in a side-car arrangement with each instance of your server/web service container.
Read more

Three strategies for designing the caching in large-scale distributed system

  • Always design the distributed systems to be ‘two mistakes high’ - handle failures at two levels so that there is at least one chance to recover instead of the system failing right away on a mistake.
  • Place the web cache container in a side-car arrangement with each instance of your server/web service container. Any modification to the cache container does not affect the decoupled service.
  • Place the cache above the service containers (or app replicas) so that all the containers can access the same cache replicas, and the cache can call the service in case of a miss.
  • The above two approaches work for stateless services. If state is a significant factor for your app and there are many concurrent connections, sharded caching serves better.
  • Use consistent hashing to distribute the load across multiple cache shards that show up as a single cache proxy to the user.

Full post here, 5 mins read

Scalability: growing a system in different directions

A scalable distributed system continues to perform effectively as its users and/or resources grow in different directions.Usually, a system grows in terms of more data, more processes, more machines, more users.Scalability can be measured in terms of size, geography & administrative effort.Size scalability is the one
Read more

Scalability: growing a system in different directions

  • A scalable distributed system continues to perform effectively as its users and/or resources grow in different directions.
  • Usually, a system grows in terms of more data, more processes, more machines, more users.
  • Scalability can be measured in terms of size, geography & administrative effort.
  • Size scalability is the one most developers think about. It can be in terms of resources and/or users. Adding nodes should not degrade the performance or slow the system down irrespective of resources available.
  • Geographical scalability implies that adding nodes is done in a way that takes cognizance of the geographical distance between existing and new nodes. Adding new nodes shouldn’t slow down the amount of time it takes to communicate among the nodes.
  • Administrative scalability requires that adding new nodes does not greatly increase overheads on human engineering and management resources or on security concerns.

Full post here, 9 mins read