#scaling software
19 posts

Lesson learned while working on large-scale server software

Always have a plan for worst-case scenarios for error conditions, and find a general solution, such as automatically shutting down all operations and return an error code with when to retry or a contact to call.
Read more

Lesson learned while working on large-scale server software

  • Always have a plan for worst-case scenarios for error conditions, and find a general solution, such as automatically shutting down all operations and return an error code with when to retry or a contact to call.
  • Document your decision making and add idempotence wherever possible.
  • Approach debugging in a scientific way: first, gather data, form the right hypothesis and design an experiment to prove it, and then apply your fix; use tools to dig deep into traces and memory without stopping the system.
  • Impose a strict implementation of Postel’s Law: Be conservative in what you send, be liberal in what you accept.
  • Be wary of a major deployment that seems to go smoothly. Errors are inevitable and the bigger and quieter the error, the more dangerous they are. If you are not sure how to handle an error, let the system crash. It makes it easy to catch and correct errors.
  • Be prepared to restart the entire system from a blank slate under heavy load.
  • Notice technical decisions and components that have global effects, not just global variables.
  • Build channels for persistent communication as new people are onboarded and leave teams. When building systems, do not assume operators will do things correctly and give them the tools to undo mistakes.

Full post here, 10 mins read

Tips for running scalable workloads on Kubernetes

You must set resource requests & limits so the Kubernetes scheduler can ensure workloads are spread across nodes evenly.
Read more

Tips for running scalable workloads on Kubernetes

  • You must set resource requests & limits so the Kubernetes scheduler can ensure workloads are spread across nodes evenly.
  • The scheduler can use configured affinities & anti-affinities as another hint of which node is best to assign your pod to.
  • In Kubernetes, readinessProbe tells that a pod is ready to start receiving requests and livenessProbe tells that a pod is running as expected. Setting these ensure that requests to service always go to a container that can process the request.
  • It is common for nodes in Kubernetes to disappear and you should configure a pod disruption budget to ensure you always have a minimum number of ready pods for deployment.

Full post here, 13 mins read

Scaling to 100k users

When you first build an application, API, DB and client may reside on one machine/server. As you scale up, you can split out the DB layer into a managed service.
Read more

Scaling to 100k users

  • When you first build an application, API, DB and client may reside on one machine/server. As you scale up, you can split out the DB layer into a managed service.
  • Consider the client as a separate entity from the API as you grow further and build for multiple platforms: web, mobile web, Android, iOS, desktop apps, third-party services, etc.
  • As you grow to about 1000 users, you might add a load balancer in front of the API to allow for horizontal scaling.
  • As serving and uploading resources start overloading servers, at say 10,000 users, move to a CDN, which you can get with a cloud storage service for static content so the API no longer needs to handle this load.
  • Finally, you might scale out the data layer at 100,000 users, with relational database systems such as PostgreSQL, MySQL, etc.
  • You might also add a cache layer using an in-memory key-value store like Redis or Memcached so that multiple hits to the DB can be served by cached data. Cache services are also easier to scale out than DBs themselves.
  • Finally, you might split out services to scale them independently, with say a load balancer exclusively for the web socket service; or you might need to partition and shard the DB, depending on your service; you might also want to install monitoring services.

Full post here, 8 mins read

Migrating functionality between large-scale production systems seamlessly

Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
Read more

Migrating functionality between large-scale production systems seamlessly

Lessons from Uber’s migration of its large and complex systems to a new production environment:

  • Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
  • Use this opportunity to settle any technical debt incurred over the years, so the team can move faster in the future and your productivity rises.
  • Carry out validation on a trial and error basis. Don’t assume it will be a one-time effort and plan for multiple iterations before you get it right.
  • Have a data analyst in your team to find issues early, especially if your system involves payments.
  • Once confident in your validation metrics, you can roll out to production. Uber chose to start with a test plan with a couple of employees dedicated to testing various success and failure cases, followed by a rollout to all Uber employees, and finally incremental rollout to cohorts of external users.
  • Push for a quick final migration, as options for a rollback are often misused, preventing complete migration.

Full post here, 6 mins read

Our adventures in scaling

Handling sudden activity spikes poses different challenges than scaling a rapidly growing user base. Check whether databases are resource-constrained and hence slowing down. Check hardware metrics during spikes to check on CPU, disk i/o and memory.
Read more

Our adventures in scaling

  • Handling sudden activity spikes poses different challenges than scaling a rapidly growing user base.
  • Check whether databases are resource-constrained and hence slowing down. Check hardware metrics during spikes to check on CPU, disk i/o and memory.
  • If there are no spikes in those metrics, look higher up the infrastructure stack at service resources for increased resource acquisition times. Also, check the garbage collection activity, which indicates whether JVM heap and threads are the bottlenecks.
  • Check network metrics next to look for a constraint in the network between services and databases - for example, if the services’ database connection pools are consistently reaching size limits.
  • To collect more metrics, log the latency of all transactions and collect those higher than a defined time, which should be analysed across daily usage to determine whether removing the identified bottleneck would make a significant difference.
  • Some of the bottlenecks may be code-related, for example, inefficient queries, a service is resource-starved, inconsistencies in database response itself - so look for metrics on higher-level functioning and not just low-level system components.

Full post here, 6 mins read