#scaling software
16 posts

Migrating functionality between large-scale production systems seamlessly

Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
Read more

Migrating functionality between large-scale production systems seamlessly

Lessons from Uber’s migration of its large and complex systems to a new production environment:

  • Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
  • Use this opportunity to settle any technical debt incurred over the years, so the team can move faster in the future and your productivity rises.
  • Carry out validation on a trial and error basis. Don’t assume it will be a one-time effort and plan for multiple iterations before you get it right.
  • Have a data analyst in your team to find issues early, especially if your system involves payments.
  • Once confident in your validation metrics, you can roll out to production. Uber chose to start with a test plan with a couple of employees dedicated to testing various success and failure cases, followed by a rollout to all Uber employees, and finally incremental rollout to cohorts of external users.
  • Push for a quick final migration, as options for a rollback are often misused, preventing complete migration.

Full post here, 6 mins read

Our adventures in scaling

Handling sudden activity spikes poses different challenges than scaling a rapidly growing user base. Check whether databases are resource-constrained and hence slowing down. Check hardware metrics during spikes to check on CPU, disk i/o and memory.
Read more

Our adventures in scaling

  • Handling sudden activity spikes poses different challenges than scaling a rapidly growing user base.
  • Check whether databases are resource-constrained and hence slowing down. Check hardware metrics during spikes to check on CPU, disk i/o and memory.
  • If there are no spikes in those metrics, look higher up the infrastructure stack at service resources for increased resource acquisition times. Also, check the garbage collection activity, which indicates whether JVM heap and threads are the bottlenecks.
  • Check network metrics next to look for a constraint in the network between services and databases - for example, if the services’ database connection pools are consistently reaching size limits.
  • To collect more metrics, log the latency of all transactions and collect those higher than a defined time, which should be analysed across daily usage to determine whether removing the identified bottleneck would make a significant difference.
  • Some of the bottlenecks may be code-related, for example, inefficient queries, a service is resource-starved, inconsistencies in database response itself - so look for metrics on higher-level functioning and not just low-level system components.

Full post here, 6 mins read

Seven deadly sins of a software project

“Maintainability is the most valuable virtue of modern software development.” Do these seven things to make maintainable software.
Read more

Seven deadly sins of a software project

“Maintainability is the most valuable virtue of modern software development.”

Do these seven things to make maintainable software.

  1. Learn about elegant coding to avoid anti-patterns allowed by languages that are too flexible.
  2. Ensure all changes are traceable (what was changed, by who, and why), by always using a ticket to flag any problem, referencing the ticket in the commit, and preserving its history.
  3. Follow an automated process of testing, packaging and deploying for all releases to execute them from a single command line.
  4. Enforce static analysis rules so no build can pass if any of the rules are violated.
  5. Measure and report test coverage and aim for at least 80% coverage. This coverage metric also lets future developers see if coverage is affected when making changes.
  6. Beware of nonstop development. Always release and version-alize software so future developers can see your intentions and roadmaps from a clear release history (typically in Git tags and release notes), and have each version available for download.
  7. Ensure user interfaces are carefully documented to let the end-user see what the software does. 


Full post here, 7 mins read

Scalability problems: Hidden challenges of growing a system

Two main challenges of scaling distributed systems: centralization and synchronization. When scaling up, the system can run into computational limitations, storage limitations, and network limitations.
Read more

Scalability problems: Hidden challenges of growing a system

  • Two main challenges of scaling distributed systems: centralization and synchronization.
  • When one node has too much control, the main source’s capacity/capability limits the entire system in terms of resources it can handle or users it can serve.
  • When scaling up, the system can run into computational limitations, storage limitations, and network limitations.
  • Synchronous communication over a WAN is not only slower, but also less reliable compared to a LAN.
  • Synchronous communication across larger geographies can be an obstacle to scaling.

Full post here, 8 mins read

What Google taught me about scaling engineering teams

Create reusable training material & use it to onboard new engineers. Agree on a set of standard coding conventions as early as possible.
Read more

What Google taught me about scaling engineering teams

  • Focus on building shared tools & abstractions across engineering teams. Dedicate people to it.
  • Create reusable training material & use it to onboard new engineers.
  • Agree on a set of standard coding conventions as early as possible.
  • Take code reviews seriously and use them to increase code quality.
  • Having lots of right data will solve many problems.
  • Automate testing to scale your code. Rigorously focus on tests during code reviews too.

Full post here, 5 mins read