#fundamentals
91 posts

Lessons Learned while Working on Large-Scale Server Software

Working on large scale software comes with its own unique set of challenges. Here’s a set of tips to remember when faced with a mammoth challenge. While the entire article is a longer and better read; here are some distilled points.
Read more

Lessons Learned while Working on Large-Scale Server Software

Working on large scale software comes with its own unique set of challenges. Here’s a set of tips to remember when faced with a mammoth challenge. While the entire article is a longer and better read; here are some distilled points.

  • Plan for the worst. Creating a baseline for the worst thing that can happen is comforting. It also helps us plan for failure (because failure is inevitable).
  • Don’t trust the network. We often take the network for granted. But network latency and flakiness can be a source of immense pain because your production system behaviour doesn’t match your localhost.
  • Crash-first software. Big, loud crashes bring a developers attention to the problem faster, thereby helping them fix the bug. Silent failures fester in systems far longer and crop up at the most inopportune moments.
  • People are the lynchpin of any system. A lot of success of large scale software depends on how people react to failure. This problem is magnified when a senior engineer leaves the team or new members join the team. Default to using tools and  processes to ensure that tribal knowledge is codified and persistent.

Full post here, 11 mins read

Production Oriented Development

This is a great article from a battle tested veteran. Lot's of sane advice, some controversial ones and some nuanced ones.
Read more

Production Oriented Development

This is a great article from a battle tested veteran. Lot's of sane advice and some, some controversial ones, some nuanced ones.

  • Engineers should be on-call and operate their own code. They know where the skeletons are hidden. Also, experiencing the pain of production failure is the quickest way of fixing them.
  • Buy beats Build. NIH (Not Invented Here) syndrome is very strong amongst tech teams but it's generally a recipe for disaster. Use SaaS tools, managed-code tools or hosted services to reduce the burden on yourself and your team.
  • Make deploys easy. The easier a process, the more it's followed. Deploy pipelines should be simple, quick and pain-free. Frequent deploys have a lot of advantages. Your customers receive value faster, the changes are smaller and it's easier to debug.
  • QA teams slow everyone down. This is a highly controversial statement and has a lot more nuance. Manual QA cannot keep up with release cycles. Aim to automate
    as much testing as possible. Also, move QA teams to testing in production to ensure business continuity rather than being road blocks.
  • Be boring! Boring technology is battle tested and bug free. Always bias towards using an old, boring technology instead of bleeding edge frameworks.
  • Things will always break. Planning for failure response is a more scalable strategy rather than prevention of failure. The former leads to better systems, while the latter leads to failure. I personally connect to this advice. We can't plan future incidents, but we can plan our response to them.

Full post here, 11 min read

Don’t be a jerk: write documentation

Documentation can be minimal and yet helpful. Take whatever you have in volatile formats (email, chat logs, shell I/O) and paste it into more durable ones (README files, diagrams, websites, FAQs, wikis).
Read more

Don’t be a jerk: write documentation

  • Documentation can be minimal and yet helpful. Take whatever you have in volatile formats (email, chat logs, shell I/O) and paste it into more durable ones (README files, diagrams, websites, FAQs, wikis).
  • Document based on your audience:
  1. For newcomers, focus on what it does, why you wrote it, who should be using it and for what, how it should be used (how to build, configure, run and get started), and where additional information may be found; do not display the source.
  2. Regular users could do with a reference manual, examples, EDoc/JavaDoc/Doc, wiki or website, and API descriptions.
  3. Contributors should have access to the source repository, as well as project structure and architecture (where to find out about specific functionalities and where new features should go), the project principles, tests, issues, and a roadmap.
  • An easy way to get started is to imagine you are talking to a new user. Then vary the user and add different scenarios (of their business context and experience/knowledge, budgets) to expand the documentation until it is comprehensive. Finally, you might split it up into categories for different people or reimagine/format it for different media/platforms.
  • Another approach is to find a problem a user might face and show them how to solve it.

Full post here, 11 mins read

Four magic numbers for measuring software delivery

Lead time, deployment frequency, change failure percentage, mean time to recovery/resolution..
Read more

Four magic numbers for measuring software delivery

  • Lead time to validate, design, implement and ship a new valuable thing. Consider two types of lead time: (a) feature lead time: the time to move an item from high-level requirements to feature release and (b) deployment lead time: the time from merging to master to the component being in production.
  • Deployment frequency, understood as number of times per developer per day you ship to production. Your deployment frequency may ebb and flow, from 5-10 on a busy day to none the rest of the week. Establish a baseline over 4 weeks: say, 1 production deployment per day on average.
  • Change failure percentage, ie, the proportion of red deployments, bugs, alerts, etc. Defining change failure rate as bugs per production deployment over a given period, aim for about 10% or less with medium or high-priority of customer impact.
  • Mean time to recovery/resolution. For mean time to resolution, aim for less than a working week.
  • Considering feature lead time as a key performance indicator can help you break features up and deliver faster. This might also decrease lead time per deployment.
  • Convert generic support tickets into bugs with customer impact implications so they can be tracked better. Make bugs more visible can also make bottlenecks in the resolution process more apparent.
  • Count red production deployments and production alerts as a change failure.

Full post here, 9 mins read

Inefficient efficiency

Latency (measured in time units) is the time between a stimulus and a response, while throughput (measured in deliveries per time unit) is the rate at which the system meets its goals.
Read more

Inefficient efficiency

  • Latency (measured in time units) is the time between a stimulus and a response, while throughput (measured in deliveries per time unit) is the rate at which the system meets its goals.
  • Sometimes latency and throughput interfere with each other. Do you field a request and respond before taking the next request (low latency for the first customer but overall lower throughput), or do you accept the second request while processing the first for higher throughput?
  • You make latency/throughput tradeoffs every day and most of us are biased towards throughput. For example - you carefully plan all foreseeable architectural improvements instead of initiating the first profitable change you come across.
  • Instead, you should optimize for latency. For example - if preferences are likely to change between requests (high rate of change), so you can adapt, which is less wasteful.
  • To decide between throughput and latency, consider the cost of delay, whether you might learn something that changes your approach subsequently, and whether external factors might force a new approach.

Full post here, 4 mins read