#reliability
6 posts

Simple systems have less downtime

Simplicity while building a system leads to less downtime because you don’t need to wait for a specifically proficient person to do/help with anything, anybody in the team can take over troubleshooting without a huge learning curve or training.
Read more

Simple systems have less downtime

  • Simplicity while building a system leads to less downtime because you don’t need to wait for a specifically proficient person to do/help with anything, anybody in the team can take over troubleshooting without a huge learning curve or training.
  • Troubleshooting, therefore, takes less time, because learning the system and then identifying and resolving the problem is almost intuitive.
  • When each part of the system has a clear function, it is easier for you to find several alternative solutions.
  • Follow these principles to build simpler systems:
  1. Features don’t justify the complexity. Choose tools that are easy to operate rather than the most feature-rich option.
  2. Complex ideas lead to complex implementations. Pare down your ideas so they can be explained fast.
  3. Try modifications before additions. Most people rush to add new layers, steps or integrations for new requirements. Instead, first, check whether the core system can be modified.

Full post here, 6 mins read

How can we apply the principles of chaos engineering to AWS Lambda

Identify weaknesses before they manifest in system-wide aberrant behaviours: improper fallback settings when a service is unavailable, retry storms from poorly tuned timeouts, outages when a downstream dependency gets too much traffic, cascading failures, etc.
Read more

How can we apply the principles of chaos engineering to AWS Lambda

  • Identify weaknesses before they manifest in system-wide aberrant behaviours: improper fallback settings when a service is unavailable, retry storms from poorly tuned timeouts, outages when a downstream dependency gets too much traffic, cascading failures, etc.
  • Lambda functions have specific vulnerabilities. There are many more functions than services, and you need to harden boundaries around every function and not just the services. There are more intermediary services with their own failure modes (Kinesis, SNS, API Gateway) and more configurations to get right (timeout, IAM permissions).
  1. Apply stricter timeout settings for intermediate services than those at the edge.
  2. Check for missing error handling that allows exceptions from downstream services to escape.
  3. Check for missing fallbacks when a downstream service is unavailable or experiences an outage.
  • Monitor metrics carefully, especially client-side, which shows how user experience is affected.
  • Design controlled experiments to probe the limits of your system.

Full post here, 4 mins read

How to measure the reliability of your software throughout the CI/CD workflow

Look beyond log files and testing to determine quality of code: set up advanced quality gates to block problematic code from passing to the next stage and use feedback loops to inform more comprehensive testing.
Read more

How to measure the reliability of your software throughout the CI/CD workflow

  • In addition to CI/CD, you should consider incorporating continuous reliability in the workflow. This may mean more focus on troubleshooting than on writing code.
  • Consider whether to automate every step, or even whether some steps should be automated more than others.
  • Look beyond log files and testing to determine quality of code: set up advanced quality gates to block problematic code from passing to the next stage and use feedback loops to inform more comprehensive testing.
  • In addition to log aggregators and performance monitoring tools, get a more granular understanding of app quality by ensuring you can access the source code, variable state and stack trace at the time of an error. Aggregate this data across the app, library, class, deployment or another boundary for an insight into the functional quality of the code.
  • Based on this data, you can categorise known, reintroduced and unknown errors, classify events, understand frequency and failure rates, enabling you to write more comprehensive tests for development and pre-production environments alike, driving the higher code quality.

Full post here, 6 mins read

How we write an incident postmortem

Make postmortem a team effort, led by the on-call engineer and compile a timeline of the incident. Keep the conversation free of blame.
Read more

How we write an incident postmortem

  • Record the impact of the incident, steps taken to mitigate it, learnings from it and follow-up tasks identified.
  • Make postmortem a team effort, led by the on-call engineer and compile a timeline of the incident.
  • Keep the conversation free of blame.
  • Ask how the mistake was allowed to happen, whether automation could have prevented it in the first place and whether wrong assumptions were behind the mistake(s).
  • Check whether alerts for the problem came fast enough and whether they could be improved or added to.
  • Consider whether the initial assessment of an incident’s impact was accurate and whether the impact could have been worse than it was.

Full post here, 10 mins read

How Shopify manages petabyte-scale MySQL backup and restore

Use incremental snapshots of data for backup after one initial full snapshot to reduce both storage and recovery times. Save on storage costs by deleting all but the last two copies for recovery purposes.
Read more

How Shopify manages petabyte-scale MySQL backup and restore

A few learnings from the post:

  • Use incremental snapshots of data for backup after one initial full snapshot to reduce both storage and recovery times.
  • Save on storage costs by deleting all but the last two copies for recovery purposes.
  • To ensure data integrity, verify these last two backup iterations daily.
  • Compress, encrypt and transfer the backup to offsite storage to build further security.
  • Using snapshots this way is a more expensive process than more traditional backup technologies. But it reduces time and processing power for both the user and the service.

Full post here, 6 mins read