#reliability
4 posts

How to measure the reliability of your software throughout the CI/CD workflow

Look beyond log files and testing to determine quality of code: set up advanced quality gates to block problematic code from passing to the next stage and use feedback loops to inform more comprehensive testing.
Read more

How to measure the reliability of your software throughout the CI/CD workflow

  • In addition to CI/CD, you should consider incorporating continuous reliability in the workflow. This may mean more focus on troubleshooting than on writing code.
  • Consider whether to automate every step, or even whether some steps should be automated more than others.
  • Look beyond log files and testing to determine quality of code: set up advanced quality gates to block problematic code from passing to the next stage and use feedback loops to inform more comprehensive testing.
  • In addition to log aggregators and performance monitoring tools, get a more granular understanding of app quality by ensuring you can access the source code, variable state and stack trace at the time of an error. Aggregate this data across the app, library, class, deployment or another boundary for an insight into the functional quality of the code.
  • Based on this data, you can categorise known, reintroduced and unknown errors, classify events, understand frequency and failure rates, enabling you to write more comprehensive tests for development and pre-production environments alike, driving the higher code quality.

Full post here, 6 mins read

How we write an incident postmortem

Make postmortem a team effort, led by the on-call engineer and compile a timeline of the incident. Keep the conversation free of blame.
Read more

How we write an incident postmortem

  • Record the impact of the incident, steps taken to mitigate it, learnings from it and follow-up tasks identified.
  • Make postmortem a team effort, led by the on-call engineer and compile a timeline of the incident.
  • Keep the conversation free of blame.
  • Ask how the mistake was allowed to happen, whether automation could have prevented it in the first place and whether wrong assumptions were behind the mistake(s).
  • Check whether alerts for the problem came fast enough and whether they could be improved or added to.
  • Consider whether the initial assessment of an incident’s impact was accurate and whether the impact could have been worse than it was.

Full post here, 10 mins read

How Shopify manages petabyte-scale MySQL backup and restore

Use incremental snapshots of data for backup after one initial full snapshot to reduce both storage and recovery times. Save on storage costs by deleting all but the last two copies for recovery purposes.
Read more

How Shopify manages petabyte-scale MySQL backup and restore

A few learnings from the post:

  • Use incremental snapshots of data for backup after one initial full snapshot to reduce both storage and recovery times.
  • Save on storage costs by deleting all but the last two copies for recovery purposes.
  • To ensure data integrity, verify these last two backup iterations daily.
  • Compress, encrypt and transfer the backup to offsite storage to build further security.
  • Using snapshots this way is a more expensive process than more traditional backup technologies. But it reduces time and processing power for both the user and the service.

Full post here, 6 mins read

Site reliability engineering best practices for data pipelines

Reduce hot-spotting by balancing out the workload across resources. Utilize autoscaling. Adhere to strict access control for privacy, security and data integrity.
Read more

Site reliability engineering best practices for data pipelines

  • Define and measure service-level objectives (SLOs) to ensure data freshness, data correctness, and superior data isolation.
  • Plan for dependency failures by checking for overdependence on products that don’t meet their own SLOs.
  • Create and maintain system diagrams, process documentation and playbook entries that outline recovery from alert conditions.
  • Reduce hot-spotting by balancing out the workload across resources.
  • Utilize autoscaling.
  • Adhere to strict access control for privacy, security and data integrity.
  • Use idempotent and two-phase mutations to avoid duplication or storage of incorrect data in case of pipeline failure in the middle of a process.
  • Use checkpointing to store partially implemented processes.

Full post here, 5 mins read