#resiliency
2 posts

Designing resilient systems beyond retries: architecture patterns and engineering chaos

Incorporate idempotency: an idempotent endpoint returns the same result given the same parameters with no side effects or any side effects are only executed once (this makes retries safer).
Read more

Designing resilient systems beyond retries: architecture patterns and engineering chaos

  • Incorporate idempotency: an idempotent endpoint returns the same result given the same parameters with no side effects or any side effects are only executed once (this makes retries safer). If an operation has side effects but cannot distinguish unique calls, add an idempotency key parameter which the client must supply for a safe retry (else retry is prevented).
  • Use asynchronous responses for ‘deferable work’: instead of relying on a successful call to a dependency that might fail, return a successful or partial response to the client from the service itself. This ensures downstream errors don’t affect the endpoint and reduces the risk of latency and resource use, with retries in the background.
  • Apply chaos engineering to test resiliency as a best practice: deliberately introduce latency or simulate outages in parts of the system so it fails and you can improve on it. However, minimize the ‘blast radius’ of chaos experiments in production - in action, it should be the opposite of chaotic:
  1. Define a steady state. Your hypothesis is that the steady state will not change during the experiment.
  2. Pick an experiment that mirrors real-world situations: a server shutting down, a lost network connection to a DB, auto-scaling events, a hardware switch.
  3. Pick a control group (which does not change) and an experiment group from the backend servers.
  4. Introduce a failure in an aspect or component of the system and attempt to disprove the hypothesis by analyzing metrics between control and experiment groups.
  5. If the hypothesis is disproved, the affected parts are in need of improvement. After making changes, repeat your experiment until confidence is achieved.
  6. Automate your chaos experiments, including automatically disabling the experiment if it exceeds the acceptable blast radius.

Full post here, 6 mins read

Improving resiliency and stability of a large-scale monolithic API service

The results of microclustering include the ability to limit downstream failures and bugs to a single vertical, and each cluster can be tuned independently of the others for better capacity planning, monitoring and granular control over deployment.
Read more

Improving resiliency and stability of a large-scale monolithic API service

Lessons from the API layer service used by LinkedIn:

  • They chose a cross-platform design (with all platforms using the same API and same endpoints for the same features) and an all-encompassing design (one API service calls all product verticals), to allow for high code reuse.
  • They reused data-schema definitions and endpoints to make it easier for engineers to collaborate but it led to issues at scale, when extended to deployment architecture. It was addressed by microclustering rather than breaking the monolith into microservices: Endpoints of the services were partitioned without breaking the code, routing traffic for each partition to a dedicated cluster of servers. Data from monitoring systems were used to identify which verticals had enough traffic to justify a partition.
  • For each vertical, the build system was modified to create an additional deployable named after the vertical, with configuration inherited from the shared service and extended. Traffic from the vertical’s endpoints was examined to estimate the number of servers needed in the new cluster.
  • While deploying, capacity testing was carried out - when there was enough traffic to overload at least three servers, servers were slowly taken down to observe latencies and error rates, revealing how many queries-per-second each server could process without incident. This information was used for capacity planning, to fine-tune resource allocation.
  • The results of microclustering include the ability to limit downstream failures and bugs to a single vertical, and each cluster can be tuned independently of the others for better capacity planning, monitoring and granular control over deployment.

Full post here, 5 mins read