Distributed systems learnings
- Building a new distributed system is easier than migrating the old system over to it. Migrating an old system is more time-consuming and just as challenging as writing one from scratch. You tend to underestimate the amount of custom monitoring needed to ensure they both work the same way and a new system is more elegant, but you need to decide whether to accommodate or drop edge cases from the legacy system.
- To improve reliability, start simple, measure, report and repeat: establish simple service-level objectives (SLOs) and a low bar for reliability (say 99.9%), measure it weekly, fix systemic issues at the root of the failure to hit it, and once confident, move to stricter definitions and targets.
- Treat idempotency, consistency and durability changes as breaking changes, even if technically not, in terms of communication, rollouts, and API versioning.
- Give importance to financial and end-user impacts of outages over the systems. Talk to the relevant teams and use appropriate metrics, and use these to put a price tag on preventive measures.
- To determine who owns a service, check who owns the oncall(the operating of the system). The rest - code ownership, understanding of the system - follow from there. This means that shared oncall between multiple teams is not a healthy practice but a bandage solution.
Full post here, 6 mins read