Improving resiliency and stability of a large-scale monolithic API service

Lessons from the API layer service used by LinkedIn:

  • They chose a cross-platform design (with all platforms using the same API and same endpoints for the same features) and an all-encompassing design (one API service calls all product verticals), to allow for high code reuse.
  • They reused data-schema definitions and endpoints to make it easier for engineers to collaborate but it led to issues at scale, when extended to deployment architecture. It was addressed by microclustering rather than breaking the monolith into microservices: Endpoints of the services were partitioned without breaking the code, routing traffic for each partition to a dedicated cluster of servers. Data from monitoring systems were used to identify which verticals had enough traffic to justify a partition.
  • For each vertical, the build system was modified to create an additional deployable named after the vertical, with configuration inherited from the shared service and extended. Traffic from the vertical’s endpoints was examined to estimate the number of servers needed in the new cluster.
  • While deploying, capacity testing was carried out - when there was enough traffic to overload at least three servers, servers were slowly taken down to observe latencies and error rates, revealing how many queries-per-second each server could process without incident. This information was used for capacity planning, to fine-tune resource allocation.
  • The results of microclustering include the ability to limit downstream failures and bugs to a single vertical, and each cluster can be tuned independently of the others for better capacity planning, monitoring and granular control over deployment.

Full post here, 5 mins read