How can we apply the principles of chaos engineering to AWS Lambda

  • Identify weaknesses before they manifest in system-wide aberrant behaviours: improper fallback settings when a service is unavailable, retry storms from poorly tuned timeouts, outages when a downstream dependency gets too much traffic, cascading failures, etc.
  • Lambda functions have specific vulnerabilities. There are many more functions than services, and you need to harden boundaries around every function and not just the services. There are more intermediary services with their own failure modes (Kinesis, SNS, API Gateway) and more configurations to get right (timeout, IAM permissions).
  1. Apply stricter timeout settings for intermediate services than those at the edge.
  2. Check for missing error handling that allows exceptions from downstream services to escape.
  3. Check for missing fallbacks when a downstream service is unavailable or experiences an outage.
  • Monitor metrics carefully, especially client-side, which shows how user experience is affected.
  • Design controlled experiments to probe the limits of your system.

