How to continuously profile tens of thousands of production servers

Some lessons & solutions from the Salesforce team that can be useful for other engineers too.

  • Ensure scalability: If writes or data are too voluminous for a single network or storage solution to handle, distribute the load across multiple data centers, coordinating retrieval from a centralized hub for investigating engineers, who can specify which clusters of hosts they may want data from.
  • Design for fault-tolerance: In a crisis where memory and CPU are overwhelmed or network connectivity lost, profiling data can be lost too. Build resilience in your buffering and pass the data to permanent storage, while allowing data to persist in batches.
  • Provide language-agnostic runtime support: If users might be working in different languages, capture and represent profiling and observability data in a way that works regardless of the underlying language. Attach the language as metadata to profiling data points so that users can query by language and ensure data structures for stack traces and metadata are generic enough to support multiple languages and environments.
  • Allow debugging engineers to access domain-specific contexts to drive their investigations to a speedier resolution. You can do a deep search of traces to match a regular expression, which is particularly useful to developers debugging the issue at hand.

Full post here, 9 mins read