#devops
14 posts

Deploys at Slack

Getting a peek into an engineering organization's deploy process is always an interesting exercise. With Slack, it's no different. Their process today is straightforward:
Read more

Deploys at Slack

Getting a peek into an engineering organization's deploy process is always an interesting exercise. With Slack, it's no different. The process today is straightforward.

  1. Create a release branch to tag the Git commit and allow developers to push hotfixes (if required).
  2. Deploy to a staging setup which is a production environment that doesn't accept any public traffic.
  3. Phased roll-outs to canary servers. This is tightly coupled with monitoring to see any spikes in errors.

Some of the core principles for such a process are:

  • Fast deploys. All the deployments are pull-based instead of push-based. The build server updates a key in Consul which in-turn pings N servers to pull the latest code.
  • Atomic deploys. During deployment, a "cold" directory is created that pulls in new code. The server is then drained of any traffic and a symlink switches between the "hot" and "cold" directories.
  • Phased roll-outs lend a lot of confidence towards reliability as it allows teams to catch errors early with lesser impact.

Full post here, 6 mins read

Growing your tech stack: when to say no

Deployment infrastructure tools (monitoring, logging, provisioning, building executables, deploying) are a moderate risk. They automate tasks for the whole team, which means the whole team (and deployment) stops if it breaks.
Read more

Growing your tech stack: when to say no

  • Local developer utilities (one-person programs, testing tools) are very low-risk. They run on your own machine, though a whole team can adopt them, boosting productivity. If not widely used, switching environments is harder and you might compromise uniformity, but it is often a worthwhile trade-off.
  • Deployment infrastructure tools (monitoring, logging, provisioning, building executables, deploying) are a moderate risk. They automate tasks for the whole team, which means the whole team (and deployment) stops if it breaks. But they reduce risk in production/deployment compared to manual set-up and troubleshooting. They constitute a hot area for development and you risk falling behind your competition without them.
  • A new programming language is also a moderate risk. Each language calls for new build tools, libraries, dependency management, packaging, test frameworks, internal DSLs. More than one person must learn them, or you get code no one but the developer understands. Getting your team on board, fast becomes your responsibility. You can mitigate the risk by carefully isolating the experimental code so that it becomes replaceable. Consider the tooling and documentation available before you select a language and whether other people have integrated it into a stack like yours (and written about it). The more languages you use, the greater the cognitive overheads when debugging.
  • Adding a new database is a serious risk. A stateful system of record is critical to your business: if it goes down, you cannot just rewrite it, business stops until you can do a migration. In the worst-case scenario, you lose data. You can mitigate this risk with replication (backup and restore) and migration automation, which you should integrate and test before data enters the system. You need a dedicated team to maintain the database (monitoring, updating, re-provisioning) as load increases.

Full post here , 13 mins read

How to continuously profile tens of thousands of production servers

Some lessons & solutions from the Salesforce team that can be useful for other engineers too.
Read more

How to continuously profile tens of thousands of production servers

Some lessons & solutions from the Salesforce team that can be useful for other engineers too.

  • Ensure scalability: If writes or data are too voluminous for a single network or storage solution to handle, distribute the load across multiple data centers, coordinating retrieval from a centralized hub for investigating engineers, who can specify which clusters of hosts they may want data from.
  • Design for fault-tolerance: In a crisis where memory and CPU are overwhelmed or network connectivity lost, profiling data can be lost too. Build resilience in your buffering and pass the data to permanent storage, while allowing data to persist in batches.
  • Provide language-agnostic runtime support: If users might be working in different languages, capture and represent profiling and observability data in a way that works regardless of the underlying language. Attach the language as metadata to profiling data points so that users can query by language and ensure data structures for stack traces and metadata are generic enough to support multiple languages and environments.
  • Allow debugging engineers to access domain-specific contexts to drive their investigations to a speedier resolution. You can do a deep search of traces to match a regular expression, which is particularly useful to developers debugging the issue at hand.

Full post here, 9 mins read

Migrating functionality between large-scale production systems seamlessly

Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
Read more

Migrating functionality between large-scale production systems seamlessly

Lessons from Uber’s migration of its large and complex systems to a new production environment:

  • Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
  • Use this opportunity to settle any technical debt incurred over the years, so the team can move faster in the future and your productivity rises.
  • Carry out validation on a trial and error basis. Don’t assume it will be a one-time effort and plan for multiple iterations before you get it right.
  • Have a data analyst in your team to find issues early, especially if your system involves payments.
  • Once confident in your validation metrics, you can roll out to production. Uber chose to start with a test plan with a couple of employees dedicated to testing various success and failure cases, followed by a rollout to all Uber employees, and finally incremental rollout to cohorts of external users.
  • Push for a quick final migration, as options for a rollback are often misused, preventing complete migration.

Full post here, 6 mins read

Improving incident retrospectives

Often, too much focus is on triggers for the incident. The retrospective should instead review the timeline of incidents, remediation items and find owners for the remediation items.
Read more

Improving incident retrospectives

  • Incidents retrospectives are an integral part of any good engineering culture.
  • Often, too much focus is on triggers for the incident. The retrospective should instead review the timeline of incidents, remediation items and find owners for the remediation items.
  • Retrospectives should be used as an opportunity for deeper analysis into systems (both people and technical) and assumptions that underlie these systems.
  • Finding remediation items should be decoupled from the retrospective process. It helps participants to be free in conducting a deeper investigation as they are unburdened from finding any shallow explanations quickly.
  • It’s a good practice to lighten up the retrospective template you are using because any template will be unequipped to capture unique characteristics of varied incidents. Also, sticking rigidly to a template means limits open-ended questions that can be quite useful in evolving your systems in the right direction.

Full post here, 6 mins read