#Issue136
2 posts

Deploys at Slack

Getting a peek into an engineering organization's deploy process is always an interesting exercise. With Slack, it's no different. Their process today is straightforward:
Read more

Deploys at Slack

Getting a peek into an engineering organization's deploy process is always an interesting exercise. With Slack, it's no different. The process today is straightforward.

  1. Create a release branch to tag the Git commit and allow developers to push hotfixes (if required).
  2. Deploy to a staging setup which is a production environment that doesn't accept any public traffic.
  3. Phased roll-outs to canary servers. This is tightly coupled with monitoring to see any spikes in errors.

Some of the core principles for such a process are:

  • Fast deploys. All the deployments are pull-based instead of push-based. The build server updates a key in Consul which in-turn pings N servers to pull the latest code.
  • Atomic deploys. During deployment, a "cold" directory is created that pulls in new code. The server is then drained of any traffic and a symlink switches between the "hot" and "cold" directories.
  • Phased roll-outs lend a lot of confidence towards reliability as it allows teams to catch errors early with lesser impact.

Full post here, 6 mins read

Do not log

Logging is an important aspect of any software system. This post focuses on the perils of logging and proposes better ways of achieving end results. Although the author uses Haskell for examples, the principles are global.
Read more

Do not log

Logging is an important aspect of any software system. This post focuses on the perils of logging and proposes better ways of achieving end results. Although the author uses Haskell for examples, the principles are global.

  • Using logs to monitor production systems is a fallacy. Use of better error tracking and monitoring products like Prometheus or Sentry to track business metrics leads to better systems. If the business metric isn't affected, does it matter that an error occurred?
  • Logging is a side effect that can also fail. The juice is not worth the squeeze.
  • Storing and Grepping through logs in a centralized location is a sub-system on its own. It's one more failure point in your architecture design.
  • Any error log should include the complete business context of the object or activity that failed. Simply logging "Error occurred" is futile. Better error logging leads to the reproducibility of the error by the engineering team.

I think logging is a very nuanced subject that takes years for teams to understand and coalesce around. Good logging principles are not born but carefully farmed within a team through many years of trial and error.

Full post here, 11 mins read