Scaling a Mature Data Pipeline — Managing Overhead

  • Over time, teams end up encoding application structure in the data pipeline. Application logic gets coupled with orchestration logic.
  • Orchestration complexity causes overhead. This complexity scales with the depth of the data pipeline.
  • When you decouple orchestration logic from application logic, you get tools to fight the overhead, without compromising the quality of the application.
  • When trying to reduce the run time of a data pipeline, analyze the whole pipeline’s execution time, not just the obvious factors like map-reduce computation time.
  • Focus on fault tolerance considerations.

Full post here, 11 mins read