Operational Tools and Habits

This is a (partial) list of operational tools and habits that I have found extremely useful in the past:

  1. One Click deployment: Each component of the system must be deployable in a fully automated fashion. There must be NO manual steps, a checklist is not good enough. If you can transparently audit it all the better.
  2. One Click rollback: Your rollback in case of a failure must be fully automated. Again there must be NO manual steps.
    1. The one-click should not require explicit access to the deployment servers.
    2. If multiple components are to be deployed then this should be done by a script.
  3. NEVER give developers write access to the production servers, an accident WILL happen if you do. Read access is fine and as you have one-click, audited, remote deployment they don’t really need it do they?
  4. Everything builds from a tagged source in a source management system or is a third party library of a known version (This may seem obvious but is often not the case).
  5. The separate components in a distributed system must be loosely coupled enough that they can be deployed independently. This is not always possible but it is definitely worth the effort. You want to avoid having to rollback multiple systems supported by different groups because of a single failure.
  6. If you have to deploy a non-backwardly compatible change to the communication protocol (it happens) then do a protocol release on its own.

The one that generally causes the most fuss is 4; but if your system requires regular access by developers you have a problem.

2 Responses to “Operational Tools and Habits”

  1. Michael Frzak Says:

    No 4 for large projects is untenable. Developer primadonnas will take ownership of the production/current tag and move it. I think version control needs to evolve and some aspects be treated like a production system. What is the difference in your mind between one click deployment and rollback and no write access to production servers?

  2. Tom Says:

    I have not come across the problem you outline with the production tag. It may depend on your definition of “large”. In team sizes of up to 24 the understanding of traceable releases and peer pressure has kept everyone in line.

    The difference between one click deployment/rollback and developer access is the ad hoc nature of the latter and the possibility of divergence of the production system from an environment to find problems (built from tagged source).