Tom Limoncelli (of "Time Managment for Systems Administrators" fame) recently posted The Limoncelli Test: 32 Questions for Your Sysadmin Team
, it's a great start, but I have a few things I'd add (and his comment form is broken).
- Can the loss of any single team member (eg, "hit by a bus") be handled with no operational impact (ie, projects may be delayed, but no services expected to fail).
- Think of SPOF's as per-service not per-system
- Are internal requests also in the ticket system, if only external stuff is in it you're not tracking a large amount of work
- Do you only have *one* ticket system for everything? Most of the better systems (eg, Atlassian JIRA) can do complex workflows reducing the need for separate systems
- Do you keep a repository of the install media for all currently deployed systems. This includes things like firmware upgrades, OS images, etc. Not just the latest version, but of *all* currently deployed versions.
- Do your laptops have fully-encrypted drives to prevent release of private data. On recent hardware there's *no* performance hit for this with spinning disks, and minimal with SSD's, and it solves so many problems. Having a policy that no such data gets on laptops may help, but isn't enough.
- Does your configuration system keep it's config in an RCS. Just having central config isn't enough, it needs to be revision control so you can roll back, and have history to know when something changed.
- For core networks and other critical systems N+2 might be needed, if failure during a maintenance would immediatly cause serious issues (DNS is a prime example in many cases). You may also need to consider having one system use different software to prevent something like a BIND exploit take out everything.
- Don't do the popular thing ("Cargo Cult Systems Administration"). Google does things that make sense for *LARGE* clusters, not a single-server site, many of the hip new programming things might not be deployable at needed scale (either down or up, programming techniques have a scale band). Virtualisation makes little to no sense for clusters (depends on the app).