Skip to content

Infrastructure Management

(Work in Progress)

Looking in the CAE Nagios monitor today (April 30, 2007), I see we’re watching 200 hosts, including the following:

  • Linux workstations: 5
  • Linux X-terminals: 8
  • Linux servers and cluster boxes (including Xen instances): 29 (plus other recent Xen instances that need to be added)
  • Linux firewalls: 4
  • Solaris workstations: 5
  • Dual-boot Windows workstations that also function as an after-hours cluster annex: 42

So for me, that makes right at 100 root filesystems, software loads, etc. that’s getting to be a major pain to keep consistent. It’s easy enough to consistently install Debian systems in the beginning: make one model system, image it with System Rescue CD, RIP, or systemimager, and deploy the image to the rest of the systems. The hassle comes in more when you’re trying to maintain them consistently. Make a needed change to one system, then make absolutely certain it ends up in the system image. Write init scripts, especially for the dual-boot systems, that download any updates to the system image on each boot as soon as the network comes up and the filesystems are available. Install cron-apt everywhere you can remember to so that security updates get installed automatically. Create accounts on the non-dual boot systems via a looped ssh with useradd, but write scripts for the dual-boot ones that useradd and userdel accounts according to the contents of the big NFS share that holds everyone’s home directories. Remember which systems have Matlab 7.0.1, which have 7.2, which have 6.5.1, and which have more than one version. It’s worked out fine overall, but there’s a lot of very reliable baling twine holding things together. And even if it’s reliable, it’s still baling twine.

Enter infrastructures.org and friends. I had read through their stuff months ago, but the arrival of 12 new Opteron computational servers and a Xeon server that I can run Xen instances on gives me the incentive, the test-beds, and the spare hardware to do it right.

Servers and roles that go into the management infrastructure, plus links to any posts where they’re explained in more detail:

I have a presentation summarizing these pages, too.