Posts Tagged ‘CompSoc’

CompSoc downtime

Monday, October 15th, 2007

So, for the past few days CompSoc has been squarely offline.

Why?

noisy, the main server, went down following what we believe to be a motherboard failure.

Our temporary solution has been to move the data disks from noisy to the (very) old server, tall.  As you can see, things are up and running again, mail is being delivered (slowly) by bump.

Over the next few weeks we’ll be looking into replacing some hardware and buying new hardware to help prevent this happening again… that said, on a budget it won’t be easy—short of a cluster there really was little we could have done better.

Sheep go “09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0″

Wednesday, May 2nd, 2007

I’m so emo.

CompSoc downtime

Thursday, February 8th, 2007

As a follow-up to the last post about /var filling up, here’s another one that’s equally as crazy.

In an effort to fix /var once and for all I scheduled some emergency downtime last night. The aim was to make the /var partition bigger. This got off to a pretty bad start when, in true Solaris fashion, I attempted to drop noisy down to single user by typing init 0. In Solaris this drops a SPARC system to the OBP (like the BIOS), and just reboots x86 machines. In FreeBSD runlevel 0 is equivalent to Solaris runlevel 5… shut the system down and then power it off.

So, at about 22:15 last night I switched the primary CompSoc server off. Hardly the fix I was looking for.

After a number of calls to Andy and Inti I had somebody switch it back on (bear in mind that the system is in Manchester and I’m near London)... in an attempt to minimise the downtime, I decided to do the fix as soon as it came back up. This time I got the right runlevel… init 1, but (as I realise now) all sorts of crazy stuff happens to FreeBSD’s serial redirection in single user mode. It appears to knock off serial support and output only to the video console. Again, not a lot of good for me.

Later on in the day somebody else rebooted it and after a number of attempts to get things working, I decided that I’d do the fix in full multi-user mode. This involved disabling logins, stopping almost all services, etc. More lsof was used to determine what was using /var; these were stopped and when there were no open filehandles I umount -f‘d /var.

I dumped the contents of /var to a different disk and set about updating the disklabel.

# /dev/ad0s1:
8 partitions:
	
  1. size offset fstype [fsize bsize bps/cpg] a: 1048576 0 4.2BSD 2048 16384 8 b: 8388608 1048576 swap c: 156296322 0 unused 0 0 # "raw" part, don't edit d: 1048576 9437184 4.2BSD 2048 16384 8 e: 52428800 10485760 4.2BSD 2048 16384 28552 f: 93381762 62914560 4.2BSD 2048 16384 28552

Above is the disklabel before the change… a is /, d is /var, e is /usr and f is /backup2. What I needed to do was grow d (currently just 512MB) by using some of the space from f, which was an unused backup directory. The obvious problem here is that /usr was in the way. My solution was to grow swap by 512MB, totally remove the d line, shrink f to around 8GB and rename it to d. This sounds a little complicated… it took me a while to get my head around it.

Prior to the change the on-disk layout was something like:

[ a (/) ] [ b (swap) ] [ d (/var) ] [ e (/usr) ] [ f (/backup2) ]

Now that I’ve made the changes the on-disk layout is more like:

[ a (/) ] [ b (bigger swap) ] [ e (/usr) ] [ d (/var) ]

The bsdlabel currently looks like:

# /dev/ad0s1:
8 partitions:
		
  • size offset fstype [fsize bsize bps/cpg] a: 1048576 0 4.2BSD 2048 16384 8 b: 9437184 1048576 swap c: 156296322 0 unused 0 0 # "raw" part, don't edit d: 16777216 62914560 4.2BSD 2048 16384 28552 e: 52428800 10485760 4.2BSD 2048 16384 28552
  • The beauty of this (as far as I was concerned) was that everything was still contiguous, no holes and no changing of slice letters. Next step was to newfs the new /var, mount it and restore the contents from the file on the other disk I previously mentioned. No major problems here, although I did manage to restore the contents of /var to both / and my personal home directory. Fortunately this mess was easy to clear up.

    So, with all of the files back, I rebooted the box. It didn’t come back up.

    After a lot of time talking Inti through the console (which I couldn’t get, because the machine was having none of single-user mode serial) we discovered that the only reason the system wouldn’t boot was because I hadn’t removed the /backup2 entry from /etc/fstab! D’oh! A rookie mistake (but one that I always make).

    Once we got this removed the system shot up. Allow a few more hours to get both bump and noisy up with LDAP working and we once again have a fully running CompSoc.

    It certainly didn’t go as planned, but I believe the end result is a good one:

    # df -h
    Filesystem     Size    Used   Avail Capacity  Mounted on
    /dev/ad0s1a    496M    383M     73M    84%    /
    devfs          1.0K    1.0K      0B   100%    /dev
    /dev/ad0s1e     24G     20G    2.0G    91%    /usr
    /dev/ad0s1d    7.7G    160M    7.0G     2%    /var
    /dev/da0       541G    190G    308G    38%    /data
    linprocfs      4.0K    4.0K      0B   100%    /usr/compat/linux/proc
    procfs         4.0K    4.0K      0B   100%    /proc
    devfs          1.0K    1.0K      0B   100%    /var/named/dev

    We really need to work on getting serial output from FreeBSD working properly, not to mention installing a new network card so that we can use the internal 10/100 interface for IPMI, which will allow us serial-over-LAN and full remote power capabilities.

    When I got home at 7PM I treated myself with a curry and an episode of Prison Break.

    Apologies to anybody that was affected by the downtime!

    Help! /var is shagged!

    Wednesday, February 7th, 2007

    The last couple of days have involved some seriously weird /var behaviour over at CompSoc.

    I’d narrowed the filling up of /var down to Apache’s error log, and had been removing it more or less twice a day. Today I disabled the error.log file and restarted Apache… a measure designed to last us until I had chance to properly fix the issue.

    This evening I come along and have a prod:

    # df -h /var
    Filesystem     Size    Used   Avail Capacity  Mounted on
    /dev/ad0s1d    496M    488M    -32M   107%    /var
    Yikes! That’s not good, but… what’s this?
    # du -sh /var
    143M    /var

    wtf? This got me pretty confused, but some searching around and I came upon a handy article at www.cyberciti.biz/tips/freebsd-why-command-df-and-du-reports-different-output.html that led me down the path of lsof. Here’s what I did:

    # lsof|grep var | sort -r -k 7
    lsof: WARNING: compiled for FreeBSD release 6.0-RELEASE-p6; this is 6.2-RELEASE.
    httpd     91413          root   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd     10028           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd     10022           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd     10021           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      9973           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      9881           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      9865           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      9809           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      9807           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      9654           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      7245           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      6896           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    httpd      5909           www   11w    VREG       0,92  361496571    32972 /var (/dev/ad0s1d)
    syslogd    9972          root   19w    VREG       0,85  110849268 58880006 /data/var/log/all.log
    httpd     91413          root   14w    VREG       0,92    4767735    32988 /var/log/httpd/access.log
    httpd     91413          root   13w    VREG       0,92    4767735    32988 /var/log/httpd/access.log
    httpd     10028           www   14w    VREG       0,92    4767735    32988 /var/log/httpd/access.log
    httpd     10028           www   13w    VREG       0,92    4767735    32988 /var/log/httpd/access.log
    [snip]

    The first few are just fine, but what the hell is Apache doing with /var/log/httpd/access.log? It’s clearly not that big…

    Anyway, a quick restart of Apache and /var returned to a far more reasonable size:

    # df -h /var
    Filesystem     Size    Used   Avail Capacity  Mounted on
    /dev/ad0s1d    496M    144M    312M    31%    /var

    Panic over… for now.

    Remote install

    Wednesday, March 8th, 2006

    Well, a bunch of things have happened since my last main post. One of them is something that caused quite a stir at the time, which I will not go into now. Suffice to say that I hope it has all been resolved now.

    Obviously yesterday CompSoc received a surprise delivery of a top-of-the-line Sun Fire T2000 2u server. This beast puts CompSoc’s “noisy” to shame (about ten times over). It has just one CPU, but this CPU has eight cores, each clocked at 1GHz, and each core is capable of running four threads at once (not sure how that works). RAM… yep, it has that too—16GB of it. Dual power supplies, a full remote management interface (that provides telnet and possibly SSH on top of the regular serial interface), etc. It’s a seriously impressive piece of kit. I’ll be sad to see it go at the end of the 90-day trial but in the meantime we’re trying to think of some things to do with it.

    20060307-151702.jpg

    If you have something capable of testing an 8GHz CPU with 16GB of RAM to its limits then let me know.

    But what I wanted to talk about was how I have just installed Solaris 10 1/06 onto a Sun Blade 100 CompSoc has. Nothing special, sure, but this is the first time I’ve installed a whole machine remotely over the Internet. It’s all very impressive… I just wish that PC manufacturers would stop messing about and hurry up and provide something that works as well.

    Time to get patching the rest of the stuff. The aim is to get the Sun T3 RAID array (9*36GB disks in RAID 5) hooked up and doing something useful. I’ve scheduled downtime for noisy tomorrow to allow me to install a dual FC card. We are considering using the T3 in a production role as a replacement for the two 200GB SATA disks in RAID1.

    photoSoc hosting

    Tuesday, February 28th, 2006

    Rog, chair of photoSoc, asked me to transfer the photoSoc website from the Union server (waldo) to CompSoc.

    No problem—something I’ve wanted to do for a while. The Union provide a great service, but I honestly don’t think that the server they have is quite up to the CompSoc server. Over the past few months we have had near perfect service—the uptime is currently about 32 days, the last reboot was a scheduled one to apply a minor security fix released by the FreeBSD team.

    I sent out the necessary emails—one to Rob Clarke at Manchester Computing asking him to switch ns0.compsoc.man.ac.uk to be the primary for the photosoc.man.ac.uk and photosoc.manchester.ac.uk domains, and another to Sam Smith, admin of the union box, explaining what was going on and asking for any info he could give me on mail and this sort of stuff.

    I got a reply from Rob Clarke to confirm the changes earlier on today but quickly spotted that while photosoc.man.ac.uk was responding, photosoc.manchester.ac.uk had disappeared. It turns out I had inadvertently removed some whitespace in the zonefile, which was causing the problem. Also, both of the photosoc domains are secondaried by utserv, cuwhich willrlew and gannet, instead of just dir.mcc.ac.uk, as 95% of our other hosted society domains are. I emailed Rob Clarke to query this was correct.

    But right now, all DNS changes seem fine. I’ve got an incorrect cache that will last about 90000s, so I’ll not be visiting the photoSoc website for a couple of hours.

    Next up is scheduling downtime for the website, doing a re-copy of the data (I already did one to confirm the website will run without modification) and migrating the MySQL database. This is fairly minor and will take no more than a couple of hours. After that a new mailing list will need to be created and the existing members will need to be transferred over.

    DNS is a pain in the arse most of the time, but with a short timeout (1800s == 30m) it works very well for us. When I’m ready for CompSoc to host the website, I just change the A line and people will seamlessly visit a different website. Same goes for the mail exchanges. Of course, once the transition is done, I’ll change the timeouts to something a little more sane.

    The Rack

    Friday, December 16th, 2005

    Good news everyone! CompSoc is now the proud owner of a hefty rack.

    Vlad swiped this from MMU as they were getting rid of a whole pile of old rackmounted equipment. After successfully wheeling the 1960s(ish) rack up Oxford Road (much to my amusement and at the cost of everybody’s hearing) we decided to take the escalator up to the Precint Centre. Bad idea… the escalator died the second we jumped on with the rack, which meant that we had to lug it up by hand. Not a fun job, especially as I ended up dragging it upwards while, thanks to Newton’s apples, it pulled down in an attempt to chop of my fingers.

    But, in the end, we prevailed… and proceeded to abuse the disabled chairlift to shift the thing from the CS building entrance to the lower ground where all of the cool CompSoc stuff is. A few more funny looks (especially from the sandwich seller kid) but also a good number of look of admiration, including from John Latham… although, I’ve no idea if he was just taking the piss or if he was genuinely interested.

    If you’re really interested you could head over to the CompSoc website and check out the images in the gallery.

    The big DNS day

    Thursday, November 24th, 2005

    Today was the day that Tuesday was not.

    I’m very happy to have finally stopped tall (our legacy server) from handling any critical CompSoc services. These were basically mail and websites; noisy has been handling CompSoc DNS for about nine months now.

    Things did not go without any problems though; somehow, due to a miscommunication, all of the Manchester University society domains have mx0.compsoc.man.ac.uk set as their primary mail exchange. In English, this means that currently every society in Manchester University is currently unable to receive email as it is being delivered to CompSoc. From our point of view, we quickly fixed the matter by simply deferring mail for unknown domains; this simply means that we sent it back to Manchester Computing for them to attempt redelivery at a later date.

    So, the end result is that noisy handles DNS, websites and remote logins and bump handles all email and mailing lists. The changeover went surprisingly well and I’m unbelievably happy about that. A day well spent.

    There are still a few things to tie up but with tall relegated to doing… well, not a lot, I’m much happier. Now, if we can get this 4TB StorageTek tape library and 290GB RAID5 array online CompSoc will be looking like the society it really needs to be!

    CompSoc goodies

    Tuesday, November 22nd, 2005

    It’s been almost two weeks now since CompSoc received some new goodies from an associate member and I’m surprised I’ve managed to not post it until now.

    The weekend before last was UA meet 14 which basically involves a whole pile of members coming down to Manchester to spend all of Friday night, Saturday and Sunday in the pub. I headed along to the Bull’s Head on Friday night and then spent a couple of hours in Kro Bar on the Saturday.

    But I suppose the real highlight was the kit that Rauxon dropped off for CompSoc. This included one StorageTek L20 tape library (stores a maximum of 20 tapes, has two tape drives capable of storing 200GB compressed/tape; a total of 4TB compressed data), a Sun StorEdge T3 fibre-channel disk array (fully populated with nine 36GB FC 15K RPM discs in RAID5—about 288GB of extremely fast storage; this is the top line model with 1GB of RAM), a Sun Blade 100 (500MHz UltraSparc II with lots of RAM and an Intel PCI board) and finally a Sun Ultra 5 workstation.

    We’ve been pretty busy sorting out a big DNS and mail changeover at CompSoc (finally the legacy tall will be replaced by noisy) but we’ve had a few spare minutes to poke at the new kit. So far we’ve got Solaris 10 installed on the Blade 100 and the companion CD is installing as I write. We’ve also had the T3 hooked up but done nothing more than create a volume (unfortunately one disk seems to be faulty so we’re looking to replace this). As for the L20… that’s been causing the biggest problem, mainly due to the lack of software/drivers we’ve got for it. I spent a bit of time messing with it this evening but without some software that fully supports two tape drives and the other kit, we’re a little stuck.

    With a little luck we’ll get the L20 to back something up before the end of the year ;)