Server Madness

I have been running industrial strength servers for many years now, as part of a project that has been moderately successful.

The current set of servers are 1U rack-mount ‘pizza boxes’, most quad-core xeon Intel servers, and one dual-CPU AMD Opteron. All are Hewlett Packard (HP) Proliant servers from roughly 2005. All have been running without major incidents, many of them since 2005.

About 2 years ago the dual Opteron died in production. No warning, it just up and died. I got the call in the morning and went and retrieved it from the co-locate service by lunch. Basically I just took one of the quad-core Xeons with all the software ‘ready to go’ and did a box swap.

Once home I didn’t spend much time on it, but a few reboots seemed to confirm a hard drive problem.

Fast forward to just before Christmas last year. I wanted a box to run JupyterHub on Ubuntu, and this box with it’s dual AMD processors and 16GB memory seemed a quick prospect.

After various diagnostics, I determined the hard drive was OK, but the SATA system was not. I managed to install Ubuntu from a burned DVD iso, but it took a very long time and then didn’t reboot successfully. I shelved the server at that point and went instead with a brand new purpose built server, which works wonderfully.

In late January I wanted a server to run OPM – an open source reservoir simulator. I’ve run it before on virtual machines, but wanted a ‘real’ server. I pulled out the dual Opteron and tried installing Ubuntu again, this time from a known good USB data key that I’d used on the JupyterHub purpose-built box. This time it installed quickly and rebooted to an operating system.

I installed it in my newly built server rack and after moving everything to the server room it fired up and has been running since.

Late last week Ubuntu updates indicated a reboot of all my Ubuntu systems was needed. All except the dual Opteron rebooted without incident. Unfortunatly, the dual Opteron would not boot, instead returning to the dreaded “SATA channel X is slow to respond…”. Eventually this fails with “SATA channel is down” and no hard drive.

It’s annoying as the POST clearly shows a SATA HDD as expected. It’s only on OS boot that SATA dies.

I decided to diagnose and fix the problem if I could. I started by pulling the cover and then tried a few boots – no good. I then pulled all the cables and eventually pulled the HDD and set it all on top. This time it booted without a hiccup! I verified all was good with a few proper shutdowns and then cold boots, and proceeded to replace all the cables and button up the case. And… the boot failed.

After much checking and rechecking, I had this niggling feeling in my brain that there was ‘something up’. At one point when it was failing to boot (at the “SATA channel X is slow to respond…” I pulled the mini-power connector off the optical drive. INSTANTLY the boot process proceeded normally.

I tested this a couple of times to confirm, and indeed the problem is the optical drive (DVD ROM). Specifically, the optical drive is a removable mini-height drive that unclips and slides out of a circuit card. The circuit card stays in the machine, and has a 36pin IDE header plus the micro-power connector. Pulling the power connector kills the drive electronics and allows the machine to boot cleanly.

It would seem that somehow the single IDE channel for the DVD ROM optical drive is interfering with the SATA channels for the HDD. That or there’s a power drain when the DVD ROM is plugged in that messes with the power to the HDD. Both HDD bays and the Optical drive share the same power harness. Power to the DVD splits off the power connector for one drive, but all are on the same set of wires from the power supply. It’s possible there’s a drain that kills one of the power lines. Unfortunately with the HDD in place you cannot tell if it’s spinning or not.

However, I don’t really think that’s the issue. Before Christmas one of the problems I experienced was the utter slowness of the DVD ROM to read the known-good disc. I suspect the DVD ROM electronics are bad and somehow influence the SATA controller. For all I know it’s the same chipset on the motherboard.

At any rate, the machine is now back together and running perfectly – just with the DVD ROM power cable disconnected.

It was very difficult to diagnose, but I’m glad I was able to finally solve the mystery.