Being something of a technophile, for years I’ve had a server at my house, providing network security, file sharing, and backup services to the computers on my network. Through my server I’m able to block ads across the whole of our internet connection, limit who has what kind of access to what kind of files (for examples, guests having read-only access to my music library), as well as entertain the many small coding projects that tickle my fancy. Unfortunately, of late my server has been acting up, and the fixes I put in place did not ultimately solve the issue.
Without getting too much into the weeds of technical jargon, my server utilizes several hard drives working together to correct for any failures; if one drive dies unexpectedly the others pick up the slack. This feature is called RAID and is controlled by a small add-on component called—surprisingly enough—a RAID controller. My server also utilizes virtualization technology, which allows me to have several servers running independently on the same physical hardware. The underlying software which facilitates this is called VMware.
Less than a year ago I upgraded my server’s hardware to much more expansive drives, which necessitated a new RAID controller. I also took that opportunity to set up VMware to better extend the server’s functionality. Things were working well, until some weeks later when the server seemed to freeze. I was no longer able to access my shared files and by all appearances it seemed to fall off the network. The server itself was still on, with hard drive lights blinking as normal, but the monitor I plugged in didn’t receive a signal and there seemed to be no reaction to anything I typed on the keyboard.
Resetting the server, it came up several minutes later as if there’d never been a problem. Unfortunately the issue returned a few days thereafter. My first thought was the RAID controller could be overheating, so I added a dedicated fan to help cool it directly. Several days later, the same hanging system, repeated every few days thereafter, often when I was in the middle of streaming audio or video from the server.
My thought was that it was still in some way connected to the RAID controller, so I went through the process of moving the VMware software installation off of the storage drives and onto a dedicated drive of its own—this way, my thinking went, any controller errors wouldn’t lock out the entire system, and I may be able to get good logging data from VMware’s internal diagnostics.
During the process of attaching the new drive and installing a fresh copy of VMware, I noticed a particular error on system boot, with the RAID controller complaining that the attached battery seemed to be out of juice. A battery is key when it comes to the smooth operation of a RAID controller because, even with a sudden loss of power, it is imperative that all disks be kept in sync. The battery allows the controller to issue those last few commands necessary to save any operations it was undergoing as the rest of the system powers off. Thinking I had found the issue, I removed the battery altogether and ordered a new one from Amazon. There are other, slower modes that RAID controllers can operate in, such as when there’s no battery available, and I thought with that switch I was back up and running.
Not four hours after I buttoned everything up and slid the server back into its rack it froze again. Disappointed, but knowing I had prepared for this by putting VMware on its own drive, I tried to fire up the internal VMware software webpage, to no avail. Whatever was going on with my system wasn’t limited to the RAID controller and the attached drives—the error froze out all of the on-board components, whether attached to the RAID array or not.
At this point I’m fairly frustrated. Luckily I have a full and complete backup of my server so I’m not worried about losing any data, but I’m running out of ways I can test my server’s components. Honestly I’m at the end of my knowledge, and as such am hoping that I can bend the ear of some local senior engineers for their take on the situation.
The process of having my server repeatedly freeze after this major upgrade, then the hope of having fixed it dashed against the rocks of continued failure has added a great deal of stress to my life of late, and I just want it fixed. Understanding where the problem is would be great, but so long as the server does what I ask of it I can’t imagine I’d have much room to complain.
This post was more venting than diagnostic, and certainly not a deep technical dive, but if anyone can see potential flaws or hang-ups in the above situation I’m more than happy to explore them.