Today I’ve had a problem in one of the servers we support, no web access, no ssh, and no console just a bunch of sentences passing so fast I couldn’t read it on the terminal. The solution a simple hard reset and the system came online, it was a hard disk failure but the system online without trouble because we were using a raid configuration. One of the disks didn’t show up in the RAID array, a few tests later and declared the hardware fault the cause of the downtime.
But why did the system came down because of a disk failure if there was a RAID system available, simple the swap was spread among the disks but not in a raid system so no redundant swap partitions, when the need for data in the swap of that file system came there wasn’t any data available and the system came to a stop.
From now on we’ll create a redundant swap partition using a RAID volume so this doesn’t happen again as a server should never stop because of a disk problem, living and learning.
Cheers,
Pedro M. S. Oliveira
BTW – to reassemble the array I used mdadm, bellow there is a simple usage if you want to reassemble a previous build array:
mdadm –manage /dev/md0 –add /dev/sda1
this command will add the partition /dev/sda1 to the raid array /dev/md0
if you want to learn more about RAID in linux just type man mdadm or mdadm –help
It would make sense to have at least one swap partition or file on the RAID array. This would give you a small performance boost.
I usually will stripe LVM containers over multiple RAID arrays, and I have dedicated swap drive or two for large servers.