Terapix (archive) - Data security

The use of RAID system solve the problem of disk failure in a pretty efficient way.

To date, the Terapix project experienced 2 disk failures [real one !] in running RAID arrays. Hereafter is what happened and was done on the 2 of November 1999.

NOTES :
this RAID arrays device has 8 disks, all used as data space ; there is an extra disk on a shelf : this is "cold spare strategy".
the RAID level is 5 : floating parity.

First, an alarm sound announced the problem. A VT100 old monitor is connected to the RAID device [serial link] and allows configuration, log events, and many other operations. The log events was clear and displayed : Disk failure on channel 2 unit 0. This allows to locate the wrong disk. For instance, channel #2 is one of the 3 disk channels that are inside the RAID device ; on each disk channel there are 2 or 3 units (disks) ; the disk #0 is the wrong one.

Then, what has been done :
unmount the RAID device from the machine (there is no need to stop the computer). If you are having hard time to umount the device because of the "device buzy" snag, then one can use the program "lsof" to list the files resident on the RAID device that are opened by running processes. These processes must be killed before unmounting.
power off the RAID (since the disks are not hot swapable)
unscrew...
locate the failing disk, replace it as identical : see LED, ID @, and jumpers, for instance, disk are usually shipped with "terminator enable", so one need to remove the "terminator enable" jumper
screw up...
after power on, one have to rebuild the RAID, thanks to the floating parity. It took about 6 hours to rebuild the RAID that was 94% full (102 gigas used) when disk failure occured.
mount the RAID device ; the volume is now again secure.

Almost the same thing happened and was done on the 6 of July 2000, on a 8*50 gigas RAID set level 5. The volume reconstruction took also about 6 hours on a 90% full volume.

Some suggestions

if one can purchase an hot swapable disk (within a cannister, or so), then no screw/unscrew offline ! Time is saved : no shutdown of the RAID (only put it "offline") BUT put markup on cannister, to know to which channel and unit corresponds the disk (this can be checked when looking the SCSI cable connection & dispatch).

hot spare seems to be quite a nice feature too ; but rather on huge number of disks : on a small number of disks, say 8, if one disk is dedicated for hot spare, then this a loss in data space.

of course, in case of "cold spare strategy", don't forget to have one disk in advance...