by
FMA
- Updated November 28th, 2004
Thu 01 Jul 2004 switch CPUs between mix1 and mix2
In order to identify if mix1's instability is due to CPUs or the motherboard.
Fri 09 Jul 2004 Tests cartes RAID mix1 - Manu - Gille - Fred
objectif: tester le fonctionnement de la carte 3ware de mix1 dont l'écriture sur le RAID fait planter la machine
tests:
désactivation de la carte correspondant au RAID plein de données de Mireille (via le BIOS: disable bus PCI 2)
sauvegarde des données autres que *.fits et *.head de Mireille dans dantel/raid1saved.tgz
reboot: reconstruction en RAID0 + mkfs.xfs et montage sur /raid1
tests d'écriture: # cat /proc/kcore > /raid1/kcore
400 Go écrits à présent, sans plantage
si pas de plantage apparent, reconstruction en RAID5 dans la nuit, et re-tests d'écriture.
update: après 700Go écrit, pas de plantage. Reconstruction en RAID5 lancée à 20:30.
Fri 09 Jul 2004 Tests RAID5 non dégradé mix1 - Gilles en effet pas de probleme a signaler , c'est a dire que le RAID est reste OK [ donc pas degraded ]jusqu'a la fin 85 %. en lisant les index des fichiers copies, on mesure la vitesse d'ecriture : 33 Megas/sec ce qui correspondant en fait a la lecture du fichier sur le disque IDE de /home .... donc pour faire des tests, j'ai fait du ftp depuis le node 2, et j'ai copie a + de 63 Megas/sec pour les gros fichiers.
je suggere pour tester de suspendre le script de copie multiple de /home/GROS_fichier -> /raid1/ , et de reprendre une copie par le reseau ...
lorsque les 2 cartes RAID seront a nouveau installees, il faudra mesurer la copie de /raid a /raid ...
concernant la degradation du RAID, est-ce bien le disque du port#2 qui a ete enleve puis remis ? d'apres la page suivante : http://mix1.iap.fr:8086 cette page est tres longue a avoir quand le /raid1 pedale a fond ??? et donc toujours pas d'affichage pour http://mix1.iap.fr:8086/alarms.html
Sat 10 Jul 2004 Tests RAID5 dégradé mix1 - Manu
Gilles' scripts were used once again to test the writing on /raid1 with a degraded array (a hard drive was removed to simulate a degraded array). As a consequence, copy speed has dropped from 70MB/s to 20MB/s. But the throughput is very irregular.
Sun 11 Jul 2004 Tests RAID5 dégradé en reconstruction mix1 - Fred
Les tests d'écriture sur un RAID5 dégradé sont passés: plus de 650Go ont été écrits sans plantage.
Choses étranges: pdflush (cf. /usr/src/linux-2.6.7-gentoo-r7/mm/pdflush.c: worker threads for writing back filesystem data) était en 8 exemplaires. J'ai dû démonter /raid1 afin de pouvoir arrêter la machine. Rien de spécial dans les logs.
Les disques retirés par Manu ont été ré-introduits, et la reconstruction du RAID5 lancée depuis le BIOS.
Un test d'écriture par hdparm -v -tT /dev/sdb donne :
Timing buffer-cache reads: 2720 MB in 2.00 seconds = 1359.53 MB/sec
Timing buffered disk reads: 38 MB in 3.12 seconds = 12.17 MB/sec
Timing buffer-cache reads: 3268 MB in 2.00 seconds = 1632.62 MB/sec
Timing buffered disk reads: 40 MB in 3.17 seconds = 12.64 MB/sec
Timing buffer-cache reads: 2776 MB in 2.00 seconds = 1386.82 MB/sec
Timing buffered disk reads: 40 MB in 3.08 seconds = 13.00 MB/sec
J'ai effacé les fichiers /raid1/b*, reste 21% de la baie occupée par /raid1/gimi.
À 2:15, je relance le test d'écriture avec:
mix1 raid1 # cat copy.bash
#!/bin/bash
i=1
while true; do
cp /home/ab/tgzs/datas.bis-0.8.4.tar.gz /raid1/toto$i
let i++
done
alors que la reconstruction du RAID en est à 9%.
Prochaine étape: même test sur même RAID, avec la seconde carte 3ware activée.
Sun 11 Jul 2004 Rebuild on mix1 - Manu
The first rebuild failed at 11am this morning:
ERROR: Rebuild/initialization/verify failed due to an error on the source or destination drive of controller ID:2 unit: 0. Please check the drives and perform a forced rebuild if necessary. (0x4)
(Forced) rebuild on /raid1 started again a 4:40pm
Meanwhile, it was found that the slowdown noticed on 3DM pages can be fixed by restarting Apache2... It might just be a coincidence.
(Forced) rebuild ended successfully a 6:40pm
BIOS reconfigured:
IOMMU set to "best fit" in BIOS (was deactivated before that).
Aperture left at 32MB
Second PCI RAID device re-enabled
Rebuild started at 7pm on the second RAID:/raid1 (the new /dev/sdb, the previous /raid1 has become /raid2). Ended successfully at 9pm.
A medium-size, multi-threaded SWarp started (reading from /raid1, writing to /raid2): completed successfully
2 big SWarps started simultaneously (both multithreaded) on /raid1 and /raid2. Crashed (fatal freeze) after about 40 minutes.
Test started again after moving the cards to PCI slots 3 and 4 (was 1 and 2). Fortunately the RAIDs were not degraded by the crash.
tuesday 13th Jul 2004 mix2 data recovery? henry
in attempt to recover the lost data on mix2, the following tests were performed, with a view to re-building the raid array. Of course the problem is probably at a more fundamental level than this, but one has to try. The following things were tried:
1. the 3ware cards were moved to different slots (3,4) on the motherboard, and the RAID reconstruction restarted. The reconstruction failed after five minutes or so (error on drive 4)
2. the 3ware card connected to raid1 (which did not contain any data) was removed from mix2, and the reconstruction restarted. It failed after five minutes or so, with the same error as before.
3. the second 3ware card was re-inserted, and deactivated in the BIOS. Drive 4 was replaced,and the re-build started. Rebuild failed after the same amount of time as before.
Tue 27 Jul 2004 RAID5 rebuild on mix1:/raid2
/dev/sdc (raid2) on mix1 is rebuilding:
identify the proper array by removing one card. Checked the raid1/raid2 association in table on the cluster layout page.
port3 needed to be unpluggeg/replugged for the card to see it
delete RAID5 / create RAID0
put second card back in
delete RAID0 / create RAID5
Wed 28 Jul 2004 installation of 3dm2
3dm2 is the new 3ware disk manager for series 9000 cards and opterons (3DM 2 Linux 2.00.00.038 for x86_64). User password is empty. Look at the configuration page for installation details.
3dm is still installed, but not started. If you want to get back to the old 3dm:
/etc/init.d/3dm2 stop
rc-update del 3dm2
rc-update add 3dm default
/etc/init.d/3dm start
Thu 29 Jul 2004 Change disk RAID mix2
mix2's port4 disk of SCSI ID 2 (raid1) has been changed. Serial number: Y60SBSGE -> Y60SBTFE.
Tue 03 Aug 2004 Change disk RAID mix1
mix1's port1 disk of raid2 has been changed. Serial number: Y60RW91E -> Y60SBB2E
Thu 05 Aug 2004 installation of smartd
Installation of smartmontools to monitor RAID5 disks
Mon 23 Aug 2004 reconstruction of first RAID5 array on mix2 Fred
/dev/sdb on mix2 was showing only 250GB available, although 3dm showed a 1.7TB array... I erased this array, and rebuilt a RAID0 one. Did not solved the problem. RAID5 reconstruction is on its way. NB: 3dm2 has functionalities (like deleting/creating arrays) which are not supported by those 8000 3ware cards.
Mystère élucidé... Ce disque de 250Go, /dev/sdb, cru comme étant une partition RAID, n'était en fait que le disque USB branché sur mix2... Le module usb_storage se chargeant avant 3w-xxxx, il prenait le nom /dev/sdb. mix2 est donc rentré dans l'ordre.
/dev/sdb 1.6T 528K 1.6T 1% /data/mix2/raid1
/dev/sdc 1.6T 528K 1.6T 1% /data/mix2/raid2
Wed 25 Aug 2004 mix3 got its name and is passed under clix's NIS Fred
Thu 26 Aug 2004 Optimization block size readahead Manu - Fred
Cf. this note.
Mon 30 Aug 2004 Installation of SuperMongo Fred
Cf. a few installtion details
Mon 13 Sep 2004 kernel upgrade Fred
kernel upgrade from 2.6.7 to 2.6.8. mix1 and mix2 are running this new kernel, mix3 is awaiting its reboot.
Tue 28 Sep 2004 kernel downgrade on mix3 Fred
back to 2.6.7-gentoo-r14 because 3dm2 was not showing any controller.
Sat 30 Oct 2004 major upgrade on mix3 and mix2 Fred
Major upgrade: switch to kernel 2.6.9-gentoo-r1 and glibc 2.3.4.20040808-r1 . For a complete list of updates:
genlop '*' | less
Mon 01 Nov 2004 change of blockdev on /data/mix3/raid* Manu - Fred
Change value from 16384 to 2048 after tests by Manu.
# blockdev --setra 2048 /dev/sdb for raid1
# blockdev --setra 2048 /dev/sdc for raid2
# vi /etc/conf.d/local.start
/sbin/blockdev --setra 2048 /dev/sdb
/sbin/blockdev --setra 2048 /dev/sdc
Mon 01 Nov 2004 update of module 3w-9xxx on mix[23] (TO BE DONE!) Manu or Fred
Update module 3w-9xxx with the one from 3ware.com. Reason: the kernel default one does not talk to 3dm2 (no visible controller).
libérer les baies (check avec lsof | grep raid)
# umount /data/mix3/raid1
# umount /data/mix3/raid2
# /etc/init.d/3dm2 stop
# rmmod 3w-9xxx
# cd /root/src/3w-9xxx2.6/driver/
# make
# cp 3w-9xxx.ko /lib/modules/2.6.9-gentoo-r1/kernel/drivers/scsi/3w-9xxx.ko
# modules-update
# modprobe 3w-9xxx
# mount /data/mix3/raid1
# mount /data/mix3/raid2
# /etc/init.d/3dm2 start
connection à http://mix3.iap.fr:8086 pour vérifier le bon fonctionnement
Thu 09 Nov 2004 fix Xconnection via ssh Fred
X-connection via ssh was not possible anymore since the last pam (Pluggable Authentication Modules) update. Solution: comment out line n° 57 in /etc/security/pam_env.conf:
#DISPLAY DEFAULT=${REMOTEHOST}:0.0 OVERRIDE=${DISPLAY}
Done on mix2 and mix3, to be done on mix1 after pam upgrade.