Terapix Star Formation Region IC 1396, © 2001 CFHT
System evolution on mix[123]
Article
log of system updates/maintenance on our opterons
by FMA - Updated November 28th, 2004
Dates and version are listed here. For how to install, consult this article.

Thu 01 Jul 2004 switch CPUs between mix1 and mix2

In order to identify if mix1's instability is due to CPUs or the motherboard.


Fri 09 Jul 2004 Tests cartes RAID mix1 - Manu - Gille - Fred

objectif: tester le fonctionnement de la carte 3ware de mix1 dont l'écriture sur le RAID fait planter la machine

tests:
-  désactivation de la carte correspondant au RAID plein de données de Mireille (via le BIOS: disable bus PCI 2)
-  sauvegarde des données autres que *.fits et *.head de Mireille dans dantel/raid1saved.tgz
-  reboot: reconstruction en RAID0 + mkfs.xfs et montage sur /raid1
-  tests d'écriture: # cat /proc/kcore > /raid1/kcore
-  400 Go écrits à présent, sans plantage
-  si pas de plantage apparent, reconstruction en RAID5 dans la nuit, et re-tests d'écriture.
-  update: après 700Go écrit, pas de plantage. Reconstruction en RAID5 lancée à 20:30.


Fri 09 Jul 2004 Tests RAID5 non dégradé mix1 - Gilles en effet pas de probleme a signaler , c'est a dire que le RAID est reste OK [ donc pas degraded ]jusqu'a la fin 85 %. en lisant les index des fichiers copies, on mesure la vitesse d'ecriture : 33 Megas/sec ce qui correspondant en fait a la lecture du fichier sur le disque IDE de /home .... donc pour faire des tests, j'ai fait du ftp depuis le node 2, et j'ai copie a + de 63 Megas/sec pour les gros fichiers.

je suggere pour tester de suspendre le script de copie multiple de /home/GROS_fichier -> /raid1/ , et de reprendre une copie par le reseau ...

lorsque les 2 cartes RAID seront a nouveau installees, il faudra mesurer la copie de /raid a /raid ...

concernant la degradation du RAID, est-ce bien le disque du port#2 qui a ete enleve puis remis ? d'apres la page suivante : http://mix1.iap.fr:8086 cette page est tres longue a avoir quand le /raid1 pedale a fond ??? et donc toujours pas d'affichage pour http://mix1.iap.fr:8086/alarms.html


Sat 10 Jul 2004 Tests RAID5 dégradé mix1 - Manu

Gilles' scripts were used once again to test the writing on /raid1 with a degraded array (a hard drive was removed to simulate a degraded array). As a consequence, copy speed has dropped from 70MB/s to 20MB/s. But the throughput is very irregular.


Sun 11 Jul 2004 Tests RAID5 dégradé en reconstruction mix1 - Fred

Les tests d'écriture sur un RAID5 dégradé sont passés: plus de 650Go ont été écrits sans plantage.

Choses étranges: pdflush (cf. /usr/src/linux-2.6.7-gentoo-r7/mm/pdflush.c: worker threads for writing back filesystem data) était en 8 exemplaires. J'ai dû démonter /raid1 afin de pouvoir arrêter la machine. Rien de spécial dans les logs.

Les disques retirés par Manu ont été ré-introduits, et la reconstruction du RAID5 lancée depuis le BIOS.

Un test d'écriture par hdparm -v -tT /dev/sdb donne :

Timing buffer-cache reads:   2720 MB in  2.00 seconds = 1359.53 MB/sec
Timing buffered disk reads:   38 MB in  3.12 seconds =  12.17 MB/sec

Timing buffer-cache reads:   3268 MB in  2.00 seconds = 1632.62 MB/sec
Timing buffered disk reads:   40 MB in  3.17 seconds =  12.64 MB/sec

Timing buffer-cache reads:   2776 MB in  2.00 seconds = 1386.82 MB/sec
Timing buffered disk reads:   40 MB in  3.08 seconds =  13.00 MB/sec

J'ai effacé les fichiers /raid1/b*, reste 21% de la baie occupée par /raid1/gimi.

À 2:15, je relance le test d'écriture avec:

mix1 raid1 # cat copy.bash
#!/bin/bash

i=1

while true; do
 cp /home/ab/tgzs/datas.bis-0.8.4.tar.gz /raid1/toto$i
 let i++
done

alors que la reconstruction du RAID en est à 9%.

Prochaine étape: même test sur même RAID, avec la seconde carte 3ware activée.

Sun 11 Jul 2004 Rebuild on mix1 - Manu

The first rebuild failed at 11am this morning:

ERROR: Rebuild/initialization/verify failed due to an error on the source or destination drive of controller ID:2 unit: 0. Please check the drives and perform a forced rebuild if necessary. (0x4)

(Forced) rebuild on /raid1 started again a 4:40pm

Meanwhile, it was found that the slowdown noticed on 3DM pages can be fixed by restarting Apache2... It might just be a coincidence.

(Forced) rebuild ended successfully a 6:40pm

BIOS reconfigured:
-  IOMMU set to "best fit" in BIOS (was deactivated before that).
-  Aperture left at 32MB
-  Second PCI RAID device re-enabled

Rebuild started at 7pm on the second RAID:/raid1 (the new /dev/sdb, the previous /raid1 has become /raid2). Ended successfully at 9pm.

A medium-size, multi-threaded SWarp started (reading from /raid1, writing to /raid2): completed successfully

2 big SWarps started simultaneously (both multithreaded) on /raid1 and /raid2. Crashed (fatal freeze) after about 40 minutes.

Test started again after moving the cards to PCI slots 3 and 4 (was 1 and 2). Fortunately the RAIDs were not degraded by the crash.


tuesday 13th Jul 2004 mix2 data recovery? henry

in attempt to recover the lost data on mix2, the following tests were performed, with a view to re-building the raid array. Of course the problem is probably at a more fundamental level than this, but one has to try. The following things were tried:

1. the 3ware cards were moved to different slots (3,4) on the motherboard, and the RAID reconstruction restarted. The reconstruction failed after five minutes or so (error on drive 4)

2. the 3ware card connected to raid1 (which did not contain any data) was removed from mix2, and the reconstruction restarted. It failed after five minutes or so, with the same error as before.

3. the second 3ware card was re-inserted, and deactivated in the BIOS. Drive 4 was replaced,and the re-build started. Rebuild failed after the same amount of time as before.


Tue 27 Jul 2004 RAID5 rebuild on mix1:/raid2

/dev/sdc (raid2) on mix1 is rebuilding:
-  identify the proper array by removing one card. Checked the raid1/raid2 association in table on the cluster layout page.
-  port3 needed to be unpluggeg/replugged for the card to see it
-  delete RAID5 / create RAID0
-  put second card back in
-  delete RAID0 / create RAID5


Wed 28 Jul 2004 installation of 3dm2

3dm2 is the new 3ware disk manager for series 9000 cards and opterons (3DM 2 Linux 2.00.00.038 for x86_64). User password is empty. Look at the configuration page for installation details.

3dm is still installed, but not started. If you want to get back to the old 3dm:

/etc/init.d/3dm2 stop
rc-update del 3dm2
rc-update add 3dm default
/etc/init.d/3dm start

Thu 29 Jul 2004 Change disk RAID mix2

mix2's port4 disk of SCSI ID 2 (raid1) has been changed. Serial number: Y60SBSGE -> Y60SBTFE.


Tue 03 Aug 2004 Change disk RAID mix1

mix1's port1 disk of raid2 has been changed. Serial number: Y60RW91E -> Y60SBB2E


Thu 05 Aug 2004 installation of smartd

Installation of smartmontools to monitor RAID5 disks


Mon 23 Aug 2004 reconstruction of first RAID5 array on mix2 Fred

/dev/sdb on mix2 was showing only 250GB available, although 3dm showed a 1.7TB array... I erased this array, and rebuilt a RAID0 one. Did not solved the problem. RAID5 reconstruction is on its way. NB: 3dm2 has functionalities (like deleting/creating arrays) which are not supported by those 8000 3ware cards.

Mystère élucidé... Ce disque de 250Go, /dev/sdb, cru comme étant une partition RAID, n'était en fait que le disque USB branché sur mix2... Le module usb_storage se chargeant avant 3w-xxxx, il prenait le nom /dev/sdb. mix2 est donc rentré dans l'ordre.

/dev/sdb              1.6T  528K  1.6T   1% /data/mix2/raid1
/dev/sdc              1.6T  528K  1.6T   1% /data/mix2/raid2

Wed 25 Aug 2004 mix3 got its name and is passed under clix's NIS Fred


Thu 26 Aug 2004 Optimization block size readahead Manu - Fred

Cf. this note.


Mon 30 Aug 2004 Installation of SuperMongo Fred

Cf. a few installtion details


Mon 13 Sep 2004 kernel upgrade Fred

kernel upgrade from 2.6.7 to 2.6.8. mix1 and mix2 are running this new kernel, mix3 is awaiting its reboot.


Tue 28 Sep 2004 kernel downgrade on mix3 Fred

back to 2.6.7-gentoo-r14 because 3dm2 was not showing any controller.


Sat 30 Oct 2004 major upgrade on mix3 and mix2 Fred

Major upgrade: switch to kernel 2.6.9-gentoo-r1 and glibc 2.3.4.20040808-r1 . For a complete list of updates:

genlop '*' | less

Mon 01 Nov 2004 change of blockdev on /data/mix3/raid* Manu - Fred

Change value from 16384 to 2048 after tests by Manu.

# blockdev --setra 2048 /dev/sdb        for raid1
# blockdev --setra 2048 /dev/sdc        for raid2
# vi /etc/conf.d/local.start
/sbin/blockdev --setra 2048 /dev/sdb
/sbin/blockdev --setra 2048 /dev/sdc

Mon 01 Nov 2004 update of module 3w-9xxx on mix[23] (TO BE DONE!) Manu or Fred

Update module 3w-9xxx with the one from 3ware.com. Reason: the kernel default one does not talk to 3dm2 (no visible controller).

-  libérer les baies (check avec lsof | grep raid)

# umount /data/mix3/raid1
# umount /data/mix3/raid2
# /etc/init.d/3dm2 stop
# rmmod 3w-9xxx
# cd /root/src/3w-9xxx2.6/driver/
# make
# cp 3w-9xxx.ko /lib/modules/2.6.9-gentoo-r1/kernel/drivers/scsi/3w-9xxx.ko
# modules-update
# modprobe 3w-9xxx
# mount /data/mix3/raid1
# mount /data/mix3/raid2
# /etc/init.d/3dm2 start

-  connection à http://mix3.iap.fr:8086 pour vérifier le bon fonctionnement

Thu 09 Nov 2004 fix Xconnection via ssh Fred

X-connection via ssh was not possible anymore since the last pam (Pluggable Authentication Modules) update. Solution: comment out line n° 57 in /etc/security/pam_env.conf:

#DISPLAY                DEFAULT=${REMOTEHOST}:0.0 OVERRIDE=${DISPLAY}

Done on mix2 and mix3, to be done on mix1 after pam upgrade.


Site Map  -   -  Contact
© Terapix 2003-2011