梦想的实现需要野心!
全部博文(63)
分类: LINUX
2009-04-18 19:54:13
Contents[] |
Intel's "HyperThreading" (HT) is not useful for compute clusters because floating point or integer operations are usually completed in succession on each processor, rather than a mixture of operations concurrently. Enabling HT on compute nodes could create a situation where a virtual processor (the primary feature of HT) receives a job assignment from the resource manager, resulting in sub-optimal performance.
In setting-up new cluster nodes, the BIOS is configured as follows: Load setup defaults, disable HyperThreading, disable boot logo display.
The initial installation & configuration tests performed on node25.ribosome.cluster. RHEL4 AS installed from CDs, using defaults unless otherwise specified.
Disk Druid: Created 4 Gb swap partition, followed by ext3 / partition with remaining space.
Firewall disabled, and SELinux disabled.
Package Group Selection: Deselected (removed) Test-Based Internet, Server Configuration Tools, Web Server, Windows File Server, Printing Support. Selected (added) Development Tools, Legacy Software Development, rsh-server (under Legacy Network Server), and system-config-kickstart (under Admin Tools).
System booted RHEL AS (2.6.9-42.ELsmp) without errors.
When prompted for RHLogin, selected the "Tell me why..." option, which made the "I cannot complete..." option appear (and selected that so I could proceed without logging-in).
Logged-in as root and activated the RHEL software (expect a 30-45 second delay):
rhnreg_ks --activationkey=[KEY]
As root,
mkdir /software
Edit /etc/fstab by adding the following lines:
homehost.cluster:/home /home nfs defaults 0 0
apphost.cluster:/software /software nfs defaults 0 0
Verify the NFS configuration:
mount homehost.cluster:/home /home
cd /home
ls
mount apphost.cluster:/software /software
cd /software
ls
Copied this /etc/hosts file from node1.ribosome.cluster to each new node:
127.0.0.1 localhost.localdomain localhost ribosome
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Copying this file is needed for the DHCP hostname and dnsdomainname assignments to function.
Added lines in the DNS databases to prepare for the new ribosome nodes.
In fileserver:/etc/bind/db.192.168:
125.40 IN PTR node25.ribosome.cluster.
126.40 IN PTR node26.ribosome.cluster.
127.40 IN PTR node27.ribosome.cluster.
128.40 IN PTR node28.ribosome.cluster.
In fileserver:/etc/bind/db.cluster:
node25.ribosome IN A 192.168.40.125
node26.ribosome IN A 192.168.40.126
node27.ribosome IN A 192.168.40.127
node28.ribosome IN A 192.168.40.128
DHCP and Bind were restarted on fileserver as follows:
/etc/init.d/dhcp3-server restart
/etc/init.d/bind9 restart
Then, on head.ribosome.cluster, lines were added for the new nodes.
In head.ribosome.cluster:/etc/dhcp3/dhcpd.conf:
host node25.ribosome.cluster { hardware ethernet 00:15:f2:8a:33:69; fixed-address 192.168.40.125; }
host node26.ribosome.cluster { hardware ethernet 00:15:f2:80:2d:3c; fixed-address 192.168.40.126; }
host node27.ribosome.cluster { hardware ethernet 00:13:d4:99:48:d0; fixed-address 192.168.40.127; }
host node28.ribosome.cluster { hardware ethernet 00:e0:18:00:12:13; fixed-address 192.168.40.128; }
DHCP was then restarted on head.ribosome.cluster as follows:
/etc/init.d/dhcp3-server restart
At this point, each new node was rebooted.
As root,
authconfig
Select "Use NIS"
Domain: cluster
Server: 192.168.0.140
No reboot or daemon restart is necessary.
The rsh-server package from the RHEL4 EMT64 AS installation CD (disc #4) must be installed, if it's not already.
As root, under Applications > System Settings > Server Settings > Services, ensure that the rsh, rlogin, rexec services are selected to start on boot.
To start these services immediately, restart the xinetd service.
Copy the /etc/hosts.equiv file from node1.ribosome.cluster to /etc on the new node before rebooting the system or restarting the xinetd service.
Ensure that the list of hosts in /etc/hosts.equiv is complete to ensure access from all appropriate hosts.
The default Ganglia installation installs files in `usr/local/bin`, `usr/local/man`, etc.
As root,
cd /root/ganglia-3.0.3
./configure
make
make check
All self-tests passed with `OK`.
make install
No problems reported.
The Ganglia installation folder contains an init.d script for starting/stopping the gmond Ganglia client daemon. In the current installation of RHEL, runlevel 5 is the default.
cp gmond.init /etc/init.d
ln -s /etc/init.d/gmond.init /etc/rc.d/rc5.d/S66gmond.init
Now, when the system is rebooted, the Ganglia client daemon is started during the boot process.
As root,
tar -xzvf torque-2.1.6.tar.gz
cd torque-2.1.6
./configure
make
make install
Copied the /sbin/start-stop-daemon from node1.ribosome.cluster to /sbin on the new node (this script was not installed with the installation steps listed above).
mv /var/spool/torque /var/spool/torque-1.2.0p6
chkconfig --add torquemom
Copied the following configuration files from /var/spool/torque-1.2.0p6 of node1.ribosome.cluster to the same locations on the new node:
/var/spool/torque-1.2.0p6/server_name
/var/spool/torque-1.2.0p6/mom_priv/config
Copied the /etc/init.d/torquemom from node1.ribosome.cluster to /etc/init.d of the new node.
chkconfig --add torquemom
Started the Torque client daemon:
/etc/rc.d/rc5.d/S50torquemom start
Starting Torque MOM.
Rebooted the node to verify that the torquemom daemon could start successfully following a reboot.
Added lines to /var/spool/torque-1.2.0p6/server_priv/nodes on fileserver to add the new node to the pool:
node25.ribosome.cluster np=2 ribosome
node26.ribosome.cluster np=2 ribosome
node27.ribosome.cluster np=2 ribosome
node28.ribosome.cluster np=2 ribosome
Torque will need to be restarted following these changes.
This is a test installation of the "evaluation version" of this compiler.
As root,
cd l_fc_c_9.1.036
./install.sh
Choose `1` from the menu to install the compiler.
Enter serial number: [SERIAL] (the installer will validate the serial online)
Choose `2` for a custom installation.
Choose `1` to install the compiler.
`accept` the license agreement.
Install in /usr/local/intel/fc91
The installer briefly tests the installation when finished (the test was passed successfully).
Set the environment variables to use the compiler (current session only):
PATH=$PATH:/usr/local/intel/fc91/bin;export PATH
LD_LIBRARY_PATH=/usr/local/intel/fc91/lib;export LD_LIBRARY_PATH
This is a test installation of the "evaluation version" of this compiler.
As root,
cd l_cc_c_9.1.043
./install.sh
Choose `1` from the menu to install the compiler.
Enter serial number: [SERIAL] (the installer will validate the serial online)
Choose `2` for a custom installation.
Choose `1` to install the compiler.
`accept` the license agreement.
Install in /usr/local/intel/cc91
The installer briefly tests the installation when finished (the test was passed successfully).
Set the environment variables to use the compiler (current session only):
PATH=$PATH:/usr/local/intel/cc91/bin;export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/intel/cc91/lib;export LD_LIBRARY_PATH
As root, with the PATH and LD_LIBRARY_PATH environment variables set as described in the compiler installation procedures (above),
mv c32b2 /usr/local
cd /usr/local/c32b2
./install.com gnu xxlarge ifort
No errors reported. The CHARMM executable is /usr/local/c32b2/exec/gnu/charmm.
As root,
wget http://download.systemimager.org/pub/sis-install/install
chmod +x install
./install --list
./install --verbose systemimager-client
Problem:
./install --verbose systemimager-client
Using pre-existing package list: /tmp/sis-packages/stable.list
Downloading: http://install.sisuite.org/sourceforge/systemimager/systemimager-client-3.6.3-1.noarch.rpm...done!
Downloading: http://install.sisuite.org/sourceforge/systemimager/systemimager-common-3.6.3-1.noarch.rpm...done!
Downloading: !
Downloading: !
error: open of perl-AppConfig-1.52-4.noarch.rpm failed: No such file or directory
rpm -Uhv systemimager-client-3.6.3-1.noarch.rpm systemimager-common-3.6.3-1.noarch.rpm systemconfigurator-2.2.2-1.noarch.rpm perl-AppConfig-1.52-4.noarch.rpm
error: open of perl-AppConfig-1.52-4.noarch.rpm failed: No such file or directory
Solution:
Find the failed download.
updatedb
find perl-App*
/tmp/sis-packages/perl-AppConfig-1.52-4.noarch.rpm?download&failedmirror=belnet.dl.sourceforge.net
Manually download perl-AppConfig-1.52-4.noarch.rpm from and copy it to the location of the failed download.
As root,
cp perl-AppConfig-1.52-4.noarch.rpm /tmp/sis-packages
./install --verbose systemimager-client
Using pre-existing package list: /tmp/sis-packages/stable.list
Checking integrity of systemimager-client-3.6.3-1.noarch.rpm: md5 OK
Checking integrity of systemimager-common-3.6.3-1.noarch.rpm: md5 OK
Checking integrity of systemconfigurator-2.2.2-1.noarch.rpm: sha1 md5 OK
Checking integrity of perl-AppConfig-1.52-4.noarch.rpm: md5 OK
rpm -Uhv systemimager-client-3.6.3-1.noarch.rpm systemimager-common-3.6.3-1.noarch.rpm systemconfigurator-2.2.2-1.noarch.rpm perl-AppConfig-1.52-4.noarch.rpm
Preparing... ########################################### [100%]
1:perl-AppConfig ########################################### [ 25%]
2:systemconfigurator ########################################### [ 50%]
3:systemimager-common ########################################### [ 75%]
4:systemimager-client ########################################### [100%]
The System Installation Suite packages you've chosen are now installed!
This version of SystemImager (3.6.3) uses a different naming scheme for its scripts than previous versions, so the names provided in the older (3.1) documentation are not always correct. For example, the prepareclient script has been renamed to si_prepareclient.
slocate -u
slocate prepareclient
/usr/sbin/si_prepareclient
/usr/sbin/si_prepareclient --server head.ribosome.cluster
Answered "y" to having /etc/services and /tmp/rsyncd.conf.20761 modified and the /etc/systemimager directory created.
Error:
Using "sfdisk" to gather information about disk:
/dev/sda
Use of uninitialized value in hash element at /usr/lib/systemimager/perl/SystemImager/Common.pm line 1042,line 7.
rsync: link_stat "/usr/share/systemimager/boot/i386/standard/initrd_template/." failed: No such file or directory (2)
rsync error: some files could not be transferred (code 23) at main.c(702)
Couldn't rsync -a /usr/share/systemimager/boot/i386/standard/initrd_template/ /tmp/.systemimager.1/. at /usr/lib/systemimager/perl/SystemImager/UseYourOwnKernel.pm line 58.
As root on head.ribosome.cluster,
/usr/sbin/getimage --quiet --image node25.ribosome.2006.10.20 --golden-client node25.ribosome.cluster
Error:
rsync: failed to connect to node25.ribosome.cluster: Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(99)
Failed to retrieve /etc/systemimager/mounted_filesystems from node25.ribosome.cluster.
getimage: Have you run "prepareclient" on node25.ribosome.cluster?
Giving up for now, to revisit this issue later. Proceeding with G4U cloning, instead.
There is no need to open the chassis of any node system in order to do this, but these systems do not have hot-swap hard drives, so they must be shutdown before installing or removing a drive.
Target and original drives must be of same make, size, and model.
Install target (blank/expendable) drive in secondary HD bay (left side, on Atipa i1002) of the node running the original image to be cloned.
Boot the original system to the G4U CD-ROM.
Copy the contents of the original drive to the target drive (here, drive order is critical). This can take about 1 hour, depending on drive size and speed.
copydisk wd0 wd1
Reinstall the target drive in its chassis and boot the machine.
On boot, this machine will not recognize its MAC addresses, so the Kudzu hardware detection tool will interrupt the boot process and permit you to delete the configurations of the 2 missing MAC addresses (do this), then configure both "new" MAC addresses for DHCP.
Allow the system to boot normally. No other changes are required.
If there are minor hardware differences between the new node and the original node, use Kudzu to remove the "missing" hardware and configure the "new" hardware.