mpich2 高性能计算 and torque作业调度
系统环境:rhel6.0 x86-64 iptables selinux off
host: server2.example.com(调度节点)
server3.example.com(计算节点)
server4.example.com(计算节点)
注:在做高性并行计算时可不指定调度节点,因为两个节点的地位是相同的
yum intall gcc gcc-c++ nfs-utils -y
注:各节点的时间基本保持一致
#在各节点建立SSH信任连接,并创建如下用户,在users用户的主目录生成SSH密钥
useradd -u 544 users
ssh-keygen #一路回车即可
将/home/xili/以nfs方式共享出来
vim /etc/exports
/home/xili *(rw,anonuid=544,anongid=544)
showmount -e 192.168.0.2
####在其余各节点挂在nfs共享目录
mount 192.168.0.2:/home/xili /home/xili
vim /etc/fstab
192.168.0.2:/home/xili /home/xili nfs defaults 0 0
####将SSH公钥拷贝的其余节点
ssh-copy-id -i .ssh/id_rsa_pub server3.example.com
#达到的效果就是可以在任何节点实现无密码连接,但第一连接需要密码
---->在各计算节点上安装软件
yum install mpich2.x86_64 -y
在添加的用户主目录创建下面的文件创建.mpd.conf隐藏文件
vim .mpd.conf
secreword=hello #hello为xili这个用户的密码
chmod 600 .mpd.conf
vim mpd.hosts
server2.example.com
server3.example.com
####本地测试
mpd & #启动mpich2
mpdtrace #观看启动的机器
mpdallexit #退出
---->运行多节点集群计算系统
mpdboot -n 2 -f mpd.hosts
注:参数-n 2指定了要启动的机器个数, -f mpd.hosts指定了通过mpd.hosts运行
mpdtrace
station11
station12
mpdallexit
---->运行测试MPICH圆周率的程序
在mpich2的安装包examples目录里面有圆周率计算的源代码icpi.c,先编译成可执行文件
mpicc icpi-64.c -o icpi -64
####单机测试
./icpi -64
Enter the number of intervals:(0 quits) 1000000000
pi is approximately 3.1415926535921401,Error is 0.0000000000023470
wall clock time = 46.571311
Enter the number of intervals:(0 quits) 10000pi is approximately 3.1415926535921401,Error is 0.00000000008333410
wall clock time = 0.000542
Enter the number of intervals:(0 quits) 0 #0退出####集群测试
mpdboot -n 2 -f mpd.hosts
mpiexec -n 2 /home/xili/icpi-64
Enter the number of intervals:(0 quits) 1000000000pi is approximately 3.1415926535921401,Error is 0.00000000000001830wall clock time = 15.530082
Enter the number of intervals:(0 quits) 10000pi is approximately 3.1415926535921401,Error is 0.00000000008333392wall clock time = 0.006318
Enter the number of intervals:(0 quits) 0
mpdallexit
************************************************************************************************
---->torque作业调度系统
####在服务节点安装软件
tar -zxf torque-3.0.0.tar.gz
cd torque-3.0.0
./configure --with-scp --with-default-server=server2.example.com
make
make install (torque配置文件/var/spool/tarque/)
make packages
#会在/usr/local/lib/生成一些torque模块
ldconfig -n /usr/local/lib
cp ~/torque-3.0.0/contrib/init.d/pbs_server /etc/init.d/
cp ~/torque-3.0.0/contrib/init.d/pbs_sched /etc/init.d/
scp ~/torque-3.0.0/contrib/init.d/pbs_mom server3.example.com:/etc/init.d/
scp ~/torque-3.0.0/contrib/init.d/pbs_mom server4.example.com:/etc/init.d/
在解压目录拷贝两个安装文件到计算节点server3.example.com,server4.example.oom,
torque-package-clients-linux-x86_64.sh
torque-package-mom-linux-x86_64.sh
chmod +x torque-package-*
执行脚本安装:
./torque-package-clients-linux-x86_64.sh --install
./torque-package-mom-linux-x86_64.sh --install
####在服务节点设定管理账户
./torque.setup root
cat /var/spool/torque/server_name #显示调度服务节点的主机的主机名
server3.example.com
server4.example.com
....(可根据实际需要添加多个)
qterm -t quick #停止torque
service pbs_server start #启动torque
service pbs_server start #启动torque自带的调度进程
####计算节点操作
ldconfig -n /usr/local/lib #使计算节点的torque安装模块生效
vim /var/spool/torque//mom_prie/config
$pbsserver server2.example.com
$logevent 255
service pbs_mom start #
####torque的调度程序以非root用户
su - xili
mpdboot -n 2 -f mpd.hosts -n节点启用的计算节点数目
mpdtrace
vim job1.pbs(串行作业)
#!/bin/sh
#PBS -N job_name
#PBS -o job.log
#PBS -e job.err
#PBD -q batch
cd /home/wxh
echo Running on hosts `hostname`
echo Time is `date`
echo $PBS_NODEFILE
echo This job has allocated 1 node
mpiexec -n 4 /home/xili/prog
vim job2.pbs(并行作业)
#!/bin/sh
#PBS -N job_name
#PBS -o job.log
#PBS -e job.err
#PBD -q batch
#PBD -l nodes=2
cd /home/wxh
echo Time is `date`
echo Directory is $PWD
echo This job runs on the following nodes:
cat $PBS_NODEFILE
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
mpiexec -machine $PBS_NODEFILE -np $NPROCS /home/xili/prog
vim prog
#!/bin/sh
echo 100000000 | ./icpi #icpi程序是mpi自带的,拷贝过来即可
chmod +x prog
qsub jobz.pbs #提交作业
qstat #查看作业
pbsnodes #查看节点
####测试结果如下
[wxh@server2 ~]$ qsub job1.pbs
7.server2.example.com
[wxh@server2 ~]$ qstat
Job id
Name
User
Time Use S Queue
--------------------------------------------------
7.server2
job_name
wxh
0 R batch
[wxh@server2 ~]$ pbsnodes
station11
state = job-exclusive 只有一个节点计算
np = 1
ntype
= cluster
jobs
= 0/7.server2.example.com
status
=
rectime=1299203609,varattr=,jobs=,state=free,netload=305533,gres=,loadave=0.0
1908,uname=Linux
station11.example.com 2.6.18-164.el5xen #1 SMP
Tue
Aug 18 16:06:30 EDT 2009 i686,opsys=linux
mom_service_port
= 15002
mom_manager_port
= 15003
gpus
= 0
station12
state
= free
np = 1
ntype
= cluster
status =
rectime=1299203603,varattr=,jobs=,state=free,netload=248091,gres=,loadave=0.00
station12.example.com
2.6.18-164.el5xen #1 SMP Tue Aug 18 16:06:30
EDT
2009 i686,opsys=linux
mom_service_port
= 15002
mom_manager_port
= 15003gpus = 0
[wxh@server2 ~]$ qsub job2.pbs
8.server2.example.com
[wxh@server2 ~]$ qstat
Job id
Name
User
Time User S Queue
----------------------------------------------------
7.server2
job_name
wxh
00:00:31
C batch8.server2
job_name
wxh
0
R batch[wxh@server2 ~]$ pbsnodes
station11
state
= job-exclusive (两个节点都在计算)
np
= 1
ntype
= cluster
jobs
= 0/8.server1.example.com status
=
rectime=1299203918,varattr=,jobs=,state=free,netload=422738,gres=,loadave=0.1
1908,uname=Linux
station11.example.com 2.6.18-164.el5xen #1 SMP
Tue
Aug 18 16:06:30 EDT 2009 i686,opsys=linux
mom_service_port
= 15002
mom_manager_port
= 15003gpus = 0
station12
state
= job-exclusive np = 1
ntype
= cluster
jobs
= 0/8.server2.example.com statux =
rectime=1299203924,varattr=,jobs=5.server2.example.com,state=free,netload=3609
station12.example.com
2.6.18-164.el5xen #1 SMP Tue Aug 18 16:06:30
EDT
2009 i686,opsys=linux
mom_service_port
= 15002
mom_manager_port
= 15003gpus
= 0若在阅读的过程中发现了任何错误或你有什么建议,
欢迎mail到yungho@yeah.net
一起交流学习
阅读(2463) | 评论(0) | 转发(0) |