Chinaunix首页 | 论坛 | 博客
  • 博客访问: 386414
  • 博文数量: 80
  • 博客积分: 1750
  • 博客等级: 上尉
  • 技术积分: 1380
  • 用 户 组: 普通用户
  • 注册时间: 2011-11-13 11:35
文章分类
文章存档

2014年(3)

2013年(1)

2012年(54)

2011年(22)

分类: LINUX

2012-03-04 11:33:50

mpich2 高性能计算 and torque作业调度

系统环境:rhel6.0 x86-64 iptables selinux off
host: server2.example.com(调度节点)
         server3.example.com(计算节点)
         server4.example.com(计算节点)

注:在做高性并行计算时可不指定调度节点,因为两个节点的地位是相同的

yum intall gcc gcc-c++ nfs-utils -y

注:各节点的时间基本保持一致
#在各节点建立SSH信任连接,并创建如下用户,在users用户的主目录生成SSH密钥
useradd -u 544 users
ssh-keygen                    #一路回车即可
将/home/xili/以nfs方式共享出来
vim /etc/exports
/home/xili    *(rw,anonuid=544,anongid=544)

showmount -e 192.168.0.2

####在其余各节点挂在nfs共享目录
mount 192.168.0.2:/home/xili  /home/xili
vim /etc/fstab
192.168.0.2:/home/xili  /home/xili  nfs  defaults  0 0

####将SSH公钥拷贝的其余节点
ssh-copy-id -i .ssh/id_rsa_pub server3.example.com

         #达到的效果就是可以在任何节点实现无密码连接,但第一连接需要密码

---->在各计算节点上安装软件
yum install mpich2.x86_64 -y

在添加的用户主目录创建下面的文件创建.mpd.conf隐藏文件
vim .mpd.conf
secreword=hello                 #hello为xili这个用户的密码

chmod 600 .mpd.conf

vim mpd.hosts
server2.example.com
server3.example.com

####本地测试
mpd &      #启动mpich2

mpdtrace   #观看启动的机器

mpdallexit #退出

---->运行多节点集群计算系统
mpdboot -n 2 -f mpd.hosts
注:参数-n 2指定了要启动的机器个数, -f mpd.hosts指定了通过mpd.hosts运行

mpdtrace
station11
station12
mpdallexit


---->运行测试MPICH圆周率的程序
在mpich2的安装包examples目录里面有圆周率计算的源代码icpi.c,先编译成可执行文件

mpicc icpi-64.c -o icpi -64

####单机测试
./icpi -64
Enter the number of intervals:(0 quits) 1000000000
pi is approximately 3.1415926535921401,Error is 0.0000000000023470
wall clock time = 46.571311

Enter the number of intervals:(0 quits) 10000
pi is approximately 3.1415926535921401,Error is 0.00000000008333410
wall clock time = 0.000542

Enter the number of intervals:(0 quits) 0    #0退出

####集群测试
mpdboot -n 2 -f mpd.hosts
mpiexec -n 2 /home/xili/icpi-64
Enter the number of intervals:(0 quits) 1000000000
pi is approximately 3.1415926535921401,Error is 0.00000000000001830
wall clock time = 15.530082

Enter the number of intervals:(0 quits) 10000
pi is approximately 3.1415926535921401,Error is 0.00000000008333392
wall clock time = 0.006318

Enter the number of intervals:(0 quits) 0
mpdallexit

************************************************************************************************
---->torque作业调度系统
####在服务节点安装软件
tar -zxf torque-3.0.0.tar.gz
cd torque-3.0.0
./configure --with-scp --with-default-server=server2.example.com
make
make install    (torque配置文件/var/spool/tarque/)
make packages
     #会在/usr/local/lib/生成一些torque模块
ldconfig -n /usr/local/lib

cp ~/torque-3.0.0/contrib/init.d/pbs_server /etc/init.d/
cp ~/torque-3.0.0/contrib/init.d/pbs_sched /etc/init.d/
scp ~/torque-3.0.0/contrib/init.d/pbs_mom server3.example.com:/etc/init.d/
scp ~/torque-3.0.0/contrib/init.d/pbs_mom server4.example.com:/etc/init.d/

在解压目录拷贝两个安装文件到计算节点server3.example.com,server4.example.oom,
torque-package-clients-linux-x86_64.sh
torque-package-mom-linux-x86_64.sh

chmod +x torque-package-*
执行脚本安装:
./torque-package-clients-linux-x86_64.sh --install
./torque-package-mom-linux-x86_64.sh --install

####在服务节点设定管理账户
./torque.setup root

cat /var/spool/torque/server_name         #显示调度服务节点的主机的主机名
server3.example.com
server4.example.com
....(可根据实际需要添加多个)

qterm -t quick      #停止torque
service pbs_server start     #启动torque
service pbs_server start     #启动torque自带的调度进程

####计算节点操作
ldconfig -n /usr/local/lib          #使计算节点的torque安装模块生效
vim /var/spool/torque//mom_prie/config
$pbsserver server2.example.com
$logevent 255

service pbs_mom start        #

####torque的调度程序以非root用户
su - xili
mpdboot -n 2 -f mpd.hosts   -n节点启用的计算节点数目
mpdtrace

vim job1.pbs(串行作业)
#!/bin/sh
#PBS -N job_name
#PBS -o job.log
#PBS -e job.err
#PBD -q batch

cd /home/wxh
echo Running on hosts `hostname`
echo Time is `date`
echo $PBS_NODEFILE
echo This job has allocated 1 node
mpiexec -n 4 /home/xili/prog

vim job2.pbs(并行作业)
#!/bin/sh
#PBS -N job_name
#PBS -o job.log
#PBS -e job.err
#PBD -q batch
#PBD -l nodes=2
cd /home/wxh
echo Time is `date`
echo Directory is $PWD
echo This job runs on the following nodes:
cat $PBS_NODEFILE
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
mpiexec -machine $PBS_NODEFILE -np $NPROCS /home/xili/prog


vim prog
#!/bin/sh
echo 100000000 | ./icpi      #icpi程序是mpi自带的,拷贝过来即可

chmod +x prog

qsub jobz.pbs             #提交作业
qstat                     #查看作业
pbsnodes                  #查看节点

####测试结果如下
[wxh@server2 ~]$ qsub job1.pbs
7.server2.example.com
[wxh@server2 ~]$ qstat
Job id
Name
User
Time Use S Queue
--------------------------------------------------
7.server2
job_name
wxh
0 R batch

[wxh@server2 ~]$ pbsnodes
station11
state = job-exclusive             只有一个节点计算
np = 1
ntype = cluster
jobs = 0/7.server2.example.com
status =
rectime=1299203609,varattr=,jobs=,state=free,netload=305533,gres=,loadave=0.0

1908,uname=Linux station11.example.com 2.6.18-164.el5xen #1 SMP
Tue Aug 18 16:06:30 EDT 2009 i686,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
station12
state = free
np = 1
ntype = cluster
status =
rectime=1299203603,varattr=,jobs=,state=free,netload=248091,gres=,loadave=0.00 station12.example.com 2.6.18-164.el5xen #1 SMP Tue Aug 18 16:06:30
EDT 2009 i686,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003

gpus = 0

[wxh@server2 ~]$ qsub job2.pbs
8.server2.example.com
[wxh@server2 ~]$ qstat
Job id
Name
User
Time User S Queue
----------------------------------------------------
7.server2
job_name
wxh
00:00:31 C batch
8.server2
job_name
wxh
0 R batch

[wxh@server2 ~]$ pbsnodes
station11
state = job-exclusive (两个节点都在计算)
np = 1
ntype = cluster
jobs = 0/8.server1.example.com
status =
rectime=1299203918,varattr=,jobs=,state=free,netload=422738,gres=,loadave=0.1

1908,uname=Linux station11.example.com 2.6.18-164.el5xen #1 SMP

Tue Aug 18 16:06:30 EDT 2009 i686,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003

gpus = 0
station12
state = job-exclusive
np = 1
ntype = cluster
jobs = 0/8.server2.example.com
statux =
rectime=1299203924,varattr=,jobs=5.server2.example.com,state=free,netload=3609

station12.example.com 2.6.18-164.el5xen #1 SMP Tue Aug 18 16:06:30
EDT 2009 i686,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0


若在阅读的过程中发现了任何错误或你有什么建议,
欢迎mail到yungho@yeah.net
一起交流学习


阅读(2468) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~