Chinaunix首页 | 论坛 | 博客
  • 博客访问: 71335
  • 博文数量: 35
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 12
  • 用 户 组: 普通用户
  • 注册时间: 2015-03-26 20:17
文章分类
文章存档

2015年(35)

我的朋友

分类: 高性能计算

2015-03-26 20:20:20

问题:
    dosxyz 和egsnrc结合使用,在提交大量并行作业后,发现任务很快结束退出,但实际上并没有执行,查看log,报错是无法访问lock文件,而所谓的lock文件在这些任务结束后仍然存在。
   咨询egs研发团队给的解释是:

we run into this problem ourselves as our cluster grows. The lock file is a poor man's job control mechanism, and for a large number of concurrent processes it is prone to create a race condition on the poor file! Typically the code will retry for a little while (12 seconds I think) and then give up. One way around this is to add an additional delay between launching the individual processes, to mitigate the bottleneck.

 

Note that if your computers have uniform performance, you can also split your number of histories at the outset and do away with the lock file entirely, but you'll have to make your own script to rename output files and recombine results at the end, etc. We have it on our list to include this alternative method at some point...


修改的方法:

File: HEN_HOUSE/scripts/batch_option.pbs

Change the following line to a higher number (say > 5)

Before: batch_sleep_time=1

After: batch_sleep_time=10


阅读(938) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~