2015年(35)
分类: 高性能计算
2015-03-26 20:20:20
原文地址:Dosxyz 和egsnrc结合使用,提交并行任务失败 作者:myching
问题:
dosxyz 和egsnrc结合使用,在提交大量并行作业后,发现任务很快结束退出,但实际上并没有执行,查看log,报错是无法访问lock文件,而所谓的lock文件在这些任务结束后仍然存在。
咨询egs研发团队给的解释是:
we run into this problem ourselves as our cluster grows. The lock file is a poor man's job control mechanism, and for a large number of concurrent processes it is prone to create a race condition on the poor file! Typically the code will retry for a little while (12 seconds I think) and then give up. One way around this is to add an additional delay between launching the individual processes, to mitigate the bottleneck.
Note that if your computers have uniform performance, you can also split your number of histories at the outset and do away with the lock file entirely, but you'll have to make your own script to rename output files and recombine results at the end, etc. We have it on our list to include this alternative method at some point...
Change the following line to a higher number (say > 5)
Before: batch_sleep_time=1
After: batch_sleep_time=10