Performance of Parallel VASP Jobs

VASP Flags Affecting Parallel Runtime

The primary purpose of the Wolfgang cluster is to run ab-initio first principles calculations, primarily using the VASP (Vienna Ab-initio Simulation Package) program. There are many choices that will affect a calculation's runtime which can be made in VASP, so this page attempts to explain them for VASP users. This page only offers a cursory overview of how these options affect the performance on Wolfgang, and is not a replacement for the VASP , which should be consulted for more in-depth information about these options.

The four main flags which affect the runtime for parallel jobs (and don't change the results) are:

IALGO
LPLANE
NSIM
NPAR

Of these, IALGO and NPAR are the most important. The flag LPLANE controls data distribution, and setting it LPLANE = TRUE generally improves performance. NSIM controls wether vector-matrix or matrix-matrix operations are used when working with the real-space non-local projection operators if IALGO = 48, and has a slight improvement when NSIM = 4. With 4 k-points, 40,000 plane waves, and 237 bands:

	`NSIM=1`	`NSIM=4`
`LREAL=F`	8978 sec	8241 sec
`LREAL=T`	5075 sec	3802 sec

Notice, that NSIM has little effect with LREAL=F, but makes a substantial difference when real-space projection operators are used.

The flag which influences the runtime the most is IALGO, which selects the core diagonalization algorithm. For all but small systems, the method of choice is RMM-DIIS (IALGO = 48), which is a good 10% faster than the Blocked Davidson (IALGO = 38) for medium sized systems, and even better for large systems.

After selecting the algorithm, the next most important flag is NPAR. NPAR should generally be set to NPAR = 1. The default is NPAR = num cpu. It is particularly important to set NPAR = 1 if you use the Davidson algorithm (IALGO=38). The Davidson will work with higher values of NPAR, but the performance becomes increasingly poorer (a good 50% poorer at NPAR = num cpu). For the RMM-DIIS algorithm, the setting of NPAR is not as critical, but a generally better (about 5% to 10%) performance is observed on Wolfgang with NPAR = 2. Also, the memory requirements are increased as NPAR is increased. This table was made for the initial ionic step (40 electronic steps) of a 64 atom HgTe supercell which has 4 k-points with 40,700 plane waves at each point, and 237 bands (IALGO=48, NSIM=4, 8 tasks).

Process Dist.	1 per node	2 per node
NPAR	Time (sec)	Time (sec)
1	8241	8368	2	7811	7451	4	10872	10255	8	13481	13647

Selecting the Number of CPUs

The most important aspect of running VASP on Wolfgang is that the runtime does not scale linearly with the number of CPUs. It scales a little worse than linearly. The exact scaling depends on the particulars of the job, but there is an initial jump in the total CPU hours used of about 20% when going from 2 CPUs to 4 CPUs. Another way to say this is that the wall time for 4 CPUs is 60% of the wall time for 2 CPUs, instead of the linear 50%. Subsequent jumps in the total CPU hours seem to be about 4% for every 2 extra CPUs, so using 8 CPUs would increase the total CPU hours by 20% + 2 * 4% = 28%. In other words, 8 CPUs would give a wall time of 32% of the 2 CPU wall time instead of the linear 25%, and it would be 52% of the 4 CPU wall time instead of 50%.

This non-linear scaling is primarily the result of using Gig-E ethernet as the nodal interconnect for the dual-CPU nodes. On the 2x2 nodes, there is a performance penalty for running all of the cores at the same time, caused by bottlenecks in the memory access and CPU internals. The penalty for using larger numbers of compute nodes is lower, because the communications become more distributed, and there are performance gains found in reducing the problem size on each CPU. But on Wolfgang, you will always consume more CPU hours as you use more CPUs, although your result will arrive in less time.

So what does this non-linear scaling mean when submitting jobs? Let's say you have two equal jobs to run. Each job will take 20 hours if run on two processors. You have two choices to run the jobs, each on 2 processors concurrently, or sequentially on 4. Running concurrently, the jobs will complete in 20 hours. Run sequentially, each job will take 12 hours, so both will be complete in 24 hours. If you're pushing a deadline, that's 4 extra hours out of a working day. It also means 4 fewer hours on 4 CPUs (16 CPU hours) for the other people using Wolfgang. Also to consider when submitting these jobs is that when you submit the jobs concurrently, they run. If the jobs are sequential, then you may find that somebody else's job started before your second job.

All of this does not mean that you shouldn't use a lot of CPUs, just that you should balance time against CPU usage. In other words, think about when you need the results, and chose a number of CPUs per job that will enable you to finish by that time, with a preference to minimize the number of CPUs per job. This helps your tasks consume less time as well as ensures other users the ability to run their tasks as well. Another consideration are the memory requirements of your jobs. Each node only has 2GB of memory per CPU core. As you increase the number of CPU's, the memory per CPU is reduced.

Other Flags Affecting Runtime (and possibly results)

The vast majority of the CPU time consumed by VASP is in the diagonalization of the bands. This time is proportional to the number of bands times the number of plane waves squared (NBANDS * NPLNW²). Generally, the number of bands and the number of plane waves grow with the number of electrons, so the runtime grows as the number of electrons cubed (NELECT³). If you use more bands than are needed, then the runtime is increased proportionally. The number of plane waves is also proportional to the energy cutoff to the 3/2 power (NPLNW ∝ ENCUT^3/2), which means that the runtime is proportional to the cutoff cubed (ENCUT³). This means that a cutoff of 300eV will take 1.73 times as long to run as a cutoff of 250eV. You should use a cutoff appropriate for your calculation. The maximum cutoff is not always the best, or necessary to attain reasonable results. For a relaxation, or slowly converging system, it may be better to do an initial calculation with a low cutoff, and then a pollishing calculation with a higher cutoff.

There is one caveat to this; the RMM-DIIS algorithm needs some extra bands to increase the rate at which it converges. In this case, there is a trade-off between the extra time per minimization step for each extra band, and the reduction in the number of minimization steps. The explains this more in-depth and gives a general rule of thumb for the number of bands.

Another tag to investigate is LREAL . This tag controls how the energy is calculated for the non-local part of the pseudopotential. This calculation can be done in real space or k-space. For k-space, this calculation scales with the number of plane waves (∝ENCUT^3/2*a₁*a₂*a₃). In real space, this calculation is done once per species, and is independent of the system size. The crossover in performance occurs around 25 atoms. For large systems, you should use LREAL = A or LREAL = T whenever possible.

The runtime is also linear with the number of [irreducible] k-points. Of course, the primary factor in the determination of ENCUT, NBANDS, and your k-point set is based on the accuracy you need in your calculations. The number of electrons can also be determined by pseudopotential selection, which is again determined by the accuracy you need. The only way to determine if you have the required accuracy is testing. Start with the predicted settings and then run a set of test calculations with different settings to see if you can get addequate accuracy with shorter runtimes. Of course, this is only practical if you will use the same basic system for multiple calculations. For a system you will use only a few times, you can frequenctly get a good idea of the settings to use by testing with a small cell of the bulk material, and extrapolating.

阅读(1060) | 评论(0) | 转发(1) |

上一篇：openmpi-intelmkl-siesta3.0 并行

下一篇：pbs安装

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6