HPC Cluster Guide

Access

In order to access our HPC cluster, you need an acount on our local system (separate from your URZ login). Students please ask their supervisor to contact the administrator.

Topology and general information

We are currently providing the following partitions:

Partition	#Nodes	#Cores / Node	RAM / Core	Instruction Set
cuda*	8	24	~5 GB	SSE1-4a
chrom	12	24	~4 GB	SSE1-4.2, AVX
sulfur	2	16	~4 GB	SSE1-4.2, AVX
calcium	4	20	~5 GB	SSE1-4.2, AVX, AVX2, FMA3
magnesium	1	12	~5 GB	SSE1-4.2, AVX, AVX2, FMA3
Cuda Nodes have an additional NVidia GeForce GTX 580* GPU with 512 CUDA cores and 1536 MB GDDR5

For more detailed information you may run one of the following commands:

sinfo / sinfo --long -N
listnodes / listnodes -l

Submitting jobs

In order to run a calculation on our cluster, you must be connected to the login node. This is possible exclusively within the TUBAF IP range, i.e. from a 139.20.xx.xxx IP and with a valid user acount on our infrastructure. For allocating HPC resoureces, you have two options:

ask for an interactive login with
# salloc -p $PARTITION -n $NCORES srun --pty /bin/bash
where you have insert a valid partition name and choose a valid number of cores
submit a job script with
# sbatch jobscript

Job scripts contain information on the nodes you want to allocate, the maximum time you allow your job to run and settings for job name and email notifications. A simple MPI-only job script looks like

example slurm job file

For hybrid (MPI+openMP) jobs, make sure you do not oversubscribe nodes, i.e. use more ranks/threads in combination than the total number of cpu cores on a node. The number of openMP threads has to be set via the --cpus-per-task option.

Note, that we must use mpirun NOT srun and do not need to specify the number of ranks via 'mpirun -np N'. Slurm will handle this automagically using the specifications from --ntasks-per-node and --nodes parameters.

For a more detailed explanation, please consult the ZIH wiki:

https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Slurm

Compiling and software access

TODO: introduce lmod system [coming soon...]

Architecture-specific compiler flags / vectorization

In case you want to run your jobs only on a certain partition, e.g. only on chrom, you should use architecture-specific optimization flags for gcc. For our example, we can utilize the AVX instruction set of the AMD opteron CPUs via -march=bdver1 -avx, resulting in a performance gain of e.g. ~ 17% for Elk TD-DFT calculations! You can find the correct flags using the listnodes -l command.

Software recommendations

Elk-LAPW

The current version 5.2.14 cannot be run hardware-spanning and MPI-only, e.g. 40 ranks on calcium[01-02]. The best scaling that can be reached on our cluster is by using 1 MPI rank per socket and number of threads = number of cores per CPU. This seems to work also for multiple nodes.

Hybrid example options for 2 'chromium' nodes with 1 rank per socket and 12 cores/CPU (=24 cores/node):

hybrid job script (excerpt)

CUDA capable nodes

On CUDA devices, you will find the nvcc compiler and the nvidia-cuda-toolkit installed. You may use it just like a conventional compiler, e.g. 'nvcc cudasrc.cpp -o cuda.out'.

Institut für Theoretische Physik