HPC Cluster Guide

Access

In order to access our HPC cluster, you need an acount on our local system (separate from your URZ login). Students please ask their supervisor to contact the administrator. 

Topology and general information

We are currently providing the following slurm partitions:

Partition#Nodes#Cores / NodeRAM / CoreInstruction Set
fquad48~2 GBSSE1-3
darwin48~3 GBSSE1-4.2
cuda*824~5 GBSSE1-4a
chrom1224~4 GBSSE1-4.2, AVX
sulfur216~4 GBSSE1-4.2, AVX
calcium420~5 GBSSE1-4.2, AVX, AVX2, FMA3
magnesium112~5 GBSSE1-4.2, AVX, AVX2, FMA3
ADDE**2108~3 GBSSE1-4.2
*Cuda Nodes have an additional NVidia GeForce GTX 580 GPU with 512 CUDA cores and 1536 MB GDDR5
**currently not available
For more detailed information you may run one of the following commands:
  • sinfo / sinfo --long -N
  • listnodes / listnodes -l

Submitting jobs

In order to run a calculation on our cluster, you must be connected to the login node. This is possible exclusively within the TUBAF IP range, i.e. from a 139.20.xx.xxx IP and with a valid user acount on our infrastructure. For allocating HPC resoureces, you have two options:

  • ask for an interactive login with
    # salloc -p $PARTITION -n $NCORES srun --pty /bin/bash
    where you have insert a valid partition name and choose a valid number of cores
  • submit a job script with
    # sbatch jobscript

Job scripts contain information on the nodes you want to allocate, the maximum time you allow your job to run and settings for job name and email notifications. A simple MPI-only job script looks like 

example slurm job file

#!/bin/bash

#SBATCH -J your-jobname
#SBATCH --partition=fquad
#SBATCH --nodes=2
#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --distribution=cyclic
#SBATCH --export=ALL
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:=1}

mpirun --map-by socket /global/sw/hybrid-test/hybrid-test &> test.out

echo "job finished"

For hybrid (MPI+openMP) jobs, make sure you do not oversubscribe nodes, i.e. use more ranks/threads in combination than the total number of cpu cores on a node. The number of openMP threads has to be set via the --cpus-per-task option. 

Note, that we must use mpirun NOT srun and do not need to specify the number of ranks via 'mpirun -np N'. Slurm will handle this automagically using the specifications from --ntasks-per-node and --nodes parameters. 

For a more detailed explanation, please consult the ZIH wiki:

  • https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Slurm 

Compiling and software access

TODO: introduce lmod system [coming soon...]

Architecture-specific compiler flags / vectorization

In case you want to run your jobs only on a certain partition, e.g. only on chrom, you should use architecture-specific optimization flags for gcc. For our example, we can utilize the AVX instruction set of the AMD opteron CPUs via -march=bdver1 -avx, resulting in a performance gain of e.g. ~ 17% for Elk TD-DFT calculations! You can find the correct flags using the listnodes -l command.

Software recommendations

Elk-LAPW

The current version 5.2.14 cannot be run hardware-spanning and MPI-only, e.g. 40 ranks on calcium[01-02]. The best scaling that can be reached on our cluster is by using 1 MPI rank per socket and number of threads = number of cores per CPU. This seems to work also for multiple nodes.

Hybrid example options for 2 'chrom' nodes with 1 rank per socket and 12 cores/CPU (=24 cores/node): 

hybrid job script (excerpt)

#SBATCH --nodes=2
#SBATCH --ntasks-per-node 2
#SBATCH --cpus-per-task=12
mpirun --map-by socket $HOME/Software/elk-5.2.14/src/elk > ELK.OUT

 

CUDA capable nodes

On CUDA devices, you will find the nvcc compiler and the nvidia-cuda-toolkit installed. You may use it just like a conventional compiler, e.g. 'nvcc cudasrc.cpp -o cuda.out'.

Titan Black @ Sirius

Device 0: "GeForce GTX TITAN Black" 
CUDA Driver Version / Runtime Version9.1 / 9.1
CUDA Capability Major/Minor version number3.5
Total amount of global memory6082 MBytes
(15) Multiprocessors, (192) CUDA Cores/MP2880 CUDA Cores
GPU Max Clock rate1072 MHz
Memory Clock rate3500 Mhz
Memory Bus Width384-bit
L2 Cache Size1572864 bytes
Maximum Texture Dimension Size (x,y,z)1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers2D=(16384, 16384), 2048 layers
Total amount of constant memory65536 bytes
Total amount of shared memory per block49152 bytes
Total number of registers available per block65536
Warp size32
Maximum number of threads per multiprocessor2048
Maximum number of threads per block1024
Max dimension size of a thread block (x,y,z)(1024, 1024, 64)
Max dimension size of a grid size    (x,y,z)(2147483647, 65535, 65535)
Maximum memory pitch2147483647 bytes
Texture alignment512 bytes
Concurrent copy and kernel executionYes with 1 copy engine(s)
Run time limit on kernelsNo
Integrated GPU sharing Host MemoryNo
Support host page-locked memory mappingYes
Alignment requirement for SurfacesYes
Device has ECC supportDisabled
Device supports Unified Addressing (UVA)Yes
Device supports Compute PreemptionNo
Supports Cooperative Kernel LaunchNo
Supports MultiDevice Co-op Kernel LaunNo
Device PCI Domain ID / Bus ID / location ID0 / 1 / 0
Compute ModeDefault*
* multiple host threads can use ::cudaSetDevice() with device simultaneously

 

 

 

 

GTX 580 @ CUDA1-8

TODO