GPU-server “amp”

This manual is work in progress, please check regularly for updates




Hardware


amp is in repair, please use amp2

amp

  • CPU: 2x AMD EPYC 64core

  • RAM: 1 TB

  • GPUs: 8x A100 Nvidia 40GB

  • OS: Ubuntu

amp2

  • CPU: 2x AMD EPYC 7713 64core (3rd gen EPYC, Zen3)

  • RAM: 2 TB

  • GPUs: 8x A100 Nvidia 80GB

  • OS: Ubuntu





Login and $HOME


The server shares the home directory with all cluster nodes. It is accessible using srun:

srun -p gpu -t 1:00:00 --pty bash

GPUs have to be reserved/requested with:

srun -p gpu --gres=gpu:A100:1 -t 1:00:00 --pty bash

you can see which GPUs have been assigned to your job using echo $CUDA_VISIBLE_DEVICES, the CUDA-deviceID in your programs always start with “0” (no matter which physical GPU was assigned to you by SLURM).

It is also possible to login directly and submit and run jobs locally (only possible for amp not for amp2):

ssh uni-ID@amp.hpc.taltech.ee

Please don’t abuse the direct login by running jobs bypassing the queueing system. If this repeatedly disturbs jobs, we will have to disable direct login.

The home-directory is the same as the cluster (/gpfs/mariana/home/<uniID>), additionally the AI-Lab home directories are mounted under /illukas/home/




Software and modules


modules specific on amp

enable the SPACK software modules (this is a separate SPACK installation from the rest of the cluster, use this one on amp):

module load amp-spack

the Nvidia HPC SDK (includes CUDA, nvcc and PGI compilers) is available with the following module directory (see below):

module load amp

from AI lab

enable the modules for AI lab software from illukas:

module load amp



GPU libraries and tools


The GPUs installed are Nvidia A100 with compute capability 80, compatible with CUDA 11. However, when developing own software, be aware of vendor lockin, CUDA is only available for Nvidia GPUs and does not work on AMD GPUs. Some new supercomputers (LUMI (CSC), El Capitan (LLNL), Frontier (ORNL)) are using AMD, and some plan the Intel “Ponte Vecchio” GPU (Aurora (ANL), SuperMUC-NG (LRZ)). To be future-proof, portable methods like OpenACC/OpenMP are recommended.

Porting to AMD/HIP for LUMI: https://www.lumi-supercomputer.eu/preparing-codes-for-lumi-converting-cuda-applications-to-hip/

Nvidia CUDA 11

Again, beware of the vendor lockin.

To compile CUDA code, use the Nvidia compiler wrapper:

nvcc

Offloading Compilers

  • PGI (Nvidia HPC-SDK) supports OpenACC and OpenMP offloading to Nvidia GPUs

  • GCC-10.3.0

  • GCC-11.2.0 with NVPTX supports GPU-offloading using OpenMP and OpenACC pragmas

  • LLVM-13.0.0 (Clang/Flang) with NVPTX supports GPU-offloading using OpenMP pragmas


See also: https://lumi-supercomputer.eu/offloading-code-with-compiler-directives/

OpenMP offloading

Since version 4.0 supports offloading to accelerators. It can be utilized by GCC, LLVM (C/Flang) and Nvidia HPC-SDK (former PGI compilers).

- GCC-10.3.0 - GCC-11.2.0 with NVPTX supports GPU-offloading using OpenMP and OpenACC pragmas - LLVM-13.0.0 (Clang/Flang) with NVPTX supports GPU-offloading using OpenMP pragmas - AOMP

List of compiler support for OpenMP: https://www.openmp.org/resources/openmp-compilers-tools/

Current recommendation: use Clang or GCC or AOMP

Nvidia HPC SDK

Compile option -⁠mp for CPU-OpenMP or -mp=gpu for GPU-OpenMP-offloading.

The table below summarizes useful compiler flags to compile you OpenMP code with offloading.

NVC/NVFortran Clang/Cray/AMD GCC/GFortran
OpenMP flag -mp -fopenmp -fopenmp -foffload=
Offload flag -mp=gpu -fopenmp-targets= -foffload=
Target NVIDIA default nvptx64-nvidia-cuda nvptx-none
Target AMD n/a amdgcn-amd-amdhsa amdgcn-amdhsa
GPU Architecture -gpu= -Xopenmp-target -march= -foffload=”-march=

OpenACC offloading

OpenACC is a portable compiler directive based approach to GPU computing. It can be utilized by GCC, (LLVM (C/Flang)) and Nvidia HPC-SDK (former PGI compilers).

Current recommendation: use HPC-SDK

Nvidia HPC SDK

Installed are versions 21.2, 21.5 and 21.9 (2021). These come with modulefiles, to use them, enable the the directory:

module load amp

then load the module you want to use, e.g.

module load nvhpc-nompi/21.5

The HPC SDK also comes with a profiler, to identify regions that would benefit most from GPU acceleration.

OpenACC is based on compiler pragmas enabling an incremental approach to parallelism (you never break the sequential program), it can be used for CPUs (multicore) and GPUs (tesla).

Compiling an OpenACC program with the Nvidia compiler: get accelerator information

pgaccelinfo

compile for multicore (C and Fortran commands)

pgcc -fast -ta=multicore -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/  -o laplace jacobi.c laplace2d.c
pgfortran -fast -ta=multicore  -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace_multicore laplace2d.f90 jacobi.f90

compile for GPU (C and Fortran commands)

pgcc -fast -ta=tesla -Minfo=accel  -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace_gpu jacobi.c laplace2d.c
pgfortran -fast -ta=tesla,managed -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace_gpu laplace2d.f90 jacobi.f90

Profiling:

nsys profile -t nvtx --stats=true --force-overwrite true -o laplace ./laplace
nsys profile -t openacc --stats=true --force-overwrite true -o laplace_data_clauses ./laplace_data_clauses 1024 1024

Analysing the profile using CLI:

nsys stat s laplace.qdrep

using the GUI:

nsys-ui

then load the .qdrep file.

GCC (upcoming)

GCC-10.3.0 GCC-11.2.0 with NVPTX supports GPU-offloading using OpenMP and OpenACC pragmas

HIP (upcoming)

For porting code to AMD-Instinct based LUMI, the AMD HIP SDK will be installed.




Singularity


The container solution singularity is available (can also run docker container). Use with

module load amp
module load Singularity

pull the docker image you want, here ubuntu:18.04:

singularity pull docker://ubuntu:18.04

write an sbatch file (here called ubuntu.slurm):

#!/bin/bash
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -p gpu
#SBATCH --gres=gpu:A100:1
#SBATCH --mem-per-cpu=4000
singularity exec docker://ubuntu:18.04 cat /etc/issue

submit to the queueing system with

sbatch ubuntu.slurm

and when the resources become available, your job will be executed.