# GPU-server "amp" ***This manual is work in progress, please check regularly for updates***

***Running your jobs on amp currently requires special procedure: you need to ssh into amp to submit jobs (and prepare your python envs***
The reason for this are differences in the operating system (amp: ubuntu, base: centos)

## Hardware --- **amp** - CPU: 2x AMD EPYC 7742 64core - RAM: 1 TB - GPUs: 8x A100 Nvidia 40GB - OS: Ubuntu **amp2** - CPU: 2x AMD EPYC 7713 64core (3rd gen EPYC, Zen3) - RAM: 2 TB - GPUs: 8x A100 Nvidia 80GB - OS: Ubuntu

## Login and $HOME --- The server shares the home directory with all cluster nodes. It is possible to login directly and submit and run jobs locally: ssh uni-ID@amp.hpc.taltech.ee Please don't abuse the direct login by running jobs bypassing the queueing system. If this repeatedly disturbs jobs, we will have to disable direct login. Jobs for amp/amp2 need to be submitted from amp/amp2 (not from base) to ensure that all environment is set-up correctly. Anyway, jobs need to be submitted using `srun` or `sbatch`, do not run jobs outside the batch system. Interactive jobs are started using `srun`: srun -p gpu -t 1:00:00 --pty bash GPUs have to be reserved/requested with: srun -p gpu --gres=gpu:A100:1 -t 1:00:00 --pty bash both amp/amp2 are in the same partition (`-p gpu`) so jobs that do not have specific requirements can run on any of the two nodes. If you need a specific type, e.g. for testing performance or because of memory requirements: it is possible to request the feature "A100-40" (for the 40GB A100s in amp) or "A100-80" (for the 80GB A100s in amp2): `--constraint=A100-40` another option is to request the job to run on a specific node, using the `-w` switch (e.g. `srun -p gpu -w amp ... `) you can see which GPUs have been assigned to your job using `echo $CUDA_VISIBLE_DEVICES`, **the CUDA-deviceID in your programs always start with "0" (no matter which physical GPU was assigned to you by SLURM)**. The home-directory is the same as the cluster (`/gpfs/mariana/home/`), additionally the AI-Lab home directories are mounted under `/illukas/home/`

## Software and modules --- ### _modules specific on amp_ enable the SPACK software modules (this is a separate SPACK installation from the rest of the cluster, use *this one* on amp): module load amp-spack the Nvidia HPC SDK (includes CUDA, nvcc and PGI compilers) is available with the following module directory (see below): module load amp ### _from AI lab_ enable the modules for AI lab software from illukas: module load amp

## GPU libraries and tools --- The GPUs installed are Nvidia A100 with compute capability 80, compatible with CUDA 11. However, when developing own software, be aware of vendor lockin, CUDA is only available for Nvidia GPUs and does not work on AMD GPUs. Some new supercomputers (LUMI (CSC), El Capitan (LLNL), Frontier (ORNL)) are using AMD, and some plan the Intel "Ponte Vecchio" GPU (Aurora (ANL), SuperMUC-NG (LRZ)). To be future-proof, portable methods like OpenACC/OpenMP are recommended. Porting to AMD/HIP for LUMI: ### _Nvidia CUDA 11_ Again, beware of the vendor lockin. To compile CUDA code, use the Nvidia compiler wrapper: nvcc ### _Offloading Compilers_ - PGI (Nvidia HPC-SDK) supports OpenACC and OpenMP offloading to Nvidia GPUs - GCC-10.3.0 - GCC-11.2.0 with NVPTX supports GPU-offloading using OpenMP and OpenACC pragmas - LLVM-13.0.0 (Clang/Flang) with NVPTX supports GPU-offloading using OpenMP pragmas
See also: ### _OpenMP offloading_ Since version 4.0 supports offloading to accelerators. It can be utilized by GCC, LLVM (C/Flang) and Nvidia HPC-SDK (former PGI compilers). - GCC-10.3.0 - GCC-11.2.0 with NVPTX supports GPU-offloading using OpenMP and OpenACC pragmas - LLVM-13.0.0 (Clang/Flang) with NVPTX supports GPU-offloading using OpenMP pragmas - AOMP
List of compiler support for OpenMP: Current recommendation: use Clang or GCC or AOMP #### _Nvidia HPC SDK_ Compile option `-⁠mp` for CPU-OpenMP or `-mp=gpu` for GPU-OpenMP-offloading. The table below summarizes useful compiler flags to compile you OpenMP code with offloading. | | NVC/NVFortran | Clang/Cray/AMD |GCC/GFortran | |-------------|-----|----------|----------| | OpenMP flag | -mp | -fopenmp | -fopenmp -foffload= | | Offload flag | -mp=gpu |-fopenmp-targets= | -foffload= | | Target NVIDIA | default | nvptx64-nvidia-cuda | nvptx-none | | Target AMD | n/a | amdgcn-amd-amdhsa | amdgcn-amdhsa | | GPU Architecture | -gpu= | -Xopenmp-target -march= | -foffload=”-march= | ### _OpenACC offloading_ OpenACC is a portable compiler directive based approach to GPU computing. It can be utilized by GCC, (LLVM (C/Flang)) and Nvidia HPC-SDK (former PGI compilers). Current recommendation: use HPC-SDK #### _Nvidia HPC SDK_ Installed are versions 21.2, 21.5 and 21.9 (2021). These come with modulefiles, to use them, enable the the directory: module load amp then load the module you want to use, e.g. module load nvhpc-nompi/21.5 The HPC SDK also comes with a profiler, to identify regions that would benefit most from GPU acceleration. OpenACC is based on compiler pragmas enabling an incremental approach to parallelism (you never break the sequential program), it can be used for CPUs (multicore) and GPUs (tesla). Compiling an OpenACC program with the Nvidia compiler: get accelerator information pgaccelinfo compile for multicore (C and Fortran commands) pgcc -fast -ta=multicore -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace jacobi.c laplace2d.c pgfortran -fast -ta=multicore -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace_multicore laplace2d.f90 jacobi.f90 compile for GPU (C and Fortran commands) pgcc -fast -ta=tesla -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace_gpu jacobi.c laplace2d.c pgfortran -fast -ta=tesla,managed -Minfo=accel -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.5/cuda/11.3/targets/x86_64-linux/include/ -o laplace_gpu laplace2d.f90 jacobi.f90 Profiling: nsys profile -t nvtx --stats=true --force-overwrite true -o laplace ./laplace nsys profile -t openacc --stats=true --force-overwrite true -o laplace_data_clauses ./laplace_data_clauses 1024 1024 Analysing the profile using CLI: nsys stat s laplace.qdrep using the GUI: nsys-ui then load the `.qdrep` file. #### _GCC (needs testing)_ - GCC-10.3.0 - GCC-11.2.0 with NVPTX supports GPU-offloading using OpenMP and OpenACC pragmas ### _HIP (upcoming)_ For porting code to AMD-Instinct based LUMI, the AMD HIP SDK will be installed.

## Example jobs --- ### Example job for meshroom ** under construction ** ### Example job for colmap (needs GPU for part of the tasks) ** under construction ** See the AI-lab guide , which needs to be slightly adapted.
### Singularity The container solution *singularity* is available (can also run docker container). Use with module load amp module load Singularity pull the docker image you want, here ubuntu:18.04: singularity pull docker://ubuntu:18.04 write an sbatch file (here called `ubuntu.slurm`): #!/bin/bash #SBATCH -t 0-00:30 #SBATCH -N 1 #SBATCH -c 1 #SBATCH -p gpu #SBATCH --gres=gpu:A100:1 #SBATCH --mem-per-cpu=4000 singularity exec docker://ubuntu:18.04 cat /etc/issue submit to the queueing system with sbatch ubuntu.slurm and when the resources become available, your job will be executed.