moved to https://docs.hpc.taltech.ee not changed to rocky yet
Available MPI versions (and comparison)
The cluster has OpenMPI installed.
The recommendation is to use OpenMPI (except you really know what you are doing)!!! MPICH does not support InfiniBand. MVAPICH is not integrated with SLURM, you need to create the hostfile yourself from the slurm-nodelist.
On all nodes:
module load mpi/openmpi-x86_64
OpenMPI will choose the fastest interface, it will try RDMA over Ethernet (RoCE) which causes “[qelr_create_qp:683]create qp: failed on ibv_cmd_create_qp” messages, these can be ignored, it will fail over to IB (higher bandwidth anyway) or TCP.
For MPI jobs prefer the green-ib partition (#SBATCH -p green-ib
) or stay within a single node (#SBATCH -N 1
).
mpirun --mca btl_openib_warn_no_device_params_found 0 ./hello-mpi
Layers in OpenMPI
PML = Point-to-point Management Layer:
UCX
MTL = Message Transfer Layer:
PSM,
PSM2,
OFI
BTL = Byte Transfer Layer:
TCP,
openib
self
sm (OpenMPI 1), vader (OpenMPI 4)
The layers can be confusing, so was openib originally developed for InfiniBand, but is now used for RoCE and is deprecated for IB. However, on some IB cards and configurations it is the only working option. Also, the MVAPICH implementation still uses the openib (verbs) instead of UCX.
Layers can be selected with the --mca
option of mpirun
:
To select TCP transport:
mpirun --mca btl tcp,self,vader
To select RDMA transport (verbs):
mpirun --mca btl openib,self,vader
To select UCX transport:
mpirun --mca pml ucx
Or using environment variables, e.g. export OMPI_MCA_btl=tcp,self,vader
.
NB! UCX is not supported on QLogic FastLinQ QL41000 Ethernet controllers.
NB! UCX 1.8 on amps from Ubuntu is broken, use SPACK version
Different MPI implementations exist:
OpenMPI
MPICH
MVAPICH
IBM Platform MPI (MPICH descendant)
IBM Spectrum MPI (OpenMPI descendant)
(at least one for each network and CPU manufacturer)
OpenMPI
available in any Linux or BSD distribution
combining technologies and resources from several other projects (incl. LAM/MPI)
can use TCP/IP, shared memory, Myrinet, Infiniband and other low latency interconnects
chooses fastest interconnect automatically (can be manually choosen, too)
well integrated into many schedulers (e.g. SLURM)
highly optimized
FOSS (BSD license)
MPICH
highly optimized
supports TCP/IP and some low latency interconnects
(older versions) DO NOT support InfiniBand (however, it supports MELLANOX IB)
available in many Linux distributions
? not intgrated into schedulers <!— is this correct? Maybe, “?” mark is better?—>
used to be a PITA to get working smoothly
FOSS
MVAPICH
highly optimized (maybe slightly faster than OpenMPI)
fork of MPICH to support IB
comes in many flavors to support TCP/IP, InfiniBand and many low latency interconnects: OpenSHMEM, PGAS
need to install several flavors and users need to choose the right one for the interconnect they want to use
generally not available in Linux distributions
not integrated with schedulers (integrated with SLURM only after version 18)
FOSS (BSD license)
Recommendation
default: use OpenMPI on both clusters
if unsatisfied with performance and running on single node or over TCP, try MPICH
if unsatisfied with performance and running on IB try MVAPICH