Available MPI versions (and comparison)

The cluster has OpenMPI installed.

The recommendation is to use OpenMPI (except you really know what you are doing)!!! MPICH does not support InfiniBand. MVAPICH is not integrated with SLURM, you need to create the hostfile yourself from the slurm-nodelist.

On all nodes:

module load mpi/openmpi-x86_64

OpenMPI will choose the fastest interface, it will try RDMA over Ethernet (RoCE) which causes “[qelr_create_qp:683]create qp: failed on ibv_cmd_create_qp” messages, these can be ignored, it will fail over to IB (higher bandwidth anyway) or TCP.

For MPI jobs prefer the green-ib partition (#SBATCH -p green-ib) or stay within a single node (#SBATCH -N 1).

mpirun --mca btl_openib_warn_no_device_params_found 0 ./hello-mpi



Layers in OpenMPI


  • PML = Point-to-point Management Layer:

    • UCX

  • MTL = Message Transfer Layer:

    • PSM,

    • PSM2,

    • OFI

  • BTL = Byte Transfer Layer:

    • TCP,

    • openib

    • self

    • sm (OpenMPI 1), vader (OpenMPI 4)


The layers can be confusing, so was openib originally developed for InfiniBand, but is now used for RoCE and is deprecated for IB. However, on some IB cards and configurations it is the only working option. Also, the MVAPICH implementation still uses the openib (verbs) instead of UCX.

Layers can be selected with the --mca option of mpirun:

To select TCP transport:

mpirun --mca btl tcp,self,vader

To select RDMA transport (verbs):

mpirun --mca btl openib,self,vader

To select UCX transport:

mpirun --mca pml ucx 

Or using environment variables, e.g. export OMPI_MCA_btl=tcp,self,vader.

NB! UCX is not supported on QLogic FastLinQ QL41000 Ethernet controllers.

NB! UCX 1.8 on amps from Ubuntu is broken, use SPACK version





Different MPI implementations exist:


  • OpenMPI

  • MPICH

  • MVAPICH

  • IBM Platform MPI (MPICH descendant)

  • IBM Spectrum MPI (OpenMPI descendant)

  • (at least one for each network and CPU manufacturer)


OpenMPI

  • available in any Linux or BSD distribution

  • combining technologies and resources from several other projects (incl. LAM/MPI)

  • can use TCP/IP, shared memory, Myrinet, Infiniband and other low latency interconnects

  • chooses fastest interconnect automatically (can be manually choosen, too)

  • well integrated into many schedulers (e.g. SLURM)

  • highly optimized

  • FOSS (BSD license)


MPICH

  • highly optimized

  • supports TCP/IP and some low latency interconnects

  • (older versions) DO NOT support InfiniBand (however, it supports MELLANOX IB)

  • available in many Linux distributions

  • ? not intgrated into schedulers <!— is this correct? Maybe, “?” mark is better?—>

  • used to be a PITA to get working smoothly

  • FOSS


MVAPICH

  • highly optimized (maybe slightly faster than OpenMPI)

  • fork of MPICH to support IB

  • comes in many flavors to support TCP/IP, InfiniBand and many low latency interconnects: OpenSHMEM, PGAS

  • need to install several flavors and users need to choose the right one for the interconnect they want to use

  • generally not available in Linux distributions

  • not integrated with schedulers (integrated with SLURM only after version 18)

  • FOSS (BSD license)


Recommendation

  • default: use OpenMPI on both clusters

  • if unsatisfied with performance and running on single node or over TCP, try MPICH

  • if unsatisfied with performance and running on IB try MVAPICH