.bashrc
SBATCH_NO_REQUEUE=1
SBATCH_OPEN_MODE=append
EOT
More about Slurm options can be found in [LUMI manuals](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/batch-job/#common-slurm-options).
Slurm script examples provided by LUMI:
- [GPU jobs](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/lumig-job/)
- [CPU jobs](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/lumic-job/)
- [Job array](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/throughput/)
## Multi Node Multi GPU PyTorch Training
---
This PyTorch script simulates training a ResNet model across multiple gpus and nodes.
### Quick Guide
1. Download:
- environment setup script - [env.sh](env.sh)
- bash script setup singularity - [setup.sh](setup.sh)
- PyTorch script - [min_dist.py](min_dist.py)
- slurm script - [dist_run.slurm](dist_run.slurm)
2. Setup environment by command:
. env.sh project_XXX
where `XXX` is a project number.
3. Setup singularity:
./setup.sh
4. Run PyTorch script:
sbatch -N 2 dist_run.slurm min_dist.py
5. You should get an output file `slurm_job-number` with content like:
8 GPU processes in total
Batch size = 128
Dataset size = 50000
Epochs = 5
Epoch 0 done in 232.64820963301463s
Epoch 1 done in 31.191600811027456s
Epoch 2 done in 31.244039460027125s
Epoch 3 done in 31.384101407951675s
Epoch 4 done in 31.143528194981627s
### Detailed Guide
Download:
- environment setup script - [env.sh](env.sh)
- bash script setup singularity - [setup.sh](setup.sh)
- PyTorch script - [min_dist.py](min_dist.py)
- slurm script - [dist_run.slurm](dist_run.slurm)
#### Setup
This commands will setup environment and singularity
. env.sh project_XXX
./setup.sh
where `XXX` is a project number that should be changed according user's project number.
#### Running
Job can be submitted into queue by command
sbatch -N 2 dist_run.slurm min_dist.py
Where `dist_run.slurm` is a resource manager, `min_dist.py` is a PyTorch script and `-N` in a number of nodes used.
```bash
#!/bin/bash
#SBATCH --job-name=DIST
#SBATCH --time=10:00
#SBATCH --mem 64G
#SBATCH --cpus-per-task 32
#SBATCH --partition small-g
#SBATCH --gpus-per-node 4
export NCCL_SOCKET_IFNAME=hsn
export NCCL_NET_GDR_LEVEL=3
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_DISABLE_CQ_HUGETLB=1
export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1)
export OMP_NUM_THREADS=8
srun singularity exec --rocm \
$SCRATCH/pytorch_latest.sif \
torchrun --nnodes=$SLURM_NNODES \
--nproc_per_node=$SLURM_GPUS_PER_NODE \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR $@
```
The environment variables containing `NCCL` and `CXI` are used by RCCL for communication over Slingshot.
The ones containing `MIOPEN` are for [MIOpen](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/index.html) to create its caches in the `/tmp` (which is local to each node and in
memory). If this is not set then MIOpen will create its cache in the user
home directory (the default) which is a problem since each node needs its own cache.