NCCL and apptainer on Olivia
Achieving good performance with NCCL executed from inside a container requires using some host-side libraries:
libfabriccompiled with Slingshot supportNCCLcompiled with recent CUDA supportaws-ofi-ncclplugin, which implements Slingshot support for NCCLOpenMPIwithlibfabricsupport
The following apptainer def file demonstrates how to build a base container that compiles nccl-tests - a set of benchmark programs designed to test NCCL performance:
Bootstrap: docker
From: ubuntu:25.04
%setup
mkdir -p $APPTAINER_ROOTFS/lib64
mkdir -p $APPTAINER_ROOTFS/usr/lib64
mkdir -p $APPTAINER_ROOTFS/usr/local/cuda
mkdir -p $APPTAINER_ROOTFS/cluster
mkdir -p $APPTAINER_ROOTFS/opt/openmpi
mkdir -p $APPTAINER_ROOTFS/opt/nccl
mkdir -p $APPTAINER_ROOTFS/opt/libfabric
mkdir -p $APPTAINER_ROOTFS/opt/aws-ofi-nccl
mkdir -p $APPTAINER_ROOTFS/opt/nccl-tests
%post
export DEBIAN_FRONTEND=noninteractive
ln -fs /usr/share/zoneinfo/Europe/Oslo /etc/localtime
echo "Europe/Oslo" > /etc/timezone
apt-get update -y
apt-get install -y wget bzip2 gcc g++ make python3
# Ubuntu has a newer version of libreadline - pretend it's 7, which we need on Olivia.
# Otherwise it can be provided with --bind /lib64/libreadline.so.7
ln -s /usr/lib/aarch64-linux-gnu/libreadline.so.8 /lib64/libreadline.so.7
# install vanilla NCCL
cd /tmp
wget https://github.com/NVIDIA/nccl/archive/refs/tags/v2.30.4-1.tar.gz
tar xaf v2.30.4-1.tar.gz
cd nccl-2.30.4-1
make -j NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" src.build
mv build/* /opt/nccl/
cd /tmp
rm -rf nccl-2.30.4-1
# install vanilla OpenMPI
cd /tmp
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.10.tar.bz2
tar xaf openmpi-5.0.10.tar.bz2
cd openmpi-5.0.10
./configure --prefix=/opt/openmpi
make -j install
cd /tmp
rm -rf openmpi-5.0.10
# install nccl-tests
cd /tmp
wget https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.18.3.tar.gz
tar xaf v2.18.3.tar.gz
cd nccl-tests-2.18.3/
make -j MPI=1 NCCL_HOME=/opt/nccl NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" src.build CC=/opt/openmpi/bin/mpicc CXX=/opt/openmpi/bin/mpicxx
mv build/* /opt/nccl-tests
cd /tmp
rm -rf nccl-tests-2.18.3
%environment
export PATH=/opt/nccl-tests:/usr/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi/lib:/opt/libfabric/lib/:/opt/nccl/lib:/opt/aws-ofi-nccl/lib:/usr/lib64:/lib64:/usr/local/cuda/lib/
Note that in the above script we install a basic version of NCCL and OpenMPI - only to be able to compile the application (in this case nccl-tests). Those libraries will not be actually be used during runtime. Instead, we will bind the host-side optimized libraries.
The container can be built on any Olivia GPU node:
ml load NRIS/GPU
ml load CUDA/13.0.0
apptainer build --nv --fakeroot --bind /cluster --bind $EBROOTCUDA:/usr/local/cuda nccl-tests.sif nccl-tests.def
To run the nccl-tests submit the following SLURM script:
#!/bin/bash
#SBATCH --job-name=nccl-tests
#SBATCH --partition=accel --gpus-per-node=4 --ntasks-per-node=4
#SBATCH --mem=700G
#SBATCH --time=1:00:00
module load NRIS/GPU
module load OpenMPI/5.0.10-GCC-14.3.0
module load NCCL/2.30.4-GCCcore-14.3.0-CUDA-13.0.0
module list
srun apptainer exec --nv --bind /cluster --bind $EBROOTCUDA:/usr/local/cuda --bind $EBROOTOPENMPI:/opt/openmpi --bind $EBROOTNCCL:/opt/nccl --bind $EBROOTLIBFABRIC:/opt/libfabric --bind $EBROOTAWSMINOFIMINNCCL:/opt/aws-ofi-nccl --bind /usr/lib64 nccl-tests.sif /opt/nccl-tests/all_reduce_perf -d int8 -b 1 -e 128M -f 2 -g 1
It’s important to note that the correct environment variables used to configure OpenMPI and NCCL are set, since we load the corresponding NRIS modules. These variables are propagated into the container, hence assuring good performance and correctness (see this article for an in-depth explanation).
For example, to run on 2 GPU nodes:
sbatch --nodes=2 --partition=accel --account=... ./nccl-tests.job
And the job output:
# nccl-tests version 2.18.3 nccl-headers=23004 nccl-library=23004
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 1 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
# Rank 0 Group 0 Pid 221423 on gpu-1-1 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 221421 on gpu-1-1 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 221424 on gpu-1-1 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 221422 on gpu-1-1 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 201122 on gpu-1-2 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 201121 on gpu-1-2 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 201119 on gpu-1-2 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 201120 on gpu-1-2 device 3 [0039:01:00] NVIDIA GH200 120GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1 1 int8 sum -1 26.95 0.00 0.00 0 22.40 0.00 0.00 0
2 2 int8 sum -1 97.16 0.00 0.00 0 22.58 0.00 0.00 0
4 4 int8 sum -1 21.81 0.00 0.00 0 21.97 0.00 0.00 0
8 8 int8 sum -1 22.80 0.00 0.00 0 22.35 0.00 0.00 0
16 16 int8 sum -1 22.02 0.00 0.00 0 21.81 0.00 0.00 0
32 32 int8 sum -1 22.36 0.00 0.00 0 22.24 0.00 0.00 0
64 64 int8 sum -1 24.82 0.00 0.00 0 23.89 0.00 0.00 0
128 128 int8 sum -1 28.81 0.00 0.01 0 28.88 0.00 0.01 0
256 256 int8 sum -1 32.95 0.01 0.01 0 29.16 0.01 0.02 0
512 512 int8 sum -1 28.76 0.02 0.03 0 28.54 0.02 0.03 0
1024 1024 int8 sum -1 32.35 0.03 0.06 0 31.42 0.03 0.06 0
2048 2048 int8 sum -1 31.32 0.07 0.11 0 30.50 0.07 0.12 0
4096 4096 int8 sum -1 32.19 0.13 0.22 0 32.62 0.13 0.22 0
8192 8192 int8 sum -1 36.09 0.23 0.40 0 33.83 0.24 0.42 0
16384 16384 int8 sum -1 36.59 0.45 0.78 0 35.31 0.46 0.81 0
32768 32768 int8 sum -1 40.00 0.82 1.43 0 123.30 0.27 0.47 0
65536 65536 int8 sum -1 61.21 1.07 1.87 0 59.30 1.11 1.93 0
131072 131072 int8 sum -1 67.95 1.93 3.38 0 88.03 1.49 2.61 0
262144 262144 int8 sum -1 196.97 1.33 2.33 0 179.73 1.46 2.55 0
524288 524288 int8 sum -1 93.05 5.63 9.86 0 92.01 5.70 9.97 0
1048576 1048576 int8 sum -1 90.58 11.58 20.26 0 89.09 11.77 20.60 0
2097152 2097152 int8 sum -1 98.79 21.23 37.15 0 98.49 21.29 37.26 0
4194304 4194304 int8 sum -1 150.75 27.82 48.69 0 150.59 27.85 48.74 0
8388608 8388608 int8 sum -1 340.35 24.65 43.13 0 215.52 38.92 68.12 0
16777216 16777216 int8 sum -1 383.60 43.74 76.54 0 647.17 25.92 45.37 0
33554432 33554432 int8 sum -1 837.94 40.04 70.08 0 931.77 36.01 63.02 0
67108864 67108864 int8 sum -1 1333.15 50.34 88.09 0 1395.72 48.08 84.14 0
134217728 134217728 int8 sum -1 2309.49 58.12 101.70 0 2182.20 61.51 107.63 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 17.8617
#
# Collective test concluded: all_reduce_perf
#