NCCL on Olivia
The NRIS/GPU software environment provides a number of NCCL modules compiled for the system, for different CUDA and GCC versions:
ml avail nccl
NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0
NCCL/2.26.6-GCCcore-14.2.0-CUDA-12.8.0
NCCL/2.27.7-GCCcore-14.3.0-CUDA-12.9.1
NCCL/2.28.3-GCCcore-14.2.0-CUDA-12.8.0
NCCL/2.29.2-GCCcore-14.3.0-CUDA-12.9.1
NCCL/2.30.4-GCCcore-14.3.0-CUDA-13.0.0
When running NCCL applications that span multiple compute nodes, the off-node communication is implemented through the aws-ofi-nccl network plugin, which uses Slingshot and libfabric to transfer data. The above NCCL modules automatically load the correct plugin version.
For best performance users, who use containers should bind the libraries provided by these modules (NCCL, aws-ofi-plugin, libfabric) and make them available inside the containers (see example NCCL container on Olivia.)
NCCL runtime configuration
At this moment (May 2026), all recent NCCL versions suffer from a data corruption issue when using GPUDirect communication in the LL128 protocol (https://github.com/NVIDIA/nccl/issues/2001). To mitigate this problem it is crucial that on a Cray Slingshot systems with GH200, like Olivia, the correct environment variables are used with the NCCL library. The current settings recommended by HPE (https://github.com/HewlettPackard/shs-ccl-docs/blob/main/ccl_env.sh) are automatically set when loading the NCCL modules on Olivia:
export HSA_FORCE_FINE_GRAIN_PCIE=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_RDZV_PROTO=alt_read
export FI_CXI_RDZV_EAGER_SIZE=0
export FI_CXI_RDZV_THRESHOLD=0
export FI_CXI_RDZV_GET_MIN=0
export FI_CXI_DEFAULT_TX_SIZE=2048
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
export FI_CXI_RX_MATCH_MODE=hybrid
Care must be taken when using a custom build of NCCL, or when using containers. If these variables are not set, data transfers can be corrupted, which is reflected in the nccl-tests results, e.g.
# nccl-tests version 2.17.9 nccl-headers=22902 nccl-library=22902
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 1 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 206828 on gpu-1-43 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 206829 on gpu-1-43 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 206830 on gpu-1-43 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 206831 on gpu-1-43 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 215832 on gpu-1-47 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 215833 on gpu-1-47 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 215834 on gpu-1-47 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 215835 on gpu-1-47 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 8 Group 0 Pid 228184 on gpu-1-49 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 9 Group 0 Pid 228185 on gpu-1-49 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 10 Group 0 Pid 228186 on gpu-1-49 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 11 Group 0 Pid 228187 on gpu-1-49 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 12 Group 0 Pid 93770 on gpu-1-51 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 13 Group 0 Pid 93771 on gpu-1-51 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 14 Group 0 Pid 93772 on gpu-1-51 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 15 Group 0 Pid 93773 on gpu-1-51 device 3 [0039:01:00] NVIDIA GH200 120GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1 1 int8 sum -1 24.41 0.00 0.00 0 21.26 0.00 0.00 0
2 2 int8 sum -1 28.02 0.00 0.00 0 21.18 0.00 0.00 0
4 4 int8 sum -1 20.95 0.00 0.00 0 21.05 0.00 0.00 0
8 8 int8 sum -1 21.01 0.00 0.00 0 22.22 0.00 0.00 0
16 16 int8 sum -1 21.01 0.00 0.00 0 20.80 0.00 0.00 0
32 32 int8 sum -1 21.37 0.00 0.00 0 55.32 0.00 0.00 0
64 64 int8 sum -1 29.23 0.00 0.00 0 24.23 0.00 0.00 0
128 128 int8 sum -1 35.74 0.00 0.01 0 35.51 0.00 0.01 0
256 256 int8 sum -1 36.74 0.01 0.01 0 105.19 0.00 0.00 0
512 512 int8 sum -1 38.00 0.01 0.03 0 37.45 0.01 0.03 0
1024 1024 int8 sum -1 39.38 0.03 0.05 0 38.84 0.03 0.05 0
2048 2048 int8 sum -1 42.59 0.05 0.09 0 42.34 0.05 0.09 0
4096 4096 int8 sum -1 47.04 0.09 0.16 0 46.63 0.09 0.16 0
8192 8192 int8 sum -1 53.29 0.15 0.29 0 52.20 0.16 0.29 0
16384 16384 int8 sum -1 60.40 0.27 0.51 0 53.04 0.31 0.58 0
32768 32768 int8 sum -1 55.10 0.59 1.12 0 54.44 0.60 1.13 0
65536 65536 int8 sum -1 75.68 0.87 1.62 0 63.65 1.03 1.93 0
131072 131072 int8 sum -1 597.83 0.22 0.41 0 452.35 0.29 0.54 0
262144 262144 int8 sum -1 76.61 3.42 6.42 0 76.21 3.44 6.45 0
524288 524288 int8 sum -1 92.47 5.67 10.63 0 92.14 5.69 10.67 0
1048576 1048576 int8 sum -1 130.34 8.05 15.08 0 130.86 8.01 15.02 0
2097152 2097152 int8 sum -1 6249.20 0.34 0.63 18134 6142.32 0.34 0.64 9609
4194304 4194304 int8 sum -1 9489.31 0.44 0.83 96817 10542.7 0.40 0.75 124342
8388608 8388608 int8 sum -1 9733.76 0.86 1.62 763876 9510.45 0.88 1.65 530181
16777216 16777216 int8 sum -1 4981.64 3.37 6.31 404221 5090.43 3.30 6.18 103664
33554432 33554432 int8 sum -1 9994.01 3.36 6.30 951005 8284.95 4.05 7.59 820604
67108864 67108864 int8 sum -1 34125.2 1.97 3.69 113492 43656.8 1.54 2.88 0
134217728 134217728 int8 sum -1 2800.93 47.92 89.85 0 3023.61 44.39 83.23 0
Performance considerations
At this moment it seems that the LL128 protocol improves performance for message sizes below ~8MB. For larger buffers it might be beneficial to turn it off using the following environment variable:
export NCCL_PROTO=^LL128
Since the previously described problem with data corruption occurs in LL128, disabling it is also a viable solution.