Translating GPU-accelerated applications

We present different tools to translate CUDA and OpenACC applications to target various GPU (Graphics Processing Unit) architectures (e.g. AMD and Intel GPUs). A special focus will be given to hipify, syclomatic and clacc. These tools have been tested on the supercomputer LUMI-G in which the GPU partitions are of AMD MI250X GPU type.

The aim of this tutorial is to guide users through a straightforward procedure for converting CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. By the end of this tutorial, we expect users to learn about:

  • How to use the hipify-perl and hipify-clang tools to translate CUDA sources to HIP sources.

  • How to use the syclomatic tool to convert CUDA source to SYCL.

  • How to use the clacc tool to convert OpenACC application to OpenMP offloading.

  • How to compile the generated HIP, SYCL and OpenMP applications.

Translating CUDA to HIP with Hipify

In this section, we cover the use of hipify-perl and hipify-clang tools to translate a CUDA application to HIP.

Hipify-perl

The hipify-perl tool is a script based on perl that translates CUDA syntax into HIP syntax (see .e.g. here. As an example, in a CUDA code that makes use of the CUDA functions cudaMalloc and cudaDeviceSynchronize, the tool will replace cudaMalloc by the HIP function hipMalloc. Similarly for the CUDA function cudaDeviceSynchronize, which will be replaced by hipDeviceSynchronize. We list below the basic steps to run hipify-perl

  • Step 1: loading modules

On LUMI-G, the following modules need to be loaded:

$module load CrayEnv
$module load rocm
  • Step 2: generating hipify-perl script

$hipify-clang --perl
  • Step 3: running hipify-perl

$perl hipify-perl program.cu > program.cu.hip
  • Step 4: compiling with hipcc the generated HIP code

$hipcc --offload-arch=gfx90a -o exec_hip program.cu.hip

Despite of the simplicity of the use of hipify-perl, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings (e.g. it replaces cuda with hip). In addition, hipify-perl lacks the ability of distinguishing device/host function calls. The alternative here is to use hipify-clang as we shall describe in the next section.

Hipify-clang

As described here, the hipify-clang tool is based on clang for translating CUDA sources into HIP sources. The tool is more robust for translating CUDA codes compared to the hipify-perl tool. Furthermore, it facilitates the analysis of the code by providing assistance.

In short, hipify-clang requires LLVM+CLANG and CUDA. Details about building hipify-clang can be found here. Note that hipify-clang is available on LUMI-G. The issue however might be related to the installation of CUDA-toolkit. To avoid any eventual issues with the installation procedure we opt for CUDA singularity container. Here we present a step-by-step guide for running hipify-clang:

  • Step 1: pulling a CUDA singularity container e.g.

$singularity pull docker://nvcr.io/nvidia/cuda:11.4.0-devel-ubuntu20.04
  • Step 2: loading a ROCM module before launching the container.

$ml rocm

During our testing, we used the rocm version rocm-5.0.2.

  • Step 3: launching the container

$singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif

where the current directory $PWD in the host is mounted to that of the container, and the directory /opt in the host is mounted to the that inside the container.

  • Step 4: setting the environment variable $PATH In order to run hipify-clang from inside the container, one can set the environment variable $PATH that defines tha path to look for the binary hipify-clang

$export PATH=/opt/rocm-5.0.2/bin:$PATH
  • Step 5: running hipify-clang

$hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include

Here the cuda path and the path to the includes and defines files should be specified. The CUDA source code and the generated output code are program.cu and hip_program.cu.hip, respectively.

  • Step 6: the syntax for compiling the generated hip code is similar to the one described in the previous section (see the hipify-perl section).

Translating CUDA to SYCL with Syclomatic

SYCLomatic is another conversion tool. However, instead of converting CUDA code to HIP syntax, SYCLomatic converts the code to SYCL/DPC++. The use of SYCLomatic requires CUDA libraries, which can be directly installed in an environment or it can be extracted from a CUDA container. Similarly to previous section, we use singularity container. Here is a step-by-step guide for using SYCLamatic

Step 1 Downloading SYCLomatic e.g. the last release from here

wget https://github.com/oneapi-src/SYCLomatic/releases/download/20230208/linux_release.tgz

Step 2 Decompressing the tarball into a desired location:

$tar -xvzf linux_release.tgz -C [desired install location]

Step 3 Adding the the executable c2s which is located in [install location]/bin in your path, either by setting the environment variable $PATH

$export PATH=[install location]/bin:$PATH

Or by creating a symbolic link into a local bin folder:

$ln -s [install location]/bin/dpct /usr/bin/c2s

Step 4 Launching SYCLomatic. This is done by running c2s from inside a CUDA container. This is similar to steps 1, 3 and 5 in the previous section.

$c2s [file to be converted]

This will create a folder in the current directory called dpct_output, in which the converted file is generated.

Step 5 Compiling the generated SYCL code

step 5.1 Look for errors in the converted file

In some cases, SYCLOmatic might not be able to convert part of the code. In such cases, SYCLomatyic will comment on the parts it is unsure about. For example, these comments might look something like this:

/*
    DPCT1003:1: Migrated API does not return error code. (*, 0) is inserted. You
    may need to rewrite this code.
*/

Before compiling, these sections will need to be manually checked for errors.

step 5.2 Once you have a valid file, you may compile it with the SYCL compiler of your choosing. There are many choices for such compilers, which vary based on the devices you are compiling for. Please confer with the INTEL SYCL documentation if you are unsure what compiler to use.

PS: Syclomatic generates data parallel C++ code (DPC++) in stead of a pure SYCL code. This means that you either need to manually convert the DPC++ code to SYCL if you want to use a pure SYCL compiler, or you need to use the intel OneAPI kit to compile the DPC++ code directly

Compiling pure SYCL code To compile the SYCL code on out clusters you need access to a SYCL compiler. On SAGA and BETZY this is straightforward and is discussed in this tutorial: What is SYCL. At the time of writing, LUMI does not have a global installation of hipSYCL. We must therefore utilize easybuild to get access to it. The guidline for installing hipSYCL on LUMI can be found here. We assume that this is done in the path /project/project_xxxxxxx/EasyBuild. The following modules can be loaded:

$export EBU_USER_PREFIX=/project/project_xxxxxxx/EasyBuild
$module load LUMI/22.08
$module load partition/G
$module load rocm
$module load hipSYCL/0.9.3-cpeCray-22.08

To test hipSYCL, the tutorial mentioned above can be considered.

Launching SYCLomatic through a singularity container

An alternative to the steps mentioned above is to create a singularity .def file (see an example here). This can be done in the following:

First, build a container image:

OBS: In most systems, you need sudo privileges to build the container. You do not have this on our clusters, you should therefore consider building a container locally and then copying it over to the cluster using scp or something similar.

$singularity build syclomatic.sif syclomatic.def

Then execute the SYCLomatic tool from inside the container:

$singularity exec syclomatic.sif c2s [file to be converted]

This will create the same dpct_output folder as mentioned in step 4.

Translate OpenACC to OpenMP with Clacc

Clacc is a tool to translate OpenACC to OpenMP offloading with the Clang/LLVM compiler environment. As indicated in the GitHub repository the compiler Clacc is the Clang’s executable in the subdirectory \bin of the \install directory as described below.

In the following we present a step-by-step guide for building and using Clacc:

Step 1.1 Load the following modules to be able to build Clacc (For LUMI-G):

module load CrayEnv
module load rocm

Step 1.2 Build and install Clacc. The building process will spend about 5 hours.

$ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git
$ cd llvm-project
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=../install     \
        -DCMAKE_BUILD_TYPE=Release            \
        -DLLVM_ENABLE_PROJECTS="clang;lld"    \
        -DLLVM_ENABLE_RUNTIMES=openmp         \
        -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \
        -DCMAKE_C_COMPILER=gcc                \
        -DCMAKE_CXX_COMPILER=g++              \
        ../llvm
$ make
$ make install

Step 1.3 Set up environment variables to be able to work from the /install directory, which is the simplest way. For more advanced usage, which includes for instance modifying Clacc, we refer readers to “Usage from Build directory”

$ export PATH=`pwd`/../install/bin:$PATH
$ export LD_LIBRARY_PATH=`pwd`/../install/lib:$LD_LIBRARY_PATH

Step 2 To compile the ported OpenMP code, one needs first to load these modules:

module load CrayEnv
module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm

Step 2.1 Compile & run an OpenACC code on a CPU-host:

$ clang -fopenacc openACC_code.c && ./executable

Step 2.2 Compile & run an OpenACC code on AMD GPU:

$ clang -fopenacc -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a openACC_code.c && ./executable

Step 2.3 Source to source mode with OpenMP port printed out to the console:

$ clang -fopenacc-print=omp OpenACC_code.c

Step 3 Compile the code with the cc compiler wrapper

cc -fopenmp -o executable OpenMP_code.c

Conclusion

We have presented an overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process for large applications might cover about 80% of the source code and thus requires manual modification to complete the porting process. It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes.