Translating GPU-accelerated applications
We present different tools to translate CUDA and OpenACC applications to target various GPU (Graphics Processing Unit) architectures (e.g. AMD and Intel GPUs). A special focus will be given to hipify
, syclomatic
and clacc
. These tools have been tested on the supercomputer LUMI-G in which the GPU partitions are of AMD MI250X GPU type.
The aim of this tutorial is to guide users through a straightforward procedure for converting CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. By the end of this tutorial, we expect users to learn about:
How to use the
hipify-perl
andhipify-clang
tools to translate CUDA sources to HIP sources.How to use the
syclomatic
tool to convert CUDA source to SYCL.How to use the
clacc
tool to convert OpenACC application to OpenMP offloading.How to compile the generated HIP, SYCL and OpenMP applications.
Translating CUDA to HIP with Hipify
In this section, we cover the use of hipify-perl
and hipify-clang
tools to translate a CUDA application to HIP.
Hipify-perl
The hipify-perl
tool is a script based on perl that translates CUDA syntax into HIP syntax (see .e.g. here. As an example, in a CUDA code that makes use of the CUDA functions cudaMalloc
and cudaDeviceSynchronize
, the tool will replace cudaMalloc
by the HIP function hipMalloc
. Similarly for the CUDA function cudaDeviceSynchronize
, which will be replaced by hipDeviceSynchronize
. We list below the basic steps to run hipify-perl
Step 1: loading modules
On LUMI-G, the following modules need to be loaded:
$module load CrayEnv
$module load rocm
Step 2: generating
hipify-perl
script
$hipify-clang --perl
Step 3: running
hipify-perl
$perl hipify-perl program.cu > program.cu.hip
Step 4: compiling with
hipcc
the generated HIP code
$hipcc --offload-arch=gfx90a -o exec_hip program.cu.hip
Despite of the simplicity of the use of hipify-perl
, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings (e.g. it replaces cuda with hip). In addition, hipify-perl
lacks the ability of distinguishing device/host function calls. The alternative here is to use hipify-clang
as we shall describe in the next section.
Hipify-clang
As described here, the hipify-clang
tool is based on clang for translating CUDA sources into HIP sources. The tool is more robust for translating CUDA codes compared to the hipify-perl
tool. Furthermore, it facilitates the analysis of the code by providing assistance.
In short, hipify-clang
requires LLVM+CLANG
and CUDA
. Details about building hipify-clang
can be found here. Note that hipify-clang
is available on LUMI-G. The issue however might be related to the installation of CUDA-toolkit. To avoid any eventual issues with the installation procedure we opt for CUDA singularity container. Here we present a step-by-step guide for running hipify-clang
:
Step 1: pulling a CUDA singularity container e.g.
$singularity pull docker://nvcr.io/nvidia/cuda:11.4.0-devel-ubuntu20.04
Step 2: loading a ROCM module before launching the container.
$ml rocm
During our testing, we used the rocm version rocm-5.0.2
.
Step 3: launching the container
$singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif
where the current directory $PWD
in the host is mounted to that of the container, and the directory /opt
in the host is mounted to the that inside the container.
Step 4: setting the environment variable
$PATH
In order to runhipify-clang
from inside the container, one can set the environment variable$PATH
that defines tha path to look for the binaryhipify-clang
$export PATH=/opt/rocm-5.0.2/bin:$PATH
Step 5: running
hipify-clang
$hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include
Here the cuda path and the path to the includes and defines files should be specified. The CUDA source code and the generated output code are program.cu
and hip_program.cu.hip
, respectively.
Step 6: the syntax for compiling the generated hip code is similar to the one described in the previous section (see the hipify-perl section).
Translating CUDA to SYCL with Syclomatic
SYCLomatic is another conversion tool. However, instead of converting CUDA code to HIP syntax, SYCLomatic converts the code to SYCL/DPC++. The use of SYCLomatic requires CUDA libraries, which can be directly installed in an environment or it can be extracted from a CUDA container. Similarly to previous section, we use singularity container. Here is a step-by-step guide for using SYCLamatic
Step 1 Downloading SYCLomatic
e.g. the last release from here
wget https://github.com/oneapi-src/SYCLomatic/releases/download/20230208/linux_release.tgz
Step 2 Decompressing the tarball into a desired location:
$tar -xvzf linux_release.tgz -C [desired install location]
Step 3 Adding the the executable c2s
which is located in [install location]/bin
in your path, either by setting the environment variable $PATH
$export PATH=[install location]/bin:$PATH
Or by creating a symbolic link into a local bin
folder:
$ln -s [install location]/bin/dpct /usr/bin/c2s
Step 4 Launching SYCLomatic
. This is done by running c2s
from inside a CUDA container. This is similar to steps 1, 3 and 5 in the previous section.
$c2s [file to be converted]
This will create a folder in the current directory called dpct_output
, in which the converted file is generated.
Step 5 Compiling the generated SYCL code
step 5.1 Look for errors in the converted file
In some cases, SYCLOmatic
might not be able to convert part of the code. In such cases, SYCLomatyic
will comment on the parts it is unsure about. For example, these comments might look something like this:
/*
DPCT1003:1: Migrated API does not return error code. (*, 0) is inserted. You
may need to rewrite this code.
*/
Before compiling, these sections will need to be manually checked for errors.
step 5.2 Once you have a valid file, you may compile it with the SYCL compiler of your choosing. There are many choices for such compilers, which vary based on the devices you are compiling for. Please confer with the INTEL SYCL documentation if you are unsure what compiler to use.
PS: Syclomatic generates data parallel C++ code (DPC++) in stead of a pure SYCL code. This means that you either need to manually convert the DPC++ code to SYCL if you want to use a pure SYCL compiler, or you need to use the intel OneAPI kit to compile the DPC++ code directly
Compiling pure SYCL code
To compile the SYCL code on out clusters you need access to a SYCL compiler. On SAGA and BETZY this is straightforward and is discussed in this tutorial: What is SYCL. At the time of writing, LUMI does not have a global installation of hipSYCL
. We must therefore utilize easybuild to get access to it. The guidline for installing hipSYCL
on LUMI can be found here. We assume that this is done in the path /project/project_xxxxxxx/EasyBuild
. The following modules can be loaded:
$export EBU_USER_PREFIX=/project/project_xxxxxxx/EasyBuild
$module load LUMI/22.08
$module load partition/G
$module load rocm
$module load hipSYCL/0.9.3-cpeCray-22.08
To test hipSYCL
, the tutorial mentioned above can be considered.
Launching SYCLomatic through a singularity container
An alternative to the steps mentioned above is to create a singularity .def file (see an example here). This can be done in the following:
First, build a container image:
OBS: In most systems, you need sudo privileges to build the container. You do not have this on our clusters, you should therefore consider building a container locally and then copying it over to the cluster using scp or something similar.
$singularity build syclomatic.sif syclomatic.def
Then execute the SYCLomatic
tool from inside the container:
$singularity exec syclomatic.sif c2s [file to be converted]
This will create the same dpct_output
folder as mentioned in step 4.
Translate OpenACC to OpenMP with Clacc
Clacc
is a tool to translate OpenACC
to OpenMP
offloading with the Clang/LLVM compiler environment. As indicated in the GitHub repository the compiler Clacc
is the Clang
’s executable in the subdirectory \bin
of the \install
directory as described below.
In the following we present a step-by-step guide for building and using Clacc
:
Step 1.1
Load the following modules to be able to build Clacc
(For LUMI-G):
module load CrayEnv
module load rocm
Step 1.2
Build and install Clacc
.
The building process will spend about 5 hours.
$ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git
$ cd llvm-project
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=../install \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_ENABLE_RUNTIMES=openmp \
-DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \
-DCMAKE_C_COMPILER=gcc \
-DCMAKE_CXX_COMPILER=g++ \
../llvm
$ make
$ make install
Step 1.3
Set up environment variables to be able to work from the /install
directory, which is the simplest way. For more advanced usage, which includes for instance modifying Clacc
, we refer readers to “Usage from Build directory”
$ export PATH=`pwd`/../install/bin:$PATH
$ export LD_LIBRARY_PATH=`pwd`/../install/lib:$LD_LIBRARY_PATH
Step 2
To compile the ported OpenMP
code, one needs first to load these modules:
module load CrayEnv
module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm
Step 2.1
Compile & run an OpenACC
code on a CPU-host:
$ clang -fopenacc openACC_code.c && ./executable
Step 2.2
Compile & run an OpenACC
code on AMD GPU:
$ clang -fopenacc -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a openACC_code.c && ./executable
Step 2.3
Source to source mode with OpenMP
port printed out to the console:
$ clang -fopenacc-print=omp OpenACC_code.c
Step 3
Compile the code with the cc
compiler wrapper
cc -fopenmp -o executable OpenMP_code.c
Conclusion
We have presented an overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process for large applications might cover about 80% of the source code and thus requires manual modification to complete the porting process. It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes.