Translating GPU-accelerated applications
We present different tools to translate CUDA and OpenACC applications to target various GPU (Graphics Processing Unit) architectures (e.g. AMD and Intel GPUs). A special focus will be given to
clacc. These tools have been tested on the supercomputer LUMI-G in which the GPU partitions are of AMD MI250X GPU type.
The aim of this tutorial is to guide users through a straightforward procedure for converting CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. By the end of this tutorial, we expect users to learn about:
How to use the
hipify-clangtools to translate CUDA sources to HIP sources.
How to use the
syclomatictool to convert CUDA source to SYCL.
How to use the
clacctool to convert OpenACC application to OpenMP offloading.
How to compile the generated HIP, SYCL and OpenMP applications.
Translating CUDA to HIP with Hipify
In this section, we cover the use of
hipify-clang tools to translate a CUDA application to HIP.
hipify-perl tool is a script based on perl that translates CUDA syntax into HIP syntax (see .e.g. here. As an example, in a CUDA code that makes use of the CUDA functions
cudaDeviceSynchronize, the tool will replace
cudaMalloc by the HIP function
hipMalloc. Similarly for the CUDA function
cudaDeviceSynchronize, which will be replaced by
hipDeviceSynchronize. We list below the basic steps to run
Step 1: loading modules
On LUMI-G, the following modules need to be loaded:
$module load CrayEnv
$module load rocm
Step 2: generating
Step 3: running
$perl hipify-perl program.cu > program.cu.hip
Step 4: compiling with
hipccthe generated HIP code
$hipcc --offload-arch=gfx90a -o exec_hip program.cu.hip
Despite of the simplicity of the use of
hipify-perl, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings (e.g. it replaces cuda with hip). In addition,
hipify-perl lacks the ability of distinguishing device/host function calls. The alternative here is to use
hipify-clang as we shall describe in the next section.
As described here, the
hipify-clang tool is based on clang for translating CUDA sources into HIP sources. The tool is more robust for translating CUDA codes compared to the
hipify-perl tool. Furthermore, it facilitates the analysis of the code by providing assistance.
CUDA. Details about building
hipify-clang can be found here. Note that
hipify-clang is available on LUMI-G. The issue however might be related to the installation of CUDA-toolkit. To avoid any eventual issues with the installation procedure we opt for CUDA singularity container. Here we present a step-by-step guide for running
Step 1: pulling a CUDA singularity container e.g.
$singularity pull docker://nvcr.io/nvidia/cuda:11.4.0-devel-ubi8
Step 2: loading a ROCM module before launching the container.
During our testing, we used the rocm version
Step 3: launching the container
$singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif
where the current directory
$PWD in the host is mounted to that of the container, and the directory
/opt in the host is mounted to the that inside the container.
Step 4: setting the environment variable
$PATHIn order to run
hipify-clangfrom inside the container, one can set the environment variable
$PATHthat defines tha path to look for the binary
Step 5: running
$hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include
Here the cuda path and the path to the includes and defines files should be specified. The CUDA source code and the generated output code are
Step 6: the syntax for compiling the generated hip code is similar to the one described in the previous section (see the hipify-perl section).
Translating CUDA to SYCL with Syclomatic
SYCLomatic is another conversion tool. However, instead of converting CUDA code to HIP syntax, SYCLomatic converts the code to SYCL/DPC++. The use of SYCLomatic requires CUDA libraries, which can be directly installed in an environment or it can be extracted from a CUDA container. Similarly to previous section, we use singularity container. Here is a step-by-step guide for using
Step 1 Downloading
SYCLomatic e.g. the last release from here
Step 2 Decompressing the tarball into a desired location:
$tar -xvzf linux_release.tgz -C [desired install location]
Step 3 Adding the the executable
c2s which is located in
[install location]/bin in your path, either by setting the environment variable
$export PATH=[install location]/bin:$PATH
Or by creating a symbolic link into a local
$ln -s [install location]/bin/dpct /usr/bin/c2s
Step 4 Launching
SYCLomatic. This is done by running
c2s from inside a CUDA container. This is similar to steps 1, 3 and 5 in the previous section.
$c2s [file to be converted]
This will create a folder in the current directory called
dpct_output, in which the converted file is generated.
Step 5 Compiling the generated SYCL code
step 5.1 Look for errors in the converted file
In some cases,
SYCLOmatic might not be able to convert part of the code. In such cases,
SYCLomatyic will comment on the parts it is unsure about. For example, these comments might look something like this:
/* DPCT1003:1: Migrated API does not return error code. (*, 0) is inserted. You may need to rewrite this code. */
Before compiling, these sections will need to be manually checked for errors.
step 5.2 Once you have a valid file, you may compile it with the SYCL compiler of your choosing. There are many choices for such compilers, which vary based on the devices you are compiling for. Please confer with the INTEL SYCL documentation if you are unsure what compiler to use.
PS: Syclomatic generates data parallel C++ code (DPC++) in stead of a pure SYCL code. This means that you either need to manually convert the DPC++ code to SYCL if you want to use a pure SYCL compiler, or you need to use the intel OneAPI kit to compile the DPC++ code directly
Compiling pure SYCL code
To compile the SYCL code on out clusters you need access to a SYCL compiler. On SAGA and BETZY this is straightforward and is discussed in this tutorial: What is SYCL. At the time of writing, LUMI does not have a global installation of
hipSYCL. We must therefore utilize easybuild to get access to it. The guidline for installing
hipSYCL on LUMI can be found here. We assume that this is done in the path
/project/project_xxxxxxx/EasyBuild. The following modules can be loaded:
$export EBU_USER_PREFIX=/project/project_xxxxxxx/EasyBuild $module load LUMI/22.08 $module load partition/G $module load rocm $module load hipSYCL/0.9.3-cpeCray-22.08
hipSYCL, the tutorial mentioned above can be considered.
Launching SYCLomatic through a singularity container
An alternative to the steps mentioned above is to create a singularity .def file (see an example here). This can be done in the following:
First, build a container image:
OBS: In most systems, you need sudo privileges to build the container. You do not have this on our clusters, you should therefore consider building a container locally and then copying it over to the cluster using scp or something similar.
$singularity build syclomatic.sif syclomatic.def
Then execute the
SYCLomatic tool from inside the container:
$singularity exec syclomatic.sif c2s [file to be converted]
This will create the same
dpct_output folder as mentioned in step 4.
Translate OpenACC to OpenMP with Clacc
Clacc is a tool to translate
OpenMP offloading with the Clang/LLVM compiler environment. As indicated in the GitHub repository the compiler
Clacc is the
Clang’s executable in the subdirectory
\bin of the
\install directory as described below.
In the following we present a step-by-step guide for building and using
Load the following modules to be able to build
Clacc (For LUMI-G):
module load CrayEnv module load rocm
Build and install
The building process will spend about 5 hours.
$ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git $ cd llvm-project $ mkdir build && cd build $ cmake -DCMAKE_INSTALL_PREFIX=../install \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_ENABLE_RUNTIMES=openmp \ -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \ -DCMAKE_C_COMPILER=gcc \ -DCMAKE_CXX_COMPILER=g++ \ ../llvm $ make $ make install
Set up environment variables to be able to work from the
/install directory, which is the simplest way. For more advanced usage, which includes for instance modifying
Clacc, we refer readers to “Usage from Build directory”
$ export PATH=`pwd`/../install/bin:$PATH $ export LD_LIBRARY_PATH=`pwd`/../install/lib:$LD_LIBRARY_PATH
To compile the ported
OpenMP code, one needs first to load these modules:
module load CrayEnv module load PrgEnv-cray module load craype-accel-amd-gfx90a module load rocm
Compile & run an
OpenACC code on a CPU-host:
$ clang -fopenacc openACC_code.c && ./executable
Compile & run an
OpenACC code on AMD GPU:
$ clang -fopenacc -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a openACC_code.c && ./executable
Source to source mode with
OpenMP port printed out to the console:
$ clang -fopenacc-print=omp OpenACC_code.c
Compile the code with the
cc compiler wrapper
cc -fopenmp -o executable OpenMP_code.c
We have presented an overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process for large applications might cover about 80% of the source code and thus requires manual modification to complete the porting process. It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes.