Unified Shared Memory with SYCL

This example demonstrates:

how to allocate USM pointers in SYCL
how to submit work tasks to a SYCL device queue
how to write a parallel kernel function in SYCL
how to perform memcpy operations locally in device memory
how to perform a reduction operation in a SYCL kernel function

In this tutorial we will SYCL-ify a somewhat more realistic example, which is taken from the OpenACC tutorial. The serial version of the Jacobi iteration program has here been slightly modified for C++ (and in anticipation of what is to come):

/**
 * Serial implementation of the Jacobi iteration
 */

#include <iostream>
#include <cstring>

// Number of rows and columns in our matrix
static const int NUM_ELEMENTS = 2000;
// Total number of elements in our matrix
static const int TOT_ELEMENTS = NUM_ELEMENTS * NUM_ELEMENTS;
// Maximum number of iterations before quiting
static const int MAX_ITER = 10000;
// Error tolerance for iteration
static const float MAX_ERROR = 0.01;
// Seed for random number generator
static const int SEED = 12345;

int main (int argc, char** argv) {
  // Initialize random number generator
  srand (SEED);

  // Create array to calculate on
  float arr[TOT_ELEMENTS];

  // Fill array with data
  for (int i = 0; i < TOT_ELEMENTS; i++) {
    // The following will create random values between [0, 1]
    arr[i] = (float) rand () / (float) RAND_MAX;
  }

  // Before starting calculation we will define a few helper variables
  float tmp[TOT_ELEMENTS];
  float err = __FLT_MAX__;

  // We copy here to get the boundary elements, which will be copied back and forth unchanged
  std::memcpy(tmp, arr, TOT_ELEMENTS*sizeof(float));

  int iterations = 0;
  // Perform Jacobi iterations until we either have low enough error or too many iterations
  while (err > MAX_ERROR && iterations < MAX_ITER) {
    err = 0.;
    // For each element take the average of the surrounding elements
    for (int i = 1; i < NUM_ELEMENTS - 1; i++) {
      for (int j = 1; j < NUM_ELEMENTS - 1; j++) {
        tmp[i * NUM_ELEMENTS + j] = 0.25 * (arr[i * NUM_ELEMENTS + j+1] +
                                            arr[i * NUM_ELEMENTS + j-1] +
                                            arr[(i-1) * NUM_ELEMENTS + j] +
                                            arr[(i+1) * NUM_ELEMENTS + j]);
        err = std::max(err, std::abs(tmp[i*NUM_ELEMENTS + j] - arr[i*NUM_ELEMENTS + j]));
      }
    }

    // Transfer new array to old (including boundary, which was untouched in the loop)
    std::memcpy(arr, tmp, TOT_ELEMENTS*sizeof(float));
    
    iterations++;
  }

  std::cout << "Iterations : " << iterations << " | Error : " << err << std::endl;

  return EXIT_SUCCESS;
}

jacobi_serial.cpp

Compile and run reference serial code

We can compile and run the reference serial version of the code on Saga. First we load a recent version of the GNU C++ compiler, and compile a jacobi_serial target with -Ofast optimization:

[me@login-1.SAGA ~]$ module load GCC/10.2.0
[me@login-1.SAGA ~]$ g++ -Ofast -o jacobi_serial jacobi_serial.cpp

Hopefully no errors occurred on this step, and we are ready to run a reference benchmark on a compute node:

[me@login-1.SAGA ~]$ srun --account=<my-account> --time=0:10:00 --ntasks=1 --cpus-per-task=1 --mem=1G time ./jacobi_serial
srun: job 3661704 queued and waiting for resources
srun: job 3661704 has been allocated resources
Iterations : 7214 | Error : 0.00999874
37.91user 0.00system 0:38.08elapsed 99%CPU (0avgtext+0avgdata 32880maxresident)k
3844inputs+0outputs (17major+1097minor)pagefaults 0swaps

The execution should take around 40 seconds on a single core (38.08s in this case, the elapsed value in the output). We notice also the printed output from our program, which states that it ran a total of 7214 iterations before reaching an error below 0.01, so this is the number of times we enter the main while loop.

Introducing SYCL and Unified Shared Memory

In contrast to the directives based approaches to GPU programming like OpenMP and OpenACC, which can often be achieved by strategically placed compiler directives into the existing code, porting to SYCL might require a bit more changes to the structure and algorithms of the program. SYCL supports fully asynchronous execution of tasks using C++ concepts like futures and events, and provides two main approaches for data management: using Unified Shared Memory (USM) or Buffers. USM uses familiar C/C++-like memory pointers in a unified virtual address space, which basically means that you can use the same pointer address on the host and on the device. This approach will likely be more familiar to the traditional C/C++ programmer, but it requires explicit management of all data dependences and synchronization, which can be achieved by adding wait statements or by capturing an event from one task and passing it on as an explicit dependency for other tasks. Buffers, on the other hand, can only be accessed through special accessor objects, which are used by the runtime to automatically construct a dependecy graph for all the tasks, and thus make sure that they are executed in the correct order.

In this tutorial we will limit ourselves to the USM approach, and we will for simplicity attach explicit wait statements to all the tasks, which effectively deactivates any asynchronous execution.

Step 1: Create a SYCL queue

Back to the Jacobi source code, we start by creating a sycl::queue. This object is “attached” to a particular device and is used to submit tasks for execution, in general asynchronously (out-of-order). As in the Hello World SYCL example, we will print out the name of the device to make sure that we pick up the correct hardware:

#include <iostream>
#include <cstring>

#include <SYCL/sycl.hpp>

// Number of rows and columns in our matrix
static const int NUM_ELEMENTS = 2000;
// Total number of elements in our matrix
static const int TOT_ELEMENTS = NUM_ELEMENTS * NUM_ELEMENTS;
// Maximum number of iterations before quiting
static const int MAX_ITER = 10000;
// Error tolerance for iteration
static const float MAX_ERROR = 0.01;
// Seed for random number generator
static const int SEED = 12345;

int main (int argc, char** argv) {
  // Create default SYCL queue and print name of device
  auto Q = sycl::queue{sycl::default_selector{}};
  std::cout << "Chosen device: "
            << Q.get_device().get_info<sycl::info::device::name>()
            << std::endl;

  // Initialize random number generator
  srand (SEED);

Step 2: Allocate USM memory

USM pointers can be allocated in three ways: malloc_host, malloc_device or malloc_shared. In this example we will use shared memory pointers only, which are pointer addresses that can be accessed by both the host and the device. Furthermore, the physical location of such shared data can actually change during program execution, and the runtime will move the data back and forth as the access pattern changes from host to device and vice versa. This will make sure that the data can become accessible from local memory on both the host and the device and it allows for fast access on the device as long as the data is allowed to reside on the device local memory throughout the execution of a kernel function, i.e. no data accesses from the host should occur in the mean time, which would result in costly data migration.

The changes we need to make to our example code in order to use shared memory is to replace the stack allocated arrays (arr and tmp) with sycl::malloc_shared:

  // Initialize random number generator
  srand (SEED);

  // Create *SHARED* array to store the input/output
  float *arr_s = sycl::malloc_shared<float>(TOT_ELEMENTS, Q);

  // Fill *SHARED* array with data
  for (int i = 0; i < TOT_ELEMENTS; i++) {
    // The following will create random values between [0, 1]
    arr_s[i] = (float) rand () / (float) RAND_MAX;
  }

  // Create *SHARED* array to calculate on
  float *tmp_s = sycl::malloc_shared<float>(TOT_ELEMENTS, Q);
  float err = __FLT_MAX__;

  // We copy here to get the boundary elements, which will be copied back and forth unchanged
  std::memcpy(tmp_s, arr_s, TOT_ELEMENTS*sizeof(float));

We have added a _s suffix to the variable name just to indicate it’s a shared pointer. Note that we pass our sycl::queue (Q) to this memory allocation as it carries the information of which device this memory should be shared (there could be several queues with different devices). We see also that the shared data arrays can be filled and std::memcpy’d in exactly the same way as before by the host, so there’s no change to how the host interacts with this data.

Note

Memory allocated with sycl::malloc_host will also be “accessible” from the device, but it will always be fetched from host memory and passed to the device through a memory bus, which is always going to be much slower than fetching directly from local memory on the device. The fast alternative to shared memory is to use sycl::malloc_host and sycl::malloc_device and then manually transfer the data between the host and the device. This is a bit less convenient, but it gives more fine-grained control to the programmer.

Step 3: Implement the parallel kernel

We now come to the main work sharing construct in our example (beware, this is a mouthful):

  int iterations = 0;
  // Perform Jacobi iterations until we either have low enough error or too many iterations
  while (err > MAX_ERROR && iterations < MAX_ITER) {
    err = 0.;
    // Submit work item to the SYCL queue
    Q.submit(
      [&](sycl::handler &h) {
        // Define work kernel as single loop
        h.parallel_for(
          sycl::range{(NUM_ELEMENTS - 2) * (NUM_ELEMENTS - 2)},
          [=](sycl::id<1> idx) {
            // Retain array indices from single loop variable
            int i = (idx[0] / NUM_ELEMENTS) + 1;
            int j = (idx[0] % NUM_ELEMENTS) + 1;
            // For each element take the average of the surrounding elements
            tmp_s[i * NUM_ELEMENTS + j] = 0.25 * (arr_s[i * NUM_ELEMENTS + j+1] +
                                                  arr_s[i * NUM_ELEMENTS + j-1] +
                                                  arr_s[(i-1) * NUM_ELEMENTS + j] +
                                                  arr_s[(i+1) * NUM_ELEMENTS + j]);
          }
        );
      }
    ).wait(); // Wait for completion before moving on
    
    // Find maximum error (cannot be done in the loop kernel above)
    for (int i = 0; i < TOT_ELEMENTS; i++) {
      err = std::max(err, std::abs(tmp_s[i] - arr_s[i]));
    }

    // Transfer new array to old (including boundary, which was untouched in the loop)
    std::memcpy(arr_s, tmp_s, TOT_ELEMENTS*sizeof(float));

    iterations++;
  }

We will not discuss in detail everything that is going on here, please refer to standard SYCL literature for more in-depth explanations, e.g. the free e-book on Data Parallel C++. The take-home message is that we submit to the queue a kernel function which represents a single iteration of a parallel_for loop for execution on the device. Some (probably unnecessary) logic is added to extract the two array indices i,j from the single loop iteration index, but otherwise the body of the kernel is the same as the nested loop we had in the serial version, except that we need to extract the computation of the maximum error from this main loop. The reason for this is that the kernel code will be executed in arbitrary order by many different threads on the device, and no single thread will be able to compute the true maximum locally.

Since the memory was allocated as malloc_shared between the host and the device, the reduction operation to find the maximum error, as well as the std::memcpy operation between tmp_s and arr_s, can be performed by the host. Keep in mind, though, that this will require a migration of the shared data back and forth between the device and the host at every iteration of the while loop (more than 7000 iterations), and we will see the effect of this in the timings below.

A critical point in the code snippet above is the wait() statement on the tail of the Q.submit() call. This will tell the host to wait for further execution until all the work in the parallel kernel has been completed. This effectively deactivates asynchronous execution of the device tasks.

Tip

Q.submit(...).wait(); is a concatenation of the slightly more expressive Q.submit(...); Q.wait();, which emphasizes that it’s the entire queue that is drained by the wait, not just the task loop that was just submitted. This means that you can submit several independent tasks to the queue for asynchronous execution, and then drain them all in Q.wait() at a later stage.

Step 4: Free USM memory

Finally, as always when allocating raw pointers in C++, one has to manually free the memory:

  std::cout << "Iterations : " << iterations << " | Error : " << err << std::endl;

  // Free *SHARED* memory
  sycl::free(arr_s, Q);
  sycl::free(tmp_s, Q);

  return EXIT_SUCCESS;
}

Compiling for CPU

With the adjustments discussed above we end up with the following source code:

/**
 * SYCL accelerated implementation of the Jacobi iteration
 */

#include <iostream>
#include <cstring>

#include <SYCL/sycl.hpp>

// Number of rows and columns in our matrix
static const int NUM_ELEMENTS = 2000;
// Total number of elements in our matrix
static const int TOT_ELEMENTS = NUM_ELEMENTS * NUM_ELEMENTS;
// Maximum number of iterations before quiting
static const int MAX_ITER = 10000;
// Error tolerance for iteration
static const float MAX_ERROR = 0.01;
// Seed for random number generator
static const int SEED = 12345;

int main (int argc, char** argv) {
  // Create default SYCL queue and print name of device
  auto Q = sycl::queue{sycl::default_selector{}};
  std::cout << "Chosen device: "
            << Q.get_device().get_info<sycl::info::device::name>()
            << std::endl;

  // Initialize random number generator
  srand (SEED);

  // Create *SHARED* array to store the input/output
  float *arr_s = sycl::malloc_shared<float>(TOT_ELEMENTS, Q);

  // Fill *SHARED* array with data
  for (int i = 0; i < TOT_ELEMENTS; i++) {
    // The following will create random values between [0, 1]
    arr_s[i] = (float) rand () / (float) RAND_MAX;
  }

  // Create *SHARED* array to calculate on
  float *tmp_s = sycl::malloc_shared<float>(TOT_ELEMENTS, Q);
  float err = __FLT_MAX__;

  // We copy here to get the boundary elements, which will be copied back and forth unchanged
  std::memcpy(tmp_s, arr_s, TOT_ELEMENTS*sizeof(float));

  int iterations = 0;
  // Perform Jacobi iterations until we either have low enough error or too many iterations
  while (err > MAX_ERROR && iterations < MAX_ITER) {
    err = 0.;
    // Submit work item to the SYCL queue
    Q.submit(
      [&](sycl::handler &h) {
        // Define work kernel as single loop
        h.parallel_for(
          sycl::range{(NUM_ELEMENTS - 2) * (NUM_ELEMENTS - 2)},
          [=](sycl::id<1> idx) {
            // Retain array indices from single loop variable
            int i = (idx[0] / NUM_ELEMENTS) + 1;
            int j = (idx[0] % NUM_ELEMENTS) + 1;
            // For each element take the average of the surrounding elements
            tmp_s[i * NUM_ELEMENTS + j] = 0.25 * (arr_s[i * NUM_ELEMENTS + j+1] +
                                                  arr_s[i * NUM_ELEMENTS + j-1] +
                                                  arr_s[(i-1) * NUM_ELEMENTS + j] +
                                                  arr_s[(i+1) * NUM_ELEMENTS + j]);
          }
        );
      }
    ).wait(); // Wait for completion before moving on
    
    // Find maximum error (cannot be done in the loop kernel above)
    for (int i = 0; i < TOT_ELEMENTS; i++) {
      err = std::max(err, std::abs(tmp_s[i] - arr_s[i]));
    }

    // Transfer new array to old (including boundary, which was untouched in the loop)
    std::memcpy(arr_s, tmp_s, TOT_ELEMENTS*sizeof(float));

    iterations++;
  }

  std::cout << "Iterations : " << iterations << " | Error : " << err << std::endl;

  // Free *SHARED* memory
  sycl::free(arr_s, Q);
  sycl::free(tmp_s, Q);

  return EXIT_SUCCESS;
}

jacobi_shared.cpp

We can compile an omp target of this code on Saga using the syclcc compiler wrapper from the hipSYCL module (feel free to ignore the warning):

[me@login-1.SAGA ~]$ module load hipSYCL/0.9.1-gcccuda-2020b
[me@login-1.SAGA ~]$ syclcc --hipsycl-targets=omp -Ofast -o jacobi_shared_cpu jacobi_shared.cpp
clang-11: warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11010. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]

And we can run it on a single compute core (please ignore also the hipSYCL warning, which comes when you run on compute nodes without GPU resources):

[me@login-1.SAGA ~]$ srun --account=<my-account> --time=0:10:00 --ntasks=1 --cpus-per-task=1 --mem=1G time ./jacobi_shared_cpu
srun: job 3671849 queued and waiting for resources
srun: job 3671849 has been allocated resources
[hipSYCL Warning] backend_loader: Could not load backend plugin: /cluster/software/hipSYCL/0.9.1-gcccuda-2020b/bin/../lib/hipSYCL/librt-backend-cuda.so
[hipSYCL Warning] libcuda.so.1: cannot open shared object file: No such file or directory
Chosen device: hipSYCL OpenMP host device
Iterations : 7229 | Error : 0.00999993
65.29user 0.37system 1:05.89elapsed 99%CPU (0avgtext+0avgdata 34300maxresident)k
10337inputs+0outputs (47major+2099minor)pagefaults 0swaps

We see from the “Chosen device” output of our program that the sycl::queue was bound to the “hipSYCL OpenMP host device”, which means that it is using the host CPU as a “device”. So this took about a minute to run, which is some 50% slower than the reference serial run we did above. However, one of the benefits of SYCL is that it can use the available CPU threads of the host as “device” for offloading. Let’s try to run the same code on 20 CPU cores:

[me@login-1.SAGA ~]$ srun --account=<my-account> --time=0:10:00 --ntasks=1 --cpus-per-task=20 --mem=1G time ./jacobi_shared_cpu
srun: job 3671925 queued and waiting for resources
srun: job 3671925 has been allocated resources
[hipSYCL Warning] backend_loader: Could not load backend plugin: /cluster/software/hipSYCL/0.9.1-gcccuda-2020b/bin/../lib/hipSYCL/librt-backend-cuda.so
[hipSYCL Warning] libcuda.so.1: cannot open shared object file: No such file or directory
Chosen device: hipSYCL OpenMP host device
Iterations : 7229 | Error : 0.00999993
594.42user 16.34system 0:30.84elapsed 1980%CPU (0avgtext+0avgdata 45092maxresident)k
10337inputs+0outputs (47major+2267minor)pagefaults 0swaps

Alright, we’re down to ~30s, which is somewhat faster than the serial reference (still not overly impressive given that we spend 20 times more resources). Let’s see if we can do better on the GPU.

Compiling for Nvidia GPUs

When compiling for the P100 Nvidia GPUs on Saga we simply have to change the hipsycl-targets from omp to cuda:sm_60, and then submit a job with GPU resources:

[me@login-1.SAGA ~]$ syclcc --hipsycl-targets=cuda:sm_60 -Ofast -o jacobi_shared_gpu jacobi_shared.cpp
[me@login-1.SAGA ~]$ srun --account=<my-account> --time=0:10:00 --ntasks=1 --gpus-per-task=1 --mem=1G --partition=accel time ./jacobi_shared_gpu
srun: job 3672238 queued and waiting for resources
srun: job 3672238 has been allocated resources
Chosen device: Tesla P100-PCIE-16GB
Iterations : 7230 | Error : 0.00999916
77.14user 54.72system 2:12.42elapsed 99%CPU (0avgtext+0avgdata 156600maxresident)k
11393inputs+0outputs (694130major+7440minor)pagefaults 0swaps

Good news first: the chosen device is now Tesla P100-PCIE-16GB, which is the name of the graphics card on the Saga GPU nodes. Our application was actually able to pick up the correct device. The bad news is of course the elapsed time of 2m12s, which is significantly slower than both the serial and OpenMP versions above. We already hinted at the reason for this poor performance, so let’s see if we can fix it.

Optimizing for GPU performance

Step 5: Move data between USM pointers on the device

In this example we have two std::memcpy performed by the host on the USM shared pointer. The first one is a single operation before we enter the main while loop, while the other is performed at the end of every loop iteration. Since this operation is performed by the host CPU, it will implicitly invoke a data migration in case the data happens to be located in device memory when the function is called. Since we are copying data between two USM pointers, we can actually perform this memcpy directly on the device, and thus avoid the costly data migration.

The memcpy that we do before the main work loop in our example could be left unchanged. This single function call should have no noticeable impact on the performance since the data is already located on the host after the initialization. We will still submit also this memcpy operation to the sycl::queue for execution on the device since it will serve as a preporatory step of migrating the data to device memory in advance of the upcoming kernel execution.

  // We copy here to get the boundary elements, which will be copied back and forth unchanged
  Q.memcpy(tmp_s, arr_s, TOT_ELEMENTS*sizeof(float)).wait();

  int iterations = 0;
  // Perform Jacobi iterations until we either have low enough error or too many iterations
  while (err > MAX_ERROR && iterations < MAX_ITER) {
    err = 0.;
    // Submit work item to the SYCL queue
    Q.submit(
      [&](sycl::handler &h) {
        // Define work kernel as single loop
        h.parallel_for(
          sycl::range{(NUM_ELEMENTS - 2) * (NUM_ELEMENTS - 2)},
          [=](sycl::id<1> idx) {
            // Retain array indices from single loop variable
            int i = (idx[0] / NUM_ELEMENTS) + 1;
            int j = (idx[0] % NUM_ELEMENTS) + 1;
            // For each element take the average of the surrounding elements
            tmp_s[i * NUM_ELEMENTS + j] = 0.25 * (arr_s[i * NUM_ELEMENTS + j+1] +
                                                  arr_s[i * NUM_ELEMENTS + j-1] +
                                                  arr_s[(i-1) * NUM_ELEMENTS + j] +
                                                  arr_s[(i+1) * NUM_ELEMENTS + j]);
          }
        );
      }
    ).wait(); // Wait for completion before moving on
    
    // Find maximum error (cannot be done in the loop kernel above)
    for (int i = 0; i < TOT_ELEMENTS; i++) {
      err = std::max(err, std::abs(tmp_s[i] - arr_s[i]));
    }

    // Transfer new array to old (including boundary, which was untouched in the loop)
    Q.memcpy(arr_s, tmp_s, TOT_ELEMENTS*sizeof(float)).wait();

    iterations++;
  }

jacobi_memcpy.cpp

As we can see from the code snippet above, there are two changes to the memcpy function calls: (1) std:: is replaced by Q. and (2) we have put a .wait() on the tail of the function call. (1) will offload the the work to be performed by the device rather than the host, while (2) will hold back the host from further execution until the Q is empty (for now the queue holds only a single memcpy task).

In contrast to the first memcpy, the one in the loop is critical for performance. If this operation is performed as std::memcpy by the host, it will require an implicit data migration from device to host (and back) in every iteration of the while loop. Making this a Q.memcpy instead will allow the copy to be executed locally in device memory without ever involving the host.

Tip

The Q.memcpy(...) syntax is actually a shorthand for something a bit more cumbersome Q.submit([&](sycl::handler &h) { h.memcpy(...); }), which is more in line with the syntax of the kernel submission above.

Step 6: Add reduction object to compute maximum error

There’s still one more operation inside the while loop that needs to be considered, and that is the computation of the maximum error in each iteration. This could not be straightforwardly included in the kernel function, so we left it as a separate loop to be executed by the host after the kernel has completed. However, just as for the memcpy that we discussed above, this will also imply a costly data migration back to the host at every iteration. The way around this problem is to attach a sycl::reduction operation to this error variable, which will allow us to include the maximum reduction back into the main kernel function. The syntax to achieve this is somewhat involved:

  // Create *SHARED* array to calculate on
  float *tmp_s = sycl::malloc_shared<float>(TOT_ELEMENTS, Q);
  float *err_s = sycl::malloc_shared<float>(1, Q);
  *err_s = __FLT_MAX__;

  // We copy here to get the boundary elements, which will be copied back and forth unchanged
  Q.memcpy(tmp_s, arr_s, TOT_ELEMENTS*sizeof(float)).wait();

  int iterations = 0;
  // Perform Jacobi iterations until we either have low enough error or too many iterations
  while (*err_s > MAX_ERROR && iterations < MAX_ITER) {
    *err_s = 0.;
    // Submit work item to the SYCL queue
    Q.submit(
      [&](sycl::handler &h) {
        // Attach a reduction operation to the err_s shared variable, to be used in the parallel_for
        auto max_err = sycl::reduction(err_s, sycl::maximum<float>());

        // Define work kernel as single loop
        h.parallel_for(
          sycl::range{(NUM_ELEMENTS - 2) * (NUM_ELEMENTS - 2)}, max_err,
          [=](sycl::id<1> idx, auto &max) {
            // Retain array indices from single loop variable
            int i = (idx[0] / NUM_ELEMENTS) + 1;
            int j = (idx[0] % NUM_ELEMENTS) + 1;
            // For each element take the average of the surrounding elements
            tmp_s[i * NUM_ELEMENTS + j] = 0.25 * (arr_s[i * NUM_ELEMENTS + j+1] +
                                                  arr_s[i * NUM_ELEMENTS + j-1] +
                                                  arr_s[(i-1) * NUM_ELEMENTS + j] +
                                                  arr_s[(i+1) * NUM_ELEMENTS + j]);
            max.combine(std::abs(tmp_s[i * NUM_ELEMENTS + j] - arr_s[i * NUM_ELEMENTS + j]));
          }
        );
      }
    ).wait(); // Wait for completion before moving on
    
    // Transfer new array to old (including boundary, which was untouched in the loop)
    Q.memcpy(arr_s, tmp_s, TOT_ELEMENTS*sizeof(float)).wait();

    iterations++;
  }

  std::cout << "Iterations : " << iterations << " | Error : " << *err_s << std::endl;

  // Free *SHARED* memory
  sycl::free(arr_s, Q);
  sycl::free(tmp_s, Q);
  sycl::free(err_s, Q);

  return EXIT_SUCCESS;
}

jacobi_reduction.cpp

First of all, we need to allocate the variable that is collecting the error as a USM pointer so that it is accessible on the device. We do this by sycl::malloc_shared of a single float. Then we need to wrap this USM pointer into a sycl::reduction operation, and pass it as an extra argument to the parallel_for kernel. Notice that the max_err object is passed into the kernel as the max argument to the lambda function. Then we call the combine() function of this sycl::reduction object, which will perform the sycl::maximum<float> operation on the data, and thus compute the true maximum among all the entries in a thread safe manner. Finally, since the err_s pointer is shared between device and host, the host will still have access to the final error and can print it out in the end.

Compiling and running optimized code

We now compile a sm_60 target of the final version, and run on a GPU node:

[me@login-1.SAGA ~]$ syclcc --hipsycl-targets=cuda:sm_60 -Ofast -o jacobi_reduction_gpu jacobi_reduction.cpp
[me@login-1.SAGA ~]$ srun --account=<my-account> --time=0:10:00 --ntasks=1 --gpus-per-task=1 --mem=1G --partition=accel time ./jacobi_reduction_gpu
srun: job 3808343 queued and waiting for resources
srun: job 3808343 has been allocated resources
Chosen device: Tesla P100-PCIE-16GB
Iterations : 7230 | Error : 0.00999916
2.03user 3.83system 0:06.49elapsed 90%CPU (0avgtext+0avgdata 156604maxresident)k
11457inputs+0outputs (1030major+6413minor)pagefaults 0swaps

We see that by making sure that the data remains in device local memory throughout the execution of the kernel, we have reduced the overall run time to about six seconds. Notice also that most of this time is spent in system calls setting up the program, and only two seconds is spent by actually running the program. This system overhead should (hopefully) remain at a few seconds also for larger application when the total runtime is much longer.

Summary

In this guide we have transitioned a serial C++ code into a small GPU application using the SYCL framework. We have taken several steps from the initial serial implementation to the final accelerated version, using concepts like Unified Shared Memory and a SYCL reduction operation. We have seen that the path to actual accelerated code is not necessarily straightforward, as several of the intermediate steps shows execution times significantly slower than the original serial code. The steps can be summarized as follows:

Version	CPUs	GPUs	Run time	Relative
`jacobi_serial`	1	0	38.1 sec	100%
`jacobi_shared`	1	0	65.9 sec	173%
`jacobi_shared`	20	0	30.8 sec	81%
`jacobi_shared`	1	1	132.4 sec	348%
`jacobi_memcpy`	1	0	110.2 sec	289%
`jacobi_memcpy`	20	0	33.9 sec	89%
`jacobi_memcpy`	1	1	93.8 sec	246%
`jacobi_reduction`	1	0	115.1 sec	302%
`jacobi_reduction`	20	0	21.6 sec	56%
`jacobi_reduction`	1	1	6.5 sec	17%

We have with this example shown in some detail how to compile and run a SYCL code on Saga, and how to make use of the available GPU resources there. We have highlighted some basic SYCL syntax, but we have not gone into much detail on what goes on under the hood, or how to write good and efficient SYCL code. This simple example only scratches the surface of what’s possible within the framework, and we encourage the reader to check out other more complete resources, like the Data Parallel C++ e-book, before venturing into a real-world porting project using SYCL.