Async and Multi-GPU OpenACC

In this guide we will go over a few advance topics regarding OpenACC. The guide will cover asynchronous operations, which can overlap memory transfers and compute for higher throughput, and how to utilize multiple GPUs.

Tip

If you are not sure what OpenACC is we have an introductory guide which explains the basics.

Introduction

Asynchronous programming in OpenACC is a way to schedule work so that the GPU can work concurrently with the tasks given. Note that this does not mean that the GPU will necessarily run multiple kernels simultaneously. Often, asynchronous programming with OpenACC will allow us to overlap memory transfer with kernel execution. This can improve efficiency since the GPU does not sit idle while transferring memory back-and-forth, resulting in improved throughput. If you are just beginning to translate an existing code base to OpenACC, asynchronous operations should be some of the last optimizations to apply and can be tricky when the problem is not embarrassingly parallel.

After reading this guide you should be familiar with the following topics and ideas.

Understand how asynchronous programming with OpenACC works.
- How memory and kernels can be overlapped.
- How different OpenACC blocks can be made dependent on each other.
- Which problems are suitable for asynchronous OpenACC.
Know the basics of utilizing multiple GPUs.

To get started we will need some piece of code that we would like to accelerate. This time we have chosen to accelerate the visualization of the Mandelbrot set.

Since this code requires a bit more setup than before we have created a meson project that can be used to build the code. Below we have attached the full project, and we will offer zip archives of the full project as we make changes, but will focus on the main code in mandelbrot.c.

Serial version of project as 'zip' archive

Below we have attached the full version of mandelbrot.c and highlighted the two main areas of computation where we will focus our efforts.

/**
 * Mandelbrot implementation for accelerators (e.g. GPUs)
 */

#include "utils/lodepng.h"
#include "utils/palette.h"
#include <omp.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// Default width and height for image if not given
static const int WIDTH = 1280;
static const int HEIGHT = 720;
// Default output name if not given
static const char* OUTPUT_NAME = "mandelbrot.png";
// Maximum iteration count before exiting mandelbrot function
static const uint32_t MAX_ITER = 1000;

// Helper function to scale 'num' to the range '[min, max]'
float scale(float num, const float min, const float max) {
  const float scale = max - min;
  return num * scale + min;
}

/**
 * Mandelbrot function, calculates the value of the mandelbrot set at pixel 'px/py'
 */
uint32_t mandelbrot(const int px, const int py, const int width, const int height,
                    const int max_iter) {
  const float x0 = scale((float) px / (float) width, -2.5, 1.);
  const float y0 = scale((float) py / (float) height, -1., 1.);
  float x = 0.;
  float y = 0.;
  float x2 = 0.;
  float y2 = 0.;
  int iters = 0;
  while (x2 + y2 < 4. && iters < max_iter) {
    y = 2. * x * y + y0;
    x = x2 - y2 + x0;
    x2 = x * x;
    y2 = y * y;
    iters += 1;
  }
  return (uint32_t) iters;
}

int main (int argc, char** argv) {
  int width = WIDTH;
  int height = HEIGHT;
  char output_name[128];
  int max_iter = MAX_ITER;
  strncpy (output_name, OUTPUT_NAME, strnlen (OUTPUT_NAME, 127) + 1);
  // Assume the first argument is the width and height of the image
  if (argc > 1) {
    if (strncmp (argv[1], "-h", 2) == 0 || strncmp (argv[1], "--help", 6) == 0) {
      printf("Usage: %s <width>x<height> <max iterations> <output filename>\n", argv[0]);
      printf("\tImage size can also be one of {8k, 4k, 3k, 1080p, 720p}\n");
      return EXIT_SUCCESS;
    }
    // First we check image size is one of the predefined sizes
    if (strncmp (argv[1], "8k", 2) == 0) {
      width = 7680;
      height = 4320;
    } else if (strncmp (argv[1], "4k", 2) == 0) {
      width = 3840;
      height = 2160;
    } else if (strncmp (argv[1], "3k", 2) == 0) {
      width = 3000;
      height = 2000;
    } else if (strncmp (argv[1], "1080p", 5) == 0) {
      width = 1920;
      height = 1080;
    } else if (strncmp (argv[1], "720p", 4) == 0) {
      width = 1280;
      height = 720;
    } else {
      // Assume user has supplied <width>x<height>
      // Try to find 'x' in argument
      char* token;
      token = strtok (argv[1], "x");
      if (token != NULL) {
        width = atoi (token);
      } else {
        printf("\033[0;31mInvalid width/height definition:\033[0m '%s'\n", argv[1]);
        printf("\tShould be '<width>x<height>'\n");
        return EXIT_FAILURE;
      }
      token = strtok (NULL, "x");
      if (token != NULL) {
        height = atoi (token);
      } else {
        printf("\033[0;31mInvalid width/height definition:\033[0m '%s'\n", argv[1]);
        printf("\tShould be '<width>x<height>'\n");
        return EXIT_FAILURE;
      }
    }
  }
  // Second argument is the maximum iteration count
  if (argc > 2) {
    max_iter = atoi (argv[2]);
  }
  // Third argument is the output filename to write PNG file to
  if (argc > 3) {
    if (strlen (argv[3]) > 127) {
      printf("\033[0;31mOutput filename to large!\033[0m");
      return EXIT_FAILURE;
    }
    strncpy (output_name, argv[3], strnlen (argv[3], 127) + 1);
  }
  // Allocate storage for image
  uint32_t* image = calloc (width * height, sizeof (uint32_t));
  if (image == NULL) {
    printf("\033[0;31mCould not allocate memory for image!\033[0m\n");
    return EXIT_FAILURE;
  }
  printf("Generating \033[0;35m%dx%d\033[0m image with max \033[0;35m%d\033[0m iterations\n",
         width, height,
         max_iter);
  /****************************************************************************/
  /***************************   Main computation   ***************************/
  /****************************************************************************/
  const double start_time = omp_get_wtime ();
  // For each pixel of our image calculate the value of the mandelbrot set
  for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
      const uint32_t iters = mandelbrot (x, y, width, height, max_iter);
      image[y * width + x] = palette[iters % palette_size];
    }
  }
  const double end_time = omp_get_wtime ();
  printf("Used \033[0;35m%.3f\033[0m ms for computation\n",
         (end_time - start_time) * 1000.0);
  /****************************************************************************/
  // Write image to file
  const unsigned char png_error = lodepng_encode32_file(output_name,
                                                        (const unsigned char*) image,
                                                        width, height);
  // Free image storage
  free (image);
  if (png_error) {
    printf("\033[0;31mAn error occurred while writing to PNG:\033[0m %s\n",
           lodepng_error_text (png_error));
    return EXIT_FAILURE;
  }
  printf("Wrote Mandelbrot result to \033[0;35m%s\033[0m\n", output_name);
  return EXIT_SUCCESS;
}

To build the project on Saga we first need to load a few modules before using meson to build the project.

$ module load Python/3.8.2-GCCcore-9.3.0
$ pip3 install --user meson
# To download the project directly
# wget https://documentation.sigma2.no/_downloads/bdfbca90a90a8d1b824fc6b1154ceee7/serial.zip
$ unzip serial.zip
$ cd AccelBrot-master

Note

We need to manually install meson above since we require version 0.56.0 which only exist as a pip package at the time of writing. Check with module --show_hidden avail Meson to see if a sufficiently new version is installed.

Then to build the project load NVHPC + Ninja and run the following meson commands

$ module load NVHPC/20.7 Ninja/1.10.0-GCCcore-9.3.0
$ CC=nvc meson setup builddir --buildtype=debugoptimized
$ meson compile -C builddir

Afterwards, as long as NVHPC and Ninja is loaded, only the last command meson compile -C builddir is required when making changes.

To run this on Saga (without GPU) the following srun command can be used

$ srun --account=<your project number> --time=02:00 --mem-per-cpu=1G ./builddir/src/mandelbrot

Tip

Try different image size or iteration parameters to see how much time the CPU vs GPU will take, ./builddir/src/mandelbrot 4k.

Tip

Use this opportunity to try to optimize the above code with OpenACC directives without focusing on asynchronous programming or multi-GPU setups.

Initial translation to OpenACC

To run the above code on a GPU using OpenACC we first need to introduce the #pragma acc routine directive. This directive tells OpenACC that we would like the function following the directive to be translated into GPU code. When looking at the code above we can see that the mandelbrot() function is used to separate the calculation of the set and iteration over the image. To be able to optimize the loop we therefore need to translate the mandelbrot() function.

// Helper function to scale 'num' to the range '[min, max]'
#pragma acc routine seq
float scale(float num, const float min, const float max) {
  const float scale = max - min;
  return num * scale + min;
}

/**
 * Mandelbrot function, calculates the value of the mandelbrot set at pixel 'px/py'
 */
#pragma acc routine seq
uint32_t mandelbrot(const int px, const int py, const int width, const int height,
                    const int max_iter) {
  const float x0 = scale((float) px / (float) width, -2.5, 1.);
  const float y0 = scale((float) py / (float) height, -1., 1.);
  float x = 0.;
  float y = 0.;
  float x2 = 0.;
  float y2 = 0.;
  int iters = 0;
  while (x2 + y2 < 4. && iters < max_iter) {
    y = 2. * x * y + y0;
    x = x2 - y2 + x0;
    x2 = x * x;
    y2 = y * y;
    iters += 1;
  }
  return (uint32_t) iters;
}

Note

In the above code we added seq to the end of the directive. This tells OpenACC that the routine must run sequentially on the GPU and no additional parallelization is possible. Adding seq is not necessary, but can be a good way to ensure that your thinking is correct as the compiler would complain if it is not correct. See the quick reference for further explanation of the possible additions to #pragma acc routine.

After this we can add the #pragma acc parallel loop directives around the image computation.

  // For each pixel of our image calculate the value of the mandelbrot set
  #pragma acc parallel loop \
    copyout(image[:width * height]) \
    copyin(palette[:palette_size]) \
    collapse(2)
  for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
      const uint32_t iters = mandelbrot (x, y, width, height, max_iter);
      image[y * width + x] = palette[iters % palette_size];
    }
  }

Initial translation of 'mandelbrot.c' to OpenACC

This initial translation is already quite a lot better than the serial CPU version (clocking in at around 10x improvement when generating a 4k image). Let’s see what insight we can gain from running with Nsight.

To run with Nsight use the following invocation of srun

$ srun --account=<your project number> --time=02:00 --mem-per-cpu=1G --partition=accel --gpus=1 nsys profile -t cuda,openacc,osrt -o initial ./builddir/src/mandelbrot 4k

Nsight profile

Timeline of initial OpenACCtranslation

As we can see, most of the timeline is taken with doing other work and not the actual compute. We will therefore zoom into the desired range (which we can identify by following the CUDA API row until a cluster of yellow boxes appear).

Selection of what to zoom into on initialtimeline Result of zoom on initial timeline

From the above screenshot we can see that we are running on the GPU, but computation and memory copies are sequential. For large amounts of data this is less than optimal and we can try to improve the situation with asynchronous scheduling of compute and memory transfer.

Async OpenACC

Translating a piece of OpenACC code to run asynchronously requires us to split our work into smaller tasks that we know could run concurrently. Looking at our main loop we can see that every computation is independent of any other iteration or computation. This make the Mandelbrot example quite simple to translate, but that does not mean asynchronous operations is only for embarrassingly parallel problems.

Tip

One way to quickly utilize asynchronous OpenACC is to identify blocks of code that are run sequentially (e.g. one loop that does something to A and another loop that does something to B) and does not involve data from one another. Loops like that can be run asynchronously which can increase throughput by overlapping computation and memory transfer.

In our case we can split the computation on rows and process the image in blocks. This will allow us to use the #pragma acc update directive to copy data from the GPU per image block instead of doing this all at the end.

Async OpenACC project as 'zip'

To split the image into blocks we will create a new command line parameter and add an additional loop around our computation.

  const int block_size = height / num_blocks;
  #pragma acc data create(image[:width * height]) copyin(palette[:palette_size])
  for (int block = 0; block < num_blocks; block++) {
    const int start = block * block_size;
    const int end = (block != num_blocks - 1) ? start + block_size : height;
    const int num_elements = end - start;
    // For each pixel of our image calculate the value of the mandelbrot set
    #pragma acc parallel loop collapse(2) async(block)
    for (int y = start; y < end; y++) {
      for (int x = 0; x < width; x++) {
        const uint32_t iters = mandelbrot (x, y, width, height, max_iter);
        image[y * width + x] = palette[iters % palette_size];
      }
    }
    #pragma acc update self(image[start * width:num_elements * width]) async(block)
  }
  #pragma acc wait

In the above code the num_blocks value divides our image into a given number of blocks. Then we create and copy the necessary data before beginning our actual computation. Notice in particular the async directive added on line 8. This directive tells OpenACC that it should launch the kernel and immediately continue working. The parameter given to async is the queue number, kernels submitted to the same queue must wait for previous work in that queue to finish before being launched. Notice therefore that we, on line 15, ensure that we have the same variable block which means that we do not update the data before the computation is complete. Lastly, on line 17 we wait for all previously launched asynchronous work to finish before continuing.

We will run this again with Nsight to see if we were able to perform the work asynchronously. Use the following command on Saga (don’t forget to compile with meson compile -C builddir)

$ srun --account=<your project number> --time=02:00 --mem-per-cpu=1G --partition=accel --gpus=1 nsys profile -t cuda,openacc,osrt -o async ./builddir/src/mandelbrot 4k

Nsight async profile

This new code runs about 1.25x faster than the initial translation, which shows the value in overlapping memory and compute. Below we have attached the zoomed in view of the Nsight timeline to show how asynchronous OpenACC looks.

View of Nsight's timeline after the asynchronoustransition

Utilizing multiple GPUs

To utilize multiple GPUs on Saga we will have to dip into the OpenACC runtime calls to query and set which GPU we want to run on. We will use the acc_get_num_devices and acc_set_device_num methods to assign work to GPUs.

Tip

To see all the directives and runtime methods of OpenACC consult the quick reference.

Tip

For most real world code it can be very difficult to find places to split the work over multiple GPUs and so a different technique might be called for.

Instead of trying to split loops into work for multiple GPUs, try to see already existing possibilities to split the work. One natural way is to give individual MPI ranks their own GPU to utilize.

This can be easily accomplished through Slurm with the --gpus-per-task flag which will allocate a number of GPUs appropriate for the number of Slurm tasks.

Multi-GPU OpenACC project as 'zip'

  const int num_gpus = acc_get_num_devices (acc_device_nvidia);
  printf("Found \033[0;35m%d\033[0m devices to split compute over\n", num_gpus);
  const double start_time = omp_get_wtime ();
  const int block_size = height / num_blocks;
  for (int g = 0; g < num_gpus; g++) {
    acc_set_device_num(g, acc_device_nvidia);
    #pragma acc enter data create(image[:width * height]) copyin(palette[:palette_size])
  }
  for (int block = 0; block < num_blocks; block++) {
    const int start = block * block_size;
    const int end = (block != num_blocks - 1) ? start + block_size : height;
    const int num_elements = end - start;
    acc_set_device_num(block % num_gpus, acc_device_nvidia);
    // For each pixel of our image calculate the value of the mandelbrot set
    #pragma acc parallel loop collapse(2) async(block)
    for (int y = start; y < end; y++) {
      for (int x = 0; x < width; x++) {
        const uint32_t iters = mandelbrot (x, y, width, height, max_iter);
        image[y * width + x] = palette[iters % palette_size];
      }
    }
    #pragma acc update self(image[start * width:num_elements * width]) async(block)
  }
  for (int g = 0; g < num_gpus; g++) {
    acc_set_device_num(g, acc_device_nvidia);
    #pragma acc wait
    #pragma acc exit data delete(image[:width * height]) delete(palette[:palette_size])
  }

Notice that we copy the data to each GPU and then assign a device after the current block number. We also have to take special care when exiting all loops that we wait for all GPUs to finish and remove the data since the #pragma acc enter data directive keeps the data in GPU memory until otherwise stated (in contrast to #pragma acc data which keeps the data only for the extent of the following block of code).

This last iterations is about 1.5x faster using --gpus=2 with diminishing, or even negative, returns for additional GPUs.

Summary

We have shown how to translate the Mandelbrot calculation into OpenACC code, how such code can be made asynchronous to overlap computation and memory transfer, and how to utilize multiple GPUs on Saga.

Below is the summary of speedup where the improvement is shown relative to the previous entry in the table (take the measured times with a grain of salt, they are more an illustration of possible speedup, not guaranteed speedup).

Version	Time in milliseconds	Speedup
Serial	`10757`	N/A
OpenMP `--cpus-per-task=6`*	`3313`	`3.24x`
Initial OpenACC	`1020`	`3.25x`
Async	`811`	`1.25x`
Multi-GPU `--gpus=2`	`547`	`1.48x`
Multi-GPU `--gpus=4`	`2932`	`0.18x`
* To keep the comparison as fair as possible we compare the CPU resources
that would be the equivalent to [the billing resources of 1 GPU on
Saga](../../jobs/projects_accounting.md).