First R calculation

Our goal on this page is to get an R calculation to run on a compute node, both as serial and parallel calculation.

Simple example to get started

We will start with a very simple R script (simple.R):

print("hello from the R script!")

We can launch it on Saga with the following job script (simple.sh). Before submitting, adjust at least the line with --account to match your allocation:

#!/bin/bash

#SBATCH --account=nn9997k
#SBATCH --job-name=example
#SBATCH --partition=normal
#SBATCH --mem=1G
#SBATCH --ntasks=1
#SBATCH --time=00:02:00

# it is good to have the following lines in any bash script
set -o errexit  # make bash exit on any error
set -o nounset  # treat unset variables as errors

module restore
module load R/4.2.1-foss-2022a

Rscript simple.R > simple.Rout

Submit the example job script with:

$ sbatch simple.sh

Longer example

Here is a longer example that takes ca. 25 seconds (sequential.R):

library(foreach)


# this function approximates pi by throwing random points into a square
# it is used here to demonstrate a function that takes a bit of time
approximate_pi <- function() {
  # number of points to use
  n <- 2000000

  # generate n random points in the square
  x <- runif(n, -1.0, 1.0)
  y <- runif(n, -1.0, 1.0)

  # count the number of points that are inside the circle
  n_in <- sum(x^2 + y^2 < 1.0)

  4 * n_in / n
}


foreach (i=1:100, .combine=c) %do% {
  approximate_pi()
}

And the corresponding run script (sequential.sh). Before submitting, adjust at least the line with --account to match your allocation:

#!/bin/bash

#SBATCH --account=nn9997k
#SBATCH --job-name=example
#SBATCH --partition=normal
#SBATCH --mem=2G
#SBATCH --ntasks=1
#SBATCH --time=00:02:00

# it is good to have the following lines in any bash script
set -o errexit  # make bash exit on any error
set -o nounset  # treat unset variables as errors

module restore
module load R/4.2.1-foss-2022a

Rscript sequential.R > sequential.Rout

Parallel job script example

Warning

We have tested this example and it works but the scaling/speed-up is pretty poor and not worth it in this example. If you know the reason, can you please suggest a change?

When running jobs in parallel, please always verify that it actually scales and that the run time goes down as you use more cores.

When testing this example on the desktop, the speed-up was much better.

Often, a good alternative to run R code in parallel is to launch many sequential R jobs at the same time, each doing its own thing.

Let’s start with the run script (parallel.sh), where we ask for 20 cores:

#!/bin/bash

#SBATCH --account=nn9997k
#SBATCH --job-name=example
#SBATCH --partition=normal
#SBATCH --mem=2G
#SBATCH --ntasks=20
#SBATCH --time=00:02:00

# it is good to have the following lines in any bash script
set -o errexit  # make bash exit on any error
set -o nounset  # treat unset variables as errors

module restore
module load R/4.2.1-foss-2022a

Rscript parallel.R > parallel.Rout

Notice how in the R script (parallel.R) we indicate to use these 20 cores and how we changed %do% to %dopar%:

library(parallel)
library(foreach)
library(doParallel)


# this function approximates pi by throwing random points into a square
# it is used here to demonstrate a function that takes a bit of time
approximate_pi <- function() {
  # number of points to use
  n <- 2000000

  # generate n random points in the square
  x <- runif(n, -1.0, 1.0)
  y <- runif(n, -1.0, 1.0)

  # count the number of points that are inside the circle
  n_in <- sum(x^2 + y^2 < 1.0)

  4 * n_in / n
}


registerDoParallel(20)

foreach (i=1:100, .combine=c) %dopar% {
  approximate_pi()
}

Which of the many R modules to load?

The short answer is to get an overview about available modules first:

$ module spider R
$ module spider bioconductor

We have more information here: Selecting the module to load

Installing R libraries

We have a separate page about Installing R libraries.