Data transfer
This page covers all common ways to move data to, from, and between Sigma2 systems, covering both POSIX-based file transfer (rsync, rclone, scp) and object storage via the S3 protocol (NIRD S3, AWS, etc.).
Jump to: POSIX file transfer | S3 object storage transfer
Summary: choosing the right tool
Dataset size |
Scenario |
Recommended tool |
|---|---|---|
Small |
Laptop to HPC cluster (few files) |
|
Small |
Laptop to HPC cluster (many small files) |
|
Small |
Between HPC clusters / NIRD (POSIX) |
|
Large |
Between HPC clusters / NIRD (POSIX) |
|
Small |
NIRD to HPC cluster (S3) |
|
Small |
Public Cloud to HPC cluster (S3) |
|
Large |
NIRD to HPC cluster (S3) |
|
Any |
Programmatic access |
|
Data can be moved to, from, and between NRIS systems using two broad approaches:
POSIX file transfer (rsync, rclone, scp) — transfers data over SSH using the regular file system.
S3 object storage transfer (aws CLI, rclone, s5cmd, boto3) — transfers data using the S3 protocol, available for NIRD Data Lake and external cloud storage such as AWS S3.
POSIX file transfer
Summary: use rsync for file transfer
For file transfer to/from and between compute and storage systems (Betzy, Fram,
Saga, NIRD), we recommend rsync. This tool is often faster than scp (for
many small files and it does not copy files that are already there) and
potentially also safer against accidental file overwrites.
For more details, see Advantages over scp and similar tools.
When using rsync, there is no need to zip/tar files first.
On Windows, many other tools exist (WinSCP,
FileZilla,
MobaXterm, and others), but we recommend to
use rsync through Windows Subsystem for Linux
(WSL).
For large data transfers rclone is an option which offers better performance through
multiple parallel transfers, see rclone as a faster alternative.
For transfers to and from NIRD, the S3 protocol is also an option. It can reach very high performance as all object transfers are independent of each other, see S3 object storage transfer.
NB: Since the implementation of 2FA (2 factor authentication) on NRIS clusters,
you might experience issues while using Filezilla, in which you never get asked to
provide the 2fa key before the password. To fix this, (in Filezilla) you need to go
to Site Manager and change Protocol Type to Interactive.
Transferring files between your computer and a compute cluster or storage resource
This is a good starting point but below we will explain what these components and options mean:
$ rsync --info=progress2 -a file-name username@cluster:receiving-directory
$ rsync --info=progress2 -a directory-name/ username@cluster:receiving-directory/directory-name
--info=progress2: This will show progress (how many percent, how much time left). You can also leave it out if you don’t need to know how far the copying is. There is also a--progressoption but that one will show progress for each file individually and often you rather want to know the overall progress.-a: Preserves ownership and time stamp and includes the-roption which copies folders recursively.file-nameordirectory-name: These are on your computer and you want to transfer them to the receiving server.username: Your username on the remote cluster. If your usernames on your local computer and on the remote resource are the same, you can leave out theusername@part.cluster: The remote server. For example:saga.sigma2.no.receiving-directory: The directory on the remote server which will receive the file(s) and/or directories.
If you want to make sure that rsync does not overwrite files that are newer
on the receiving end, add the --update option.
If you want to rsync between two computers that both offer an SSH connection, note that then
you can use rsync both ways: from cluster A to cluster B, but also the reverse.
rsync directory
Please note that there is a trailing slash (/) at the end of the first argument in the
syntax of the second command, while rsync directories, ie:
$ rsync --info=progress2 -a directory-name/ username@cluster:receiving-directory/directory-name
This trailing slash (/) signifies the contents of the directory directory-name.
The outcome would create a hierarchy like the following on your cluster:
~/receiving-directory/directory-name/contents-of-the-dir
Without the trailing slash, directory-name, including the directory, would be placed within your receiving directory.
The outcome would be the following on the cluster:
~/receiving-directory/directory-name/directory-name/contents-of-the-dir
rsync using compression
If you have a strong CPU at both ends of the line, and you’re on a slow
network, you can save bandwidth by compressing the data with the -z flag:
$ rsync --info=progress2 -az file-name username@cluster:receiving-directory
$ rsync --info=progress2 -az directory-name username@cluster:receiving-directory/directory-name
Problem with many small files
Many small files are often not great for the transfer (although rsync does
not seem to mind but for scp this can make a big difference, see below). Many
tiny files are often also a problem for parallel file systems. If you develop
programs for high-performance computing, avoid using very many tiny files.
Advantages over scp and similar tools
rsyncwill not transfer files if they already exist and do not differ.With
rsync --updateyou can avoid accidentally overwriting newer files in the destination directory.You can use compression for file transfer.
Resumes interrupted transfers.
More flexibility and better cross-platform support.
Typically people recommend scp for file transfer and we have also done this
in the past. But let us here compare scp with rsync. In this example I
tried to transfer a 100 MB file from my home computer (not on the fast
university network) to a cluster, either as one large file or split into 5000
smaller files.
For one or few files it does not matter:
$ scp file.txt username@cluster:directory
# 81 sec
$ rsync --info=progress2 -a file.txt username@cluster:directory
# 79 sec
$ rsync --info=progress2 -az file.txt username@cluster:directory
# 61 sec
However, it can matter a lot if you want to transfer many small files.
Notice how the transfer takes 10 times longer with scp:
$ scp -r many-files username@cluster:directory
# 833 sec
$ rsync --info=progress2 -a many-files username@cluster:directory/many-files
# 81 sec
$ rsync --info=progress2 -az many-files username@cluster:directory/many-files
# 62 sec
In the above example, scp struggles with many small files but rsync does
not seem to mind. For scp we would have to first tar/zip the small files
to one large file but for rsync we don’t have to.
How was the test data created?
Just in case anybody wants to try the above example on their own, we used this script to generate the example data:
#!/usr/bin/env bash
# create a file that is 100 MB large
base64 /dev/urandom | head -c 100000000 > file.txt
# split into 5000 smaller files
mkdir -p many-files
cd many-files
split -n 5000 ../file.txt
Transferring files between Betzy/Olivia/Saga and NIRD
The easiest way to transfer files between clusters is to set up an ssh-keypair. Once (following the guide) there’s a private/public key on sender/receiver, you won’t need to authenticate with password + OTP. Note that the guide above is currently only limited to cross-cluster connections/transfers.
Since NIRD is mounted on the login nodes of Betzy, Olivia, and Saga,
one can use regular cp or mv commands on the cluster login nodes to copy or
move files into or out of the NIRD project areas.
System |
Mount point |
|---|---|
NIRD Data Peak |
|
NIRD Data Lake |
|
Note
On Saga and Betzy, NIRD is mounted on login nodes only (not compute nodes). On Olivia, it is mounted on SVC nodes (read-write) and compute nodes (read-only). Olivia also supports automatic data staging via Slurm – see Staging In/Out Files from/to NIRD using Slurm.
For more information, please check out the page about Storage areas on HPC clusters.
What to do if rsync is not fast enough?
Disk speed, meta-data performance, network speed, and firewall speed may limit the transfer bandwidth.
If you have access to a network with a large bandwidth and you are sure that
you are limited by the one rsync process and not by something else, you can
start multiple rsync processes, by piping a list of paths to xargs or
parallel which launches multiple rsync instances in parallel. But please
mind that this way you can saturate the network bandwidth for other users and
also saturate the login node with rsync processes or overwhelm the file
system. If you have to transfer large amount of data and one rsync process is
not enough, we recommend that you talk to us first: Getting help.
Please also plan for it: If you need to transfer large amount of data, don’t start on the last day of your project. Data transfer may take hours or even days.
rclone as a faster alternative
While rsync does a good job, it unfortunately only uses one thread (or transfer), while rclone can use a range of parallel transfers (both one per file and split a large file into chunks).
rclone can utilise multiple threads / streams to run multiple transfers in
parallel. An example copying the same 5000-file dataset used above (this does
the transfer from Saga to Olivia):
$ rclone copy SAGA:/cluster/work/users/user/many-files . -P --transfers=30 --ignore-checksum
The following table is illustrative of the performance:
Transfer application |
Options / threads |
Wall time [seconds] |
|---|---|---|
scp |
833 |
|
rsync |
|
81 |
rsync |
|
62 |
rclone |
|
25 |
rclone |
|
15 |
rclone |
|
11 |
rclone |
|
10 |
rclone |
|
9 |
rclone can keep a large number of operations in flight simultaneously which
is the reason for it being so efficient.
A command like:
$ rclone copy SAGA:/cluster/projects/nnXXXXk/user/ . -P --transfers=20
will copy 20 files in parallel. A larger example:
$ rclone copy SAGA:/cluster/projects/nnxxxxk/ . -P --transfers=60 --ignore-checksum
Transferred: 200 GiB / 200 GiB, 100%, 2.038 GiB/s, ETA 0s
Checks: 0 / 0, -, Listed 200
Transferred: 200 / 200, 100%
Elapsed time: 1m33.7s
This shows close to 2 GBytes/s – about 7 TBytes per hour, or 150 TBytes/day.
Troubleshooting: “Broken pipe” error during transfer
The organization which provides the network to the clusters, may perform daily housekeeping of their DNS and then the connection from outside to the NRIS services can drop. This can cause a “broken pipe” error during file transfer from outside.
One way to avoid this, especially while copying large datasets, is to use IP addresses instead of domain names.
One way to get the IP of one of the login nodes (example: Saga):
$ nslookup saga.sigma2.no
S3 object storage transfer
The S3 protocol is available for NIRD Data Lake NIRD S3 (including the Central Data Library NIRD Central Data Library) and for external cloud storage such as AWS S3. S3 transfers are independent per object, enabling very high parallel throughput – NIRD S3 is capable of up to 27 GB/s.
S3 is the right choice when:
Your data lives in NIRD Data Lake or another S3-compatible store.
You need to fetch data from AWS or another cloud provider.
You are building automated or scripted pipelines.
You are integrating with AI/ML frameworks that expect object storage.
Proxy setup (required on all NRIS systems)
All outbound HTTPS traffic from Sigma2 HPC systems goes through a proxy. Set
these variables before using any S3 tool, or add them to your ~/.bashrc.
Olivia:
export http_proxy=http://10.63.2.48:3128/
export https_proxy=http://10.63.2.48:3128/
Saga (and other Sigma2 systems):
export http_proxy=http://proxy.saga:3128/
export https_proxy=http://proxy.saga:3128/
Verify connectivity before starting a transfer:
$ curl -I https://s3.nird.sigma2.no
A successful response returns HTTP/1.1 200 OK.
Credentials
NIRD S3
Your S3 credentials are provided in a file <username>-<project>-s3creds.txt
in your home directory after S3 access is activated. Configure ~/.aws/credentials:
[default]
aws_access_key_id = <your-access-key>
aws_secret_access_key = <your-secret-key>
[s3test]
aws_access_key_id = <Access Key from the credentials file>
aws_secret_access_key = <Secret Key from the credentials file>
And ~/.aws/config:
[default]
region = us-east-1
[profile s3test]
region = us-east-1
s3 =
multipart_chunksize = 5GB
multipart_threshold = 2GB
max_concurrent_requests = 100
AWS S3
$ aws configure
Enter your AWS Access Key ID, Secret Access Key, default region (e.g. eu-west-1),
and output format. Or export credentials directly:
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
Using the AWS CLI
The AWS CLI works for both NIRD S3 (using --endpoint-url) and standard AWS S3.
$ module load awscli
NIRD S3:
# List your buckets
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no s3 ls
# List objects in a bucket
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no \
s3 ls s3://username-nsxxxxk-bucketname/
# Download a single file
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no \
s3 cp s3://username-nsxxxxk-bucketname/path/to/file.nc ./file.nc
# Download a directory recursively
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no \
s3 cp s3://username-nsxxxxk-bucketname/dataset/ ./dataset/ --recursive
# Sync a prefix to local storage
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no \
s3 sync s3://username-nsxxxxk-bucketname/input/ ./input/
# Fetch object metadata
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no \
s3api head-object --bucket username-nsxxxxk-bucketname --key somefile
# Upload a file
aws --profile s3test --endpoint-url https://s3.nird.sigma2.no \
s3 cp local-file.nc s3://username-nsxxxxk-bucketname/
AWS S3 (no --endpoint-url needed):
# List buckets
aws s3 ls
# Download a single file
aws s3 cp s3://your-bucket-name/path/to/file.nc ./file.nc
# Download a directory recursively
aws s3 cp s3://your-bucket-name/dataset/ ./dataset/ --recursive
# Sync a bucket prefix to local
aws s3 sync s3://your-bucket-name/path/ ./local-path/
# Use a named profile
aws s3 ls --profile my-project-profile
rclone with S3 backend
rclone also connects to S3 endpoints and offers the same parallel transfer
advantage as in the POSIX case. Measured throughput on NIRD S3 reaches ~10 GB/s
with sufficient parallelism.
Add the following to ~/.config/rclone/rclone.conf:
[S3]
type = s3
provider = Ceph
env_auth = false
access_key_id = <your-access-key>
secret_access_key = <your-secret-key>
endpoint = https://s3.nird.sigma2.no
# List objects
rclone ls S3:<user>-ns<project>k-<user>/
# Download a single file (with progress)
rclone copy S3:user-nsXXXXk-user/tmp.medium . -P
# Download multiple files in parallel (wildcards don't work -- use --include)
rclone copy S3:user-nsXXXXk-user/ . --include "*.nc" -P --transfers=20
Example output:
Transferred: 245.027 GiB / 245.027 GiB, 100%, 19.587 MiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 4m56.9s
s5cmd – maximum throughput for many objects
s5cmd is significantly faster than the AWS CLI
for large numbers of files or objects, running operations fully in parallel.
# Install if not available as a module
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xzf s5cmd_2.2.2_Linux-64bit.tar.gz
# Download from NIRD S3
./s5cmd --endpoint-url https://s3.nird.sigma2.no \
cp 's3://username-nsxxxxk-bucketname/path/*' ./local-path/
# Download from AWS S3
./s5cmd cp 's3://your-bucket-name/path/*' ./local-path/
Python (boto3)
For scripted or automated workflows, use boto3:
pip install boto3 --user
NIRD S3:
import boto3
s3 = boto3.client(
"s3",
endpoint_url="https://s3.nird.sigma2.no",
aws_access_key_id="your-access-key",
aws_secret_access_key="your-secret-key",
)
# List objects
response = s3.list_objects_v2(Bucket="username-nsxxxxk-bucketname")
for obj in response.get("Contents", []):
print(obj["Key"])
# Download a file
s3.download_file("username-nsxxxxk-bucketname", "path/to/file.nc", "local-file.nc")
AWS S3:
import boto3
s3 = boto3.client("s3", region_name="eu-west-1")
s3.download_file("your-bucket-name", "path/to/file.nc", "local-file.nc")
Note
When running Python scripts on HPC, make sure the proxy environment variables are set in your job script or shell session.
Example Slurm job script
The following fetches input data from NIRD S3, runs an analysis, and pushes results back – all within a single Slurm job on Saga:
#!/bin/bash
#SBATCH --job-name=fetch-and-run
#SBATCH --account=nnXXXXk
#SBATCH --time=02:00:00
#SBATCH --mem=8G
#SBATCH --ntasks=1
# Proxy (required on Saga)
export http_proxy=http://proxy.saga:3128/
export https_proxy=http://proxy.saga:3128/
module load awscli
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
# Fetch input data
aws --endpoint-url https://s3.nird.sigma2.no s3 sync \
s3://username-nsxxxxk-bucketname/input-data/ $SCRATCH/input-data/
# Run analysis
python my_analysis.py --input $SCRATCH/input-data/
# Push results back
aws --endpoint-url https://s3.nird.sigma2.no s3 sync \
$SCRATCH/results/ s3://username-nsxxxxk-bucketname/results/
Troubleshooting S3 transfers
Symptom |
Likely cause |
Fix |
|---|---|---|
|
Proxy not set |
Export |
|
Proxy TLS interception |
Try |
|
Wrong credentials or no bucket permission |
Check access key and bucket permissions |
|
Bucket name typo |
Run |
Slow transfer speed |
Single-threaded tool |
Switch to |
Transfer interrupted |
Network drop / DNS housekeeping |
Re-run; use |