Summary: use rsync for file transfer
For file transfer to/from and between compute and storage systems (Betzy, Fram,
Saga, NIRD), we recommend
rsync. This tool is often faster than
many small files and it does not copy files that are already there) and
potentially also safer against accidental file overwrites.
For more details, see Advantages over scp and similar tools.
rsync, there is no need to zip/tar files first.
Transferring files between your computer and a compute cluster or storage resource
This is a good starting point but below we will explain what these components and options mean:
$ rsync --info=progress2 -a file-name username@cluster:receiving-directory $ rsync --info=progress2 -a directory-name/ username@cluster:receiving-directory/directory-name
--info=progress2: This will show progress (how many percent, how much time left). You can also leave it out if you don’t need to know how far the copying is. There is also a
--progressoption but that one will show progress for each file individually and often you rather want to know the overall progress.
directory-name: These are on your computer and you want to transfer them to the receiving server.
-a: Preserves ownership and time stamp and includes the
-roption which copies folders recursively.
username: Your username on the remote cluster. If your usernames on your local computer and on the remote resource are the same, you can leave out the
cluster: The remote server. For example:
receiving-directory: The directory on the remote server which will receive the file(s) and/or directories.
If you want to make sure that
rsync does not overwrite files that are newer
on the receiving end, add the
If you want to
rsync between two computers that both offer an SSH connection, note that then
you can use
rsync both ways: from cluster A to cluster B, but also the reverse.
Please note that there is a trailing slash (
/) at the end of the first argument in the
syntax of the second command, while rsync directories, ie:
rsync --info=progress2 -a directory-name/ username@cluster:receiving-directory/directory-name
This trailing slash (
/) signifies the contents of the directory
The outcome would create a hierarchy like the following on your cluster:
Without the trailing slash,
directory-name, including the directory, would be placed within your receiving directory.
The outcome without the trailing slash (
/) at the first argument in the above command, ie:
rsync --info=progress2 -a directory-name username@cluster:receiving-directory/directory-name
would create a hierarchy like the following on your cluster:
rsync using compression
If you have a strong CPU at both ends of the line, and you’re on a slow
network, you can save bandwidth by compressing the data with the
$ rsync --info=progress2 -az file-name username@cluster:receiving-directory $ rsync --info=progress2 -az directory-name username@cluster:receiving-directory/directory-name
Problem with many small files
Many small files are often not great for the transfer (although
not seem to mind but for
scp this can make a big difference, see below). Many
tiny files are often also a problem for parallel file systems. If you develop
programs for high-performance computing, avoid using very many tiny files.
Advantages over scp and similar tools
rsyncwill not transfer files if they already exist and do not differ.
rsync --updateyou can avoid accidentally overwriting newer files in the destination directory.
You can use compression for file transfer.
Resumes interrupted transfers.
More flexibility and better cross-platform support.
Typically people recommend
scp for file transfer and we have also done this
in the past. But let us here compare
rsync. In this example I
tried to transfer a 100 MB file from my home computer (not on the fast
university network) to a cluster, either as one large file or split into 5000
For one or few files it does not matter:
$ scp file.txt username@cluster:directory # 81 sec $ rsync --info=progress2 -a file.txt username@cluster:directory # 79 sec $ rsync --info=progress2 -az file.txt username@cluster:directory # 61 sec
However, it can matter a lot if you want to transfer many small files.
Notice how the transfer takes 10 times longer with
$ scp -r many-files username@cluster:directory # 833 sec $ rsync --info=progress2 -a many-files username@cluster:directory/many-files # 81 sec $ rsync --info=progress2 -az many-files username@cluster:directory/many-files # 62 sec
In the above example,
scp struggles with many small files but
not seem to mind. For
scp we would have to first
zip the small files
to one large file but for
rsync we don’t have to.
How was the test data created?
Just in case anybody wants to try the above example on their own, we used this script to generate the example data:
#/usr/bin/env bash # create a file that is 100 MB large base64 /dev/urandom | head -c 100000000 > file.txt # split into 5000 smaller files mkdir -p many-files cd many-files split -n 5000 ../file.txt
Transferring files between Betzy/Fram/Saga and NIRD
Since NIRD is mounted on the login nodes of Betzy, Fram, and Saga,
one can use regular
mv commands on the cluster login nodes to copy or
move files into or out of the NIRD project areas.
For more information, please check out the page about Storage areas on HPC clusters.
What to do if rsync is not fast enough?
Disk speed, meta-data performance, network speed, and firewall speed may limit the transfer bandwidth.
If you have access to a network with a large bandwidth and you are sure that
you are limited by the one
rsync process and not by something else, you can
rsync processes, by piping a list of paths to
parallel which launches multiple
rsync instances in parallel. But please
mind that this way you can saturate the network bandwidth for other users and
also saturate the login node with
rsync processes or overwhelm the file
system. If you have to transfer large amount of data and one
rsync process is
not enough, we recommend that you talk to us first: Getting help.
Please also plan for it: If you need to transfer large amount of data, don’t start on the last day of your project. Data transfer may take hours or even days.
Troubleshooting: “Broken pipe” error during transfer
The organization which provides the network to the clusters, may perform daily housekeeping of their DNS and then the connection from outside to the NRIS services can drop. This can cause a “broken pipe” error during file transfer from outside.
One way to avoid this, especially while copying large datasets, is to use IP addresses instead of domain names.
One way to get the IP of one of the login nodes (example: Saga):
$ nslookup saga.sigma2.no