NIRD S3
What is object storage?
Object storage is specifically designed to manage large volumes of unstructured data. It stores data as discrete units, known as objects, each accompanied by metadata and a unique identifier. Each object is accessible via a dedicated key-pair, ensuring that all data blocks within a file are kept together as a single entity, complete with its relevant metadata and unique ID.
In the context of object storage, traditional file storage directories (POSIX) are analogous to ‘buckets.’
About NIRD S3
On NIRD, object storage is available as a service component within the NIRD Data Lake, supporting the efficient consumption and ingestion of unstructured data. This service also facilitates data streaming, data sharing with external collaborators, and more.
The NIRD Data Lake provides unified access to data, allowing the same data can be accessed simultaneously through multiple protocols, such as POSIX, NFS, and S3.
The object storage on NIRD Data Lake facilitates seamless collaboration without the necessity for data staging between locations. It also allows integration with data processing frameworks and AI/ML platforms, providing fast, and easy access to large datasets for training models and running simulations.
NIRD S3 comes with enhanced load balancing and high I/O throughput, with a measured throughput of 27GB/s.
NIRD S3 supports 10TB single object sizes and up to 100 million objects per bucket.
Working with NIRD S3
This guide provides a brief overview of how to use NIRD S3, though it does not cover all possible use cases and functionalities.
Obtaining S3 Access
To gain access to NIRD S3, you must be a member of an active project on NIRD, have an allocation on the NIRD Data Lake, and own the directory for which you’re requesting S3 access.
Use the RequestS3Access command to submit a request. This request is processed periodically by the system, which will then register an S3 account and create a bucket linked to the specified directory.
nird-login $ RequestS3Access /nird/datalake/NSxxxxK/my-directory
Upon successful request, further information is sent to your email address.
Please read carefully the information in the email and pay attention to the NAME of your bucket in the message. Some symbols are not permitted in the bucket names and the directory name which you created for the bucket may have been modified (e.g. replaced underscores, etc.).
S3 client configuration
Several CLI clients are available like AWS CLI, s3cmd, MinIO Client, rclone, etc. As graphical user interface, Cyberduck may be used. Please consult each tool’s documentation for detailed help.
Below are two examples rclone and AWS.
rclone
Examples of setting up rclone and using rclone command line interface. The rclone application can do a range of file transfers (cp, scp, rsync type etc), but only S3 is show here.
Make a configuration for rclone.
by running
rclone config
Or editing the config file directly, it’s located at:
$HOME/.config/rclone/rclone.conf
Editing the config file is simpler than running the config option as only a few lines are needed to set up rclone with S3.
An example config could look like this, fill in your credentials from the file on NIRD.
[S3]
type = s3
provider = Other
access_key_id = xxxxxxxxxxx
secret_access_key = yyyyyyyyyyyyy
endpoint = https://s3.nird.sigma2.no
acl = private
After the config file is updated the rclone is ready for use against the NIRD datalake.
rclone ls S3:<user>-ns<project>k-<user>/
1099511627776 tmp.dta
102400000000 tmp.m1
263095463936 tmp.medium
1073741824 tmp.small
Once we have retrieved the names we can copy the file we want to use.
rclone copy S3:<user>-ns<project>k-<user>/tmp.medium .
To monitor progress and transfer speed use the option -P.
rclone copy S3:user-nsXXXXk-user/tmp.medium . -P
Transferred: 245.027 GiB / 245.027 GiB, 100%, 19.587 MiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 4m56.9s
Copy other direction:
rclone copy tmp.m1 S3:user-nsXXXXk-user/ -P --ignore-checksum
As the receiver has to read the whole file to calculate the checksum it takes a considerable amount of time, it’s turned of to provide the transfer speed only.
It’s possible to run multiple transfers simultaneously with many files (wildcards don’t work, hence the include option) :
rclone copy S3:user-nsXXXXk-user/ . --include "tmp.m?"\
--transfers=10 -P --ignore-checksum
Transferred: 953.674 GiB / 953.674 GiB, 100%, 5.225 GiB/s, ETA 0s
Checks: 0 / 0, -, Listed 19
Transferred: 10 / 10, 100%
Elapsed time: 2m22.0s
The combined transfer bandwidth can be quite good. On the other hans doing checksum at both ends takes time and impedes the transfer, causing it to take considerably longer time.
When transferring much data as the below example shows quite high transfer speed can be obtained as measured from the User Access Nodes (uan or login nodes). The example transfers a set of 1000 smaller files at 512 MB each with a varying number of transfers (threads) (rclone uses sftp as the transfer layer).
Transfers |
Transfer speed [GB/s] |
|---|---|
10 |
7.376 |
20 |
10.26 |
30 |
11.15 |
40 |
10.20 |
50 |
11.33 ! |
60 |
10.56 |
At 10 GBytes/s it’s possible to transfer 36 TB/h which is actually quite good.
For more details of what rclone can do see: Documentation
Amazon Web Serices client
Examples using the AWS CLI, and example configuration listed below.
.aws/config
The [profile s3test] value in .aws/config shall match the value in .aws/credentials : [s3test]
[default]
region = us-east-1
[profile s3test]
region = us-east-1
s3 =
multipart_chunksize = 5GB
multipart_threshold = 2GB
max_concurrent_requests = 100
.aws/credentials
The values of aws_access_key_id and aws_secret_access_key in .aws/credentials shall contain the values in the <username>-<project>-s3creds.txt file as follows (this file is automatically created after S3 access request here above and is located in the user home directory):
[default]
aws_access_key_id =
aws_secret_access_key =
[s3test]
aws_access_key_id = <Access Key in the file>
aws_secret_access_key = <Secret Key in the file>
List bucket
aws --profile s3test --endpoint https://s3.nird.sigma2.no s3 ls s3://username-nsxxxxk-bucketname/
Put object
aws --profile s3test --endpoint https://s3.nird.sigma2.no s3 cp local-file.nc s3://username-nsxxxxk-bucketname/
Get object
aws --profile s3test --endpoint https://s3.nird.sigma2.no s3 cp s3://username-nsxxxxk-bucketname/somefile .
Use S3 API to fetch file metadata
aws --profile s3test --endpoint https://s3.nird.sigma2.no s3api head-object --bucket username-nsxxxxk-bucketname --key somefile