Data Management and Transfer
This section covers the storage and file systems available on the system; the different ways that you can transfer data to and from Cirrus; and how to transfer backed up data from prior to the March 2022 Cirrus upgrade.
In all cases of data transfer, users should use the Cirrus login nodes.
Cirrus file systems and storage
The Cirrus service, like many HPC systems, has a complex structure. There are a number of different data storage types available to users:
- Home file system
- Work file systems
- Solid state storage
Each type of storage has different characteristics and policies, and is suitable for different types of use.
There are also two different types of node available to users:
- Login nodes
- Compute nodes
Each type of node sees a different combination of the storage types. The following table shows which storage options are available on different node types:
Storage | Login nodes | Compute nodes | Notes |
---|---|---|---|
Home | yes | no | No backup |
Work | yes | yes | No backup |
Solid state | yes | yes | No backup |
Home file system
Every project has an allocation on the home file system and your
project's space can always be accessed via the path
/home/[project-code]
. The home file system is approximately 1.5 PB in
size and is implemented using the Ceph technology. This means that this
storage is not particularly high performance but are well suited to
standard operations like compilation and file editing. This file systems
is visible from the Cirrus login nodes.
There are currently no backups of any data on the home file system.
Quotas on home file system
All projects are assigned a quota on the home file system. The project PI or manager can split this quota up between groups of users if they wish.
You can view any home file system quotas that apply to your account by logging into SAFE and navigating to the page for your Cirrus login account.
- Log into SAFE
- Use the "Login accounts" menu and select your Cirrus login account
- The "Login account details" table lists any user or group quotas that are linked with your account. (If there is no quota shown for a row then you have an unlimited quota for that item, but you may still may be limited by another quota.)
Quota and usage data on SAFE is updated twice daily so may not be exactly up to date with the situation on the system itself.
From the command line
Some useful information on the current contents of directories on the
/home
file system is available from the command line by using the Ceph
command getfattr
. This is to be preferred over standard Unix commands
such as du
for reasons of efficiency.
For example, the number of entries (files plus directories) in a home directory can be queried via
$ cd
$ getfattr -n ceph.dir.entries .
# file: .
ceph.dir.entries="33"
The corresponding attribute rentries
gives the recursive total in all
subdirectories, that is, the total number of files and directories:
$ getfattr -n ceph.dir.rentries .
# file: .
ceph.dir.rentries="1619179"
Other useful attributes (all prefixed with ceph.dir.
) include files
which is the number of ordinary files, subdirs
the number of
subdirectories, and bytes
the total number of bytes used. All these
have a corresponding recursive version, respectively: rfiles
,
rsubdirs
, and rbytes
.
A full path name can be specified if required.
Work file system
Every project has an allocation on the work file system and your
project's space can always be accessed via the path
/work/[project-code]
. The work file system is approximately 1 PB in
size and is implemented using the Lustre parallel file system
technology. They are designed to support data in large files. The
performance for data stored in large numbers of small files is probably
not going to be as good.
There are currently no backups of any data on the work file system.
Ideally, the work file system should only contain data that is:
- actively in use;
- recently generated and in the process of being saved elsewhere; or
- being made ready for up-coming work.
In practice it may be convenient to keep copies of datasets on the work file system that you know will be needed at a later date. However, make sure that important data is always backed up elsewhere and that your work would not be significantly impacted if the data on the work file system was lost.
If you have data on the work file system that you are not going to need in the future please delete it.
Quotas on the work file system
As for the home file system, all projects are assigned a quota on the work file system. The project PI or manager can split this quota up between groups of users if they wish.
You can view any work file system quotas that apply to your account by logging into SAFE and navigating to the page for your Cirrus login account.
- Log into SAFE
- Use the "Login accounts" menu and select your Cirrus login account
- The "Login account details" table lists any user or group quotas that are linked with your account. (If there is no quota shown for a row then you have an unlimited quota for that item, but you may still may be limited by another quota.)
Quota and usage data on SAFE is updated twice daily so may not be exactly up to date with the situation on the system itself.
You can also examine up to date quotas and usage on the Cirrus system
itself using the lfs quota
command. To do this:
Change directory to the work directory where you want to check the
quota. For example, if I wanted to check the quota for user auser
in
project t01
then I would:
[auser@cirrus-login1 auser]$ cd /work/t01/t01/auser
[auser@cirrus-login1 auser]$ lfs quota -hu auser .
Disk quotas for usr auser (uid 68826):
Filesystem used quota limit grace files quota limit grace
. 5.915G 0k 0k - 51652 0 0 -
uid 68826 is using default block quota setting
uid 68826 is using default file quota setting
the quota and limit of 0k here indicate that no user quota is set for this user.
To check your project (group) quota, you would use the command:
[auser@cirrus-login1 auser]$ lfs quota -hg t01 .
Disk quotas for grp t01 (gid 37733):
Filesystem used quota limit grace files quota limit grace
. 958.3G 0k 13.57T - 1427052 0 0 -
gid 37733 is using default file quota setting
the limit of 13.57T
indicates the quota for the group.
Solid state storage
More information on using the solid state storage can be found in the
/user-guide/solidstate
section of the user guide.
The solid state storage is not backed up.
Accessing Cirrus data from before March 2022
Prior to the March 2022 Cirrus upgrade,all user date on the /lustre/sw
filesystem was archived. Users can access their archived data from the
Cirrus login nodes in the /home-archive
directory. Assuming you are
user auser
from project x01
, your pre-rebuild archived data can be
found in:
/home-archive/x01/auser
The data in the /home-archive
file system is read only meaning
that you will not be able to create, edit, or copy new information to
this file system.
To make archived data visible from the compute nodes, you will need to
copy the data from the /home-archive
file system to the /home
file
system. Assuming again that you are user auser
from project x01
and
that you were wanting to copy data from
/home-archive/x01/auser/directory_to_copy
to
/home/x01/x01/auser/destination_directory
, you would do this by
running:
cp -r /home-archive/x01/auser/directory_to_copy \
/home/x01/x01/auser/destination_directory
Note that the project code appears once in the path for the old home archive and twice in the path on the new /home file system.
Note
The capacity of the home file system is much larger than the work file system so you should move data to home rather than work.
Archiving
If you have related data that consists of a large number of small files
it is strongly recommended to pack the files into a larger "archive"
file for ease of transfer and manipulation. A single large file makes
more efficient use of the file system and is easier to move and copy and
transfer because significantly fewer meta-data operations are required.
Archive files can be created using tools like tar
and zip
.
tar
The tar
command packs files into a "tape archive" format. The command
has general form:
tar [options] [file(s)]
Common options include:
-c
create a new archive-v
verbosely list files processed-W
verify the archive after writing-l
confirm all file hard links are included in the archive-f
use an archive file (for historical reasons, tar writes its output to stdout by default rather than a file).
Putting these together:
tar -cvWlf mydata.tar mydata
will create and verify an archive.
To extract files from a tar file, the option -x
is used. For example:
tar -xf mydata.tar
will recover the contents of mydata.tar
to the current working
directory.
To verify an existing tar file against a set of data, the -d
(diff)
option can be used. By default, no output will be given if a
verification succeeds and an example of a failed verification follows:
$> tar -df mydata.tar mydata/*
mydata/damaged_file: Mod time differs
mydata/damaged_file: Size differs
Note
tar files do not store checksums with their data, requiring the original data to be present during verification.
Tip
Further information on using tar
can be found in the tar
manual
(accessed via man tar
or at man
tar).
zip
The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:
zip [options] mydata.zip [file(s)]
Common options are:
-r
used to zip up a directory-#
where "#" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process.
Together:
zip -0r mydata.zip mydata
will create an archive.
Note
Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.
The corresponding unzip
command is used to extract data from the
archive. The simplest use case is:
unzip mydata.zip
which recovers the contents of the archive to the current working directory.
Files in a zip archive are stored with a CRC checksum to help detect
data loss. unzip
provides options for verifying this checksum against
the stored files. The relevant flag is -t
and is used as follows:
$> unzip -t mydata.zip
Archive: mydata.zip
testing: mydata/ OK
testing: mydata/file OK
No errors detected in compressed data of mydata.zip.
Tip
Further information on using zip
can be found in the zip
manual
(accessed via man zip
or at man
zip).
Data transfer
Before you start
Read Harry Mangalam's guide on How to transfer large amounts of data via network. This tells you all you want to know about transferring data.
Data Transfer via SSH
The easiest way of transferring data to/from Cirrus is to use one of the
standard programs based on the SSH protocol such as scp
, sftp
or
rsync
. These all use the same underlying mechanism (ssh) as you
normally use to login to Cirrus. So, once the command has been executed
via the command line, you will be prompted for your password for the
specified account on the remote machine.
To avoid having to type in your password multiple times you can set up a
ssh-key as documented in the User Guide at connecting
SSH Transfer Performance Considerations
The ssh protocol encrypts all traffic it sends. This means that file-transfer using ssh consumes a relatively large amount of CPU time at both ends of the transfer. The encryption algorithm used is negotiated between the ssh-client and the ssh-server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The arcfour algorithm is usually quite fast assuming both hosts support it.
A single ssh based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.
In addition, you should consider the following when transferring data.
- Only transfer those files that are required. Consider which data you really need to keep.
- Combine lots of small files into a single tar archive, to reduce the overheads associated in initiating many separate data transfers (over SSH each file counts as an individual transfer).
- Compress data before sending it, e.g. using gzip.
scp command
The scp
command creates a copy of a file, or if given the -r
flag, a
directory, on a remote machine.
For example, to transfer files to Cirrus:
scp [options] source user@login.cirrus.ac.uk:[destination]
(Remember to replace user
with your Cirrus username in the example
above.)
In the above example, the [destination]
is optional, as when left out
scp
will simply copy the source into the user's home directory. Also
the source
should be the absolute path of the file/directory being
copied or the command should be executed in the directory containing the
source file/directory.
Tip
If your local version of OpenSSL (the library underlying scp
) is
very new you may see errors transferring data to Cirrus using scp
where the version
of OpenSSL is older. The errors typically look like
scp: upload "mydata": path canonicalization failed
. You can get
around this issue by adding the -O
option to scp
.
If you want to request a different encryption algorithm add the
-c [algorithm-name]
flag to the scp
options. For example, to use the
(usually faster) aes128-ctr encryption algorithm you would use:
scp [options] -c aes128-ctr source user@login.cirrus.ac.uk:[destination]
(Remember to replace user
with your Cirrus username in the example
above.)
rsync command
The rsync
command can also transfer data between hosts using a ssh
connection. It creates a copy of a file or, if given the -r
flag, a
directory at the given destination, similar to scp
above.
Given the -a
option rsync can also make exact copies (including
permissions), this is referred to as mirroring. In this case the
rsync
command is executed with ssh to create the copy on a remote
machine.
To transfer files to Cirrus using rsync
the command should have the
form:
rsync [options] -e ssh source user@login.cirrus.ac.uk:[destination]
(Remember to replace user
with your Cirrus username in the example
above.)
In the above example, the [destination]
is optional, as when left out
rsync
will simply copy the source into the users home directory. Also
the source
should be the absolute path of the file/directory being
copied or the command should be executed in the directory containing the
source file/directory.
Additional flags can be specified for the underlying ssh
command by
using a quoted string as the argument of the -e
flag. e.g.
rsync [options] -e "ssh -c aes128-ctr" source user@login.cirrus.ac.uk:[destination]
(Remember to replace user
with your Cirrus username in the example
above.)
Data transfer using rclone
Rclone is a command-line program to manage files on cloud storage. You can transfer files directly to/from cloud storage services, such as MS OneDrive and Dropbox. The program preserves timestamps and verifies checksums at all times.
First of all, you must download and unzip rclone on Cirrus:
wget https://downloads.rclone.org/v1.62.2/rclone-v1.62.2-linux-amd64.zip
unzip rclone-v1.62.2-linux-amd64.zip
cd rclone-v1.62.2-linux-amd64/
The previous code snippet uses rclone v1.62.2, which was the latest version when these instructions were written.
Configure rclone using ./rclone config
. This will guide you through an interactive setup process where you can make a new remote (called remote
). See the following for detailed instructions for:
Please note that a token is required to connect from Cirrus to the cloud service. You need a web browser to get the token. The recommendation is to run rclone in your laptop using rclone authorize
, get the token, and then copy the token from your laptop to Cirrus. The rclone website contains further instructions on configuring rclone on a remote machine without web browser.
Once all the above is done, you’re ready to go. If you want to copy a directory, please use:
rclone copy <cirrus_directory> remote:<cloud_directory>
Please note that “remote” is the name that you have chosen when running rclone config`. To copy files, please use:
rclone copyto <cirrus_file> remote:<cloud_file>
Note
If the session times out while the data transfer takes place, adding the -vv
flag to an rclone transfer forces rclone to output to the terminal and therefore avoids triggering the timeout process.
Data transfer using Globus
The Cirrus /work
filesystem, which is hosted on the e1000 fileserver, has a Globus Collection (formerly known as an endpoint) with the name e1000-fs1 directories
Full step-by-step guide for using Globus to transfer files to/from Cirrus /work