Data Storage¶
Clusters often provide multiple file systems with different characteristics. For data-intensive jobs, it is particularly important to be familiar with each file system specifications, since they make a trade-off between the read/write rate and storage capacity. Some file systems optimize parallel I/O efficiency, some are designed specifically for large file storage and data integrity, and some are designed for file exchange over a heterogeneous network.
In general, the /home
mount point mount point is meant to store
executables, scripts and user settings. It is typically mounted from a
virtual device with low capacity and moderate read/write rates.
For example, on the Ant cluster, the /home
mount point
has 4 TB of storage with a 1.6 Gbps read rate,
a 530 Mbps serial write rate on compute nodes (NFS) and
a 850 Mbps serial write rate on the head node (ZFS).
Concurrent write operations from multiple nodes decrease the write rate.
Data storage generally uses parallel file systems like Lustre or
BeeGFS. These are designed to allow fast write operations in parallel,
i.e. the bandwidth is additive.
For example, on the Ant cluster, the /work
mount point
has 204 TB of storage with
a \(n\times\)2.7 Gbps parallel read rate and
a \(n\times\)2.1 Gbps parallel write rate,
with \(n\) the number of nodes that read or write concurrently.
Metadata (file system tree, file permissions, user id, group id, timestamps,
size, etc.) and data are usually stored on separate servers so that metadata
I/O is less affected by network congestion.
Some clusters offer a way to request temporary access to a large amount of data storage, typically 1 week, for data-intensive jobs. All data is lost when storage expires; users need to back up the data locally.
To show which file systems are available on a node, run:
df -h -T
Here is the output on the Ant head node:
Filesystem Type Size Used Avail Use% Mounted on
tmpfs tmpfs 189G 11G 179G 6% /dev/shm
/dev/mapper/vgroot-lvroot xfs 98G 32G 67G 32% /
beegfs_nodev beegfs 204T 36T 168T 18% /work
pool_data/home zfs 4.0T 1.9T 2.2T 48% /home
cvmfs2 fuse 9.8G 2.7G 7.1G 28% /cvmfs/cvmfs-config.cern.ch
cvmfs2 fuse 9.8G 2.7G 7.1G 28% /cvmfs/software.eessi.io
Here is the output on Ant compute nodes:
Filesystem Type Size Used Avail Use% Mounted on
tmpfs tmpfs 189G 266M 189G 1% /dev/shm
/dev/sda4 xfs 438G 19G 420G 5% /
beegfs_nodev beegfs 204T 37T 168T 18% /work
10.1.1.254:/home nfs4 4.0T 1.9T 2.2T 47% /home
10.1.1.254:/cvmfs/software.eessi.io nfs 9.8G 37M 9.8G 1% /cvmfs/software.eessi.io
10.1.1.254:/opt/software nfs4 2.0T 29G 2.0T 2% /opt/software
Note how the /home
mount point is a ZFS on the head node
but is exposed as a NFS on compute nodes, which explains
the difference in write rate on the head vs. compute node.
The /work
mount point is a BeeGFS on all nodes,
which makes it both fast and scalable.
The /dev/shm
workspace is a tmpfs, which makes it extremely
fast (6.4 Gbps read rate), but it is volatile and local to the node.
The EESSI software stack is injected via a CernVM-FS on the head node
and is exposed as a NFS on compute nodes.
When writing a large number of small files (in the millions), the number of inodes becomes relevant. Some file systems are optimized to store a small number of large files, and may only have a few million inodes to spare. To show inode usage, run:
df -h -i
Data integrity¶
Some file systems can leverage RAID for data redundancy via distributed parity bits; in the event of a drive failure, the lost data can be reconstructed on-the-fly via an expensive boolean XOR operation between the parity bits block and the paired file block.
Some file systems can be configured to automatically duplicate all data. Each block is written twice to distinct physical drives or servers. I/O operations in the event of a drive failure don’t incur any overhead. The main downside is a halving of the available storage space.
Volatile file systems and time-limited workspaces can be used to store non-permanent data with exceptionally high read/write rates. This is ideal to store checkpoint files, restart files, and temporary files in data analysis workflows.
Types of file systems¶
NFS¶
The Network File System (NFS) is a distributed file system. It provides the same API as a local file system, however read and write operations involve a network communication under the hood.
This file system can suffer from latency issues, in particular it is possible to write to a file but not being able to read from it for a few milliseconds if the network bandwidth is saturated, and files might have a timestamp in the future when different workstations have slightly different times, causing issues in software that rely on accurate time stamps, such as CMake. File transfer is a serial process: the entire file is sent from the server to the client as a single chunk, which is inadequate in a HPC environment.
BeeGFS¶
BeeGFS is a parallel file system that combines multiple drives and expose them to the operating system as a single virtual drive. Large files are split into smaller chunks that are stored on different drives in parallel to maximize read and write bandwidth (striping). Read data is cached on the server RAM so that frequent read operations on the same data chunks are almost free after the first read operation. Buddy Mirroring, if enabled, stores each stripe twice on different drives. Metadata and data are stored on separate servers. BeeGFS data can be accessed through NFS.
Lustre¶
Lustre is a parallel
file system that combines multiple drives and expose them to the operating
system as a single virtual drive.
Data is not stored in files, but in multiple blobs (objects), which are
scattered over multiple drives in parallel to maximize read and write
bandwidth (striping). One object may contain parts of multiple files.
Metadata and data are stored on separate servers;
to fully benefit from this design, Linux tools need to be replaced by
Lustre-specific tools, such as lfs find
instead of find
and
star
instead of tar
, or be given extra command line options,
such as ls --color=never
instead of ls
to avoid expensive
calls to stat()
.
Lustre doesn’t scale well in jobs that handle a large number of small files.
GPFS¶
GPFS is a clustered file system that combines multiple drives and expose them to the operating system as a single virtual drive. Large files are split into 1 MB chunks that are stored on different drives in parallel to maximize read and write bandwidth (striping). GPFS scales well in jobs that handle a large number of small files. Metadata and data are stored on separate servers. GPFS data can be accessed through NFS.
CephFS¶
The Ceph File System (CephFS) combines multiple drives and expose them to the operating system as a single virtual drive. It is primarily designed for storage of large files and doesn’t offer the same parallel I/O efficiency as Lustre or BeeGFS, but comes with a built-in data replication mechanism.
CernVM-FS¶
The CernVM File System (CernVM-FS)
is a distributed file system. It uses a FUSE module and deliver files
over the internet. It is used by EESSI to dynamically fetch software
when users invoke module load
.
ZFS¶
The Zettabyte File System (ZFS) combines multiple drives and expose them to the operating system as a single virtual drive. Data written to disk is compressed on-the-fly to speed up read and write operations of low-entropy data. It is also automatically deduplicated, which requires extra CPU power and large amounts of RAM, which can negatively impact slurm jobs that allocate the entire RAM on the node.
tmpfs¶
The temporary file system (tmpfs) stores data directly in the RAM, allowing for fast read and write operations, but is limited by the amount of available RAM. Data is local to the node.
All data is lost during a reboot or when power goes out.
Linux PAM memory¶
Clusters may mount a tmpfs shared memory folder under ${XDG_RUNTIME_DIR}
upon user login and unmount it upon user logout (this environment variable
typically expands to /run/user/${UID}
). Its maximum storage is fixed,
but it will only allocate memory when files are written to it (dynamic allocation).
Even though SLURM jobs have the same user id as the user who submits the job, SLURM cannot log the user in, and thus PAM systemd won’t mount this folder when the job runs!
Benchmarking file systems¶
To benchmark read/write rates with dd
:
# create a large file on RAM filled with random data
mkdir -p /dev/shm/${USER}
dd if=/dev/urandom of=/dev/shm/${USER}/urandom.bin bs=1G count=10 status=none
# measure write rate
dd if=/dev/shm/${USER}/urandom.bin of=${HOSTNAME}-urandom.bin bs=1G count=10
# measure read rate
dd if=${HOSTNAME}-urandom.bin of=/dev/null bs=1G count=10
# free the RAM
rm /dev/shm/${USER}/urandom.bin
# free the hard drive
rm *.bin
We use a file on the RAM to avoid skewing the results.
Reading data directly from /dev/urandom
would introduce
too much overhead from the random number generator, and reading
from /dev/zero
could inflate write rates if the file system
uses on-the-fly data compression (e.g. ZFS).
The write operation can be executed simultaneously on multiple nodes
(hence the ${HOSTNAME}
), in which case the write rate might go down
a little bit. However, the total write rate is the sum of all write rates.
The situation is different on serial file systems, where the write rate
decreases significantly when writing concurrently, usually by a factor
of \(1/n\).
To benchmark read access from a local network:
scp ${USER}@ant:urandom.bin /dev/null
This command can only measure the read rate once. Subsequent calls may
return inflated read rates, because Linux caches data received from scp
in the unallocated sections of the RAM to speed-up future scp
operations
(this corresponds to the buff/cache
line in the output of top
).
Archiving¶
Research data should follow the FAIR principles [Wilkinson et al., 2016]. The University of Stuttgart offers the DaRUS platform to archive research datasets. Data can be uploaded via the web browser or an API (user guide). The file size limit is 100 GB (BigData user guide).
Larger files can be stored on magnetic tape drives, which have 18 TB of storage capacity and are suitable for long-term archival.