Data Storage

Clusters often provide multiple file systems with different characteristics. For data-intensive jobs, it is particularly important to be familiar with each file system specifications, since they make a trade-off between the read/write rate and storage capacity. Some file systems optimize parallel I/O efficiency, some are designed specifically for large file storage and data integrity, and some are designed for file exchange over a heterogeneous network.

In general, the /home mount point mount point is meant to store executables, scripts and user settings. It is typically mounted from a virtual device with low capacity and moderate read/write rates. For example, on the Ant cluster, the /home mount point has 4 TB of storage with a 1.6 Gbps read rate, a 530 Mbps serial write rate on compute nodes (NFS) and a 850 Mbps serial write rate on the head node (ZFS). Concurrent write operations from multiple nodes decrease the write rate.

Data storage generally uses parallel file systems like Lustre or BeeGFS. These are designed to allow fast write operations in parallel, i.e. the bandwidth is additive. For example, on the Ant cluster, the /work mount point has 204 TB of storage with a \(n\times\)2.7 Gbps parallel read rate and a \(n\times\)2.1 Gbps parallel write rate, with \(n\) the number of nodes that read or write concurrently. Metadata (file system tree, file permissions, user id, group id, timestamps, size, etc.) and data are usually stored on separate servers so that metadata I/O is less affected by network congestion.

Some clusters offer a way to request temporary access to a large amount of data storage, typically 1 week, for data-intensive jobs. All data is lost when storage expires; users need to back up the data locally.

To show which file systems are available on a node, run:

df -h -T

Here is the output on the Ant head node:

Filesystem                   Type      Size  Used Avail Use% Mounted on
tmpfs                        tmpfs     189G   11G  179G   6% /dev/shm
/dev/mapper/vgroot-lvroot    xfs        98G   32G   67G  32% /
beegfs_nodev                 beegfs    204T   36T  168T  18% /work
pool_data/home               zfs       4.0T  1.9T  2.2T  48% /home
cvmfs2                       fuse      9.8G  2.7G  7.1G  28% /cvmfs/cvmfs-config.cern.ch
cvmfs2                       fuse      9.8G  2.7G  7.1G  28% /cvmfs/software.eessi.io

Here is the output on Ant compute nodes:

Filesystem                          Type      Size  Used Avail Use% Mounted on
tmpfs                               tmpfs     189G  266M  189G   1% /dev/shm
/dev/sda4                           xfs       438G   19G  420G   5% /
beegfs_nodev                        beegfs    204T   37T  168T  18% /work
10.1.1.254:/home                    nfs4      4.0T  1.9T  2.2T  47% /home
10.1.1.254:/cvmfs/software.eessi.io nfs       9.8G   37M  9.8G   1% /cvmfs/software.eessi.io
10.1.1.254:/opt/software            nfs4      2.0T   29G  2.0T   2% /opt/software

Note how the /home mount point is a ZFS on the head node but is exposed as a NFS on compute nodes, which explains the difference in write rate on the head vs. compute node. The /work mount point is a BeeGFS on all nodes, which makes it both fast and scalable. The /dev/shm workspace is a tmpfs, which makes it extremely fast (6.4 Gbps read rate), but it is volatile and local to the node. The EESSI software stack is injected via a CernVM-FS on the head node and is exposed as a NFS on compute nodes.

When writing a large number of small files (in the millions), the number of inodes becomes relevant. Some file systems are optimized to store a small number of large files, and may only have a few million inodes to spare. To show inode usage, run:

df -h -i

Data integrity

Some file systems can leverage RAID for data redundancy via distributed parity bits; in the event of a drive failure, the lost data can be reconstructed on-the-fly via an expensive boolean XOR operation between the parity bits block and the paired file block.

Some file systems can be configured to automatically duplicate all data. Each block is written twice to distinct physical drives or servers. I/O operations in the event of a drive failure don’t incur any overhead. The main downside is a halving of the available storage space.

Volatile file systems and time-limited workspaces can be used to store non-permanent data with exceptionally high read/write rates. This is ideal to store checkpoint files, restart files, and temporary files in data analysis workflows.

Types of file systems

NFS

The Network File System (NFS) is a distributed file system. It provides the same API as a local file system, however read and write operations involve a network communication under the hood.

This file system can suffer from latency issues, in particular it is possible to write to a file but not being able to read from it for a few milliseconds if the network bandwidth is saturated, and files might have a timestamp in the future when different workstations have slightly different times, causing issues in software that rely on accurate time stamps, such as CMake. File transfer is a serial process: the entire file is sent from the server to the client as a single chunk, which is inadequate in a HPC environment.

BeeGFS

BeeGFS is a parallel file system that combines multiple drives and expose them to the operating system as a single virtual drive. Large files are split into smaller chunks that are stored on different drives in parallel to maximize read and write bandwidth (striping). Read data is cached on the server RAM so that frequent read operations on the same data chunks are almost free after the first read operation. Buddy Mirroring, if enabled, stores each stripe twice on different drives. Metadata and data are stored on separate servers. BeeGFS data can be accessed through NFS.

Lustre

Lustre is a parallel file system that combines multiple drives and expose them to the operating system as a single virtual drive. Data is not stored in files, but in multiple blobs (objects), which are scattered over multiple drives in parallel to maximize read and write bandwidth (striping). One object may contain parts of multiple files. Metadata and data are stored on separate servers; to fully benefit from this design, Linux tools need to be replaced by Lustre-specific tools, such as lfs find instead of find and star instead of tar, or be given extra command line options, such as ls --color=never instead of ls to avoid expensive calls to stat(). Lustre doesn’t scale well in jobs that handle a large number of small files.

GPFS

GPFS is a clustered file system that combines multiple drives and expose them to the operating system as a single virtual drive. Large files are split into 1 MB chunks that are stored on different drives in parallel to maximize read and write bandwidth (striping). GPFS scales well in jobs that handle a large number of small files. Metadata and data are stored on separate servers. GPFS data can be accessed through NFS.

CephFS

The Ceph File System (CephFS) combines multiple drives and expose them to the operating system as a single virtual drive. It is primarily designed for storage of large files and doesn’t offer the same parallel I/O efficiency as Lustre or BeeGFS, but comes with a built-in data replication mechanism.

CernVM-FS

The CernVM File System (CernVM-FS) is a distributed file system. It uses a FUSE module and deliver files over the internet. It is used by EESSI to dynamically fetch software when users invoke module load.

ZFS

The Zettabyte File System (ZFS) combines multiple drives and expose them to the operating system as a single virtual drive. Data written to disk is compressed on-the-fly to speed up read and write operations of low-entropy data. It is also automatically deduplicated, which requires extra CPU power and large amounts of RAM, which can negatively impact slurm jobs that allocate the entire RAM on the node.

tmpfs

The temporary file system (tmpfs) stores data directly in the RAM, allowing for fast read and write operations, but is limited by the amount of available RAM. Data is local to the node.

All data is lost during a reboot or when power goes out.

Linux shared memory

Clusters may mount a tmpfs shared memory folder under /dev/shm. By default, its maximum size is half the available RAM, but it will only allocate memory when files are written to it (dynamic allocation). Therefore, if 90% of the RAM is used by running process, the total size of /dev/shm will automatically shrink to 10% of the total RAM, unless the drive contains more than 10% of RAM in stored data already.

If shared memory is full and more files are written to it, or the RAM is full and more memory is allocated, the operating system will start using swap memory, which is several orders of magnitude slower than RAM. If no swap file exists, or swap memory is exhausted, processes will start experiencing memory errors. When all physical memory is completely exhausted, the Linux kernel invokes the out-of-memory (OOM) killer to stop running processes based on their memory footprint and privilege (root vs. user).

Clusters can be configured to automatically clear files in shared memory when a user logs out (variable RemoveIPC in /etc/systemd/logind.conf, by default yes), making this mount point sensitive to network perturbations. In addition, shared memory can be configured to not use swap, in which case write errors will occur when the partition is full, and to enfore user quota.

Linux PAM memory

Clusters may mount a tmpfs shared memory folder under ${XDG_RUNTIME_DIR} upon user login and unmount it upon user logout (this environment variable typically expands to /run/user/${UID}). Its maximum storage is fixed, but it will only allocate memory when files are written to it (dynamic allocation).

Even though SLURM jobs have the same user id as the user who submits the job, SLURM cannot log the user in, and thus PAM systemd won’t mount this folder when the job runs!

Benchmarking file systems

To benchmark read/write rates with dd:

# create a large file on RAM filled with random data
mkdir -p /dev/shm/${USER}
dd if=/dev/urandom of=/dev/shm/${USER}/urandom.bin bs=1G count=10 status=none
# measure write rate
dd if=/dev/shm/${USER}/urandom.bin of=${HOSTNAME}-urandom.bin bs=1G count=10
# measure read rate
dd if=${HOSTNAME}-urandom.bin of=/dev/null bs=1G count=10
# free the RAM
rm /dev/shm/${USER}/urandom.bin
# free the hard drive
rm *.bin

We use a file on the RAM to avoid skewing the results. Reading data directly from /dev/urandom would introduce too much overhead from the random number generator, and reading from /dev/zero could inflate write rates if the file system uses on-the-fly data compression (e.g. ZFS).

The write operation can be executed simultaneously on multiple nodes (hence the ${HOSTNAME}), in which case the write rate might go down a little bit. However, the total write rate is the sum of all write rates. The situation is different on serial file systems, where the write rate decreases significantly when writing concurrently, usually by a factor of \(1/n\).

To benchmark read access from a local network:

scp ${USER}@ant:urandom.bin /dev/null

This command can only measure the read rate once. Subsequent calls may return inflated read rates, because Linux caches data received from scp in the unallocated sections of the RAM to speed-up future scp operations (this corresponds to the buff/cache line in the output of top).

Archiving

Research data should follow the FAIR principles [Wilkinson et al., 2016]. The University of Stuttgart offers the DaRUS platform to archive research datasets. Data can be uploaded via the web browser or an API (user guide). The file size limit is 100 GB (BigData user guide).

Larger files can be stored on magnetic tape drives, which have 18 TB of storage capacity and are suitable for long-term archival.