.. _Submitting Jobs:

Submitting Jobs
===============

This section explains how to write and submit a job script using the slurm scheduler.
This is perhaps the most important part of the documentation.
Whether you understand all of the theory or not, you can always jump here and
start running things immediately.

The `SLURM Quick Start User Guide <https://slurm.schedmd.com/quickstart.html>`__
is a good place to get familiar with the basic commands for starting, stopping and monitoring jobs.
Further information can be found in the `documentation <https://slurm.schedmd.com/documentation.html>`_.

Job script
----------

In order to run a job on the cluster you will need to write a submission script.
This script tells the cluster what kind of environment you need, what modules must be
loaded, and of course, what file needs to be run.
An important point to stress here is that, whether you loaded modules on the head node
or not (the node you ssh into first), all modules must be added to your submit script.
This is because when the cluster actually starts running your jobs, it does so on a
completely different node, therefore, it has no memory of which modules you loaded
manually when your first accessed the cluster.
SLURM jobs inherit the user name (``${USER}``) and user id (``${UID}``)
of the user who submits the job.

A job script has the following general structure:

.. code-block:: bash
   :linenos:

    #!/bin/bash
    #SBATCH --time=00:10:00
    #SBATCH --output %j.stdout
    #SBATCH --error  %j.stderr
    module load spack/default gcc/12.3.0 cuda/12.3.0 openmpi/4.1.6 \
                fftw/3.3.10 boost/1.83.0 python/3.12.1
    source espresso-4.3/venv/bin/activate
    srun --cpu-bind=cores python3 espresso-4.3/testsuite/python/particle.py
    deactivate
    module purge

Let's break it down line by line:

* L1: shebang, which is needed to select the shell interpreter
* L2-4: slurm batch options; those can be provided by the command line too
* L5-6: load modules; you can use Spack, EasyBuild, EESSI, or custom modules
* L7: enter a Python environment, e.g. venv, virtualenv or conda; only relevant to Python users
* L8: slurm launcher (depends on the cluster)
* L9: leave the Python environment
* L10: clear modules

This job script is specific to Ant. On other clusters, you will need to adapt
the launcher and the ``module load`` command. Use ``module spider`` to find out
how package names are spelled (lowercase, titlecase, with extra suffixes) and
which versions are available. Often, a package version will only be available
for a specific C compiler or CUDA release; Modulefiles typically do a good job
of informing you of incompatible package versions. Some clusters require the
``srun`` launcher, while other clusters allow you to use the ``mpirun`` launcher.
You can find examples of slurm job scripts for all clusters the ICP has access
to in this user guide.

Submit jobs
-----------

Submit jobs with `sbatch <https://slurm.schedmd.com/sbatch.html>`__:

.. code-block:: bash

    sbatch --job-name="test" --nodes=1 --ntasks=4 --mem-per-cpu=2GB job.sh

All command line options override the corresponding ``#SBATCH`` variables in the job script.
A "task" is usually understood as a CPU core. There are many options
(`complete list <https://slurm.schedmd.com/sbatch.html#SECTION_OPTIONS>`__),
but here are the most important ones:

* ``--time=<duration>``: wall clock time, typically provided as colon-separated integers, e.g. "minutes", "minutes:seconds", "hours:minutes:seconds"
* ``--output=<filepath>``:  output file for stdout, by default "slurm-%j.out", where the "%j" is replaced by the job ID
* ``--error=<filepath>``: output file for stderr, by default "slurm-%j.out", where the "%j" is replaced by the job ID
* ``--job-name=<name>``: job name; long names are allowed, but most slurm tools will only show the first 8 characters
* ``--chdir="${HOME}/hydrogels"``: set the working directory of the job script
* ``--gres=<category>[[:type]:count]``: request generic resources, e.g. ``--gres=gpu:l4:2`` to request 2x NVIDIA L4
* ``--licenses=<license>[@db][:count][,license[@db][:count]...]``: comma-separated list of licenses for commercial software, e.g. ``--licenses=nastran@slurmdb:12,matlab`` to request 12 NASTRAN licenses from slurmdb and 1 MATLAB license
* ``--partition=<name>``: name of the partition (only on clusters that use partitions)
* ``--exclude=<name>[,name...]``: prevent job from landing on a comma-separated list of nodes
* ``--exclusive``: prevent any other job from landing on the node(s) on which this job is running (only relevant for benchmark jobs!)
* ``--test-only``: dry run: give an estimate of the job start date (but doesn't schedule it!)
* amount of RAM: only one of the following:

  * ``--mem=<size>[K|M|G|T]``: total job memory, default is in megabytes (``M``)
  * ``--mem-per-cpu=<size>[K|M|G|T]``: job memory per CPU, default is in megabytes (``M``)
  * ``--mem-per-gpu=<size>[K|M|G|T]``: job memory per GPU, default is in megabytes (``M``)

* number of tasks: any valid combination of the following:

  * ``--nodes=<count>``: number of nodes
  * ``--ntasks=<count>``: number of tasks
  * ``--ntasks-per-core=<count>``: number of tasks per core, usually this should be 1
  * ``--ntasks-per-node=<count>``: number of tasks per node
  * ``--ntasks-per-gpu=<count>``: number of tasks per GPU

Core pinning
~~~~~~~~~~~~

Some clusters have hyperthreading enabled, in which case you will need to check
in their online documentation how to set the MPI binding policy to use logical
core #0 in each physical core (i.e. skip logical core #1).

Some clusters provide the hwloc tool to help you troubleshoot the binding policy.
For example, on HPC Vega, one must use ``sbatch --hint=nomultithread ...`` and
``srun --cpu-bind=cores ...`` to bind per core. Starting an interactive job with:

.. code-block:: bash

    srun --partition=dev --time=0:05:00 --job-name=interactive --hint=nomultithread \
         --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB --pty /usr/bin/bash

and executing hwloc reporting:

.. code-block:: bash

    module load Boost/1.72.0-gompi-2020a hwloc
    mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings /bin/true

outputs:

.. code-block:: none

    rank  0 bound to socket 1[core  0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..]
    rank  1 bound to socket 1[core  1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../..]
    rank  2 bound to socket 1[core  2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../..]
    rank  3 bound to socket 1[core  3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../..]
    rank  4 bound to socket 1[core  4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../..]
    rank  5 bound to socket 1[core  5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../..]
    rank  6 bound to socket 1[core  6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../..]
    rank  7 bound to socket 1[core  7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../..]
    rank  8 bound to socket 1[core  8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../..]
    rank  9 bound to socket 1[core  9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../..]
    rank 10 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../..]
    rank 11 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../..]
    rank 12 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../..]
    rank 13 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../..]
    rank 14 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/..]
    rank 15 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB]

Each CPU has 64 physical cores and 2 hyperthreads per core, thus 128 logical cores.
The ASCII diagram separates physical cores with a forward slash character.
Each hyperthread is represented as a dot. Bound hyperthreads are represented
as a "B". This table shows the CPU is using 2 hyperthreads per core, and each
MPI rank is bound to both logical cores of the same physical core.

For more details, check these external resources:

* `Processor Affinity <https://apps.fz-juelich.de/jsc/hps/jureca/affinity.html>`__
* `pinning with psslurm <https://apps.fz-juelich.de/jsc/hps/jureca/_downloads/cf737b6d62fe241158478d214849236a/Pinning_with_psslurm.pdf>`__:
  slides with a graphical representation of pinning from ``srun``
  on a 2-socket mainboard with hyperthreading enabled
* `LLview pinning app <https://apps.fz-juelich.de/jsc/llview/pinning/?supercomputer=juwels&modus=task&hex2bin=0x&nodes=1&task=4&cpu_per_task=8&hint=nomultithread&distribution_node=block&distribution_socket=cyclic&distribution_core=fcyclic&cpu_bind=rank>`__:
  GUI to show exactly how the ``srun`` pinning options get mapped on the hardware
  (currently only implemented for hardware at Jülich and partners)

Cancel jobs
-----------

Cancel jobs with `scancel <https://slurm.schedmd.com/scancel.html>`__:

.. code-block:: bash

    # cancel a specific job
    scancel 854836
    # cancel all my jobs
    scancel -u ${USER}
    # cancel all my jobs that are still pending
    scancel -t PENDING -u ${USER}

Monitor jobs
------------

List queued jobs with `squeue <https://slurm.schedmd.com/squeue.html>`__:

.. code-block:: none

    $ squeue --me
       JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      853911       ant hydrogel    jgrad PD       0:00      1 (Priority)
      853912       ant  solvent    jgrad  R    8:32:40      2 compute[08,12]
      853913       ant     test    jgrad  R      01:17      1 compute11

The status column (``ST``) can take various values
(`complete list <https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES>`__),
but here are the most common:

* ``PD``: pending
* ``R``: running
* ``S``: suspended

The nodelist/reason column (``NODELIST(REASON)``) states the node(s) where the
job is running, or if it's not running, the reason why. There are many reasons
(`complete list <https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES>`__),
but here are the most common:

* ``Resources``: the queue is full and the job is waiting on resources to become available
* ``Priority``: there are pending jobs with a higher priority
* ``Dependency``: the job depends on another job that hasn't completed yet
* ``Licenses``: the job is waiting for a license to become available

Check job status with `sacct <https://slurm.schedmd.com/sacct.html>`__:

.. code-block:: bash

    $ sacct -j 854579 # pending job
    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    854579        dft_water        ant       root         12    PENDING      0:0
    $ sacct -j 853980 # running job
    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    853980       WL2057-0-+        ant       root          5    RUNNING      0:0
    853980.batch      batch                  root          5    RUNNING      0:0
    853980.0      lmp_lmpwl                  root          5    RUNNING      0:0
    $ sacct -j 853080 # completed job
    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    853080       capillary+        ant       root          1  COMPLETED      0:0
    853080.batch      batch                  root          1  COMPLETED      0:0
    $ sacct -j 853180 # cancelled job
    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    853180       2flowfiel+        ant       root          1 CANCELLED+      0:0
    $ sacct # show most recent jobs
    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    854836          solvent        ant       root          4 CANCELLED+      0:0
    854837         mpi_test        ant       root          2     FAILED      2:0
    854837.0       mpi_test                  root          2  COMPLETED      0:0

Check job properties with `scontrol <https://slurm.schedmd.com/scontrol.html>`__:

.. code-block:: none

    $ scontrol show job 853977
    JobId=853977 JobName=WL2051-0-40
       UserId=rkajouri(1216) GroupId=rkajouri(1216) MCS_label=N/A
       Priority=372 Nice=0 Account=root QOS=normal
       JobState=RUNNING Reason=None Dependency=(null)
       RunTime=1-02:24:05 TimeLimit=2-00:00:00 TimeMin=N/A
       Partition=ant AllocNode:Sid=ant:515436
       NodeList=compute11
       NumNodes=1 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
       OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
       WorkDir=/work/rkajouri/simulations/confinement/bulk_water

Cluster information
-------------------

List partitions and hardware information with `sinfo <https://slurm.schedmd.com/sinfo.html>`__:

.. code-block:: bash

    $ sinfo # node list
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    ant*         up 2-00:00:00      1  down* compute02
    ant*         up 2-00:00:00      2    mix compute[08,12]
    ant*         up 2-00:00:00     14  alloc compute[01,03-07,09-11,13-15,17-18]
    ant*         up 2-00:00:00      1   down compute16
    $ sinfo -s # partition summary
    PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
    ant*         up 2-00:00:00        16/0/2/18 compute[01-18]
    $ sinfo -o "%20N  %10c  %10m  %20G " # node accelerators
    NODELIST              CPUS        MEMORY      GRES
    compute[01-18]        64          386000      gpu:l4:2
    $ sinfo -R # show down nodes
    REASON               USER      TIMESTAMP           NODELIST
    Not responding       slurm     2024-07-21T07:37:47 compute02
    Node unexpectedly re slurm     2024-07-23T16:47:22 compute16
    $ sinfo -o "%10R %24N %20H %10U %30E " # show down nodes with more details
    PARTITION  NODELIST                 TIMESTAMP            USER       REASON
    ant        compute02                2024-07-21T07:37:47  slurm(202) Not responding
    ant        compute16                2024-07-23T16:47:22  slurm(202) Node unexpectedly rebooted
    ant        compute[01,03-15,17-18]  Unknown              Unknown    none

The most common node states are:

* ``alloc``: node is fully used and cannot accept new jobs
* ``mix``: node is partially used and can accept small jobs
* ``idle``: node isn't used and is awaiting jobs
* ``plnd``: node isn't used but a large job is planned and will be allocated as soon as other nodes become idle to satisfy allocation requirements
* ``down``: node is down and awaiting maintenance

Show node properties with `scontrol <https://slurm.schedmd.com/scontrol.html>`__:

.. code-block:: none

    $ scontrol show node=compute11
    NodeName=compute11 Arch=x86_64 CoresPerSocket=32
       Gres=gpu:l4:2
       OS=Linux 5.14.0-362.24.1.el9_3.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 4 22:31:43 UTC 2024
       RealMemory=386000 AllocMem=146880 FreeMem=376226 Sockets=2 Boards=1
       State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
       Partitions=ant
    $ scontrol show nodes
    NodeName=compute01 Arch=x86_64 CoresPerSocket=32
    [...]
    NodeName=compute18 Arch=x86_64 CoresPerSocket=32