.. _Submitting Jobs: Submitting Jobs =============== This section explains how to write and submit a job script using the slurm scheduler. This is perhaps the most important part of the documentation. Whether you understand all of the theory or not, you can always jump here and start running things immediately. The `SLURM Quick Start User Guide `__ is a good place to get familiar with the basic commands for starting, stopping and monitoring jobs. Further information can be found in the `documentation `_. Job script ---------- In order to run a job on the cluster you will need to write a submission script. This script tells the cluster what kind of environment you need, what modules must be loaded, and of course, what file needs to be run. An important point to stress here is that, whether you loaded modules on the head node or not (the node you ssh into first), all modules must be added to your submit script. This is because when the cluster actually starts running your jobs, it does so on a completely different node, therefore, it has no memory of which modules you loaded manually when your first accessed the cluster. SLURM jobs inherit the user name (``${USER}``) and user id (``${UID}``) of the user who submits the job. A job script has the following general structure: .. code-block:: bash :linenos: #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --output %j.stdout #SBATCH --error %j.stderr module load spack/default gcc/12.3.0 cuda/12.3.0 openmpi/4.1.6 \ fftw/3.3.10 boost/1.83.0 python/3.12.1 source espresso-4.3/venv/bin/activate srun --cpu-bind=cores python3 espresso-4.3/testsuite/python/particle.py deactivate module purge Let's break it down line by line: * L1: shebang, which is needed to select the shell interpreter * L2-4: slurm batch options; those can be provided by the command line too * L5-6: load modules; you can use Spack, EasyBuild, EESSI, or custom modules * L7: enter a Python environment, e.g. venv, virtualenv or conda; only relevant to Python users * L8: slurm launcher (depends on the cluster) * L9: leave the Python environment * L10: clear modules This job script is specific to Ant. On other clusters, you will need to adapt the launcher and the ``module load`` command. Use ``module spider`` to find out how package names are spelled (lowercase, titlecase, with extra suffixes) and which versions are available. Often, a package version will only be available for a specific C compiler or CUDA release; Modulefiles typically do a good job of informing you of incompatible package versions. Some clusters require the ``srun`` launcher, while other clusters allow you to use the ``mpirun`` launcher. You can find examples of slurm job scripts for all clusters the ICP has access to in this user guide. Submit jobs ----------- Submit jobs with `sbatch `__: .. code-block:: bash sbatch --job-name="test" --nodes=1 --ntasks=4 --mem-per-cpu=2GB job.sh All command line options override the corresponding ``#SBATCH`` variables in the job script. A "task" is usually understood as a CPU core. There are many options (`complete list `__), but here are the most important ones: * ``--time=``: wall clock time, typically provided as colon-separated integers, e.g. "minutes", "minutes:seconds", "hours:minutes:seconds" * ``--output=``: output file for stdout, by default "slurm-%j.out", where the "%j" is replaced by the job ID * ``--error=``: output file for stderr, by default "slurm-%j.out", where the "%j" is replaced by the job ID * ``--job-name=``: job name; long names are allowed, but most slurm tools will only show the first 8 characters * ``--chdir="${HOME}/hydrogels"``: set the working directory of the job script * ``--gres=[[:type]:count]``: request generic resources, e.g. ``--gres=gpu:l4:2`` to request 2x NVIDIA L4 * ``--licenses=[@db][:count][,license[@db][:count]...]``: comma-separated list of licenses for commercial software, e.g. ``--licenses=nastran@slurmdb:12,matlab`` to request 12 NASTRAN licenses from slurmdb and 1 MATLAB license * ``--partition=``: name of the partition (only on clusters that use partitions) * ``--exclude=[,name...]``: prevent job from landing on a comma-separated list of nodes * ``--exclusive``: prevent any other job from landing on the node(s) on which this job is running (only relevant for benchmark jobs!) * ``--test-only``: dry run: give an estimate of the job start date (but doesn't schedule it!) * amount of RAM: only one of the following: * ``--mem=[K|M|G|T]``: total job memory, default is in megabytes (``M``) * ``--mem-per-cpu=[K|M|G|T]``: job memory per CPU, default is in megabytes (``M``) * ``--mem-per-gpu=[K|M|G|T]``: job memory per GPU, default is in megabytes (``M``) * number of tasks: any valid combination of the following: * ``--nodes=``: number of nodes * ``--ntasks=``: number of tasks * ``--ntasks-per-core=``: number of tasks per core, usually this should be 1 * ``--ntasks-per-node=``: number of tasks per node * ``--ntasks-per-gpu=``: number of tasks per GPU Core pinning ~~~~~~~~~~~~ Some clusters have hyperthreading enabled, in which case you will need to check in their online documentation how to set the MPI binding policy to use logical core #0 in each physical core (i.e. skip logical core #1). Some clusters provide the hwloc tool to help you troubleshoot the binding policy. For example, on HPC Vega, one must use ``sbatch --hint=nomultithread ...`` and ``srun --cpu-bind=cores ...`` to bind per core. Starting an interactive job with: .. code-block:: bash srun --partition=dev --time=0:05:00 --job-name=interactive --hint=nomultithread \ --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB --pty /usr/bin/bash and executing hwloc reporting: .. code-block:: bash module load Boost/1.72.0-gompi-2020a hwloc mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings /bin/true outputs: .. code-block:: none rank 0 bound to socket 1[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..] rank 1 bound to socket 1[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../..] rank 2 bound to socket 1[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../..] rank 3 bound to socket 1[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../..] rank 4 bound to socket 1[core 4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../..] rank 5 bound to socket 1[core 5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../..] rank 6 bound to socket 1[core 6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../..] rank 7 bound to socket 1[core 7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../..] rank 8 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../..] rank 9 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../..] rank 10 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../..] rank 11 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../..] rank 12 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../..] rank 13 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../..] rank 14 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/..] rank 15 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB] Each CPU has 64 physical cores and 2 hyperthreads per core, thus 128 logical cores. The ASCII diagram separates physical cores with a forward slash character. Each hyperthread is represented as a dot. Bound hyperthreads are represented as a "B". This table shows the CPU is using 2 hyperthreads per core, and each MPI rank is bound to both logical cores of the same physical core. For more details, check these external resources: * `Processor Affinity `__ * `pinning with psslurm `__: slides with a graphical representation of pinning from ``srun`` on a 2-socket mainboard with hyperthreading enabled * `LLview pinning app `__: GUI to show exactly how the ``srun`` pinning options get mapped on the hardware (currently only implemented for hardware at Jülich and partners) Cancel jobs ----------- Cancel jobs with `scancel `__: .. code-block:: bash # cancel a specific job scancel 854836 # cancel all my jobs scancel -u ${USER} # cancel all my jobs that are still pending scancel -t PENDING -u ${USER} Monitor jobs ------------ List queued jobs with `squeue `__: .. code-block:: none $ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 853911 ant hydrogel jgrad PD 0:00 1 (Priority) 853912 ant solvent jgrad R 8:32:40 2 compute[08,12] 853913 ant test jgrad R 01:17 1 compute11 The status column (``ST``) can take various values (`complete list `__), but here are the most common: * ``PD``: pending * ``R``: running * ``S``: suspended The nodelist/reason column (``NODELIST(REASON)``) states the node(s) where the job is running, or if it's not running, the reason why. There are many reasons (`complete list `__), but here are the most common: * ``Resources``: the queue is full and the job is waiting on resources to become available * ``Priority``: there are pending jobs with a higher priority * ``Dependency``: the job depends on another job that hasn't completed yet * ``Licenses``: the job is waiting for a license to become available Check job status with `sacct `__: .. code-block:: bash $ sacct -j 854579 # pending job JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 854579 dft_water ant root 12 PENDING 0:0 $ sacct -j 853980 # running job JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 853980 WL2057-0-+ ant root 5 RUNNING 0:0 853980.batch batch root 5 RUNNING 0:0 853980.0 lmp_lmpwl root 5 RUNNING 0:0 $ sacct -j 853080 # completed job JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 853080 capillary+ ant root 1 COMPLETED 0:0 853080.batch batch root 1 COMPLETED 0:0 $ sacct -j 853180 # cancelled job JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 853180 2flowfiel+ ant root 1 CANCELLED+ 0:0 $ sacct # show most recent jobs JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 854836 solvent ant root 4 CANCELLED+ 0:0 854837 mpi_test ant root 2 FAILED 2:0 854837.0 mpi_test root 2 COMPLETED 0:0 Check job properties with `scontrol `__: .. code-block:: none $ scontrol show job 853977 JobId=853977 JobName=WL2051-0-40 UserId=rkajouri(1216) GroupId=rkajouri(1216) MCS_label=N/A Priority=372 Nice=0 Account=root QOS=normal JobState=RUNNING Reason=None Dependency=(null) RunTime=1-02:24:05 TimeLimit=2-00:00:00 TimeMin=N/A Partition=ant AllocNode:Sid=ant:515436 NodeList=compute11 NumNodes=1 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) WorkDir=/work/rkajouri/simulations/confinement/bulk_water Cluster information ------------------- List partitions and hardware information with `sinfo `__: .. code-block:: bash $ sinfo # node list PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ant* up 2-00:00:00 1 down* compute02 ant* up 2-00:00:00 2 mix compute[08,12] ant* up 2-00:00:00 14 alloc compute[01,03-07,09-11,13-15,17-18] ant* up 2-00:00:00 1 down compute16 $ sinfo -s # partition summary PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST ant* up 2-00:00:00 16/0/2/18 compute[01-18] $ sinfo -o "%20N %10c %10m %20G " # node accelerators NODELIST CPUS MEMORY GRES compute[01-18] 64 386000 gpu:l4:2 $ sinfo -R # show down nodes REASON USER TIMESTAMP NODELIST Not responding slurm 2024-07-21T07:37:47 compute02 Node unexpectedly re slurm 2024-07-23T16:47:22 compute16 $ sinfo -o "%10R %24N %20H %10U %30E " # show down nodes with more details PARTITION NODELIST TIMESTAMP USER REASON ant compute02 2024-07-21T07:37:47 slurm(202) Not responding ant compute16 2024-07-23T16:47:22 slurm(202) Node unexpectedly rebooted ant compute[01,03-15,17-18] Unknown Unknown none The most common node states are: * ``alloc``: node is fully used and cannot accept new jobs * ``mix``: node is partially used and can accept small jobs * ``idle``: node isn't used and is awaiting jobs * ``plnd``: node isn't used but a large job is planned and will be allocated as soon as other nodes become idle to satisfy allocation requirements * ``down``: node is down and awaiting maintenance Show node properties with `scontrol `__: .. code-block:: none $ scontrol show node=compute11 NodeName=compute11 Arch=x86_64 CoresPerSocket=32 Gres=gpu:l4:2 OS=Linux 5.14.0-362.24.1.el9_3.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 4 22:31:43 UTC 2024 RealMemory=386000 AllocMem=146880 FreeMem=376226 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=ant $ scontrol show nodes NodeName=compute01 Arch=x86_64 CoresPerSocket=32 [...] NodeName=compute18 Arch=x86_64 CoresPerSocket=32