Slurm

Slurm Basics

On the cluster, all computation jobs are scheduled by Slurm, an easy “open source, fault-tolerant, and highly scalable cluster management and job scheduling system” Slurm Quickstart.

Slurm schedules a job and requests the user-claimed resources on one of the computation nodes. As resources, the cluster provides three types: CPUs, GPUs, and memory. In the study partition, the GPUs in the cluster come in two types: Tesla (P100) and GTX (1080Ti). You may request a single GPU via the option --gres=gpu:1. The Slurm scheduler reserves one GPU exclusive for your job. The job is started on one computational node having enough free resources. CPUs are requested with -c or --cpus-per-task=6. For further information, please look into the man page of srun (man srun) and sbatch (man sbatch). Reading the Slurm documentation is also highly recommended

The commands sinfo and squeue provide detailed information about the cluster’s state and jobs running.

Fair-Use-Policy

There are no limitations like the maximum number of parallel jobs for each user. Instead, we repose on a fair use of all users meaning one user should not occupy all nodes or block unused resources. For example, this happens when one job requests all CPU cores and no CPU cores are left for the GPUs (see next section).

CPU management and CPU-only jobs

Although the cluster is called a GPU cluster, it is also appropriate for CPU-only computing as it does not only provide 10 + 40 GPUs but also 200 + 600 CPU cores. Effective utilization of the CPU resources can be tricky so you should be familiar with CPU management. If you start a job using many CPU cores, make sure, that your jobs do not use all the CPUs. There should be at least 2 spare CPU cores per GPU on the node, e.g. use a maximum of 36 CPU cores on the study partition.

Mail Notification

You can enable mail notification for your Slurm jobs. Depending on the setup you will get an email to your TechFak mail account if your job is started, fails, or ends. You only need to add the option #SBATCH --mail-type=BEGIN,FAIL,END in your sbatch script. In case you are not using the TechFak mail address, you can forward those emails to your @uni-bielefeld.de address using our interface (see mail forwarding).

SBATCH Options

The following table shows some further options for an sbatch script. You find a full description on the man page: man sbatch

Option Default Value Values, Example Description
–nodes=<N> 1 int Number of compute nodes
–ntasks=<N> 1 int Number of tasks to run
–cpus-per-task=<N> 1 int Number of CPU cores per task
–cpus-per-gpu=<N> 2 int Number of CPU cores per GPU
–ntasks-per-node=<N> 1 int Number of tasks per node
–ntasks-per-core=<N> 1 int Number of tasks per CUP core
–mem=<mem> 25000 25G, 100G memory per node in MB
–gres=gpu:<type>:<N> - gtx:1, tesla:1, a40:2 Request nodes with GPUs
–time=<time> 1:00:00 1-00:00:00 Walltime limit for the job
–partition=<name> gpu study Partition to run the job
–job-name=<name> job script’s name my-slurm-job Name of the job
–output=<path> slurm-%j.out my-slurm-job.out Standard output file
–error=<path> slurm-%j.err my-slurm-job.err Standard error file
–mail-user=<mail> - juser Always your login
–mail-type=<mode> - BEGIN, END, FAIL, REQUEUE Event types for notifications
–tmp=<MB> 0 10, 100, 1G Size in MB of temporal storage on node, located in /local/slurmjobs/$SLURM_JOB_ID

Storage

By design, the data has to be stored on network storage. Only this setup leads to an easy method to provide the files to all the nodes within the cluster, e.g. the login node as well as all compute nodes. Of course loading and working with a lot of data takes longer than on a local PC, as all the data has to be transmitted via the network. Further, if multiple users produce I/O, the latency also increases. This leads to some lags on the login node, e.g. commands like cd or ls take longer.

You can reduce the delays for your job by using local storage on the compute nodes. If you set --tmp`` in your sbatchscript, the directory in/local/slurmjobs/$SLUURM_JOB_IDas well an environment variable$SLURM_JOB_TMP` will be created. This folder is on fast SSDs only on the node where your job runs. After your job is finished, this folder will be deleted.

You can use the data for all temporal files you need for your job, e.g. if you download a model, unpack your training data, or save temporal results. One example is given here. Of course, you need to copy those files into your home directory or somewhere else, if you want to keep them. You can simply add some commands at the end of your sbatchscript as:

#SBATCH --tmp=1G

python my_python_script.py

mkdir /homes/juser/experiments/result_${SLURM_JOB_ID}
cp -R $SLURM_JOB_DIR/result /homes/juser/experiments/result_${SLURM_JOB_ID}

Within your Python script, you can also use the environment variable using

import os

tmpdir = os.environ['SLURM_JOB_TMP']

Monitoring a running job

Next to squeue, you gain more information about your job using scontrol show job JobID -d. It is also possible to gain the output from, e.g. nvidia-smi, via an srun command like srun --jobid=JobID nvidia-smi. This will start the program on the same node and show the performance from the GPUs you reserved for your job.