The GPU clusters consists of two partition, study
and gpu
with 15 nodes in total. servant-[1:5]
and worker-[1:2]
belongs to the study
partition and all worker-*
to the gpu
partition. The configuration is listed in the following table.
Hostname | Partition | CPU-Cores | RAM | GPU | VRAM per GPU | TMP |
---|---|---|---|---|---|---|
servant-1 | study | 2 x 20 | 216 GB | 2 x Tesla P-100 | 16 GB | 800 GB |
servant-2 | study | 2 x 20 | 216 GB | 2 x Tesla P-100 | 16 GB | 800 GB |
servant-3 | study | 2 x 20 | 108 GB | 2 x GTX 1080 TI | 11 GB | 400 GB |
servant-4 | study | 2 x 20 | 108 GB | 2 x GTX 1080 TI | 11 GB | 400 GB |
servant-5 | study | 2 x 20 | 108 GB | 2 x GTX 1080 TI | 11 GB | 400 GB |
worker-[1:2] | study | 2 x 60 | 443 GB | 4 x NVIDIA A40 | 46 GB | 1000 GB |
worker-[1:11] | gpu | 2 x 60 | 443 GB | 4 x NVIDIA A40 | 46 GB | 1000 GB |
It might be that we reconfigure the nodes temporarily. You can see the running configuration using the following commands:
> scontrol show partitions
> scontrol show nodes
We had to shutdown or reboot some nodes by time-to-time. Typically, we will drain just a few compute nodes such that the cluster can be used without any interrupt. If so, we will block those nodes and create a reservation.
If you wonder why some nodes are idling (see sinfo -N
) and your job is not scheduled to one of these nodes, you can see all reservations using
scontrol show reservations
All compute nodes have non-persistent local storage. Using --tmp
in your sbatch
script, will create an empty folder in /local/slurmjobs/$SLURM_JOB_ID
as well a new environment variable $SLURM_JOB_TMP
with exact this path. This folder can be used as a temporal storage for the running job, e.g. for loaded models, checkpoint, datasets, etc.