The GPU clusters consists of three partition, cpu, study, and gpu with 23 nodes in total. assistant-[1:4] belong to the cpu partition, servant-[1:5] and worker-[1:3] to the study partition and all worker-* to the gpu partition. The configuration is listed in the following table.
| Hostname | Partition | CPU-Cores | RAM | GPU | VRAM per GPU | TMP |
|---|---|---|---|---|---|---|
| assistant-[1:4] | cpu | 64 | 443 GB | 0 GB | ||
| servant-1 | study | 2 x 20 | 216 GB | 2 x Tesla P-100 | 16 GB | 800 GB |
| servant-2 | study | 2 x 20 | 216 GB | 2 x Tesla P-100 | 16 GB | 800 GB |
| servant-4 | study | 2 x 20 | 108 GB | 2 x GTX 1080 TI | 11 GB | 400 GB |
| servant-5 | study | 2 x 20 | 108 GB | 2 x GTX 1080 TI | 11 GB | 400 GB |
| worker-[1:3] | study | 2 x 60 | 443 GB | 4 x NVIDIA A40 | 46 GB | 1000 GB |
| worker-[1:11] | gpu | 120 | 443 GB | 4 x NVIDIA A40 | 46 GB | 1000 GB |
| worker-[12:14] | gpu | 180 | 672 GB | 2 x NVIDIE H200 | 140 GB | 1000 GB |
| worker-15 | gpu | 180 | 672 GB | 3 x NVIDIE H200 | 140 GB | 1000 GB |
It might be that we reconfigure the nodes temporarily. You can see the running configuration using the following commands:
> scontrol show partitions
> scontrol show nodes
We had to shutdown or reboot some nodes by time-to-time. Typically, we will drain just a few compute nodes such that the cluster can be used without any interrupt. If so, we will block those nodes and create a reservation.
If you wonder why some nodes are idling (see sinfo -N) and your job is not scheduled to one of these nodes, you can see all reservations using
scontrol show reservations
All compute nodes have non-persistent local storage. Using --tmp in your sbatch script, will create an empty folder in /local/slurmjobs/$SLURM_JOB_ID as well a new environment variable $SLURM_JOB_TMP with exact this path. This folder can be used as a temporal storage for the running job, e.g. for loaded models, checkpoint, datasets, etc.