Cluster Nodes

Cluster Nodes

The GPU clusters consists of two partition, study and gpu with 15 nodes in total. servant-[1:5] and worker-[1:2] belongs to the study partition and all worker-* to the gpu partition. The configuration is listed in the following table.

Hostname Partition CPU-Cores RAM GPU VRAM per GPU TMP
servant-1 study 2 x 20 216 GB 2 x Tesla P-100 16 GB 800 GB
servant-2 study 2 x 20 216 GB 2 x Tesla P-100 16 GB 800 GB
servant-3 study 2 x 20 108 GB 2 x GTX 1080 TI 11 GB 400 GB
servant-4 study 2 x 20 108 GB 2 x GTX 1080 TI 11 GB 400 GB
servant-5 study 2 x 20 108 GB 2 x GTX 1080 TI 11 GB 400 GB
worker-[1:2] study 2 x 60 443 GB 4 x NVIDIA A40 46 GB 1000 GB
worker-[1:11] gpu 2 x 60 443 GB 4 x NVIDIA A40 46 GB 1000 GB

Actual configuration

It might be that we reconfigure the nodes temporarily. You can see the running configuration using the following commands:

> scontrol show partitions

> scontrol show nodes

Maintenance

We had to shutdown or reboot some nodes by time-to-time. Typically, we will drain just a few compute nodes such that the cluster can be used without any interrupt. If so, we will block those nodes and create a reservation. If you wonder why some nodes are idling (see sinfo -N) and your job is not scheduled to one of these nodes, you can see all reservations using

scontrol show reservations

Storage on compute nodes

All compute nodes have non-persistent local storage. Using --tmp in your sbatch script, will create an empty folder in /local/slurmjobs/$SLURM_JOB_ID as well a new environment variable $SLURM_JOB_TMP with exact this path. This folder can be used as a temporal storage for the running job, e.g. for loaded models, checkpoint, datasets, etc.