First Steps

Access the cluster

The GPU cluster is only accessible within the network of the Faculty of Technology. You can either connect from the citec WiFi or wired network inside the Citec building, via the public proxy server shell.techfak.de, or via VPN.

Due to performance reasons, the cluster has a separate file server storing all home directories as well as data sets (s. volumes). We do not create a new home directory for each user in the faculty, instead create the home and give permission on demand. The access permission is set via UNIX groups. You can check your groups on one of our netboot computers, e.g. compute via

> id
groups=10020(gpuv2)

If you see the group gpuv2 in the list, you have obtained the permission

Inside the network, the login node is reachable via SSH:

ssh TECHFAK-LOGIN@login-1.gpu.cit-ec.net

The login password is the same as for all other servers, see passwords.

Note that this node is just to control your compute jobs and should not be used for any calculations.

Install Miniconda

We highly recommend installing Miniconda into your home. Using Miniconda, you have full control of the Python version you are using and all versions of your installed Python packages.

You can download and install the latest Miniconda from the command line via:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

After installation, you need to initialize Miniconda using:

~/miniconda3/bin/conda init bash

If you do not want to activate the base environment on every login, you can deactivate it with:

conda config --set auto_activate_base false

Getting Started with Slurm

The login node (login-1.gpu.cit-ec.de) is not supposed to do any extensive computations. Instead, on this node you can copy data, start jobs, and control running or pending jobs. To run your first job, you need a script that defines the required resources for your job. An example of a script is:

#!/bin/bash
#SBATCH --partition=study
#SBATCH --gres=gpu:0
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00

echo "Process $SLURM_PROCID of Job $SLURM_JOBID with the local id $SLURM_LOCALID runs on $(hostname)"

sleep 30
echo "done"

For example, you can use nano job.sbatch and save the commands into the file. To start a Slurm job, you just need to start the script using sbatch:

sbatch job.sbatch

Using squeue -l, you can see all running and pending jobs. You find a line as:

> squeue -l
     JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
     10791     study job.sbat    juser  RUNNING       0:25   1:00:00      1 servant-3

Congrats, your first Slurm job runs on the cluster. A new file named as slurm-10791.out is automatically created with the console output of the running job. For this small script, it looks like this:

> cat slurm-10791.out 
Process 0 of Job 10791 with the local id 0 runs on servant-3.GPU.CIT-EC.NET
done

To start your program, you only need to edit job.sbatch, e. g. change the sleep 30 command in python myscript.py. You can also add a line conda activate into your script.

If you want to cancel a running or scheduled job, you can use scancel JOBID. The job will be killed and will not be listed by squeue.

Starting several jobs

If you need to start several jobs, it is handy to start such jobs as an array. Using arrays you can easily limit the number of parallel jobs and also start your program with different settings or setups. You only need to add --array=0-6%4 as an option in your sbatch script. This option will schedule 7 jobs and limit your number of parallel jobs by 4.

#!/bin/bash
#SBATCH --partition=study
#SBATCH --gres=gpu:0
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --array=0-6%4


echo "Process $SLURM_PROCID of Job $SLURM_JOBID with the local id $SLURM_LOCALID runs on $(hostname)"

echo "Stating job with ID $SLURM_ARRAY_TASK_ID"
sleep 30
echo "done"

The environment variable SLURM_ARRAY_TASK_ID might be used as a parameter defining the used setup for your program, e. g. python myscript.py $SLURM_ARRAY_TASK_ID. You can monitor your jobs using squeue -l and see your running jobs in a maximum of four. All others are waiting due to no free resources or the array limit:

         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
       10792_0     study job.sbat    juser  RUNNING       0:09   1:00:00      1 servant-3
       10792_1     study job.sbat    juser  RUNNING       0:09   1:00:00      1 servant-3
       10792_2     study job.sbat    juser  RUNNING       0:09   1:00:00      1 servant-3
       10792_3     study job.sbat    juser  RUNNING       0:09   1:00:00      1 servant-3
 10792_[4-6%4]     study job.sbat    juser  PENDING       0:00   1:00:00      1 (JobArrayTaskLimit)

Copy Files/Folders to the Cluster

Due to performance issues, the cluster has a separate network storage. No further volumes from the TechFak are mounted. To copy data or files to the cluster you can either use files.techfak.de as

rsync -a files.techfak.de:/homes/$USER/MY/FOLDER /homes/$USER/MY/FOLDER

to copy files from your TechFak home to the cluster. You might also copy some files from your PC to the cluster using

rsync -a /PATH/TO/MY/FILE login-1.gpu.cit-ec.de:/homes/$USER/MY/FOLDER/

You can also use any FTP program to copy files from your workstation to the cluster and vice versa.

Backups and Snapshots

There are no backups for all home directories nor volumes mounted in /vol/. As the cluster is designed to start data-hungry and computation-intensive jobs, any backup will slow down running jobs. Further, training data can often be restored or generated easily which makes a backup needless. Thus, make sure that you save your results somewhere on a different computer or/and synchronize them into your TechFak home or network volumes.

In case you mess something up in your home, there are snapshots from the last two days. To restore any accidentally deleted or overwritten file, you can copy an older version from the snapshot folders in ~/.zfs/snapshot/.

Further Instructions

You find further instructions regarding Slurm on the next page: Slurm.