Job Scripts

This page documents how to write job scripts to submit jobs on the Abel (HPC for UiO) and Colossus (HPC for TSD) clusters.
 

To run a job on the cluster involves creating a shell script called a job script. The job script is a plain-text file containing any number of commands, including your main computational task, i.e., it may copy or rename files, cd into the proper directory, etc., all before doing the "real" job. The lines in the script file are the commands to be executed, in the given order. Lines starting with a "#" are ignored as comments, except lines that start with a "#SBATCH" which are not executed, but contain special instructions to the queue system.

If you are not familiar with shell scripts, they are simply a set of commands that you could have typed at the command line. You can find more information about shell scripts here: Introduction to Bash shell scripts.

A job script consists of a couple of parts:

  • Instructions to the queue system
  • Commands to set up the execution environment
  • The actual commands you want to be run

Instruction parameters to the queue system may be specified on the sbatch command line and/or in #SBATCH lines in the job script. There can be as many #SBATCH lines as you need, and you can combine several parameters on the same line. If a parameter is specified both on the command line and in the jobscript, the parameter specified on the command line takes precedence. The #SBATCH lines should precede any commands in the script. A couple of parameters are compulsory. If they are not present, the job will not run:

  • --account: Specifies the project the job will run in.
  • --time: Specifies the maximal wall clock time of the job. Do not specify the limit too low, because the job will be killed if it has not finished within this time. On the other hand, shorter jobs will be started sooner, so do not specify longer than you need. The default maximum allowed --time specification is 1 week; see details here.
  • --mem-per-cpu: Specifies how much RAM each task (default is 1 task; see Parallell Jobs) of the job needs.

The commands to set up the environment should include

source /cluster/bin/jobsetup

and will most likely include one or more

module load SomeProgram/SomeVersion

jobsetup will set up needed environment variables and shell functions, and must be the first command in the script. module will set up environment variables to get access to the specified program.  It is recommended to specify the explicit version in the module load command.  We also encourage you to use

module purge

prior to the module load statements, to avoid inheriting unknown environment variable settings from the shell you use to submit the job.  We also advice using

set -o errexit

early in the script.  This makes the script exit immediately if any command below it fails, instead of simply going on to the next command.  This makes it much easier to find out whether anything went wrong in the job, and if so, where it happened.

A Simple Serial Job

This is a script for running a simple, serial job

#!/bin/bash

# Job name:
#SBATCH --job-name=YourJobname
#
# Project:
#SBATCH --account=YourProject
#
# Wall clock limit:
#SBATCH --time=hh:mm:ss
#
# Max memory usage:
#SBATCH --mem-per-cpu=Size

## Set up job environment:
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

## Copy input files to the work directory:
cp MyInputFile $SCRATCH

## Make sure the results are copied back to the submit directory (see Work Directory below):
chkfile MyResultFile

## Do some work:
cd $SCRATCH
YourCommands

Substitute real values for YourJobname, YourProject, hh:mm:ss, Size (with 'M' for megabytes or 'G' for gigabytes; e.g. 200M or 1G), MyInputFile, MyResultFile and YourCommands.

Work Directory

Each job has access to a separate scratch space directory on the shared file system /work. The name of the directory is stored in the environment variable $SCRATCH. As a general rule, all jobs should use the scratch directory ($SCRATCH) as its work directory. This is especially important if the job uses a lot of files, or does much random access on the files (repeatedly read and write to different places of the files).

There are several reasons for using $SCRATCH:

  • $SCRATCH is on a faster file system than user home directories.
  • There is less risk of interfering with running jobs by accidentally modifying or deleting the jobs' input or output files.
  • Temporary files are automatically cleaned up, because the scratch directory is removed when the job finishes.
  • It avoids taking unneeded backups of temporary and partial files, because $SCRATCH is not backed up.

If you need to access the $SCRATCH area from outside a job (for instance for monitoring the job), the directory is (currently) /work/jobs/jobid.d, where jobid is the job id.

The directory where you ran sbatch is stored in the environment variable $SUBMITDIR. If you want automatic copying of files or directories back to $SUBMITDIR when the job is terminated, mark them with the command chkfile in the job script:

chkfile OneFile AnotherFile SomeDirectory
  • The chkfile command should be placed early in the script, before the main computational commands: The files will be copied back even if the script crashes (even when you use set -o errexit), but not if it is terminated before it got to the chkfile command.
  • If you want to use shell-metacharacters, they should be quoted.  I.e., use chkfile "Results*" instead of chkfile Results*.
  • Do not use absolute paths or things like ".*" or .. in the chkfile command; use paths relative to $SCRATCH.
  • The files/directories are copied to $SUBMITDIR with rsync, which means that if you use chkfile ResultDir, the contents of ResultDir will be copied into $SUBMITDIR/ResultDir, but if you use chkfile ResultDir/ (i.e., with a trailing /), the contents of ResultDir will be copied directly into $SUBMITDIR.

We recommend using chkfile (or cleanup; see below) instead of explicitly copying the result files with cp.

For instance:

#!/bin/bash
#SBATCH --job-name=YourJobname --account=YourProject
#SBATCH --time=hh:mm:ss --mem-per-cpu=Size
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

## Copy files to work directory:
cp $SUBMITDIR/YourDatafile $SCRATCH

## Mark outfiles for automatic copying to $SUBMITDIR:
chkfile YourOutputfile

## Run command
cd $SCRATCH
YourProgram YourDatafile > YourOutputfile

The $SCRATCH directory is removed upon job exit (after copying back chkfiled files).

If you want more flexibility than what chkfile gives, you can use the command cleanup instead. It is used to specify commands to run when your job exits (before the $SCRATCH directory is removed). Just like chkfile, the cleanup commands are run even if your script crashes (even when you use set -o errexit), but not if it crashes before reaching the cleanup command, so place the command early in the script.

For instance:

cleanup "cp $SCRATCH/outputfile /some/other/directory/newName"

Note: do not use single quotes (') if the commands contain variables like $SCRATCH and $SUBMITDIR.

/work/users/username

If you need to share files between jobs, or use a file several times, it can be copied to /work/users/username (where username is your user name). Files in this directory are automatically deleted after a certain time. Currently, they are deleted after 45 days, but that can change in the future. There is no backup of files in /work/users/username. Note: /work/users/username does not exist on Colossus.

Splitting a Job into Tasks (Array Jobs)

To run many instances of the same job, use the arrayrun command. This is useful if you have a lot of data-sets which you want to process in the same way with the same job-script. arrayrun works very similarly to mpirun:

arrayrun [-r] from-to [sbatch switches] YourCommand

Typically, YourCommand is a compiled program or a job script. If it is a compiled program, the arrayrun command line must include the sbatch switches neccessary to submit the program. If it is a script, it can contain #SBATCH lines in addition to or instead of switches on the arrayrun command line. from and to are the first and last task number. Each instance of YourCommand can use the environment variable $TASK_ID for selecting which data set to use, etc. For instance:

arrayrun 1-100 MyScript

will run 100 copies of MyScript, setting the environment variable $TASK_ID to 1, 2, ..., 100 in turn.

It is possible to specify the TASK_IDs in other ways than from-to: it can be a single number, a range (from-to), a range with a step size (from-to:step), or a comma separated list of these. A couple of examples:

Specification   Resulting TASK_IDs
1,4,42        # 1, 4, 42
1-5           # 1, 2, 3, 4, 5
0-10:2        # 0, 2, 4, 6, 8, 10
32,56,100-200 # 32, 56, 100, 101, 102, ..., 200

Note: spaces, decimal numbers or negative numbers are not allowed.

The instances of an array job are independent, they have their own $SCRATCH and are treated like separate jobs. You may also specify a parallel environment in array-jobs, so that each instance gets e.g. 8 processors. If the job script running arrayrun is cancelled, all instances of the array job will also be cancelled.

If you specify the -r switch (before the TASK_IDs), arrayrun will restart a task that has failed. To avoid endless loops, a task is only restarted once, and a maximum of 5 tasks will be restarted.

See arrayrun --help for details.

An extended example:

$ cat workerScript
#!/bin/bash
#SBATCH --account=YourProject
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=1G
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

DATASET=dataset.$TASK_ID
OUTFILE=result.$TASK_ID

cp $DATASET $SCRATCH
cd $SCRATCH
chkfile $OUTFILE
YourProgram $DATASET > $OUTFILE
$ cat submitScript
#!/bin/sh
#SBATCH --account=YourProject
#SBATCH --time=50:0:0
#SBATCH --mem-per-cpu=200M
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

arrayrun 1-200 workerScript
$ sbatch submitScript

This job will process the datasets dataset.1, dataset.2, ... dataset.200 and leave the results in result.1, result.2, ... result.200.

Note that the arrayrun command should be executed in the directory that contains the workerScript, so the submitScript should not cd to $SCRATCH. (The workerScript should, though.)

On Abel, the worker scripts of array jobs are good candidates for running in the lowpri QoS. But do not run the submit script itself in the lowpri QoS: if the submit script is restarted, all worker scripts will be rerun.  (There is no lowpri on Colossus.)

Note: the arrayrun command is a locally developed command, and it is quite possible that it will be further developed and changed. Suggestions can be sent to hpc(at)usit.uio.no.

Parallel Jobs

Since the clusters consist of multi-cpu and multi-core compute nodes, there are several ways of parallelising code and running jobs in parallel. One can use MPI, OpenMP and threading. (For SIMD jobs and other jobs with very independent parallel tasks, array jobs is a good alternative.) In this section we will explain how to run in parallel in either of these ways.

Note: A parallel job will get one, shared scratch directory ($SCRATCH), not separate directories for each node! This means that if more than one process or thread write output to disk, they must use different file names, or put the files in different subdirectories.

OpenMP or Threading

In order to run in parallel on one node using either threading or OpenMP, the only thing you need to remember is to tell the queue system that your task needs more than one core, all on the same node. This is done with the --cpus-per-task switch.  It will reserve the needed cores on the node and set the environment variable $OMP_NUM_THREADS. For example:

#!/bin/bash
# Job name:
#SBATCH --job-name=YourJobname
#
# Project:
#SBATCH --account=YourProject
#
# Wall clock limit:
#SBATCH --time=hh:mm:ss
#
# Max memory usage per core (MB):
#SBATCH --mem-per-cpu=MegaBytes
#
# Number of cores:
#SBATCH --cpus-per-task=NumCores

## Set up job environment:
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

## Copy files to work directory:
cp $SUBMITDIR/YourDatafile $SCRATCH

## Mark outfiles for automatic copying to $SUBMITDIR:
chkfile YourOutputfile

## Run command
cd $SCRATCH
## (For non-OpenMP-programs, you must control the number of threads manually, using $OMP_NUM_THREADS.)
YourCommand YourDatafile > YourOutputfile

The --cpus-per-task will ensure all cores are allocated on a single node. If you ask for more cores than is available on any node, you will not be able to submiut the job. Most nodes on Abel have 16 cores, and on Colossus 20 cores.

MPI

To run MPI jobs, you must specify how many tasks to run (i.e., cores to use), and set up the desired MPI environment. Abel supports OpenMPI, which can be used by setting up the environment with modules:

module load openmpi.compiler/version

where compiler is the compiler you used when compiling your program. See

module avail openmpi

for available versions. The symbol generation for different fortran compilers differ, hence versions of the MPI fortran interface for GNU, Intel, Portland and Open64. All OpenMPI libraries are built using gcc.

You need to use module load both in order to compile your code and to run it. You also should compile and run in the same MPI-environment. Although some MPI-versions may be compatible, they usually are not.

If you need to compile C MPI code with icc please see OpenMPI's documentation about environment variables to set in order to force mpicc to use icc.

The simplest way to specify the number of tasks (cores) to use in an MPI job, is to use the sbatch switch --ntasks. For instance

#SBATCH --ntasks 10

would give you 10 tasks. The queue system allocates the tasks to nodes depending on available cores and memory, etc. A simple MPI jobscript can then be like:

#!/bin/bash
# Job name:
#SBATCH --job-name=YourJobname
#
# Project:
#SBATCH --account=YourProject
#
# Wall clock limit:
#SBATCH --time=hh:mm:ss
#
# Max memory usage per task:
#SBATCH --mem-per-cpu=Size
#
# Number of tasks (cores):
#SBATCH --ntasks=NumTasks

## Set up job environment:
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

module load openmpi.intel
## Set up input and output files:
cp InputFile $SCRATCH
chkfile OutputFile

cd $SCRATCH
mpirun YourCommand

Queue System Options

Using the queue option --ntasks in the previous example, we have assumed that it doesn't matter how your tasks are allocated to nodes. The queue system will run your tasks on nodes as it sees fit (however, it will try to allocate as many tasks as possible on each node). For small jobs (but see "Jobs needing 16 cores or more on Abel" below!), that is usually OK. Sometimes, however, you might need more control. Then you can use the switches --nodes and --ntasks-per-node instead of --ntasks:

  • --nodes: How many nodes to use
  • --ntasks-per-node: How many tasks to run on each node.

For instance, to get 1 task on each of 4 nodes, you can use

#SBATCH --nodes=4 --ntasks-per-node=1

Or, to use 4 task on each of 2 nodes:

#SBATCH --nodes=2 --ntasks-per-node=4

There are more advanced options for selecting cores and nodes, as well. See man sbatch for the gory details.

TCP/IP over InfiniBand for MPI

If you have MPI jobs hard linked to use TCP/IP we have some tricks to use InfiniBand even for these. It is possible to run the TCP/IP over InfiniBand with far better performance than over Ethernet. However this only apply to communications between the compute nodes. Please contact us if you have such an application or want to use TCP/IP over InfiniBand. All nodes have two IP numbers one for the Ethernetnet and one for InfiniBand.

Useful Commands in MPI Scripts

If you need to execute a command once on each node of a job, you can use

srun --ntasks=$SLURM_JOB_NUM_NODES command

Combining MPI with OpenMP or threading

You can combine MPI with OpenMP or threading, such that MPI is used for launching multi-threaded processes on each node. The best way to do this is:

#!/bin/bash
# Job name:
#SBATCH --job-name=YourJobname
#
# Project:
#SBATCH --account=YourProject
#
# Wall clock limit:
#SBATCH --time=hh:mm:ss
#
# Max memory usage per task:
#SBATCH --mem-per-cpu=Size
#
# Number of tasks (MPI ranks):
#SBATCH --ntasks=NumTasks
#
# Number of threads per task:
#SBATCH --cpus-per-task=NumThreads

## Set up job environment:
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors
module load openmpi.intel

## Set up input/output files:
cp InputFile $SCRATCH
chkfile OutputFile

## Run command
## (If YourCommand is not OpenMP, use $OMP_NUM_THREADS to control the number of threads manually.)
mpirun YourCommand

This makes mpirun start NumTasks ranks (processes), each of which having NumThreads threads. If you are not using OpenMP, your program must make sure not to start more than NumThreads threads.

Just as with single-threaded MPI jobs, you can get more than one rank (MPI process) on each node. If you need more control over how many ranks are started on each node, use --ntasks-per-node and --nodes as above. For instance:

#SBATCH --nodes=3 --ntasks-per-node=2 --cpus-per-task=4

will start 2 MPI ranks on each of 3 machines, and each process is allowed to use 4 threads.

Jobs needing 16 cores or more on Abel

Note: this section is only about jobs on Abel.

If your job needs 16 CPU cores or more, it is highly recommended to ask for whole nodes. Also if your job needs (almost) all memory on a node, it should ask for a whole node. This applies to all normal jobs (i.e., everything except hugemem, GPU, long, or lowpri jobs). Doing so will likely make your job run faster (especially if its processes communicate a lot) and start sooner. It will also put less strain on the queue system, which sometimes leads to jobs not starting at all.

This can be done in two ways: Either ask for 16 cores per node (with --ntasks-per-node and/or --cpus-per-task) or make sure the job asks for between 61 and 61.5 GiB RAM per node. The principle is: make sure --ntasks-per-node * --cpus-per-task is 16, or --ntasks-per-node * --cpus-per-task * --mem-per-cpu is between 61 and 61.5 GiB (the default for --cpus-per-task is 1).

For instance:

  • MPI jobs with single threaded tasks (ranks) that need no more then 3936 MiB RAM per task, can use --ntasks-per-node=16 --mem-per-cpu=3936 in combination with either --nodes=N or --ntasks=M (where M is a multiple of 16).
  • MPI jobs with single threaded tasks that need more than 3936 MiB per task, should ask for so many tasks per node that the total memory requirement is between 61 and 61.5 GiB per node. For instance, if the job needs at least 5 GiB per task, it can specify --ntasks-per-node=12 --mem-per-cpu=5248, (resulting in 12 * 5248 / 1024 = 61.5 GiB per node), in combination with --nodes or --ntasks as above.
  • MPI jobs with multi threaded tasks can ask for --ntasks-per-node and --cpus-per-task such that --ntasks-per-node * --cpus-per-task is 16 if --mem-per-cpu is no more than 3936.  For instance --ntasks-per-node=2 --cpus-per-task=8 --mem-per-cpu=3936 for MPI tasks with 8 threads. If the required memory per core is higher, the number of tasks * threads must be reduced as above.
  • Single threaded jobs needing more than 61 GiB RAM can simply specify --mem-per-cpu=61G or --mem-per-cpu=62976 (which is 61.5 GiB).
  • Multi-threaded jobs can specify --cpus-per-task=16 if they don't need more than 3936 MiB per core.  If they need more, they should specify --cpus-per-task and --mem-per-cpu such that --cpus-per-task * --mem-per-cpu is between 61 and 61.5 GiB.  For instance, if the job needs 6 GiB per core, it could use --cpus-per-task=10 --mem-per-cpu=6297 (resulting in 10 * 6297 / 1024 = 61.49 GiB).

Please do not use --exclusive! It has strange side effects that most likely will lead to your job being accounted for more CPUs than it is using, sometimes many times more!

Job Workflows (Pipelines)

You can submit a workflow of jobs via SDAG (Slurm Direct Acyclic Graph). It is installed on Abel as module, and can be used:

module load sdag
sdag <sdag-script>

Structure

The workflow description file includes two types of statements:

  • Job descriptionJOB job-name job-script-file-path. This must be provided for each job in the workflow. The keyword JOB is case sensitive, and is separated from the job script file path using space or tab.
  • WorkflowPARENTparent-jobs-list CHILD child-jobs-list. Upon the submission, each child job will be submitted with dependency on all parent jobs. A child job won't start before all parent jobs are completed, i.e. ended with exit code 0. If any of the parent jobs fails, all child jobs will be cancelled. This means that there will be no orphan jobs. A parent/child job list must be either space or tab separated.

Guidelines

You need to follow these guidelines when using sdag:

  • You must provide a valid workflow description file. If not, sdag will return:
    Error: You must enter a valid DAG description file
  • You must provide a valid job description file path in each job description statement. If not, sdag will return:
    Error in line [XX]: XX.sbatch is not a file.
  • In case of a wrong syntax in a job description statement, sdag will return:
    Error in line [XX]: A job definition statement must be written as:
    JOB <job_name> <job_submission_file>
  • In case of a wrong syntax in a workflow statement, sdag will return:
    Error in line [XX]: A workflow Statement must be written as:
    PARENT <parent_jobs> CHILD <children_jobs>
  • In a workflow statement, if one of the child jobs is already defined in a previous workflow statement as a parent for one of the parent jobs, or one of their ancestors, sdag will return:
    Error in line [XX]: Job YY Cannot be a parent for job ZZ. Job ZZ is an ancestor of job YY

Example

To submit a workflow with four jobs (A, B, C, and D) with the following structure:

           B
          / \
start--> A   D -->end
          \ /
           C

Four jobs scripts are included: jobA.sbatchjobB.sbatchjobC.sbatch, and jobD.sbatch. The workflow description file is dagtest.sdag with the following contents:

JOB A jobA.sbatch
JOB B jobB.sbatch
JOB C jobC.sbatch
JOB D jobD.sbatch
PARENT A CHILD B C
PARENT B C CHILD D

Each job is calling a python script test.py, which takes the following arguments:

  • A comma separated list of input files.
  • An output file name.
  • An integer indicating the runtime in seconds.
  • A job name.

An example, having an input file input.txt with the following contents:

This is a test text input file...

Then running:

python test.py input.txt outA.txt 30 A

The output outA.txt will contain the following:

Job[A] Run on machine: compute-10-16.local      Slept for 30 seconds
---------------FILE input.txt STARTED HERE---------------------
This is a test text input file...
---------------FILE input.txt ENDED HERE---------------------

To submit the workflow, you need first to create a job script for each job (A, B, C, and D).

jobA.sbatch:

#!/bin/bash
# Job name:
#SBATCH --job-name=jobA
#
# Project:
#SBATCH --account=<your-project>
#
# Wall clock limit:
#SBATCH --time=00:01:00
#
# Max memory usage:
#SBATCH --mem-per-cpu=40

## Set up job environment
source /cluster/bin/jobsetup

## Copy input files to the work directory:
cp test.py $SCRATCH
cp input.txt $SCRATCH

## Make sure the results are copied back to the submit directory (see Work Directory below):
chkfile outA.txt

## Do some work:
cd $SCRATCH
python test.py input.txt outA.txt 30 A

jobB.sbatch:

.......
## Copy input files to the work directory:
cp test.py $SCRATCH
cp outA.txt $SCRATCH

## Make sure the results are copied back to the submit directory (see Work Directory below):
chkfile outB.txt

## Do some work:
cd $SCRATCH
python test.py outA.txt outB.txt 20 B

jobC.sbatch:

.......
## Copy input files to the work directory:
cp test.py $SCRATCH
cp outA.txt $SCRATCH

## Make sure the results are copied back to the submit directory (see Work Directory below):
chkfile outC.txt

## Do some work:
cd $SCRATCH
python test.py outA.txt outC.txt 20 C

jobD.sbatch:

.......
## Copy input files to the work directory:
cp test.py $SCRATCH
cp outB.txt $SCRATCH
cp outC.txt $SCRATCH

## Make sure the results are copied back to the submit directory (see Work Directory below):
chkfile outD.txt

## Do some work:
cd $SCRATCH
python test.py outB.txt,outC.txt outD.txt 20 D

Now submit the workflow:

module load sdag
sdag dagtest.sdag

You will get an output like the following:

Job A   ID: 11008698
Parents:
Children: C,B

Job B   ID: 11008700
Parents: A
Children: D

Job C   ID: 11008699
Parents: A
Children: D

Job D   ID: 11008701
Parents: C,B
Children:

For each job, four values are displayed:

  • Job name: given in the workflow description.
  • Job ID: the real job indentifier after submitting the job to Slurm. It can be used to check the status of the job.
  • Job parents: a comma separated list of jobs on which the job is dependent.
  • Job children: a comma separated list of jobs which are dependent on this job.

The output of job A, outA.txt:

Job[A] Run on machine: compute-16-13.local    Slept for 30 seconds
---------------FILE input.txt STARTED HERE---------------------
This is a test text input file...
---------------FILE input.txt ENDED HERE---------------------

The output of job B (child of job A):

Job[B] Run on machine: compute-16-21.local    Slept for 20 seconds
---------------FILE outA.txt STARTED HERE---------------------
Job[A] Run on machine: compute-16-13.local    Slept for 30 seconds
---------------FILE input.txt STARTED HERE---------------------
This is a test text input file...
---------------FILE input.txt ENDED HERE---------------------

---------------FILE outA.txt ENDED HERE---------------------

The output of job D (child of jobs B and C):

Job[D] Run on machine: compute-16-23.local    Slept for 20 seconds
---------------FILE outB.txt STARTED HERE---------------------
Job[B] Run on machine: compute-16-21.local    Slept for 20 seconds
---------------FILE outA.txt STARTED HERE---------------------
Job[A] Run on machine: compute-16-13.local    Slept for 30 seconds
---------------FILE input.txt STARTED HERE---------------------
This is a test text input file...
---------------FILE input.txt ENDED HERE---------------------

---------------FILE outA.txt ENDED HERE---------------------

---------------FILE outB.txt ENDED HERE---------------------

Job[D] Run on machine: compute-16-23.local    Slept for 20 seconds
---------------FILE outC.txt STARTED HERE---------------------
Job[C] Run on machine: compute-16-21.local    Slept for 20 seconds
---------------FILE outA.txt STARTED HERE---------------------
Job[A] Run on machine: compute-16-13.local    Slept for 30 seconds
---------------FILE input.txt STARTED HERE---------------------
This is a test text input file...
---------------FILE input.txt ENDED HERE---------------------

---------------FILE outA.txt ENDED HERE---------------------

---------------FILE outC.txt ENDED HERE---------------------

As shown in the above outputs, each job in the workflow is dependent on its parent. The same way you may construct your own workflow.

SDAG is developed and maintained by the Abel team. The source code is on github, comments and contributions are welcome

Debugging tools

TotalView is installed on Abel. The corresponding module is totalview. For more information please visit the totalview web site at http://www.roguewave.com/products/totalview.aspx (TotalView is not available on Colossus.)

Large Memory Jobs

Most nodes are equipped with 16 cores (Abel) or 20 cores (Colossus) and 64 GiB of RAM, of which 61.5 GiB can be used for jobs. (The rest is used by the operating system.) There are also a couple of special hugemem nodes with 32 cores and 1 TiB memory, of which about 1006 GiB can be used for jobs. To see how many nodes there currently are in the cluster, together with their number of cores and MB RAM, use

sinfo -e -o '%D %c %m'

You specify how much memory per core your job should be allowed to use by setting the --mem-per-cpu parameter. Technically, this limit specifies the amount of resident memory + swap the job can use. If the job tries to use more than this setting, it will be killed. For instance

#SBATCH --mem-per-cpu=2000M

If the job tries to use more than 2000 MiB resident memory, it will be killed. Note that --mem-per-cpu is specified per requested core.

Also, if you need more than 61.5 GiB RAM on a single node, you must specify --partition=hugemem to get access to the nodes with more RAM. For instance

#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=10G --partition=hugemem

Note: There is no need to specify how much RAM a node must have; the queue system will not allocate jobs on nodes with too little free RAM. Therefore, one should not use the --mem specification.

Accounting of Large Memory Jobs

To ensure maximal utilisation of the cluster, memory usage is accounted as well as cpu usage. Memory specifications are converted to "Processor Equivalents" (PE) using a conversion factor of approximately 4 GiB / core(*). If a job specifies more than 4 GiB RAM per task, i.e., --mem-per-cpu=M, where M > 4G, each task will count as M / 4G cores instead of 1 core. For instance, a job with --ntasks=2 --mem-per-cpu=8G will be counted as using 4 cores instead of 2.

The reason for this is that large memory jobs make the "unused" cores inaccessible to other jobs. For instance, a job on a 64 GiB node using --ntasks=1 --mem-per-cpu=61G will in practice use all cores on the node, and should be accounted as such.

Note that only jobs which specify more than 4 GiB per core will be affected by this; all other jobs will be accounted with the number of tasks specified.

(*) The exact value of the factor depends on the total amount of RAM per core in the cluster, and is currently about 4.5 GiB / core on Abel and 4.3 GiB / core on Colossus.

GPU Jobs

Abel has a few nodes with GPU accelerators.  (There are no GPUs on Colossus.) Each node has two GPU cards. To get access to these nodes, specify

#SBATCH --partition=accel --gres=gpu:1

to use one card, or

#SBATCH --partition=accel --gres=gpu:2

to use both.

In addition, you must specify how many CPUs you need with --ntasks, --ntasks-per-node and/or --cpus-per-task as usual, as well as the amount of RAM the job needs.

In the job script, the environment variable CUDA_VISIBLE_DEVICES will show which GPU device(s) to use. It will have values '0', '1' or '0,1' corresponding to /dev/nvidia0, /dev/nvidia1 or both, respectively.

Here is an example script, running MrBayes on 1 CPU (the default) and 1 GPU:

#!/bin/bash
#
#SBATCH --job-name=YourJobName --account=YourProject
#SBATCH --time=TimeLimit
#
## Ask for 1 GPU
#SBATCH --partition=accel --gres=gpu:1
#
## and 3000 MiB RAM
#SBATCH --mem-per-cpu=3000

## Set up job environtment:
source /cluster/bin/jobsetup
module purge   # clear any inherited modules
set -o errexit # exit on errors

## Get access to the GPU enabled MrBayes:
module load mrbayes_gpu

mb YourInputFile

The Lowpri QoS

On Abel, it is possible to run your program om somebody else's cores if they are not using them. To do this you specify that you want to use the lowpri qos (Quality of Service). This is done by specifying:

#SBATCH --qos=lowpri

in your script.

If the owner of the cores you are running on wants to use them again before your job is finished, your job will be stopped and put back on the queue. Therefore, if your job is longer than a couple of hours, you should make sure it checkpoints, i.e. saves intermediate results so that it can start from where it left if it is restarted.

Note that it is not possible to run hugemem jobs, GPU jobs or Notur jobs in the lowpri QoS. There is no lowpri QoS on Colossus.

Checkpointing

Checkpointing a job means that the job can be stopped and started somewhere else, and continues where it left off.

Long-running jobs should implement some form of checkpointing, by saving intermediate results at intervals and being able to start with the latest intermediate results if restarted. This is especially important for lowpri-jobs which can be stopped and requeued, but also for regular jobs in order to guard against node failures etc.

Automatic checkpointing is very hard to implement and the current trend is leave this to the application. Using a parallel file system and parallel IO the complete data structure can be saved to disk within a reasonable amount of time. The MTBF for very large clusters (larger than Abel, PRACE tier-0 size clusters) are low and for very large jobs checkpoint / restart need to be in place.

Useful sbatch/qlogin/srun parametres

Parameter Description
--account=project Specify the project to run under. This parameter is required.
--begin=time Start the job at a given time
--constraint=feature Request nodes with a certain feature. Currently supported features include intel, ib, rackN. If you need more than one feature, they must be combined with & in the same --constraint specification, e.g. --constraint=ib&rack21. Note: If you try to use more than one --constraint specification, the last one will override the earlier.
--cpus-per-task=cores Specify the number of cpus (actually: cores) to allocate for each task in the job.  See --ntasks and --ntasks-per-node, or man sbatch.
--dependency=dependency list Defer the start of this job until the specified dependencies have been satisfied. See man sbatch for details.
--error=file Send 'stderr' to the specified file. Default is to send it to the same file as 'stdout'. (Note: $HOME or ~ cannot be used. Use absolute or relative paths instead.)
--input=file Read 'stdin' from the specified file. (Note: $HOME or ~ cannot be used. Use absolute or relative paths instead.)
--job-name=jobname Specify job name
--mem-per-cpu=size Specify the memory required per allocated core. This is the normal way of specifying memory requirements (see Large Memory Jobs). size should be an integer followed by 'M' or 'G'.
--partition=hugemem Run on a hugemem node (see Large Memory Jobs).
--nodes=nodes Specify the number of nodes to allocate. nodes can be an integer or a range (min-max). This is often combined with --ntasks-per-node.
--ntasks=tasks Specify the number of tasks (usually cores, but see --cpus-per-task) to allocate. This is the usual way to specify cores in MPI jobs.
--ntasks-per-node=tasks Specify the number of tasks (usually cores, but see --cpus-per-task) to allocate within each allocated node. Often combined with --nodes.
--output=file

Send 'stdout' (and 'stderr' if not redirected with --error) to the specified file instead of slurm-%j.out.

Note:

  1. $HOME or ~ cannot be used. Use absolute or relative paths instead.
  2. The jobid should always be included in the filename in order to produce a separate file for each job (i.e. --output=custom-name%j.out). This is necessary for error tracing.  
--qos=lowpri Run a job in the lowpri qos (not available on Colossus)
--time=time Specify a (wall clock) time limit for the job. time can be hh:mm:ss or dd-hh:mm:ss. This parameter is required.
--mail-type=event Notify user by email when certain event types occur. Valid events are BEGIN, END, FAIL, REQUEUE, and ALL. (Mail notifications are not available on Colossus.)

sbatch policy settings

A set of extra tests and policy settings has been added to the sbatch command, to make sure jobs will start and run properly. The additional tests ensure:

  • Jobs that will run on only one node are not allowed to specify --constraint=ib
  • All account names are lower cased
  • That all job comments and constraints are passed to SLURM (per default, only the last specification will be passed on when multiple specifications exists)
Published Sep. 10, 2012 4:04 PM - Last modified Oct. 10, 2018 10:05 AM