To run a job on the cluster involves creating a shell script called a job script. The job script is a plain-text file containing any number of commands, including your main computational task, i.e., it may copy or rename files, cd into the proper directory, etc., all before doing the "real" job. The lines in the script file are the commands to be executed, in the given order. Lines starting with a "#" are ignored as comments, except lines that start with a "#SBATCH" which are not executed, but contain special instructions to the queue system.
If you are not familiar with shell scripts, they are simply a set of commands that you could have typed at the command line. You can find more information about shell scripts here: Introduction to Bash shell scripts.
A job script consists of a couple of parts:
- Instructions to the queue system
- Commands to set up the execution environment
- The actual commands you want to be run
Instruction parameters to the queue system may be specified on the sbatch command line and/or in #SBATCH lines in the job script. There can be as many #SBATCH lines as you need, and you can combine several parameters on the same line. If a parameter is specified both on the command line and in the jobscript, the parameter specified on the command line takes precedence. The #SBATCH lines should precede any commands in the script. A couple of parameters are compulsory. If they are not present, the job will not run:
- --account: Specifies the project the job will run in.
- --time: Specifies the maximal wall clock time of the job. Do not specify the limit too low, because the job will be killed if it has not finished within this time. On the other hand, shorter jobs will be started sooner, so do not specify longer than you need. The default maximum allowed --time specification is 1 week; see details here.
- --mem-per-cpu: Specifies how much RAM each task (default is 1 task; see Parallell Jobs) of the job needs.
The commands to set up the environment should include
and will most likely include one or more
module load SomeProgram/SomeVersion
jobsetup will set up needed environment variables and shell functions, and must be the first command in the script. module will set up environment variables to get access to the specified program. It is recommended to specify the explicit version in the module load command. We also encourage you to use
prior to the module load statements, to avoid inheriting unknown environment variable settings from the shell you use to submit the job. We also advice using
set -o errexit
early in the script. This makes the script exit immediately if any command below it fails, instead of simply going on to the next command. This makes it much easier to find out whether anything went wrong in the job, and if so, where it happened.
A Simple Serial Job
This is a script for running a simple, serial job
#!/bin/bash # Job name: #SBATCH --job-name=YourJobname # # Project: #SBATCH --account=YourProject # # Wall clock limit: #SBATCH --time=hh:mm:ss # # Max memory usage: #SBATCH --mem-per-cpu=Size ## Set up job environment: source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors ## Copy input files to the work directory: cp MyInputFile $SCRATCH ## Make sure the results are copied back to the submit directory (see Work Directory below): chkfile MyResultFile ## Do some work: cd $SCRATCH YourCommands
Substitute real values for YourJobname, YourProject, hh:mm:ss, Size (with 'M' for megabytes or 'G' for gigabytes; e.g. 200M or 1G), MyInputFile, MyResultFile and YourCommands.
Each job has access to a separate scratch space directory on the shared file system /work. The name of the directory is stored in the environment variable $SCRATCH. As a general rule, all jobs should use the scratch directory ($SCRATCH) as its work directory. This is especially important if the job uses a lot of files, or does much random access on the files (repeatedly read and write to different places of the files).
There are several reasons for using $SCRATCH:
- $SCRATCH is on a faster file system than user home directories.
- There is less risk of interfering with running jobs by accidentally modifying or deleting the jobs' input or output files.
- Temporary files are automatically cleaned up, because the scratch directory is removed when the job finishes.
- It avoids taking unneeded backups of temporary and partial files, because $SCRATCH is not backed up.
If you need to access the $SCRATCH area from outside a job (for instance for monitoring the job), the directory is (currently) /work/jobs/jobid.d, where jobid is the job id.
The directory where you ran sbatch is stored in the environment variable $SUBMITDIR. If you want automatic copying of files or directories back to $SUBMITDIR when the job is terminated, mark them with the command chkfile in the job script:
chkfile OneFile AnotherFile SomeDirectory
- The chkfile command should be placed early in the script, before the main computational commands: The files will be copied back even if the script crashes (even when you use set -o errexit), but not if it is terminated before it got to the chkfile command.
- If you want to use shell-metacharacters, they should be quoted. I.e., use chkfile "Results*" instead of chkfile Results*.
- Do not use absolute paths or things like ".*" or .. in the chkfile command; use paths relative to $SCRATCH.
- The files/directories are copied to $SUBMITDIR with rsync, which means that if you use chkfile ResultDir, the contents of ResultDir will be copied into $SUBMITDIR/ResultDir, but if you use chkfile ResultDir/ (i.e., with a trailing /), the contents of ResultDir will be copied directly into $SUBMITDIR.
We recommend using chkfile (or cleanup; see below) instead of explicitly copying the result files with cp.
#!/bin/bash #SBATCH --job-name=YourJobname --account=YourProject #SBATCH --time=hh:mm:ss --mem-per-cpu=Size source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors ## Copy files to work directory: cp $SUBMITDIR/YourDatafile $SCRATCH ## Mark outfiles for automatic copying to $SUBMITDIR: chkfile YourOutputfile ## Run command cd $SCRATCH YourProgram YourDatafile > YourOutputfile
The $SCRATCH directory is removed upon job exit (after copying back chkfiled files).
If you want more flexibility than what chkfile gives, you can use the command cleanup instead. It is used to specify commands to run when your job exits (before the $SCRATCH directory is removed). Just like chkfile, the cleanup commands are run even if your script crashes (even when you use set -o errexit), but not if it crashes before reaching the cleanup command, so place the command early in the script.
cleanup "cp $SCRATCH/outputfile /some/other/directory/newName"
Note: do not use single quotes (') if the commands contain variables like $SCRATCH and $SUBMITDIR.
If you need to share files between jobs, or use a file several times, it can be copied to /work/users/username (where username is your user name). Files in this directory are automatically deleted after a certain time. Currently, they are deleted after 45 days, but that can change in the future. There is no backup of files in /work/users/username. Note: /work/users/username does not exist on Colossus.
Splitting a Job into Tasks (Array Jobs)
To run many instances of the same job, use the arrayrun command. This is useful if you have a lot of data-sets which you want to process in the same way with the same job-script. arrayrun works very similarly to mpirun:
arrayrun [-r] from-to [sbatch switches] YourCommand
Typically, YourCommand is a compiled program or a job script. If it is a compiled program, the arrayrun command line must include the sbatch switches neccessary to submit the program. If it is a script, it can contain #SBATCH lines in addition to or instead of switches on the arrayrun command line. from and to are the first and last task number. Each instance of YourCommand can use the environment variable $TASK_ID for selecting which data set to use, etc. For instance:
arrayrun 1-100 MyScript
will run 100 copies of MyScript, setting the environment variable $TASK_ID to 1, 2, ..., 100 in turn.
It is possible to specify the TASK_IDs in other ways than from-to: it can be a single number, a range (from-to), a range with a step size (from-to:step), or a comma separated list of these. A couple of examples:
Specification Resulting TASK_IDs 1,4,42 # 1, 4, 42 1-5 # 1, 2, 3, 4, 5 0-10:2 # 0, 2, 4, 6, 8, 10 32,56,100-200 # 32, 56, 100, 101, 102, ..., 200
Note: spaces, decimal numbers or negative numbers are not allowed.
The instances of an array job are independent, they have their own $SCRATCH and are treated like separate jobs. You may also specify a parallel environment in array-jobs, so that each instance gets e.g. 8 processors. If the job script running arrayrun is cancelled, all instances of the array job will also be cancelled.
If you specify the -r switch (before the TASK_IDs), arrayrun will restart a task that has failed. To avoid endless loops, a task is only restarted once, and a maximum of 5 tasks will be restarted.
See arrayrun --help for details.
An extended example:
$ cat workerScript #!/bin/bash #SBATCH --account=YourProject #SBATCH --time=1:0:0 #SBATCH --mem-per-cpu=1G source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors DATASET=dataset.$TASK_ID OUTFILE=result.$TASK_ID cp $DATASET $SCRATCH cd $SCRATCH chkfile $OUTFILE YourProgram $DATASET > $OUTFILE
$ cat submitScript #!/bin/sh #SBATCH --account=YourProject #SBATCH --time=50:0:0 #SBATCH --mem-per-cpu=200M source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors arrayrun 1-200 workerScript
$ sbatch submitScript
This job will process the datasets dataset.1, dataset.2, ... dataset.200 and leave the results in result.1, result.2, ... result.200.
Note that the arrayrun command should be executed in the directory that contains the workerScript, so the submitScript should not cd to $SCRATCH. (The workerScript should, though.)
On Abel, the worker scripts of array jobs are good candidates for running in the lowpri QoS. But do not run the submit script itself in the lowpri QoS: if the submit script is restarted, all worker scripts will be rerun. (There is no lowpri on Colossus.)
Note: the arrayrun command is a locally developed command, and it is quite possible that it will be further developed and changed. Suggestions can be sent to hpc(at)usit.uio.no.
Since the clusters consist of multi-cpu and multi-core compute nodes, there are several ways of parallelising code and running jobs in parallel. One can use MPI, OpenMP and threading. (For SIMD jobs and other jobs with very independent parallel tasks, array jobs is a good alternative.) In this section we will explain how to run in parallel in either of these ways.
Note: A parallel job will get one, shared scratch directory ($SCRATCH), not separate directories for each node! This means that if more than one process or thread write output to disk, they must use different file names, or put the files in different subdirectories.
OpenMP or Threading
In order to run in parallel on one node using either threading or OpenMP, the only thing you need to remember is to tell the queue system that your task needs more than one core, all on the same node. This is done with the --cpus-per-task switch. It will reserve the needed cores on the node and set the environment variable $OMP_NUM_THREADS. For example:
#!/bin/bash # Job name: #SBATCH --job-name=YourJobname # # Project: #SBATCH --account=YourProject # # Wall clock limit: #SBATCH --time=hh:mm:ss # # Max memory usage per core (MB): #SBATCH --mem-per-cpu=MegaBytes # # Number of cores: #SBATCH --cpus-per-task=NumCores ## Set up job environment: source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors ## Copy files to work directory: cp $SUBMITDIR/YourDatafile $SCRATCH ## Mark outfiles for automatic copying to $SUBMITDIR: chkfile YourOutputfile ## Run command cd $SCRATCH ## (For non-OpenMP-programs, you must control the number of threads manually, using $OMP_NUM_THREADS.) YourCommand YourDatafile > YourOutputfile
The --cpus-per-task will ensure all cores are allocated on a single node. If you ask for more cores than is available on any node, you will not be able to submiut the job. Most nodes on Abel have 16 cores, and on Colossus 20 cores.
To run MPI jobs, you must specify how many tasks to run (i.e., cores to use), and set up the desired MPI environment. Abel supports OpenMPI, which can be used by setting up the environment with modules:
module load openmpi.compiler/version
where compiler is the compiler you used when compiling your program. See
module avail openmpi
for available versions. The symbol generation for different fortran compilers differ, hence versions of the MPI fortran interface for GNU, Intel, Portland and Open64. All OpenMPI libraries are built using gcc.
You need to use module load both in order to compile your code and to run it. You also should compile and run in the same MPI-environment. Although some MPI-versions may be compatible, they usually are not.
If you need to compile C MPI code with icc please see OpenMPI's documentation about environment variables to set in order to force mpicc to use icc.
The simplest way to specify the number of tasks (cores) to use in an MPI job, is to use the sbatch switch --ntasks. For instance
#SBATCH --ntasks 10
would give you 10 tasks. The queue system allocates the tasks to nodes depending on available cores and memory, etc. A simple MPI jobscript can then be like:
#!/bin/bash # Job name: #SBATCH --job-name=YourJobname # # Project: #SBATCH --account=YourProject # # Wall clock limit: #SBATCH --time=hh:mm:ss # # Max memory usage per task: #SBATCH --mem-per-cpu=Size # # Number of tasks (cores): #SBATCH --ntasks=NumTasks ## Set up job environment: source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors module load openmpi.intel
## Set up input and output files: cp InputFile $SCRATCH chkfile OutputFile cd $SCRATCH mpirun YourCommand
Queue System Options
Using the queue option --ntasks in the previous example, we have assumed that it doesn't matter how your tasks are allocated to nodes. The queue system will run your tasks on nodes as it sees fit (however, it will try to allocate as many tasks as possible on each node). For small jobs (but see "Jobs needing 16 cores or more on Abel" below!), that is usually OK. Sometimes, however, you might need more control. Then you can use the switches --nodes and --ntasks-per-node instead of --ntasks:
- --nodes: How many nodes to use
- --ntasks-per-node: How many tasks to run on each node.
For instance, to get 1 task on each of 4 nodes, you can use
#SBATCH --nodes=4 --ntasks-per-node=1
Or, to use 4 task on each of 2 nodes:
#SBATCH --nodes=2 --ntasks-per-node=4
There are more advanced options for selecting cores and nodes, as well. See man sbatch for the gory details.
TCP/IP over InfiniBand for MPI
If you have MPI jobs hard linked to use TCP/IP we have some tricks to use InfiniBand even for these. It is possible to run the TCP/IP over InfiniBand with far better performance than over Ethernet. However this only apply to communications between the compute nodes. Please contact us if you have such an application or want to use TCP/IP over InfiniBand. All nodes have two IP numbers one for the Ethernetnet and one for InfiniBand.
Useful Commands in MPI Scripts
If you need to execute a command once on each node of a job, you can use
srun --ntasks=$SLURM_JOB_NUM_NODES command
Combining MPI with OpenMP or threading
You can combine MPI with OpenMP or threading, such that MPI is used for launching multi-threaded processes on each node. The best way to do this is:
#!/bin/bash # Job name: #SBATCH --job-name=YourJobname # # Project: #SBATCH --account=YourProject # # Wall clock limit: #SBATCH --time=hh:mm:ss # # Max memory usage per task: #SBATCH --mem-per-cpu=Size # # Number of tasks (MPI ranks): #SBATCH --ntasks=NumTasks # # Number of threads per task: #SBATCH --cpus-per-task=NumThreads ## Set up job environment: source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors module load openmpi.intel ## Set up input/output files: cp InputFile $SCRATCH chkfile OutputFile ## Run command ## (If YourCommand is not OpenMP, use $OMP_NUM_THREADS to control the number of threads manually.) mpirun YourCommand
This makes mpirun start NumTasks ranks (processes), each of which having NumThreads threads. If you are not using OpenMP, your program must make sure not to start more than NumThreads threads.
Just as with single-threaded MPI jobs, you can get more than one rank (MPI process) on each node. If you need more control over how many ranks are started on each node, use --ntasks-per-node and --nodes as above. For instance:
#SBATCH --nodes=3 --ntasks-per-node=2 --cpus-per-task=4
will start 2 MPI ranks on each of 3 machines, and each process is allowed to use 4 threads.
Jobs needing 16 cores or more on Abel
Note: this section is only about jobs on Abel.
If your job needs 16 CPU cores or more, it is highly recommended to ask for whole nodes. Also if your job needs (almost) all memory on a node, it should ask for a whole node. This applies to all normal jobs (i.e., everything except hugemem, GPU, long, or lowpri jobs). Doing so will likely make your job run faster (especially if its processes communicate a lot) and start sooner. It will also put less strain on the queue system, which sometimes leads to jobs not starting at all.
This can be done in two ways: Either ask for 16 cores per node (with --ntasks-per-node and/or --cpus-per-task) or make sure the job asks for between 61 and 61.5 GiB RAM per node. The principle is: make sure --ntasks-per-node * --cpus-per-task is 16, or --ntasks-per-node * --cpus-per-task * --mem-per-cpu is between 61 and 61.5 GiB (the default for --cpus-per-task is 1).
- MPI jobs with single threaded tasks (ranks) that need no more then 3936 MiB RAM per task, can use --ntasks-per-node=16 --mem-per-cpu=3936 in combination with either --nodes=N or --ntasks=M (where M is a multiple of 16).
- MPI jobs with single threaded tasks that need more than 3936 MiB per task, should ask for so many tasks per node that the total memory requirement is between 61 and 61.5 GiB per node. For instance, if the job needs at least 5 GiB per task, it can specify --ntasks-per-node=12 --mem-per-cpu=5248, (resulting in 12 * 5248 / 1024 = 61.5 GiB per node), in combination with --nodes or --ntasks as above.
- MPI jobs with multi threaded tasks can ask for --ntasks-per-node and --cpus-per-task such that --ntasks-per-node * --cpus-per-task is 16 if --mem-per-cpu is no more than 3936. For instance --ntasks-per-node=2 --cpus-per-task=8 --mem-per-cpu=3936 for MPI tasks with 8 threads. If the required memory per core is higher, the number of tasks * threads must be reduced as above.
- Single threaded jobs needing more than 61 GiB RAM can simply specify --mem-per-cpu=61G or --mem-per-cpu=62976 (which is 61.5 GiB).
- Multi-threaded jobs can specify --cpus-per-task=16 if they don't need more than 3936 MiB per core. If they need more, they should specify --cpus-per-task and --mem-per-cpu such that --cpus-per-task * --mem-per-cpu is between 61 and 61.5 GiB. For instance, if the job needs 6 GiB per core, it could use --cpus-per-task=10 --mem-per-cpu=6297 (resulting in 10 * 6297 / 1024 = 61.49 GiB).
Please do not use --exclusive! It has strange side effects that most likely will lead to your job being accounted for more CPUs than it is using, sometimes many times more!
Job Workflows (Pipelines)
You can submit a workflow of jobs via SDAG (Slurm Direct Acyclic Graph). It is installed on Abel as module, and can be used:
module load sdag
The workflow description file includes two types of statements:
- Job description:
JOBjob-name job-script-file-path. This must be provided for each job in the workflow. The keyword
JOBis case sensitive, and is separated from the job script file path using space or tab.
CHILDchild-jobs-list. Upon the submission, each child job will be submitted with dependency on all parent jobs. A child job won't start before all parent jobs are completed, i.e. ended with exit code 0. If any of the parent jobs fails, all child jobs will be cancelled. This means that there will be no orphan jobs. A parent/child job list must be either space or tab separated.
You need to follow these guidelines when using sdag:
- You must provide a valid workflow description file. If not, sdag will return:
Error: You must enter a valid DAG description file
- You must provide a valid job description file path in each job description statement. If not, sdag will return:
Error in line [XX]: XX.sbatch is not a file.
- In case of a wrong syntax in a job description statement, sdag will return:
Error in line [XX]: A job definition statement must be written as: JOB <job_name> <job_submission_file>
- In case of a wrong syntax in a workflow statement, sdag will return:
Error in line [XX]: A workflow Statement must be written as: PARENT <parent_jobs> CHILD <children_jobs>
- In a workflow statement, if one of the child jobs is already defined in a previous workflow statement as a parent for one of the parent jobs, or one of their ancestors, sdag will return:
Error in line [XX]: Job YY Cannot be a parent for job ZZ. Job ZZ is an ancestor of job YY
To submit a workflow with four jobs (A, B, C, and D) with the following structure:
B / \ start--> A D -->end \ / C
Four jobs scripts are included:
jobD.sbatch. The workflow description file is
dagtest.sdag with the following contents:
JOB A jobA.sbatch JOB B jobB.sbatch JOB C jobC.sbatch JOB D jobD.sbatch PARENT A CHILD B C PARENT B C CHILD D
Each job is calling a python script
test.py, which takes the following arguments:
- A comma separated list of input files.
- An output file name.
- An integer indicating the runtime in seconds.
- A job name.
An example, having an input file
input.txt with the following contents:
This is a test text input file...
python test.py input.txt outA.txt 30 A
outA.txt will contain the following:
Job[A] Run on machine: compute-10-16.local Slept for 30 seconds ---------------FILE input.txt STARTED HERE--------------------- This is a test text input file... ---------------FILE input.txt ENDED HERE---------------------
To submit the workflow, you need first to create a job script for each job (A, B, C, and D).
#!/bin/bash # Job name: #SBATCH --job-name=jobA # # Project: #SBATCH --account=<your-project> # # Wall clock limit: #SBATCH --time=00:01:00 # # Max memory usage: #SBATCH --mem-per-cpu=40 ## Set up job environment source /cluster/bin/jobsetup ## Copy input files to the work directory: cp test.py $SCRATCH cp input.txt $SCRATCH ## Make sure the results are copied back to the submit directory (see Work Directory below): chkfile outA.txt ## Do some work: cd $SCRATCH python test.py input.txt outA.txt 30 A
....... ## Copy input files to the work directory: cp test.py $SCRATCH cp outA.txt $SCRATCH ## Make sure the results are copied back to the submit directory (see Work Directory below): chkfile outB.txt ## Do some work: cd $SCRATCH python test.py outA.txt outB.txt 20 B
....... ## Copy input files to the work directory: cp test.py $SCRATCH cp outA.txt $SCRATCH ## Make sure the results are copied back to the submit directory (see Work Directory below): chkfile outC.txt ## Do some work: cd $SCRATCH python test.py outA.txt outC.txt 20 C
....... ## Copy input files to the work directory: cp test.py $SCRATCH cp outB.txt $SCRATCH cp outC.txt $SCRATCH ## Make sure the results are copied back to the submit directory (see Work Directory below): chkfile outD.txt ## Do some work: cd $SCRATCH python test.py outB.txt,outC.txt outD.txt 20 D
Now submit the workflow:
module load sdag sdag dagtest.sdag
You will get an output like the following:
Job A ID: 11008698 Parents: Children: C,B Job B ID: 11008700 Parents: A Children: D Job C ID: 11008699 Parents: A Children: D Job D ID: 11008701 Parents: C,B Children:
For each job, four values are displayed:
- Job name: given in the workflow description.
- Job ID: the real job indentifier after submitting the job to Slurm. It can be used to check the status of the job.
- Job parents: a comma separated list of jobs on which the job is dependent.
- Job children: a comma separated list of jobs which are dependent on this job.
The output of job A, outA.txt:
Job[A] Run on machine: compute-16-13.local Slept for 30 seconds ---------------FILE input.txt STARTED HERE--------------------- This is a test text input file... ---------------FILE input.txt ENDED HERE---------------------
The output of job B (child of job A):
Job[B] Run on machine: compute-16-21.local Slept for 20 seconds ---------------FILE outA.txt STARTED HERE--------------------- Job[A] Run on machine: compute-16-13.local Slept for 30 seconds ---------------FILE input.txt STARTED HERE--------------------- This is a test text input file... ---------------FILE input.txt ENDED HERE--------------------- ---------------FILE outA.txt ENDED HERE---------------------
The output of job D (child of jobs B and C):
Job[D] Run on machine: compute-16-23.local Slept for 20 seconds ---------------FILE outB.txt STARTED HERE--------------------- Job[B] Run on machine: compute-16-21.local Slept for 20 seconds ---------------FILE outA.txt STARTED HERE--------------------- Job[A] Run on machine: compute-16-13.local Slept for 30 seconds ---------------FILE input.txt STARTED HERE--------------------- This is a test text input file... ---------------FILE input.txt ENDED HERE--------------------- ---------------FILE outA.txt ENDED HERE--------------------- ---------------FILE outB.txt ENDED HERE--------------------- Job[D] Run on machine: compute-16-23.local Slept for 20 seconds ---------------FILE outC.txt STARTED HERE--------------------- Job[C] Run on machine: compute-16-21.local Slept for 20 seconds ---------------FILE outA.txt STARTED HERE--------------------- Job[A] Run on machine: compute-16-13.local Slept for 30 seconds ---------------FILE input.txt STARTED HERE--------------------- This is a test text input file... ---------------FILE input.txt ENDED HERE--------------------- ---------------FILE outA.txt ENDED HERE--------------------- ---------------FILE outC.txt ENDED HERE---------------------
As shown in the above outputs, each job in the workflow is dependent on its parent. The same way you may construct your own workflow.
SDAG is developed and maintained by the Abel team. The source code is on github, comments and contributions are welcome
TotalView is installed on Abel. The corresponding module is totalview. For more information please visit the totalview web site at http://www.roguewave.com/products/totalview.aspx (TotalView is not available on Colossus.)
Large Memory Jobs
Most nodes are equipped with 16 cores (Abel) or 20 cores (Colossus) and 64 GiB of RAM, of which 61.5 GiB can be used for jobs. (The rest is used by the operating system.) There are also a couple of special hugemem nodes with 32 cores and 1 TiB memory, of which about 1006 GiB can be used for jobs. To see how many nodes there currently are in the cluster, together with their number of cores and MB RAM, use
sinfo -e -o '%D %c %m'
You specify how much memory per core your job should be allowed to use by setting the --mem-per-cpu parameter. Technically, this limit specifies the amount of resident memory + swap the job can use. If the job tries to use more than this setting, it will be killed. For instance
If the job tries to use more than 2000 MiB resident memory, it will be killed. Note that --mem-per-cpu is specified per requested core.
Also, if you need more than 61.5 GiB RAM on a single node, you must specify --partition=hugemem to get access to the nodes with more RAM. For instance
#SBATCH --ntasks-per-node=8 #SBATCH --mem-per-cpu=10G --partition=hugemem
Note: There is no need to specify how much RAM a node must have; the queue system will not allocate jobs on nodes with too little free RAM. Therefore, one should not use the --mem specification.
Accounting of Large Memory Jobs
To ensure maximal utilisation of the cluster, memory usage is accounted as well as cpu usage. Memory specifications are converted to "Processor Equivalents" (PE) using a conversion factor of approximately 4 GiB / core(*). If a job specifies more than 4 GiB RAM per task, i.e., --mem-per-cpu=M, where M > 4G, each task will count as M / 4G cores instead of 1 core. For instance, a job with --ntasks=2 --mem-per-cpu=8G will be counted as using 4 cores instead of 2.
The reason for this is that large memory jobs make the "unused" cores inaccessible to other jobs. For instance, a job on a 64 GiB node using --ntasks=1 --mem-per-cpu=61G will in practice use all cores on the node, and should be accounted as such.
Note that only jobs which specify more than 4 GiB per core will be affected by this; all other jobs will be accounted with the number of tasks specified.
(*) The exact value of the factor depends on the total amount of RAM per core in the cluster, and is currently about 4.5 GiB / core on Abel and 4.3 GiB / core on Colossus.
Abel has a few nodes with GPU accelerators. (There are no GPUs on Colossus.) Each node has two GPU cards. To get access to these nodes, specify
#SBATCH --partition=accel --gres=gpu:1
to use one card, or
#SBATCH --partition=accel --gres=gpu:2
to use both.
In addition, you must specify how many CPUs you need with --ntasks, --ntasks-per-node and/or --cpus-per-task as usual, as well as the amount of RAM the job needs.
In the job script, the environment variable CUDA_VISIBLE_DEVICES will show which GPU device(s) to use. It will have values '0', '1' or '0,1' corresponding to /dev/nvidia0, /dev/nvidia1 or both, respectively.
Here is an example script, running MrBayes on 1 CPU (the default) and 1 GPU:
#!/bin/bash # #SBATCH --job-name=YourJobName --account=YourProject #SBATCH --time=TimeLimit # ## Ask for 1 GPU #SBATCH --partition=accel --gres=gpu:1 # ## and 3000 MiB RAM #SBATCH --mem-per-cpu=3000 ## Set up job environtment: source /cluster/bin/jobsetup module purge # clear any inherited modules set -o errexit # exit on errors ## Get access to the GPU enabled MrBayes: module load mrbayes_gpu mb YourInputFile
The Lowpri QoS
On Abel, it is possible to run your program om somebody else's cores if they are not using them. To do this you specify that you want to use the lowpri qos (Quality of Service). This is done by specifying:
in your script.
If the owner of the cores you are running on wants to use them again before your job is finished, your job will be stopped and put back on the queue. Therefore, if your job is longer than a couple of hours, you should make sure it checkpoints, i.e. saves intermediate results so that it can start from where it left if it is restarted.
Note that it is not possible to run hugemem jobs, GPU jobs or Notur jobs in the lowpri QoS. There is no lowpri QoS on Colossus.
Checkpointing a job means that the job can be stopped and started somewhere else, and continues where it left off.
Long-running jobs should implement some form of checkpointing, by saving intermediate results at intervals and being able to start with the latest intermediate results if restarted. This is especially important for lowpri-jobs which can be stopped and requeued, but also for regular jobs in order to guard against node failures etc.
Automatic checkpointing is very hard to implement and the current trend is leave this to the application. Using a parallel file system and parallel IO the complete data structure can be saved to disk within a reasonable amount of time. The MTBF for very large clusters (larger than Abel, PRACE tier-0 size clusters) are low and for very large jobs checkpoint / restart need to be in place.
Useful sbatch/qlogin/srun parametres
|--account=project||Specify the project to run under. This parameter is required.|
|--begin=time||Start the job at a given time|
|--constraint=feature||Request nodes with a certain feature. Currently supported features include intel, ib, rackN. If you need more than one feature, they must be combined with & in the same --constraint specification, e.g. --constraint=ib&rack21. Note: If you try to use more than one --constraint specification, the last one will override the earlier.|
|--cpus-per-task=cores||Specify the number of cpus (actually: cores) to allocate for each task in the job. See --ntasks and --ntasks-per-node, or man sbatch.|
|--dependency=dependency list||Defer the start of this job until the specified dependencies have been satisfied. See man sbatch for details.|
|--error=file||Send 'stderr' to the specified file. Default is to send it to the same file as 'stdout'. (Note: $HOME or ~ cannot be used. Use absolute or relative paths instead.)|
|--input=file||Read 'stdin' from the specified file. (Note: $HOME or ~ cannot be used. Use absolute or relative paths instead.)|
|--job-name=jobname||Specify job name|
|--mem-per-cpu=size||Specify the memory required per allocated core. This is the normal way of specifying memory requirements (see Large Memory Jobs). size should be an integer followed by 'M' or 'G'.|
|--partition=hugemem||Run on a hugemem node (see Large Memory Jobs).|
|--nodes=nodes||Specify the number of nodes to allocate. nodes can be an integer or a range (min-max). This is often combined with --ntasks-per-node.|
|--ntasks=tasks||Specify the number of tasks (usually cores, but see --cpus-per-task) to allocate. This is the usual way to specify cores in MPI jobs.|
|--ntasks-per-node=tasks||Specify the number of tasks (usually cores, but see --cpus-per-task) to allocate within each allocated node. Often combined with --nodes.|
Send 'stdout' (and 'stderr' if not redirected with --error) to the specified file instead of slurm-%j.out.
|--qos=lowpri||Run a job in the lowpri qos (not available on Colossus)|
|--time=time||Specify a (wall clock) time limit for the job. time can be hh:mm:ss or dd-hh:mm:ss. This parameter is required.|
|--mail-type=event||Notify user by email when certain event types occur. Valid events are BEGIN, END, FAIL, REQUEUE, and ALL. (Mail notifications are not available on Colossus.)|
sbatch policy settings
A set of extra tests and policy settings has been added to the sbatch command, to make sure jobs will start and run properly. The additional tests ensure:
- Jobs that will run on only one node are not allowed to specify --constraint=ib
- All account names are lower cased
- That all job comments and constraints are passed to SLURM (per default, only the last specification will be passed on when multiple specifications exists)