Long-running jobs & job stringing

Most partitions are limited to a wall-clock time of 5 days, which means that jobs that need longer than 5 days cannot be submitted. If you need to run a job for longer than that, there are a few options:

use more CPU cores, this may cause your job to run faster and finish in less time;
job stringing (described here); or
contact the system administrators to get special permission.

Job stringing is splitting a long-running job (> 5 days) into multiple smaller jobs (subjobs, < 5 days each), that are strung together, i.e. they are run one after another. Each of the subjobs performs a part of the calculation.

To allow for job stringing, the software you're using must support the following:

It must support checkpointing or snapshotting. This allows one subjob to pick up where the previous left off.
It must run for a fixed number of iterations (or for a fixed amount of time), or run until it is killed.

Guide

The idea is to submit an array job, where each element of the array is a subjob. It must be submitted so that:

The number of elements is equal to or larger than the expected number of subjobs; and
At most one element (subjob) can run at a time.

Then, if the calculation is completely finished, i.e. the remaining/unused subjobs are not needed anymore, the job script cancels the array job. (If the number of array elements equals the expected number of subjobs, this is not needed, because the array job's end and the calculation's end will coincide.)

A rough sketch of the job script is as follows:

Run the calculation.
Check if the calculation is completely finished.
If it is, cancel the array job.

Again, steps #2 and #3 are not needed if the number of array elements equals the number of subjobs.

Case #1: software with limited no. of iterations (or limited running time)

Suppose that:

we have a wall-clock time limit of 1 minute;
we have a calculation of which each iteration takes 25 seconds;
the total number of iterations to be done is 8; and
our software can run a limited number of iterations per subjob (or for a limited amount of time).

Since 2 iterations is 50 seconds, which is slightly less than 1 minute, we can fit 2 iterations in one subjob and still have some slack. It is important to have some slack, for e.g. I/O and job starting/stopping. We decide to run our calculation as follows:

2 iterations per subjob; and
4 subjobs (thus 4 array elements).

The job script would then be:

job-stringing-limited.job

#!/usr/bin/bash

# Each subjob runs for at most 1 minute.  In 1 minute, we will run
# exactly 2 iterations of our calculation.
#SBATCH --time=1

# Each subjob typically does 2 iterations, and we want 8 iterations, so
# we need 4 subjobs.  The array here has 4 elements (1 to 4 inclusive),
# of which at most 1 can run at a time (indicated by "%1").
#SBATCH --array=1-4%1

# Put the job ID in the checkpoint file name.
checkpoint_file=job-stringing-$SLURM_ARRAY_JOB_ID.ckp

# -N 8 -- number of iterations total
# -n 2 -- number of iterations in this subjob
# -c   -- path to checkpoint file
bash job-stringing-calculation.sh -N 8 -n 2 -c $checkpoint_file

# The job script always reaches this point.  At this point, we can be
# sure that:
#
# * we've done another 2 iterations of our calculation; and
#
# * maybe we've reached iteration #8, and also the end of the of the
#   array job.

Case #2: software unlimited no. of iterations

Suppose that:

we have a wall-clock time limit of 1 minute;
we have a calculation of which each iteration takes 25 seconds;
the total number of iterations to be done is 8; and
our software cannot run a limited number of iterations; it runs until completely finished or until it gets killed.

Unlike in case #1, our software doesn't stop after 2 iterations. Instead it will start a 3^rd iteration, which may finish if Slurm doesn't kill the subjob in time. (Slurm is not exact down to the second.) As a result, we cannot really predict how many subjobs we'll need, so we'll need to check if our calculation is completely finished.

Note that if our calculation is not finished, Slurm will kill the subjob half-way, i.e. it will kill at the bash calculation.sh line. The rest of the job script will not be executed. However, if the calculation is finished, bash calculation.sh will be done, and the job script will proceed to the next line. The next line can then tell Slurm to cancel all remaining subjobs.

The job script:

job-stringing-unlimited.job

#!/bin/bash

# Each subjob runs for at most 1 minute.  In 1 minute, we will run
# at minimum 2 iterations of our calculation.
#SBATCH --time=1

# Each subjob does at minimum 2 iterations, and we want 8 iterations, so
# we need at most 4 subjobs.  The array here has 4 elements (1 to 4
# inclusive), of which at most 1 can run at a time (indicated by "%1").
#SBATCH --array=1-4%1

# Put the job ID in the checkpoint file name.
checkpoint_file=job-stringing-$SLURM_ARRAY_JOB_ID.ckp

# -N 8 -- number of iterations total
# -c   -- path to checkpoint file
bash job-stringing-calculation.sh -N 8 -c $checkpoint_file

# The job script only reaches this point, if the calculation is
# completely finished.  If that happens, we can cancel the remaining
# subjobs.
scancel $SLURM_ARRAY_JOB_ID

Alternatives

The problem that cases #1 and #2 above try to solve, is to see whether or not the calculation is finished. In case #1 we could exactly calculate how many subjobs we need, so we didn't need to check. In case #2 we couldn't calculate how many subjobs we need, so instead we relied on somehow detecting if the calculation was finished, and if so, we cancelled the remaining subjobs. In particular, we relied on the subjob being cancelled due to over-time, or not.

There are other ways to detect if the calculation is finished:

It may be possible to read the checkpoint file, e.g. to see if a calculation converged.
It may be possible to read a log file or so, to see if the calculation finished.
Maybe to software exits with a different exit code if it is finished.

These are very advanced methods, that are not detailed here.

Example calculation

The above examples rely on a calculation job-stringing-calculation.sh, which can be downloaded here. It isn't really a calculation: in each iteration it just sleeps for 25 seconds, and does so for at most max_iterations iterations.

The code does the following:

At the start it reads the last iteration_counter from the checkpoint file.
It makes sure the iteration_counter is less than max_iterations, and increments it by 1.
It makes sure the sub_iteration number is not 0, and decrements it by 1.
It sleeps 25 seconds. (This is our "work".)
It saves the current iteration_counter to the checkpoint file.
It goes back to step #2.

Steps #1 and #5 are for checkpointing. Steps #2 and 6 are really just a for loop. The for loop is written as a while-true loop, so that it can start at any given index, which is needed for checkpointing.