How to Create Job Scripts with R, Python, Bash
In this tutorial we will write a job submission script for SLURM. Assume that you have an account on SOMHPC or other campus HPC system running SLURM and understand the jobs submission procedure introduced in SLURM User Guide.
Scenario
HPC cluster with a job manager such as SLURM is a great way to scale your jobs for data analysis and other large computation work! In this tutorial, we will walk through a very simple method to do this using different strategies.
Write an executable script in R / Python
Organize your inputs, output location, and scripts.
Loop over some set of variables and submit a SLURM job to use your executable to process each one.
We will cover each of these steps in detail.
Write an Executable Script
You first have some script in R or Python. It likely reads in data, processes it, and creates a result. You will need to turn this script into an executable, meaning that it accepts variable arguments.
Using R
A simple parse is to retrieve your script inputs with just a few lines:
args = commandArgs(TRUE)input1 = args[1]input2 = args[2]input3 = args[3]# Your code here!if I saved this in a script called “run.R” I could then execute:
$ Rscript run.R bird1 bird2 bird3and input1 would be assigned to “bird1,” and “bird2” and “bird3” to input2 and input3, respectively.
Using Python
In Python, we use the sys module. The same would look like this:
import sysinput1 = sys.argv[1]input2 = sys.argv[2]input3 = sys.argv[3]# Your code here!Calling would then look like:
$ python run.py bird1 bird2 bird3sys.argv is actually just a list of your calling script and the input arguments following it. Python starts indexing at 0, and we are skipping over the value at sys.argv[0]. This would actually coincide to the name of your script.
A Little About Executables
When you write your executable, it’s good practice to not hard code any variables For example, if my script is going to process an input file, I could take in just a subject identifier and then define the full path in the script, but If you change a location, your script breaks. Instead, assign this path duty to the calling script. This means that your executable should instead expect a general input_path:
## R Example DO THIS...input_path = args[3](!file.exists(input_path)){ cat('Cannot find', input_path, 'exiting!\n') stop()}## Python Example DO THISinput_path = sys.argv[3]if not os.path.exists(input_path) print('Ruhroh, %s doesn't exist!' %input_path) sys.exit 1Notice that for both, as a sanity check we check that input_path exists.
Loop submission using your executable:
You then want to loop over some set of input variables (for example, csv files with data.) You can imagine doing this on your computer - each of the inputs would be processed in serial. That can take many hours if you have a few hundred inputs each taking 10 minutes, and it’s totally unfeasible if you have thousands of simulations, each of which might need 30 minutes to an hour.
Strategy 1: Submit a Job File
A submission script looks like this:
#!/bin/bash#SBATCH --job-name=MyHPC.job#SBATCH --output=.out/MyHPC.out#SBATCH --error=.out/MyHPC.err#SBATCH --time=2-00:00#SBATCH --mem=12000#SBATCH --qos=debug#SBATCH --mail-type=ALL#SBATCH --mail-user=$USER@vcu.eduRscript $HOME/myHPCJOBS/run.R bird1 bird2 bird3Importantly, notice the last line! It’s just a single line that calls our script to run our job. In fact, look at the entire file, and the interpreter at the top - #!/bin/bash - it’s just a bash script! The only thing that makes it a little different is all of the #SBATCH commands. What’s that about? This is actually the syntax that SLURM understands as a configuration argument for your job. It just corresponds with the way that you submit the job to slurm using the sbatch command. In fact, you are free to write whatever lines that you need after the SBATCH lines. You can expect that the node running the job will have all the same information that you have on a login node. This means it will source your bash profile, and know the locations of your $HOME and $SCRATCH. It also means that you can run the same commands like module load if you need special software for the job.
What do all the different variables mean?
Some of the above are obvious, like --mem corresponds to memory in GB, --time in the format above means 2 days, and the output and error correspond to file paths to write output and error to. For full descriptions of all the options, the best source is the man pages (linux manual) which you can read via:
$ man sbatchIf you just had the one job file above, let’s say it were called MyHPC.job, you would submit it like this to slurm:
sbatch MyHPC.jobIf it falls within the node limits and accounting, you will see that the job is submit. If you need a helper tool to generate just one template, check out the Job maker that I put together a few years ago.
Strategy 2: Submit Directly to sbatch
What if you had the script RScript, and you didn’t want to create a job file? You can do the exact same submission using sbatch directly:
sbatch --job-name=MyHPC.job \ --output=.out/MyHPC.out \ --error=.out/MyHPC.err --time=2-00:00 \ --mem=12000 \ --qos=debug \ Rscript $HOME/MYHPCJOBS/run.R bird1 bird2 bird3and then of course you would need to reproduce that command to run it again.
Write a Loop Submission Script
Here I will show you very basic example in each of R, Python, and bash to loop through a set up input variables (the subject identifier to derive an input path) to generate job files, and then submit them. We can assume the following:
the number of jobs to submit is within our accounting limits, so we will submit them all at once (meaning they get added to the queue).
at the start of the script, you check for existence of directories. Usually you will need to create a top level or subject level directory somewhere in the loop, given that it doesn’t exist.
you have permission to write to where you need to write. This not only means that you have write permission, but if you are writing to a shared space, you make sure that others will too.
Bash Submission
Bash scripting is the most general solution, and we will start with it. Here is a basic template to generate the SBATCH script above. Let’s call this script run_jobs.sh
#!/bin/bash# We assume running this from the script directoryjob_directory=$PWD/.jobdata_dir="${SCRATCH}/project/birdsfly”birds=("bird1" "bird2")for bird in ${birds[@]}; do job_file="${job_directory}/${bird}.job" echo "#!/bin/bash#SBATCH --job-name=${bird}.job#SBATCH --output=.out/${bird}.out#SBATCH --error=.out/${bird}.err#SBATCH --time=2-00:00#SBATCH --mem=12000#SBATCH --qos=debug#SBATCH --mail-type=ALL#SBATCH --mail-user=$USER@vcu.eduRscript $HOME/birdfly /run.R A B C" > $job_file sbatch $job_filedoneNotice that we are echoing the job into the $job_file. We are also launching the newly created job with the last line sbatch $job_file.
Good Practices:
Finally, here are a few good job submission practices.
Always use full paths. For the scripts, it might be reasonable to use relative paths, and this is because they are run in context of their own location. However, in the case of data and other files, you should always use full paths.
Don’t run large computation on the login nodes! It negatively impacts all cluster users. Grab a development node with
sdev.Think about how much memory you actually need. You want to set a bit of an upper bound so a spike doesn’t kill the job, but you also don’t want to waste resources when you (or someone else) could be running more jobs.
You should generally not run anything massive before it is tested. This means that after you write your loop script, you might step through it manually, submit the job, ensure that it runs successfully, and inspect the output.
And as a reminder, here are some useful SLURM commands for checking your job.
# Show the overall status of each partitionsinfo# Submit a jobsbatch .jobs/jobFile.job# See the entire job queuesqueue# See only jobs for a given usersqueue -u username# Count number of running / in queue jobssqueue -u username | wc -l# Get estimated start times for your jobssqueue --start -u username# Show the status of a currently running jobsstat -j jobID# Show the final status of a finished jobsacct -j jobID# Kill a job with ID $PIDscancel $PID# Kill ALL jobs for a userscancel -u username# Kill all pending jobsscancel -u username --state=pending# Run interactive node with 16 cores (12 plus all memory on 1 node) for 4 hours:srun -n 12 -N 1 --mem=64000 --time 4:0:0 --pty bash# Claim interactive node for exclusive use, 8 hourssrun --exclusive --time 8:0:0 --pty bash# Same as above, but with X11 forwardingsrun --exclusive --time 8:0:0 --x11 --pty bash# Same as above, but with priority over your other jobssrun --nice=9999 --exclusive --time 8:0:0 --x11 --pty -p dev -t 12:00 bash# Check utilization of group allocationsacct# Running jobs in the group allocationsrun -p groupidsbatch -p groupid# Stop/restart jobs interactively# To stop:scancel -s SIGSTOP job id# To restart (this won't free up memory):scancel -s SIGCONT job id# Get usage for file systemsdf -kdf -h $HOME# Get usage for your home directory du# Counting Filesfind . -type f | wc -l