We use the command ssh
to connect to the login nodes of the cluster using the internet. After the authentication, we will get a bash on the cluster.
diehlpk@Patricks-Air ~/g/reusummer22 (main)> ssh rostam
Enter passphrase for key '/Users/diehlpk/.ssh/id_rsa_lm2':
On modern clusters password authentication is not used any more due to less security. Instead ssh keys are used.
We use the command ssh-kyegen
to generate two keys:
ssh-keygen -t ed25519 -C "your_email@example.com"
To copy files from or to the cluster, we use the command scp
which is like
the cp
command, but we can copy data over the network or internet.
scp data.csv rostam:
scp rostam:data_clecius.csv ./
On most clusters the command module
is used to change software versions or compilers.
[diehlpk@rostam1 ~]$ module list
Currently Loaded Modules:
1) gcc/11.2.0 2) boost/1.78.0-release 3) papi/6.0.0 4) git/2.34.1 5) python/3.10.0
module avail
-------------------------------------------------------------------------- /opt/apps/gcc11/modulefiles
boost/1.76.0-debug boost/1.77.0-debug boost/1.78.0-debug
boost/1.76.0-release mpich/3.4.2 openmpi/4.1.2
module avail cuda
----------------------------------------------------------------------------- /opt/apps/modulefiles
cuda/10.2 (g) cuda/11.2 (g) cuda/11.4 (g) cuda/11.5 (g) cuda/11.6 (g,D)
[diehlpk@rostam1 ~]$ module load cuda/11.6
[diehlpk@rostam1 ~]$ module list
Currently Loaded Modules:
1) gcc/11.2.0 3) papi/6.0.0 5) python/3.10.0 7) ucx/1.12.1 9) hwloc/2.6.0 11) cuda/11.6 (g)
2) boost/1.78.0-release 4) git/2.34.1 6) cmake/3.22.0 8) pmix/4.1.0 10) Rostam2
module unload cuda
CMake is an open-source, cross-platform family of tools designed to build, test and package software. CMake is used to control the software compilation process using simple platform and compiler independent configuration files, and generate native makefiles and workspace that can be used in the compiler environment of your choice.
git clone https://github.com/ModernCPPBook/Examples.git
cd Examples
mkdir build
cd build
cmake ..
make
In the root folder of your project, we need to provide the file CMakeLists.txt
with the build instructions.
Here, is one example to build the example.cpp
from the C++ lecture:
project(regression)
cmake_minimum_required(VERSION 3.0)
set (CMAKE_CXX_STANDARD 17)
add_executable(regression example.cpp)
A common work load manager for clusters is Simple Linux Utility for Resource Management or slurm. Slurm distributes the jobs from users equally and fair so all users of the resources get some share of the computational time. Depending on the cluster there are other options, however, the LSU clusters use slurm. Therefore, we will look into this tool.
Terminology:
The command sinfo
gives us information about the available queues where we can submit jobs
[diehlpk@rostam1 build]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
medusa* up 3-00:00:00 1 down$ medusa15
medusa* up 3-00:00:00 15 idle medusa[00-14]
buran up 3-00:00:00 16 idle buran[00-15]
cuda up 3-00:00:00 2 maint toranj[0-1]
cuda up 3-00:00:00 3 idle bahram,diablo,geev
cuda-V100 up 3-00:00:00 2 idle diablo,geev
cuda-A100 up 3-00:00:00 2 maint toranj[0-1]
cuda-K80 up 3-00:00:00 1 idle bahram
The command squeue
is used to check the submitted jobs on the cluster.
[diehlpk@rostam1 build]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
25853 cuda-A100 Octo-Tig jenkins PD 0:00 1 (ReqNodeNotAvail)
25758 jenkins-a jenkins- jenkins PD 0:00 1 (Resources)
25766 jenkins-a jenkins- jenkins PD 0:00 1 (Priority)
25777 jenkins-a jenkins- jenkins PD 0:00 1 (Priority)
25785 jenkins-a jenkins- jenkins PD 0:00 1 (Priority)
25796 jenkins-a jenkins- jenkins PD 0:00 1 (Priority)
[diehlpk@rostam1 build]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24932 marvin void.sba diehlpk R 6-02:12:23 1 marvin01
24933 marvin circle.s diehlpk R 6-02:08:11 1 marvin00
Checking for the JOBID
which is used to cancel the running or pending job:
[diehlpk@rostam1 build]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24932 marvin void.sba diehlpk R 6-02:12:23 1 marvin01
24933 marvin circle.s diehlpk R 6-02:08:11 1 marvin00
Now we can cancel the job by using the command scancel
scancel 24932
For debugging purposed, we want to get access to the terminal of a node and run our code there or use bash commands.
Here, we use the command srun
to access the terminal
srun -p medusa -N 1 --pty /bin/bash -l
srun: job 25857 queued and waiting for resources
srun: job 25857 has been allocated resources
[diehlpk@medusa00 build]$ hostname
medusa00.*.lsu.edu
exit
We need to generate a file regression.sh
which is a bash script, but has additional commands for the usage
of resources. In this case, I want to use one (-N 1) medusa (-p mudusa) for 72 hours to run my simulation code.
#!/usr/bin/env bash
#SBATCH -o fine-1.out
#SBATCH -e fine-1.err
#SBATCH -t 72:00:00
#SBATCH -p medusa
#SBATCH -N 1
#SBATCH --mail-user=pdiehl@cct.lsu.edu
#SBATCH -D /home/diehlpk/Simulations/PUM2/rectcrack
Below the configuration, all modules used during compilation are loaded. After that all bash commands can be used and lastly the application is executed using the command srun
.
module load hpx vtk
i=4
srun PeriHPX -i input-delta-${i}-98.yaml > log${i}-98.txt
To submit the jon to the medusa
queue, the command sbatch
is used.
sbatch regression.sh