Introduction to HPC clusters¶

  • ssh to connect to the cluster
  • scp to copy data from and to the cluster
  • module files to load software
  • CMake to build code and libraries
  • slurm to start jobs

Secure Shell or SSH¶

We use the command ssh to connect to the login nodes of the cluster using the internet. After the authentication, we will get a bash on the cluster.

  • ssh user@cluster
    • user is your username on the cluster and not the one on your local device
    • cluster is the URL to the cluster
diehlpk@Patricks-Air ~/g/reusummer22 (main)> ssh rostam
Enter passphrase for key '/Users/diehlpk/.ssh/id_rsa_lm2':

ssh keys¶

On modern clusters password authentication is not used any more due to less security. Instead ssh keys are used.
We use the command ssh-kyegen to generate two keys:

  • Public key to share with the cluster administrators
  • Private key for your own usage. You need this private key instead of the password to connect to the login node.
ssh-keygen -t ed25519 -C "your_email@example.com"

Secure file copy or scp¶

To copy files from or to the cluster, we use the command scp which is like the cp command, but we can copy data over the network or internet.

Copy data from your local computer to the cluster¶

scp data.csv rostam:

Copy data from your the cluster to your local computer¶

scp rostam:data_clecius.csv ./

Module files to load software or change compilers¶

On most clusters the command module is used to change software versions or compilers.

Showing the currently loaded items¶

[diehlpk@rostam1 ~]$ module list

Currently Loaded Modules:
  1) gcc/11.2.0   2) boost/1.78.0-release   3) papi/6.0.0   4) git/2.34.1   5) python/3.10.0

Showing the available items¶

module avail
-------------------------------------------------------------------------- /opt/apps/gcc11/modulefiles 
   boost/1.76.0-debug      boost/1.77.0-debug      boost/1.78.0-debug            
   boost/1.76.0-release    mpich/3.4.2         openmpi/4.1.2

Showing all available versions for one module¶

module avail cuda

----------------------------------------------------------------------------- /opt/apps/modulefiles
   cuda/10.2 (g)    cuda/11.2 (g)    cuda/11.4 (g)    cuda/11.5 (g)    cuda/11.6 (g,D)

Loading a module¶

[diehlpk@rostam1 ~]$ module load cuda/11.6
[diehlpk@rostam1 ~]$ module list

Currently Loaded Modules:
  1) gcc/11.2.0             3) papi/6.0.0   5) python/3.10.0   7) ucx/1.12.1   9) hwloc/2.6.0  11) cuda/11.6 (g)
  2) boost/1.78.0-release   4) git/2.34.1   6) cmake/3.22.0    8) pmix/4.1.0  10) Rostam2

Unloading a module¶

module unload cuda

CMake to build libraries or software¶

CMake is an open-source, cross-platform family of tools designed to build, test and package software. CMake is used to control the software compilation process using simple platform and compiler independent configuration files, and generate native makefiles and workspace that can be used in the compiler environment of your choice.

  • CMake is used by 55% of open source projects as the build system.
  • Many academic codes use CMake to be build, e.g. lammps, gromacs, astrophysics code, ...

Building existing projects¶

git clone https://github.com/ModernCPPBook/Examples.git
cd Examples
mkdir build
cd build
cmake ..
make

Generating your own project¶

In the root folder of your project, we need to provide the file CMakeLists.txt with the build instructions.

Here, is one example to build the example.cpp from the C++ lecture:

project(regression)
cmake_minimum_required(VERSION 3.0)

set (CMAKE_CXX_STANDARD 17)
add_executable(regression example.cpp)

Running jobs on clusters¶

A common work load manager for clusters is Simple Linux Utility for Resource Management or slurm. Slurm distributes the jobs from users equally and fair so all users of the resources get some share of the computational time. Depending on the cluster there are other options, however, the LSU clusters use slurm. Therefore, we will look into this tool.

Terminology:

  • Job: A file submitted by the user with some information, e.g. time, memory usage, amount of nodes.
  • Node: One computer within the cluster. The largest cluster has 178k nodes.
  • Queue: Wait list where to submit jobs
  • Accounting: Amount of node hours a user got with the allocation.

Checking queue information¶

The command sinfo gives us information about the available queues where we can submit jobs

[diehlpk@rostam1 build]$ sinfo
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
medusa*            up 3-00:00:00      1  down$ medusa15
medusa*            up 3-00:00:00     15   idle medusa[00-14]
buran              up 3-00:00:00     16   idle buran[00-15]
cuda               up 3-00:00:00      2  maint toranj[0-1]
cuda               up 3-00:00:00      3   idle bahram,diablo,geev
cuda-V100          up 3-00:00:00      2   idle diablo,geev
cuda-A100          up 3-00:00:00      2  maint toranj[0-1]
cuda-K80           up 3-00:00:00      1   idle bahram

Checking job status¶

The command squeue is used to check the submitted jobs on the cluster.

[diehlpk@rostam1 build]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             25853 cuda-A100 Octo-Tig  jenkins PD       0:00      1 (ReqNodeNotAvail)
             25758 jenkins-a jenkins-  jenkins PD       0:00      1 (Resources)
             25766 jenkins-a jenkins-  jenkins PD       0:00      1 (Priority)
             25777 jenkins-a jenkins-  jenkins PD       0:00      1 (Priority)
             25785 jenkins-a jenkins-  jenkins PD       0:00      1 (Priority)
             25796 jenkins-a jenkins-  jenkins PD       0:00      1 (Priority)

Checking the job status of your own jobs¶

[diehlpk@rostam1 build]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             24932    marvin void.sba  diehlpk  R 6-02:12:23      1 marvin01
             24933    marvin circle.s  diehlpk  R 6-02:08:11      1 marvin00

Canceling a job¶

Checking for the JOBID which is used to cancel the running or pending job:

[diehlpk@rostam1 build]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             24932    marvin void.sba  diehlpk  R 6-02:12:23      1 marvin01
             24933    marvin circle.s  diehlpk  R 6-02:08:11      1 marvin00

Now we can cancel the job by using the command scancel

scancel 24932

Submitting an interactive job¶

For debugging purposed, we want to get access to the terminal of a node and run our code there or use bash commands.

Here, we use the command srun to access the terminal

srun -p medusa -N 1 --pty /bin/bash -l
srun: job 25857 queued and waiting for resources
srun: job 25857 has been allocated resources
[diehlpk@medusa00 build]$ hostname
medusa00.*.lsu.edu
exit

Submitting a long running job¶

We need to generate a file regression.sh which is a bash script, but has additional commands for the usage of resources. In this case, I want to use one (-N 1) medusa (-p mudusa) for 72 hours to run my simulation code.

#!/usr/bin/env bash
#SBATCH -o fine-1.out
#SBATCH -e fine-1.err
#SBATCH -t 72:00:00
#SBATCH -p medusa
#SBATCH -N 1
#SBATCH --mail-user=pdiehl@cct.lsu.edu
#SBATCH -D /home/diehlpk/Simulations/PUM2/rectcrack

Below the configuration, all modules used during compilation are loaded. After that all bash commands can be used and lastly the application is executed using the command srun.

module load hpx vtk 
i=4

srun PeriHPX -i input-delta-${i}-98.yaml > log${i}-98.txt

To submit the jon to the medusa queue, the command sbatch is used.

sbatch regression.sh