Slurm Batch System

Slurm User Interfaces (UIs)

Slurm users need to login one of the following user interface nodes to use the slurm batch job system.

User Interface Nodes

OS

Purpose

slurm-ui01.twgrid.org

CentOS 7

Job submission, File download/upload

slurm-ui02.twgrid.org

CentOS 7

Job submission, File download/upload

Note

  • The resources of the user interface node is limited, please don’t run your jobs in the user interfaces, or your jobs will be killed without notice.

  • To avoid ssh password guessing attack, IP addresses with multiple login failures in a short time will be banned for hours.

  • Installed GUI software in UIs for users to preview files. Please use, for example:

    ssh -XY <your_account>@slurm-ui01.twgrid.org
    

    to login in the UI with X11 forwarding enabled. If you are using Windows™ system, please install and execute Xming before connect to the UI. If you are using MacOS™, you will probably install xquartz to have the capability of X11 in the MacOS™.

    Software

    Filetype

    Type

    vim | nano

    text editor

    CLI

    eog

    images

    GUI

    xpdf

    pdf files

    GUI

Slurm Resources and Queues

Slurm Resources 2022-06-27

Cluster

Worker Nodes

Total CPU cores

CPU/node

CPU model

Memory/node

Disk space/node

Network

GPU model

GPU/node

HPC_FDR5

92

2208

24

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

125GB

2TB (System: 400GB)

10GbE

N/A

N/A

HPC_HDR1

2

768

128

AMD EPYC 7662 64-Core Processor

1520GB

1TB (System: 20GB)

100GbE

N/A

N/A

GPU_V100

1

48

48

Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz

768GB

1TB (System: 20GB)

10GbE

V100

8

GPU_A100

1

64

64

AMD EPYC 7302 16-Core Processor

1024GB

1TB (System: 20GB)

100GbE

A100

8

Slurm Partitions 2022-06-27

Partition

Timelimit

CPU Cores

GPU Boards

Nodes

Resource

large

14-00:00:0

840

N/A

42

QDR4

long_serial

14-00:00:0

100

N/A

5

QDR4

short

3-00:00:0

1000

N/A

50

QDR4

development

1:00:0

20

N/A

1

QDR4

a100

5-00:00:0

64

8*A100

50

A100

v100

5-00:00:0

48

8*V100

50

V100

amd

5-00:00:0

768

N/A

6

HDR1

Note

The resources are shared with different queues, so some of the resources are mutually exclusive with different queues.

System Topography

  • The system scheme could be found in the following image. The network connection is majorly in 10G ethernet.

    _images/slurm_scheme_v4.png

Slurm Software

environment-modules

You could use use environment-modules (in module command) for easy setup of your environment with our predefined configurations. Users could find predefined module by:

module avail

Load MPICH2 + gcc48:

module load gcc/4.8.5
module load mpich

Unload all loaded modules:

module purge

Load Openmpi + Intel2018:

module load intel/2018
module load openmpi

Load OpenMPI + gcc48:

module load gcc/4.8.5
module load openmpi

Note

module software tree (version: 20211130a)

── Compiler
│   ├── gcc
│   │   ├── 10.3.0
│   │   ├── 11.1.0
│   │   ├── 4.6.2
│   │   ├── 4.8.1
│   │   ├── 4.8.5
│   │   │   ├── mpich
│   │   │   │   └── 3.4.1
│   │   │   ├── mvapich2
│   │   │   │   └── 2.3.5
│   │   │   └── openmpi
│   │   │       ├── 2.1.6
│   │   │       └── 4.1.0
│   │   └── 9.3.0
│   └── intel
│       ├── 2017
│       ├── 2018
│       │   ├── mpich
│       │   │   └── 3.4.1
│       │   ├── mvapich2
│       │   │   └── 2.3.5
│       │   └── openmpi
│       │       ├── 2.1.6
│       │       └── 4.1.0
│       └── 2020
│           ├── lammaps
│           │   └── jct
│           │       └── 3Mar2020
│           ├── lammps
│           │   └── jct
│           │       └── 3Mar2020
│           ├── mpich
│           │   └── 3.4.1
│           ├── mvapich2
│           │   └── 2.3.5
│           └── openmpi
│               ├── 2.1.6
│               ├── 3.1.6
│               └── 4.1.0
├── CompilerMPI
│   ├── gcc
│   │   └── 4.8.5
│   │       └── openmpi
│   │           ├── 2.1.6
│   │           │   └── hdf5
│   │           │       ├── 1.12.0
│   │           │       └── 1.8.21
│   │           └── 4.1.0
│   └── intel
│       └── 2020
│           └── openmpi
│               ├── 2.1.6
│               ├── 3.1.6
│               └── 4.1.0
├── Core
│   ├── app
│   │   ├── anaconda3
│   │   │   ├── 4.10.3
│   │   │   └── 4.9.2
│   │   ├── binutils
│   │   │   └── 2.35.2
│   │   ├── cmake
│   │   │   └── 3.20.3
│   │   ├── make
│   │   │   └── 4.3
│   │   └── root
│   │       └── 6.24
│   ├── gcc
│   │   ├── 10.3.0
│   │   ├── 11.1.0
│   │   ├── 4.8.5
│   │   └── 9.3.0
│   ├── glibc
│   ├── intel
│   │   ├── 2017
│   │   ├── 2018
│   │   └── 2020
│   ├── nvhpc_sdk
│   │   └── 20.11
│   ├── pgi -> nvhpc_sdk/
│   └── python
│       └── 3.9.5
└── VERSION

ssinfo

ssinfo is made by DiCOS administrator, and available in slurm-ui. It could help users to know some system informations, including accounting, news, and documentation, etc.

  • Show document of slurm

ssinfo docu
  • Show personal information on QDR4 cluster

ssinfo me
  • Show news of slurm and DiCOS

ssinfo news
  • Show current slurm information

ssinfo slurm
  • Show module tree and dependencies

ssinfo modules

CVMFS

CVMFS represented for CernVM-FS. It’s originally used in the grid computing, and try to deliver the updated software for the computation. The file system is read-only, so it is very suitable for the software delivery. In DiCOS system, CVMFS file system is for the software repository for users, and mounted in /cvmfs. The modules environment in slurm system help user to setup the environment for specifically software, and the software is located in CVMFS.

See also

Docs

Slurm Tutorials

On Site Slurm Documents

User documents for SLURM are located in

/ceph/astro_phys/user_document/

Create a working directory, assume it as mpi_work in your HOME directory. Copy all the scripts from the following directory to start.

/ceph/astro_phys/user_document/scripts/*

Request for Specific Software Installation

If you have special requirement for the software installation, please contact to DiCOS-Support@twgrid.org.