Slurm Configuration Basics

1. Install Munge

Munge is a secure authentication service used by Slurm. To install Munge, run the following command:

sudo apt update

sudo apt install munge

2. Install Slurm

To install Slurm, run the following command:

sudo apt install slurm-wlm

3. Configure Slurm

Once Slurm is installed, you need to configure it by editing the slurm.conf file. To edit the file, run the following command:

sudo nano /etc/slurm/slurm.conf

This will open the slurm.conf file in the Nano text editor. You can edit the configuration parameters as needed.

BASIC EXAMPLE OF CONFIGURATION FILE

ClusterName=cluster

SlurmctldHost=<add-hostname.

ProctrackType=proctrack/linuxproc

ReturnToService=1

SlurmctldPidFile=/var/run/slurm/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurm/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurm/slurmd

SlurmUser=slurm

StateSaveLocation=/var/spool/slurm/slurmctld

TaskPlugin=task/affinity,task/cgroup

InactiveLimit=0

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

Waittime=0

# SCHEDULING

SchedulerType=sched/backfill

SelectType=select/cons_tres

#JobCompParams=

#JobCompPass=

#JobCompPort=

JobCompType=jobcomp/none

#JobCompUser=

#JobContainerType=

JobAcctGatherFrequency=30

#JobAcctGatherType=

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurm/slurmd.log

GresTypes=gpu

# COMPUTE NODES

NodeName=<hostname> NodeAddr=<hostname> CPUs=20 RealMemory=64222 Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:M2000:1 State=UNKNOWN

PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP AllowGroups=ldft

PartitionName=small_compute Nodes=All Default=NO MaxTime=1-00:00:00 State=UP AllowGroups=small_compute MaxCPUsPerNode=10 MaxMemPerNode=32768

PartitionName=large_compute Nodes=All Default=NO MaxTime=INFINITE State=UP AllowGroups=large_compute MaxCPUsPerNode=19 MaxMemPerNode=61440

3.1. Edit Service files

Edit slurmctld.service, and slurmd.service file to properly set the pid files location accordinly with defined location at slurm.conf file. The fie usually are at /usr/lib/systemd/system/slurmctld.service

Verify if node is accordinly setup:

slurmd -C

3.2. Reload Daemon

sudo systemctl daemon-reload

3.3. Enable services

sudo systemctl enable slurmctld slurmd

4. Start Slurm

To start Slurm, you need to start the control daemon and the compute nodes. To start the control daemon, run the following command:

sudo systemctl start slurmctld

To start the compute nodes, run the following command on each node:

sudo systemctl start slurmd

Verify status:

sudo systemctl status slurmctld slurmd

Congratulations, you hadsuccessfully installed and configured Slurm on Ubuntu!

5. Getting Node Information

To get information about a node, you can use the lscpu command. This will provide information about the CPU architecture, including the number of CPUs, cores per socket, sockets, and threads per core. To get information about the memory, you can use the free command.

Here’s an example of how to get the information you mentioned using these commands:

Hostname

To get the hostname of the node, you can use the hostname command:

$ hostname

<node_hostname>

CPU Information

To get the CPU information, you can use the lscpu command:

$ lscpu | grep -E '^CPU\(s\)|^Core|Socket|^Thread|^CPU MHz|^L3'

CPU(s):                <number_of_CPUs>

Thread(s) per core:    <number_of_threads_per_core>

Core(s) per socket:    <number_of_cores_per_socket>

Socket(s):             <number_of_sockets>

CPU MHz:               <CPU_speed_in_MHz>

L3 cache:              <L3_cache_size_in_KB>

Memory Information

To get the memory information, you can use the free command:

$ free -h

total        used        free      shared  buff/cache   available

Mem:           <total_memory>    <used_memory>   <free_memory>   0B          <buffer_cache>   <available_memory>

Swap:          <total_swap>      0B            <free_swap>

Note that the -h option is used to display the memory sizes in a human-readable format.

Slurm.conf file

ClusterName=<name>
SlurmctldHost=<hostname>

AuthType=auth/munge

MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none

# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

# SCHEDULING
SchedulerType=sched/builtin
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

#ACCOUNTING
AccountingStorageUser=useracct_gather/linuxacct
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
SlurmUser=slurm

# COMPUTE NODES
NodeName=<node_hostname> CPUs=<n> CoresPerSocket=<n> Sockets=<n> ThreadsPerCore=<n> RealMemory=<n> State=UNKNOWN

PartitionName=debug Nodes=<node_hostname> Default=NO MaxTime=INFINITE State=UP AllowGroups=root
PartitionName=small_compute Nodes=<node_hostname> Default=YES MaxTime=07-00:00:00 State=UP MaxCPUsPerNode=10 AllowGroups=small_compute MaxMemPerNode=10000
PartitionName=large_compute Nodes= <node_hostname> Default=NO MaxTime=INFINITE State=UP MaxCPUsPerNode=19 AllowGroups=large_compute,root MaxMemPerNode=60000

Creating Modules

To use a specific conda environment, such as for Tensorflow, we should create a module to be run within the slurm batch script.

Create a new file with the name conda in the directory /usr/share/modules/modulefiles/ (you may need root access for this step).

sudo nano /usr/share/modules/modulefiles/conda

Add the following lines to the file:

#%Module

prepend-path PATH /path/to/conda/bin

Replace /path/to/conda with the path where conda is installed. Save the file and exit the editor. Now, load the conda module using the module command:

module load conda

This should add the conda executable to your path and allow you to use it in your Slurm script.

Additional resources

Python code of a GUI to generate a basic slurm script

Here a pre-built executable to be run on linux. Next the code itself.


import tkinter as tk
from tkinter import ttk
from tkinter import messagebox

class SLURMSubmitGUI:

    def __init__(self):
        self.partitions = ["small_compute", "large_compute", "debug"]
        self.num_cores = [1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
        self.conda = False
        self.espresso = False
        self.window = tk.Tk()
        self.window.title("SLURM Submit GUI")        
        # Create the menu bar
        self.menu_bar = tk.Menu(self.window)
        # Create the Help menu
        self.help_menu = tk.Menu(self.menu_bar, tearoff=False)
        self.help_menu.add_command(label="Help", command=self.show_help)
        # Add the Help menu to the menu bar
        self.menu_bar.add_cascade(label="Help", menu=self.help_menu)
        # Set the menu bar as the window's menu
        self.window.config(menu=self.menu_bar)           
        #styles
        style = ttk.Style()
        style.configure('TimeEntry.TEntry', padding=5, font=('Arial', 12), width=40)
        # Job name input
        self.job_name_label = ttk.Label(self.window, text="Job Name:")
        self.job_name_label.grid(row=0, column=0, padx=5, pady=5)
        self.job_name_entry = ttk.Entry(self.window, width=40)
        self.job_name_entry.grid(row=0, column=1, padx=5, pady=5)        
        # Partition selection
        self.partition_label = ttk.Label(self.window, text="Partition:")
        self.partition_label.grid(row=1, column=0, padx=5, pady=5)
        self.partition_option = tk.StringVar()
        self.partition_option.set(self.partitions[0])
        self.partition_menu = ttk.OptionMenu(self.window, self.partition_option, *self.partitions)
        self.partition_menu.config(width=40)
        self.partition_menu.grid(row=1, column=1, padx=5, pady=5)
        # Number of cores selection
        self.num_cores_label = ttk.Label(self.window, text="Number of Cores:")
        self.num_cores_label.grid(row=2, column=0, padx=5, pady=5)
        self.num_cores_option = tk.StringVar()
        self.num_cores_option.set(str(self.num_cores[0]))
        self.num_cores_menu = ttk.OptionMenu(self.window, self.num_cores_option, *map(str, self.num_cores))
        self.num_cores_menu.config(width=40)
        self.num_cores_menu.grid(row=2, column=1, padx=5, pady=5)
        # Memory input
        self.memory_label = ttk.Label(self.window, text="Memory (e.g. 1G):")
        self.memory_label.grid(row=3, column=0, padx=5, pady=5)
        self.memory_entry = ttk.Entry(self.window)#, width=40)
        self.memory_entry.config(width=40)
        self.memory_entry.insert(0,'1G')
        self.memory_entry.grid(row=3, column=1, padx=5, pady=5)
        # Time limit input
        self.time_label = ttk.Label(self.window, text="Time Limit (e.g. 1-00:00:00):")
        self.time_label.grid(row=4, column=0, padx=5, pady=5)
        self.time_entry = ttk.Entry(self.window, style='TimeEntry.TEntry',width=40)
        self.time_entry.insert(0, '1-00:00:00')
        self.time_entry.grid(row=4, column=1, padx=5, pady=5)
        # Script path input
        self.script_label = ttk.Label(self.window, text="Running Command:")
        self.script_label.grid(row=5, column=0, padx=5, pady=5)
        self.script_entry = ttk.Entry(self.window, width=40)
        self.script_entry.grid(row=5, column=1, padx=5, pady=5)
        #RadioButon - conda
        radio_var = tk.StringVar()
        self.radio_button = tk.Radiobutton(self.window, text='Enable Conda Environment', variable=radio_var, 
                                      value='Enable', command=self.toggle)
        self.radio_button.grid(row=6, column=0, padx=5, pady=5)               
        #RadioButon - espresso
        espresso_var = tk.StringVar()
        self.espresso_button = tk.Radiobutton(self.window, text='Enable Espresso Environment', variable=espresso_var, 
                                      value='Enable', command=self.toggle_espresso)
        self.espresso_button.grid(row=6, column=1, padx=5, pady=5)       
        # Submit button
        self.submit_button = ttk.Button(self.window, text="Submit", command=self.generate_script)
        self.submit_button.grid(row=7, column=0, padx=5, pady=5)
        self.show_file_button = ttk.Button(self.window, text="Show File", command=self.show_file_content)
        self.show_file_button.grid(column=1, row=7, padx=5, pady=5)        
        self.close_button = tk.Button(self.window, text="Close", command=self.window.destroy)
        self.close_button.grid(column=2, row=7, padx=5, pady=5)
        self.window.mainloop()        
    def toggle(self):
        if self.conda == False:
            self.conda = True
        else:
            self.conda = False            
    def toggle_espresso(self):
        if self.espresso == False:
            self.espresso = True
        else:
            self.espresso = False        
    def show_file_content(self):
        """Open a new window and display the content of a file."""
        # Create a new window
        file_win = tk.Toplevel(self.window)
        file_win.title("File Content")
        # Create a text widget to display the file content
        text = tk.Text(file_win, wrap="word")
        text.pack(side="left", fill="both", expand=True)
        # Add a scrollbar
        scrollbar = tk.Scrollbar(file_win, command=text.yview)
        scrollbar.pack(side="right", fill="y")
        text.config(yscrollcommand=scrollbar.set)
        # Read the content of the file and insert it into the text widget
        job_name = self.job_name_entry.get()        
        with open(f"{job_name}.sbatch", "r") as f:
            content = f.read()
            text.insert("1.0", content)
        # Disable the text widget to prevent editing
        text.configure(state="enabled")
    def generate_script(self):
        job_name = self.job_name_entry.get()
        partition = self.partition_option.get()
        num_cores = int(self.num_cores_option.get())
        memory = self.memory_entry.get()
        time = self.time_entry.get()
        script = self.script_entry.get()
        condaANN =  "conda activate /home/ldft/.conda/envs/ANN_env"
        with open(f"{job_name}.sbatch", "w") as f:
            f.write("#!/bin/bash\n")
            f.write(f"#SBATCH --job-name={job_name}\n")
            f.write(f"#SBATCH --partition={partition}\n")
            f.write(f"#SBATCH --nodes=1\n")
            f.write(f"#SBATCH --ntasks-per-node={num_cores}\n")
            f.write(f"#SBATCH --mem={memory}\n")
            f.write(f"#SBATCH --time={time}\n")
            f.write(f"#SBATCH --output={job_name}.out\n")
            f.write(f"#SBATCH --error={job_name}.err\n\n")            
            print(self.conda)
            if self.conda:
                f.write("\n module load conda")                
                f.write("\n conda activate /home/ldft/.conda/envs/ANN_env\n")            
            if self.espresso:                                
                f.write("\n export PATH=$PATH:/home/ldft/Documents/install/qe-6.6/bin \n")                
            f.write(f"\n srun mpirun -n {num_cores} {script}\n")            
    def show_help(self):
        # Define the function to display help information
        # This can be a new window with text, images, or other widgets
        # Here is an example with a simple message box:
        messagebox.showinfo("Help", """This code defines a GUI (Graphical User Interface) 
        for submitting jobs to a SLURM (Simple Linux Utility for Resource Management) cluster.

The GUI has several input fields for specifying the job parameters, such as the job name, 
the partition to run the job on, the number of cores, the memory limit, the time limit, 
and the command to be run. 

The GUI also has two radio buttons to enable/disable a Conda environment or an Espresso environment.

The generate_script method is called when the "Submit" button is pressed, 
which generates a shell script with the specified parameters and submits it to the SLURM cluster.

The show_file_content method is called when the "Show File" button is pressed, 
which opens a new window and displays the content of the generated shell script.""")

Here are some useful pages/tutorials showing the basics of how to configure Slurm Scheduler on a single workstation.

Laboratório de Desenvolvimento em Física Teórica