Slurm Configuration Basics

 

1. Install Munge

Munge is a secure authentication service used by Slurm. To install Munge, run the following command:

sudo apt update
sudo apt install munge

2. Install Slurm

To install Slurm, run the following command:

sudo apt install slurm-wlm

3. Configure Slurm

Once Slurm is installed, you need to configure it by editing the slurm.conf file. To edit the file, run the following command:

sudo nano /etc/slurm-llnl/slurm.conf

This will open the slurm.conf file in the Nano text editor. You can edit the configuration parameters as needed.

4. Start Slurm

To start Slurm, you need to start the control daemon and the compute nodes. To start the control daemon, run the following command:

sudo service slurmctld start

To start the compute nodes, run the following command on each node:

sudo service slurmd start

Congratulations, you have successfully installed and configured Slurm on Ubuntu!

5. Getting Node Information

To get information about a node, you can use the lscpu command. This will provide information about the CPU architecture, including the number of CPUs, cores per socket, sockets, and threads per core. To get information about the memory, you can use the free command.

Here’s an example of how to get the information you mentioned using these commands:

Hostname

To get the hostname of the node, you can use the hostname command:

$ hostname
<node_hostname>

CPU Information

To get the CPU information, you can use the lscpu command:

$ lscpu | grep -E '^CPU\(s\)|^Core|Socket|^Thread|^CPU MHz|^L3'
CPU(s): <number_of_CPUs>
Thread(s) per core: <number_of_threads_per_core>
Core(s) per socket: <number_of_cores_per_socket>
Socket(s): <number_of_sockets>
CPU MHz: <CPU_speed_in_MHz>
L3 cache: <L3_cache_size_in_KB>

Memory Information

To get the memory information, you can use the free command:

$ free -h
total used free shared buff/cache available
Mem: <total_memory> <used_memory> <free_memory> 0B <buffer_cache> <available_memory>
Swap: <total_swap> 0B <free_swap>
Note that the -h option is used to display the memory sizes in a human-readable format.

Slurm.conf file

ClusterName=<name>
SlurmctldHost=<hostname>

AuthType=auth/munge

MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none

# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

# SCHEDULING
SchedulerType=sched/builtin
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

#ACCOUNTING
AccountingStorageUser=useracct_gather/linuxacct
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
SlurmUser=slurm

# COMPUTE NODES
NodeName=<node_hostname> CPUs=<n> CoresPerSocket=<n> Sockets=<n> ThreadsPerCore=<n> RealMemory=<n> State=UNKNOWN

PartitionName=debug Nodes=<node_hostname> Default=NO MaxTime=INFINITE State=UP AllowGroups=root
PartitionName=small_compute Nodes=<node_hostname> Default=YES MaxTime=07-00:00:00 State=UP MaxCPUsPerNode=10 AllowGroups=small_compute MaxMemPerNode=10000
PartitionName=large_compute Nodes= <node_hostname> Default=NO MaxTime=INFINITE State=UP MaxCPUsPerNode=19 AllowGroups=large_compute,root MaxMemPerNode=60000

Creating Modules

To use a specific conda environment, such as for Tensorflow, we should create a module to be run within the slurm batch script.

Create a new file with the name conda in the directory /usr/share/modules/modulefiles/ (you may need root access for this step).

sudo nano /usr/share/modules/modulefiles/conda

 

Add the following lines to the file:

#%Module
prepend-path PATH /path/to/conda/bin

 

Replace /path/to/conda with the path where conda is installed. Save the file and exit the editor. Now, load the conda module using the module command:

module load conda

 

This should add the conda executable to your path and allow you to use it in your Slurm script.

Additional resources

 

Python code of a GUI to generate a basic slurm script

Here a pre-built executable to be run on linux. Next the code itself.


import tkinter as tk
from tkinter import ttk
from tkinter import messagebox

class SLURMSubmitGUI:

    def __init__(self):
        self.partitions = ["small_compute", "large_compute", "debug"]
        self.num_cores = [1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
        self.conda = False
        self.espresso = False
        self.window = tk.Tk()
        self.window.title("SLURM Submit GUI")        
        # Create the menu bar
        self.menu_bar = tk.Menu(self.window)
        # Create the Help menu
        self.help_menu = tk.Menu(self.menu_bar, tearoff=False)
        self.help_menu.add_command(label="Help", command=self.show_help)
        # Add the Help menu to the menu bar
        self.menu_bar.add_cascade(label="Help", menu=self.help_menu)
        # Set the menu bar as the window's menu
        self.window.config(menu=self.menu_bar)           
        #styles
        style = ttk.Style()
        style.configure('TimeEntry.TEntry', padding=5, font=('Arial', 12), width=40)
        # Job name input
        self.job_name_label = ttk.Label(self.window, text="Job Name:")
        self.job_name_label.grid(row=0, column=0, padx=5, pady=5)
        self.job_name_entry = ttk.Entry(self.window, width=40)
        self.job_name_entry.grid(row=0, column=1, padx=5, pady=5)        
        # Partition selection
        self.partition_label = ttk.Label(self.window, text="Partition:")
        self.partition_label.grid(row=1, column=0, padx=5, pady=5)
        self.partition_option = tk.StringVar()
        self.partition_option.set(self.partitions[0])
        self.partition_menu = ttk.OptionMenu(self.window, self.partition_option, *self.partitions)
        self.partition_menu.config(width=40)
        self.partition_menu.grid(row=1, column=1, padx=5, pady=5)
        # Number of cores selection
        self.num_cores_label = ttk.Label(self.window, text="Number of Cores:")
        self.num_cores_label.grid(row=2, column=0, padx=5, pady=5)
        self.num_cores_option = tk.StringVar()
        self.num_cores_option.set(str(self.num_cores[0]))
        self.num_cores_menu = ttk.OptionMenu(self.window, self.num_cores_option, *map(str, self.num_cores))
        self.num_cores_menu.config(width=40)
        self.num_cores_menu.grid(row=2, column=1, padx=5, pady=5)
        # Memory input
        self.memory_label = ttk.Label(self.window, text="Memory (e.g. 1G):")
        self.memory_label.grid(row=3, column=0, padx=5, pady=5)
        self.memory_entry = ttk.Entry(self.window)#, width=40)
        self.memory_entry.config(width=40)
        self.memory_entry.insert(0,'1G')
        self.memory_entry.grid(row=3, column=1, padx=5, pady=5)
        # Time limit input
        self.time_label = ttk.Label(self.window, text="Time Limit (e.g. 1-00:00:00):")
        self.time_label.grid(row=4, column=0, padx=5, pady=5)
        self.time_entry = ttk.Entry(self.window, style='TimeEntry.TEntry',width=40)
        self.time_entry.insert(0, '1-00:00:00')
        self.time_entry.grid(row=4, column=1, padx=5, pady=5)
        # Script path input
        self.script_label = ttk.Label(self.window, text="Running Command:")
        self.script_label.grid(row=5, column=0, padx=5, pady=5)
        self.script_entry = ttk.Entry(self.window, width=40)
        self.script_entry.grid(row=5, column=1, padx=5, pady=5)
        #RadioButon - conda
        radio_var = tk.StringVar()
        self.radio_button = tk.Radiobutton(self.window, text='Enable Conda Environment', variable=radio_var, 
                                      value='Enable', command=self.toggle)
        self.radio_button.grid(row=6, column=0, padx=5, pady=5)               
        #RadioButon - espresso
        espresso_var = tk.StringVar()
        self.espresso_button = tk.Radiobutton(self.window, text='Enable Espresso Environment', variable=espresso_var, 
                                      value='Enable', command=self.toggle_espresso)
        self.espresso_button.grid(row=6, column=1, padx=5, pady=5)       
        # Submit button
        self.submit_button = ttk.Button(self.window, text="Submit", command=self.generate_script)
        self.submit_button.grid(row=7, column=0, padx=5, pady=5)
        self.show_file_button = ttk.Button(self.window, text="Show File", command=self.show_file_content)
        self.show_file_button.grid(column=1, row=7, padx=5, pady=5)        
        self.close_button = tk.Button(self.window, text="Close", command=self.window.destroy)
        self.close_button.grid(column=2, row=7, padx=5, pady=5)
        self.window.mainloop()        
    def toggle(self):
        if self.conda == False:
            self.conda = True
        else:
            self.conda = False            
    def toggle_espresso(self):
        if self.espresso == False:
            self.espresso = True
        else:
            self.espresso = False        
    def show_file_content(self):
        """Open a new window and display the content of a file."""
        # Create a new window
        file_win = tk.Toplevel(self.window)
        file_win.title("File Content")
        # Create a text widget to display the file content
        text = tk.Text(file_win, wrap="word")
        text.pack(side="left", fill="both", expand=True)
        # Add a scrollbar
        scrollbar = tk.Scrollbar(file_win, command=text.yview)
        scrollbar.pack(side="right", fill="y")
        text.config(yscrollcommand=scrollbar.set)
        # Read the content of the file and insert it into the text widget
        job_name = self.job_name_entry.get()        
        with open(f"{job_name}.sbatch", "r") as f:
            content = f.read()
            text.insert("1.0", content)
        # Disable the text widget to prevent editing
        text.configure(state="enabled")
    def generate_script(self):
        job_name = self.job_name_entry.get()
        partition = self.partition_option.get()
        num_cores = int(self.num_cores_option.get())
        memory = self.memory_entry.get()
        time = self.time_entry.get()
        script = self.script_entry.get()
        condaANN =  "conda activate /home/ldft/.conda/envs/ANN_env"
        with open(f"{job_name}.sbatch", "w") as f:
            f.write("#!/bin/bash\n")
            f.write(f"#SBATCH --job-name={job_name}\n")
            f.write(f"#SBATCH --partition={partition}\n")
            f.write(f"#SBATCH --nodes=1\n")
            f.write(f"#SBATCH --ntasks-per-node={num_cores}\n")
            f.write(f"#SBATCH --mem={memory}\n")
            f.write(f"#SBATCH --time={time}\n")
            f.write(f"#SBATCH --output={job_name}.out\n")
            f.write(f"#SBATCH --error={job_name}.err\n\n")            
            print(self.conda)
            if self.conda:
                f.write("\n module load conda")                
                f.write("\n conda activate /home/ldft/.conda/envs/ANN_env\n")            
            if self.espresso:                                
                f.write("\n export PATH=$PATH:/home/ldft/Documents/install/qe-6.6/bin \n")                
            f.write(f"\n srun mpirun -n {num_cores} {script}\n")            
    def show_help(self):
        # Define the function to display help information
        # This can be a new window with text, images, or other widgets
        # Here is an example with a simple message box:
        messagebox.showinfo("Help", """This code defines a GUI (Graphical User Interface) 
        for submitting jobs to a SLURM (Simple Linux Utility for Resource Management) cluster.

The GUI has several input fields for specifying the job parameters, such as the job name, 
the partition to run the job on, the number of cores, the memory limit, the time limit, 
and the command to be run. 

The GUI also has two radio buttons to enable/disable a Conda environment or an Espresso environment.

The generate_script method is called when the "Submit" button is pressed, 
which generates a shell script with the specified parameters and submits it to the SLURM cluster.

The show_file_content method is called when the "Show File" button is pressed, 
which opens a new window and displays the content of the generated shell script.""")

 

 

 

Here are some useful pages/tutorials showing the basics of how to configure Slurm Scheduler on a single workstation.

Turn your workstation into a mini-grid (with Slurm)

Setting up a single server SLURM cluster

 

Installing/emulating SLURM on an Ubuntu 16.04 desktop

 

Slurm controler node

 

Really Super Quick Start Guide to Setting Up SLURM