1. Install Munge
Munge is a secure authentication service used by Slurm. To install Munge, run the following command:
sudo apt update
sudo apt install munge
2. Install Slurm
To install Slurm, run the following command:
sudo apt install slurm-wlm
3. Configure Slurm
Once Slurm is installed, you need to configure it by editing the slurm.conf
file. To edit the file, run the following command:
sudo nano /etc/slurm/slurm.conf
This will open the slurm.conf
file in the Nano text editor. You can edit the configuration parameters as needed.
BASIC EXAMPLE OF CONFIGURATION FILE
3.1. Edit Service files
Edit slurmctld.service, and slurmd.service file to properly set the pid files location accordinly with defined location at slurm.conf file. The fie usually are at /usr/lib/systemd/system/slurmctld.service
Verify if node is accordinly setup:
slurmd -C
3.2. Reload Daemon
sudo systemctl daemon-reload
3.3. Enable services
sudo systemctl enable slurmctld slurmd
4. Start Slurm
To start Slurm, you need to start the control daemon and the compute nodes. To start the control daemon, run the following command:
sudo systemctl start slurmctld
To start the compute nodes, run the following command on each node:
sudo systemctl start slurmd
sudo systemctl status slurmctld slurmd
Congratulations, you hadsuccessfully installed and configured Slurm on Ubuntu!
5. Getting Node Information
To get information about a node, you can use the lscpu
command. This will provide information about the CPU architecture, including the number of CPUs, cores per socket, sockets, and threads per core. To get information about the memory, you can use the free
command.
Here’s an example of how to get the information you mentioned using these commands:
Hostname
To get the hostname of the node, you can use the hostname
command:
$ hostname
<node_hostname>
CPU Information
To get the CPU information, you can use the lscpu
command:
$ lscpu | grep -E '^CPU\(s\)|^Core|Socket|^Thread|^CPU MHz|^L3'
CPU(s): <number_of_CPUs>
Thread(s) per core: <number_of_threads_per_core>
Core(s) per socket: <number_of_cores_per_socket>
Socket(s): <number_of_sockets>
CPU MHz: <CPU_speed_in_MHz>
L3 cache: <L3_cache_size_in_KB>
Memory Information
To get the memory information, you can use the free
command:
$ free -h
total used free shared buff/cache available
Mem: <total_memory> <used_memory> <free_memory> 0B <buffer_cache> <available_memory>
Swap: <total_swap> 0B <free_swap>
-h
option is used to display the memory sizes in a human-readable format.Slurm.conf file
ClusterName=<name>
SlurmctldHost=<hostname>AuthType=auth/munge
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0# SCHEDULING
SchedulerType=sched/builtin
SelectType=select/cons_tres
SelectTypeParameters=CR_Core#ACCOUNTING
AccountingStorageUser=useracct_gather/linuxacct
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
SlurmUser=slurm# COMPUTE NODES
NodeName=<node_hostname> CPUs=<n> CoresPerSocket=<n> Sockets=<n> ThreadsPerCore=<n> RealMemory=<n> State=UNKNOWNPartitionName=debug Nodes=<node_hostname> Default=NO MaxTime=INFINITE State=UP AllowGroups=root
PartitionName=small_compute Nodes=<node_hostname> Default=YES MaxTime=07-00:00:00 State=UP MaxCPUsPerNode=10 AllowGroups=small_compute MaxMemPerNode=10000
PartitionName=large_compute Nodes= <node_hostname> Default=NO MaxTime=INFINITE State=UP MaxCPUsPerNode=19 AllowGroups=large_compute,root MaxMemPerNode=60000
Creating Modules
To use a specific conda
environment, such as for Tensorflow, we should create a module to be run within the slurm batch script.
Create a new file with the name conda
in the directory /usr/share/modules/modulefiles/
(you may need root access for this step).
sudo nano /usr/share/modules/modulefiles/conda
Add the following lines to the file:
#%Module
prepend-path PATH /path/to/conda/bin
Replace /path/to/conda
with the path where conda
is installed. Save the file and exit the editor. Now, load the conda
module using the module
command:
module load conda
This should add the conda
executable to your path and allow you to use it in your Slurm script.
Additional resources
Python code of a GUI to generate a basic slurm script
Here a pre-built executable to be run on linux. Next the code itself.
import tkinter as tk
from tkinter import ttk
from tkinter import messagebox
class SLURMSubmitGUI:
def __init__(self):
self.partitions = ["small_compute", "large_compute", "debug"]
self.num_cores = [1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
self.conda = False
self.espresso = False
self.window = tk.Tk()
self.window.title("SLURM Submit GUI")
# Create the menu bar
self.menu_bar = tk.Menu(self.window)
# Create the Help menu
self.help_menu = tk.Menu(self.menu_bar, tearoff=False)
self.help_menu.add_command(label="Help", command=self.show_help)
# Add the Help menu to the menu bar
self.menu_bar.add_cascade(label="Help", menu=self.help_menu)
# Set the menu bar as the window's menu
self.window.config(menu=self.menu_bar)
#styles
style = ttk.Style()
style.configure('TimeEntry.TEntry', padding=5, font=('Arial', 12), width=40)
# Job name input
self.job_name_label = ttk.Label(self.window, text="Job Name:")
self.job_name_label.grid(row=0, column=0, padx=5, pady=5)
self.job_name_entry = ttk.Entry(self.window, width=40)
self.job_name_entry.grid(row=0, column=1, padx=5, pady=5)
# Partition selection
self.partition_label = ttk.Label(self.window, text="Partition:")
self.partition_label.grid(row=1, column=0, padx=5, pady=5)
self.partition_option = tk.StringVar()
self.partition_option.set(self.partitions[0])
self.partition_menu = ttk.OptionMenu(self.window, self.partition_option, *self.partitions)
self.partition_menu.config(width=40)
self.partition_menu.grid(row=1, column=1, padx=5, pady=5)
# Number of cores selection
self.num_cores_label = ttk.Label(self.window, text="Number of Cores:")
self.num_cores_label.grid(row=2, column=0, padx=5, pady=5)
self.num_cores_option = tk.StringVar()
self.num_cores_option.set(str(self.num_cores[0]))
self.num_cores_menu = ttk.OptionMenu(self.window, self.num_cores_option, *map(str, self.num_cores))
self.num_cores_menu.config(width=40)
self.num_cores_menu.grid(row=2, column=1, padx=5, pady=5)
# Memory input
self.memory_label = ttk.Label(self.window, text="Memory (e.g. 1G):")
self.memory_label.grid(row=3, column=0, padx=5, pady=5)
self.memory_entry = ttk.Entry(self.window)#, width=40)
self.memory_entry.config(width=40)
self.memory_entry.insert(0,'1G')
self.memory_entry.grid(row=3, column=1, padx=5, pady=5)
# Time limit input
self.time_label = ttk.Label(self.window, text="Time Limit (e.g. 1-00:00:00):")
self.time_label.grid(row=4, column=0, padx=5, pady=5)
self.time_entry = ttk.Entry(self.window, style='TimeEntry.TEntry',width=40)
self.time_entry.insert(0, '1-00:00:00')
self.time_entry.grid(row=4, column=1, padx=5, pady=5)
# Script path input
self.script_label = ttk.Label(self.window, text="Running Command:")
self.script_label.grid(row=5, column=0, padx=5, pady=5)
self.script_entry = ttk.Entry(self.window, width=40)
self.script_entry.grid(row=5, column=1, padx=5, pady=5)
#RadioButon - conda
radio_var = tk.StringVar()
self.radio_button = tk.Radiobutton(self.window, text='Enable Conda Environment', variable=radio_var,
value='Enable', command=self.toggle)
self.radio_button.grid(row=6, column=0, padx=5, pady=5)
#RadioButon - espresso
espresso_var = tk.StringVar()
self.espresso_button = tk.Radiobutton(self.window, text='Enable Espresso Environment', variable=espresso_var,
value='Enable', command=self.toggle_espresso)
self.espresso_button.grid(row=6, column=1, padx=5, pady=5)
# Submit button
self.submit_button = ttk.Button(self.window, text="Submit", command=self.generate_script)
self.submit_button.grid(row=7, column=0, padx=5, pady=5)
self.show_file_button = ttk.Button(self.window, text="Show File", command=self.show_file_content)
self.show_file_button.grid(column=1, row=7, padx=5, pady=5)
self.close_button = tk.Button(self.window, text="Close", command=self.window.destroy)
self.close_button.grid(column=2, row=7, padx=5, pady=5)
self.window.mainloop()
def toggle(self):
if self.conda == False:
self.conda = True
else:
self.conda = False
def toggle_espresso(self):
if self.espresso == False:
self.espresso = True
else:
self.espresso = False
def show_file_content(self):
"""Open a new window and display the content of a file."""
# Create a new window
file_win = tk.Toplevel(self.window)
file_win.title("File Content")
# Create a text widget to display the file content
text = tk.Text(file_win, wrap="word")
text.pack(side="left", fill="both", expand=True)
# Add a scrollbar
scrollbar = tk.Scrollbar(file_win, command=text.yview)
scrollbar.pack(side="right", fill="y")
text.config(yscrollcommand=scrollbar.set)
# Read the content of the file and insert it into the text widget
job_name = self.job_name_entry.get()
with open(f"{job_name}.sbatch", "r") as f:
content = f.read()
text.insert("1.0", content)
# Disable the text widget to prevent editing
text.configure(state="enabled")
def generate_script(self):
job_name = self.job_name_entry.get()
partition = self.partition_option.get()
num_cores = int(self.num_cores_option.get())
memory = self.memory_entry.get()
time = self.time_entry.get()
script = self.script_entry.get()
condaANN = "conda activate /home/ldft/.conda/envs/ANN_env"
with open(f"{job_name}.sbatch", "w") as f:
f.write("#!/bin/bash\n")
f.write(f"#SBATCH --job-name={job_name}\n")
f.write(f"#SBATCH --partition={partition}\n")
f.write(f"#SBATCH --nodes=1\n")
f.write(f"#SBATCH --ntasks-per-node={num_cores}\n")
f.write(f"#SBATCH --mem={memory}\n")
f.write(f"#SBATCH --time={time}\n")
f.write(f"#SBATCH --output={job_name}.out\n")
f.write(f"#SBATCH --error={job_name}.err\n\n")
print(self.conda)
if self.conda:
f.write("\n module load conda")
f.write("\n conda activate /home/ldft/.conda/envs/ANN_env\n")
if self.espresso:
f.write("\n export PATH=$PATH:/home/ldft/Documents/install/qe-6.6/bin \n")
f.write(f"\n srun mpirun -n {num_cores} {script}\n")
def show_help(self):
# Define the function to display help information
# This can be a new window with text, images, or other widgets
# Here is an example with a simple message box:
messagebox.showinfo("Help", """This code defines a GUI (Graphical User Interface)
for submitting jobs to a SLURM (Simple Linux Utility for Resource Management) cluster.
The GUI has several input fields for specifying the job parameters, such as the job name,
the partition to run the job on, the number of cores, the memory limit, the time limit,
and the command to be run.
The GUI also has two radio buttons to enable/disable a Conda environment or an Espresso environment.
The generate_script method is called when the "Submit" button is pressed,
which generates a shell script with the specified parameters and submits it to the SLURM cluster.
The show_file_content method is called when the "Show File" button is pressed,
which opens a new window and displays the content of the generated shell script.""")
Here are some useful pages/tutorials showing the basics of how to configure Slurm Scheduler on a single workstation.
Turn your workstation into a mini-grid (with Slurm)
Setting up a single server SLURM cluster
Installing/emulating SLURM on an Ubuntu 16.04 desktop
Slurm controler node