Tuesday, Jan 16
I’ve been doing some teaching of students, specifically a discipline we’ve called “data engineering” and the documentation of Slurm is quite bad and obfuscates how to do a very simple thing - namely - how to set up a Slurm cluster.
And so I thought I’d put this up here so that it’s google-able.
-
Make sure you have two or more nodes (you can do with with one but that’s trivial) on a private address range with the same users defined (preferrably through a directory service but whatever) and for sanity sake a shared filesystem.
-
Install
slurmctld
on the “head” node, andslurmd
on any compute nodes as well asmunge
on all nodes. On AlmaLinux 9 these are all in your repos. -
Generate a munge key with
create-munge-key
and copy it to all the nodes (/etc/munge/munge.key
). -
Enable and start
munge
withsystemctl
. -
Edit
/etc/slurm/slurm.conf
to make sure all the nodes are in it at the end. There should be one node name line - the default one in AlmaLinux haslocalhost
in it so replace it with a comma separated list of compute nodes and their shape (number of cores). You also need to change theSlurmctldHost
to an external interface on the “head” node. -
Make sure
/etc/slurm/slurm.conf
is identical on all nodes (shared filesystem can help here). -
Enable and start
slurmd
on each compute node (withsystemctl
). -
Enable and start
slurmctld
on head node (withsystemctl
)..
You should now be able to use sinfo
/squeue
etc and submit jobs.
To add a new node, add it to slurm.conf
(synchronise on every node) and then either restart every instance of slurmd
and slurmctld
or run scontrol reconfigure
.