Tuesday, Jan 16
I’ve been doing some teaching of students, specifically a discipline we’ve called “data engineering” and the documentation of Slurm is quite bad and obfuscates how to do a very simple thing - namely - how to set up a Slurm cluster.
And so I thought I’d put this up here so that it’s google-able.
-
Make sure you have two or more nodes (you can do with with one but that’s trivial) on a private address range with the same users defined (preferrably through a directory service but whatever) and for sanity sake a shared filesystem.
-
Install
slurmctldon the “head” node, andslurmdon any compute nodes as well asmungeon all nodes. On AlmaLinux 9 these are all in your repos. -
Generate a munge key with
create-munge-keyand copy it to all the nodes (/etc/munge/munge.key). -
Enable and start
mungewithsystemctl. -
Edit
/etc/slurm/slurm.confto make sure all the nodes are in it at the end. There should be one node name line - the default one in AlmaLinux haslocalhostin it so replace it with a comma separated list of compute nodes and their shape (number of cores). You also need to change theSlurmctldHostto an external interface on the “head” node. -
Make sure
/etc/slurm/slurm.confis identical on all nodes (shared filesystem can help here). -
Enable and start
slurmdon each compute node (withsystemctl). -
Enable and start
slurmctldon head node (withsystemctl)..
You should now be able to use sinfo/squeue etc and submit jobs.
To add a new node, add it to slurm.conf (synchronise on every node) and then either restart every instance of slurmd and slurmctld or run scontrol reconfigure.