How to deploy Slurm¶
This how-to guide shows you how to deploy the Slurm workload manager as the resource management and job scheduling service of your Charmed HPC cluster. The deployment, management, and operations of Slurm are controlled by the Slurm charms.
Prerequisites¶
To successfully deploy Slurm in your Charmed HPC cluster, you will at least need:
The Juju CLI client installed on your machine.
Once you have verified that you have met the prerequisites above, proceed to the instructions below.
Deploy Slurm¶
You have two options for deploying Slurm:
Using the Juju CLI client.
Using the Juju Terraform client.
If you want to use Terraform to deploy Slurm, see the
Manage terraform-provider-juju how-to guide for additional
requirements.
First, use juju add-model to create the slurm model on your charmed-hpc
machine cloud:
juju add-model slurm charmed-hpc
Now use juju deploy to deploy Slurm’s services with MySQL as
the storage database for slurmdbd:
Deploying slurmctld in high-availability
Do not follow the instructions below for deploying slurmctld with the other Slurm services if you wish to deploy slurmctld with high-availability enabled.
See the Deploying slurmctld in high availability section for instructions on how to deploy slurmctld instead.
Deploying Slurm on LXD
Do not follow the instructions below for deploying Slurm if your backing cloud is LXD.
On LXD, if you deploy the Slurm charms to system containers rather than virtual machines,
Slurm cannot use the recommended process tracking plugin proctrack/cgroup,
and additional modifications must be made to the default LXD profile.
See the Deploy Slurm on LXD section for instructions on the additional constraints that must be passed to Juju so that Slurm is deployed on virtual machines instead of system containers.
juju deploy sackd --base "ubuntu@24.04" --channel "edge"
juju deploy slurmctld --base "ubuntu@24.04" --channel "edge"
juju deploy slurmd --base "ubuntu@24.04" --channel "edge"
juju deploy slurmdbd --base "ubuntu@24.04" --channel "edge"
juju deploy slurmrestd --base "ubuntu@24.04" --channel "edge"
juju deploy mysql --channel "8.0/stable"
After that, use juju integrate to integrate all of Slurm’s services together,
and integrate slurmdbd with MySQL:
juju integrate slurmctld sackd
juju integrate slurmctld slurmd
juju integrate slurmctld slurmdbd
juju integrate slurmctld slurmrestd
juju integrate slurmdbd mysql:database
First, create the Terraform configuration file slurm/main.tf using
mkdir and touch:
mkdir slurm
touch slurm/main.tf
Now open slurm/main.tf in a text editor and add the Juju Terraform provider to your configuration:
terraform {
required_providers {
juju = {
source = "juju/juju"
version = "~> 1.0"
}
}
}
Next, create the slurm model on your charmed-hpc machine cloud:
resource "juju_model" "slurm" {
name = "slurm"
cloud {
name = "charmed-hpc"
}
}
Now deploy Slurm’s services with MySQL as the storage database for slurmdbd:
Deploying slurmctld in high-availability
Do not follow the instructions below for deploying slurmctld with the other Slurm services if you wish to deploy slurmctld with high-availability enabled.
See the Deploying slurmctld in high availability section for instructions on how to deploy slurmctld instead.
Deploying Slurm on LXD
Do not follow the instructions below for deploying Slurm if your backing cloud is LXD.
On LXD, if you deploy the Slurm charms to system containers rather than virtual machines,
Slurm cannot use the recommended process tracking plugin proctrack/cgroup,
and additional modifications must be made to the default LXD profile.
See the Deploy Slurm on LXD section for instructions on the additional constraints that must be passed to Juju so that Slurm is deployed on virtual machines instead of system containers.
module "sackd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmctld" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmdbd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmrestd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "mysql" {
source = "git::https://github.com/canonical/mysql-operators//machines/terraform"
model = juju_model.slurm.uuid
}
After that, integrate all of Slurm’s services together, and integrate slurmdbd with MySQL:
resource "juju_integration" "sackd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.sackd.app_name
endpoint = module.sackd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.login-node
}
}
resource "juju_integration" "slurmd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmd.app_name
endpoint = module.slurmd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.slurmd
}
}
resource "juju_integration" "slurmdbd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmdbd.app_name
endpoint = module.slurmdbd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.slurmdbd
}
}
resource "juju_integration" "slurmrestd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmrestd.app_name
endpoint = module.slurmrestd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.slurmrestd
}
}
resource "juju_integration" "slurmdbd_to_mysql" {
model_uuid = juju_model.slurm.uuid
application {
name = module.mysql.app_name
endpoint = module.mysql.provides.database
}
application {
name = module.slurmdbd.app_name
endpoint = module.slurmdbd.requires.database
}
}
You can expand the dropdown below to view the full slurm/main.tf Terraform
configuration before applying it. Now use the terraform command to apply
your configuration:
terraform -chdir=slurm init
terraform -chdir=slurm apply -auto-approve
Full slurm/main.tf Terraform configuration file
terraform {
required_providers {
juju = {
source = "juju/juju"
version = "~> 1.0"
}
}
}
resource "juju_model" "slurm" {
name = "slurm"
cloud {
name = "charmed-hpc"
}
}
module "sackd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmctld" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmdbd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "slurmrestd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
model_uuid = juju_model.slurm.uuid
}
module "mysql" {
source = "git::https://github.com/canonical/mysql-operators//machines/terraform"
model = juju_model.slurm.uuid
}
resource "juju_integration" "sackd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.sackd.app_name
endpoint = module.sackd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.login-node
}
}
resource "juju_integration" "slurmd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmd.app_name
endpoint = module.slurmd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.slurmd
}
}
resource "juju_integration" "slurmdbd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmdbd.app_name
endpoint = module.slurmdbd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.slurmdbd
}
}
resource "juju_integration" "slurmrestd_to_slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmrestd.app_name
endpoint = module.slurmrestd.provides.slurmctld
}
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.requires.slurmrestd
}
}
resource "juju_integration" "slurmdbd_to_mysql" {
model_uuid = juju_model.slurm.uuid
application {
name = module.mysql.app_name
endpoint = module.mysql.provides.database
}
application {
name = module.slurmdbd.app_name
endpoint = module.slurmdbd.requires.database
}
}
Your Slurm deployment will become active within a few minutes. The output of the
juju status will be similar to the following:
user@host:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm charmed-hpc charmed-hpc/default 3.6.0 unsupported 17:16:37Z
App Version Status Scale Charm Channel Rev Exposed Message
mysql 8.0.39-0ubun... active 1 mysql 8.0/stable 313 no
sackd 23.11.4-1.2u... active 1 sackd latest/edge 4 no
slurmctld 23.11.4-1.2u... active 1 slurmctld latest/edge 86 no primary - UP
slurmd 23.11.4-1.2u... active 1 slurmd latest/edge 107 no
slurmdbd 23.11.4-1.2u... active 1 slurmdbd latest/edge 78 no
slurmrestd 23.11.4-1.2u... active 1 slurmrestd latest/edge 80 no
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 5 10.32.18.127 3306,33060/tcp Primary
sackd/0* active idle 4 10.32.18.203
slurmctld/0* active idle 0 10.32.18.15 primary - UP
slurmd/0* active idle 1 10.32.18.207
slurmdbd/0* active idle 2 10.32.18.102
slurmrestd/0* active idle 3 10.32.18.9
Machine State Address Inst id Base AZ Message
0 started 10.32.18.15 juju-d566c2-0 ubuntu@24.04 Running
1 started 10.32.18.207 juju-d566c2-1 ubuntu@24.04 Running
2 started 10.32.18.102 juju-d566c2-2 ubuntu@24.04 Running
3 started 10.32.18.9 juju-d566c2-3 ubuntu@24.04 Running
4 started 10.32.18.203 juju-d566c2-4 ubuntu@24.04 Running
5 started 10.32.18.127 juju-d566c2-5 ubuntu@22.04 Running
Deploying slurmctld in high availability¶
The slurmctld charm optionally supports high availability (HA)
through the native functionality provided by Slurm. This functionality requires a
low-latency shared filesystem; follow the instructions in the
Deploy a shared filesystem section to deploy a shared filesystem.
Choosing a shared filesystem
See the Shared StateSaveLocation using filesystem-client charm
section for guidance on choosing a shared filesystem. It is recommended that the
HA file system not be the same as the filesystem used for the cluster compute nodes
to avoid I/O-intensive user jobs from impacting slurmctld’s responsiveness.
The suggested approach is to deploy a dedicated HA file system then subsequently
provision a separate file system for the compute nodes.
Once a chosen shared filesystem has been deployed and made available through a proxy or provider charm,
use the following instructions, substituting [filesystem-provider] with the name of the provider charm
to deploy slurmctld with HA enabled.
In this example, two slurmctld units are deployed. One slurmctld unit acts as the primary Slurm controller, and the other unit serves as the backup controller:
juju deploy filesystem-client --channel latest/edge
juju integrate filesystem-client:filesystem [filesystem-provider]:filesystem
juju deploy slurmctld --base "ubuntu@24.04" --channel "edge" --num-units 2
juju integrate slurmctld:mount filesystem-client:mount
module "filesystem-client" {
source = "git::https://github.com/charmed-hpc/filesystem-charms//charms/filesystem-client/terraform"
model_uuid = juju_model.slurm.uuid
}
resource "juju_integration" "provider_to_filesystem" {
model_uuid = juju_model.slurm.uuid
application {
name = module.[filesystem-provider].app_name
endpoint = module.[filesystem-provider].provides.filesystem
}
application {
name = module.filesystem-client.app_name
endpoint = module.filesystem-client.requires.filesystem
}
}
module "slurmctld" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
model_uuid = juju_model.slurm.uuid
units = 2
}
resource "juju_integration" "filesystem-to-slurmctld" {
model_uuid = juju_model.slurm.uuid
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.provides.mount
}
application {
name = module.filesystem-client.app_name
endpoint = module.filesystem-client.requires.mount
}
}
Once slurmctld is scaled up, the output of juju status will be similar
to the following. The output can be different depending on the shared filesystem you chose.
CephFS is used in this example to provide the HA filesystem:
user@host:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm charmed-hpc charmed-hpc/default 3.6.0 unsupported 17:16:37Z
App Version Status Scale Charm Channel Rev Exposed Message
cephfs-server-proxy active 1 cephfs-server-proxy latest/edge 25 no
filesystem-client active 1 filesystem-client latest/edge 20 no Integrated with `cephfs` provider
mysql 8.0.39-0ubun... active 1 mysql 8.0/stable 313 no
sackd 23.11.4-1.2u... active 1 sackd latest/edge 4 no
slurmctld 23.11.4-1.2u... active 1 slurmctld latest/edge 86 no primary - UP
slurmd 23.11.4-1.2u... active 1 slurmd latest/edge 107 no
slurmdbd 23.11.4-1.2u... active 1 slurmdbd latest/edge 78 no
slurmrestd 23.11.4-1.2u... active 1 slurmrestd latest/edge 80 no
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 5 10.32.18.127 3306,33060/tcp Primary
sackd/0* active idle 4 10.32.18.203
slurmctld/0* active idle 0 10.32.18.15 primary - UP
filesystem-client/0* active idle 10.32.18.15 Mounted filesystem at `/srv/slurmctld-statefs`
slurmctld/1 active idle 6 10.32.18.204 backup - UP
filesystem-client/1 active idle 10.32.18.204 Mounted filesystem at `/srv/slurmctld-statefs`
slurmd/0* active idle 1 10.32.18.207
slurmdbd/0* active idle 2 10.32.18.102
slurmrestd/0* active idle 3 10.32.18.9
Machine State Address Inst id Base AZ Message
0 started 10.32.18.15 juju-d566c2-0 ubuntu@24.04 Running
1 started 10.32.18.207 juju-d566c2-1 ubuntu@24.04 Running
2 started 10.32.18.102 juju-d566c2-2 ubuntu@24.04 Running
3 started 10.32.18.9 juju-d566c2-3 ubuntu@24.04 Running
4 started 10.32.18.203 juju-d566c2-4 ubuntu@24.04 Running
5 started 10.32.18.127 juju-d566c2-5 ubuntu@22.04 Running
6 started 10.32.18.204 juju-d566c2-6 ubuntu@24.04 Running
Deploying Slurm on LXD¶
Pass the constraint "virt-type=virtual-machine" to Juju to deploy
Slurm on virtual machines instead of system containers:
juju deploy sackd \
--base "ubuntu@24.04" \
--channel "edge" \
--constraints="virt-type=virtual-machine"
juju deploy slurmctld \
--base "ubuntu@24.04" \
--channel "edge" \
--constraints="virt-type=virtual-machine"
juju deploy slurmd \
--base "ubuntu@24.04" \
--channel "edge" \
--constraints="virt-type=virtual-machine"
juju deploy slurmdbd \
--base "ubuntu@24.04" \
--channel "edge" \
--constraints="virt-type=virtual-machine"
juju deploy slurmrestd \
--base "ubuntu@24.04" \
--channel "edge" \
--constraints="virt-type=virtual-machine"
juju deploy mysql \
--channel "8.0/stable" \
--constraints="virt-type=virtual-machine"
module "sackd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
model_uuid = juju_model.slurm.uuid
constraints = "virt-type=virtual-machine"
}
module "slurmctld" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
model_uuid = juju_model.slurm.uuid
constraints = "virt-type=virtual-machine"
}
module "slurmd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
model_uuid = juju_model.slurm.uuid
constraints = "virt-type=virtual-machine"
}
module "slurmdbd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
model_uuid = juju_model.slurm.uuid
constraints = "virt-type=virtual-machine"
}
module "slurmrestd" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
model_uuid = juju_model.slurm.uuid
constraints = "virt-type=virtual-machine"
}
module "mysql" {
source = "git::https://github.com/canonical/mysql-operators//machines/terraform"
model = juju_model.slurm.uuid
constraints = "virt-type=virtual-machine"
}
Set compute nodes to idle¶
Compute nodes are initially registered with their state set to down after your
Slurm deployment becomes active. You can use the set-node-state action to set the
compute nodes’ state to idle to make them available for scheduled jobs.
For example, to set the state of compute node slurmd-0 to idle, run:
juju run slurmctld/leader set-node-state nodes="slurmd-0" state=idle
Tips
You can get the node name of a compute node by substituting the forward slash
/character in the unit’s name with the dash-character. For example, the unitslurmd/0in Juju would be namedslurmd-0in Slurm.juju statuscan be used to find unit names. For example, to list all the units that belong to the slurmd application, run:
user@host:~$ juju status slurmd
Model Controller Cloud/Region Version SLA Timestamp
slurm charmed-hpc charmed-hpc/default 3.6.0 unsupported 17:16:37Z
App Version Status Scale Charm Channel Rev Exposed Message
slurmd 23.11.4-1.2u... active 1 slurmd latest/edge 107 no
Unit Workload Agent Machine Public address Ports Message
slurmd/0* active idle 1 10.32.18.207
Machine State Address Inst id Base AZ Message
1 started 10.32.18.207 juju-d566c2-1 ubuntu@24.04 Running
The
nodesparameter of theset-node-stateaction accepts node name ranges for updating the state of multiple nodes at once. For example, to set the state of compute nodesslurmd-0to toslurmd-9toidle, the node name rangeslurmd-[0-9]can be used:
juju run slurmctld/leader set-node-state nodes="slurmd-[0-9]" state=idle
Verify compute nodes are idle¶
You can use sinfo with juju exec to verify that a
compute node’s state is idle. For example, to check if node slurmd-0 is idle:
user@host:~$ juju exec --unit sackd/0 -- sinfo --nodes slurmd-0
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
slurmd up infinite 1 idle slurmd-0
To verify that all the nodes in a partition are idle, run sinfo without the
--nodes flag:
user@host:~$ juju exec --unit sackd/0 -- sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
slurmd up infinite 10 idle slurmd-[0-9]
Next steps¶
Now that Slurm is deployed, you can deploy the shared filesystem of your Charmed HPC cluster:
You can also explore the Glossary for further information on sackd, slurmctld, slurmd, slurmdbd, slurmrestd, and MySQL and how they are managed by their respective charms.