How to deploy Slurm

This how-to guide shows you how to deploy the Slurm workload manager as the resource management and job scheduling service of your Charmed HPC cluster. The deployment, management, and operations of Slurm are controlled by the Slurm charms.

Prerequisites

To successfully deploy Slurm in your Charmed HPC cluster, you will at least need:

Once you have verified that you have met the prerequisites above, proceed to the instructions below.

Deploy Slurm

You have two options for deploying Slurm:

  1. Using the Juju CLI client.

  2. Using the Juju Terraform client.

If you want to use Terraform to deploy Slurm, see the Manage terraform-provider-juju how-to guide for additional requirements.

First, use juju add-model to create the slurm model on your charmed-hpc machine cloud:

juju add-model slurm charmed-hpc

Now use juju deploy to deploy Slurm’s services with MySQL as the storage database for slurmdbd:

Deploying slurmctld in high-availability

Do not follow the instructions below for deploying slurmctld with the other Slurm services if you wish to deploy slurmctld with high-availability enabled.

See the Deploying slurmctld in high availability section for instructions on how to deploy slurmctld instead.

Deploying Slurm on LXD

Do not follow the instructions below for deploying Slurm if your backing cloud is LXD. On LXD, if you deploy the Slurm charms to system containers rather than virtual machines, Slurm cannot use the recommended process tracking plugin proctrack/cgroup, and additional modifications must be made to the default LXD profile.

See the Deploy Slurm on LXD section for instructions on the additional constraints that must be passed to Juju so that Slurm is deployed on virtual machines instead of system containers.

juju deploy sackd --base "ubuntu@24.04" --channel "edge"
juju deploy slurmctld --base "ubuntu@24.04" --channel "edge"
juju deploy slurmd --base "ubuntu@24.04" --channel "edge"
juju deploy slurmdbd --base "ubuntu@24.04" --channel "edge"
juju deploy slurmrestd --base "ubuntu@24.04" --channel "edge"
juju deploy mysql --channel "8.0/stable"

After that, use juju integrate to integrate all of Slurm’s services together, and integrate slurmdbd with MySQL:

juju integrate slurmctld sackd
juju integrate slurmctld slurmd
juju integrate slurmctld slurmdbd
juju integrate slurmctld slurmrestd
juju integrate slurmdbd mysql:database

First, create the Terraform configuration file slurm/main.tf using mkdir and touch:

mkdir slurm
touch slurm/main.tf

Now open slurm/main.tf in a text editor and add the Juju Terraform provider to your configuration:

slurm/main.tf
terraform {
  required_providers {
    juju = {
      source  = "juju/juju"
      version = "~> 1.0"
    }
  }
}

Next, create the slurm model on your charmed-hpc machine cloud:

slurm/main.tf
resource "juju_model" "slurm" {
  name = "slurm"
  cloud {
    name = "charmed-hpc"
  }
}

Now deploy Slurm’s services with MySQL as the storage database for slurmdbd:

Deploying slurmctld in high-availability

Do not follow the instructions below for deploying slurmctld with the other Slurm services if you wish to deploy slurmctld with high-availability enabled.

See the Deploying slurmctld in high availability section for instructions on how to deploy slurmctld instead.

Deploying Slurm on LXD

Do not follow the instructions below for deploying Slurm if your backing cloud is LXD. On LXD, if you deploy the Slurm charms to system containers rather than virtual machines, Slurm cannot use the recommended process tracking plugin proctrack/cgroup, and additional modifications must be made to the default LXD profile.

See the Deploy Slurm on LXD section for instructions on the additional constraints that must be passed to Juju so that Slurm is deployed on virtual machines instead of system containers.

slurm/main.tf
module "sackd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmctld" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmdbd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmrestd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "mysql" {
  source          = "git::https://github.com/canonical/mysql-operators//machines/terraform"
  model = juju_model.slurm.uuid
}

After that, integrate all of Slurm’s services together, and integrate slurmdbd with MySQL:

slurm/main.tf
resource "juju_integration" "sackd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.sackd.app_name
    endpoint = module.sackd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.login-node
  }
}

resource "juju_integration" "slurmd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmd.app_name
    endpoint = module.slurmd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmd
  }
}

resource "juju_integration" "slurmdbd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmdbd.app_name
    endpoint = module.slurmdbd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmdbd
  }
}

resource "juju_integration" "slurmrestd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmrestd.app_name
    endpoint = module.slurmrestd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmrestd
  }
}

resource "juju_integration" "slurmdbd_to_mysql" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.mysql.app_name
    endpoint = module.mysql.provides.database
  }

  application {
    name     = module.slurmdbd.app_name
    endpoint = module.slurmdbd.requires.database
  }
}

You can expand the dropdown below to view the full slurm/main.tf Terraform configuration before applying it. Now use the terraform command to apply your configuration:

terraform -chdir=slurm init
terraform -chdir=slurm apply -auto-approve
Full slurm/main.tf Terraform configuration file
slurm/main.tf
terraform {
  required_providers {
    juju = {
      source  = "juju/juju"
      version = "~> 1.0"
    }
  }
}

resource "juju_model" "slurm" {
  name = "slurm"
  cloud {
    name = "charmed-hpc"
  }
}

module "sackd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmctld" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmdbd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "slurmrestd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
  model_uuid  = juju_model.slurm.uuid
}

module "mysql" {
  source          = "git::https://github.com/canonical/mysql-operators//machines/terraform"
  model = juju_model.slurm.uuid
}

resource "juju_integration" "sackd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.sackd.app_name
    endpoint = module.sackd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.login-node
  }
}

resource "juju_integration" "slurmd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmd.app_name
    endpoint = module.slurmd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmd
  }
}

resource "juju_integration" "slurmdbd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmdbd.app_name
    endpoint = module.slurmdbd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmdbd
  }
}

resource "juju_integration" "slurmrestd_to_slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmrestd.app_name
    endpoint = module.slurmrestd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmrestd
  }
}

resource "juju_integration" "slurmdbd_to_mysql" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.mysql.app_name
    endpoint = module.mysql.provides.database
  }

  application {
    name     = module.slurmdbd.app_name
    endpoint = module.slurmdbd.requires.database
  }
}

Your Slurm deployment will become active within a few minutes. The output of the juju status will be similar to the following:

user@host:~$
juju status
Model  Controller   Cloud/Region         Version  SLA          Timestamp
slurm  charmed-hpc  charmed-hpc/default  3.6.0    unsupported  17:16:37Z

App         Version          Status  Scale  Charm       Channel      Rev  Exposed  Message
mysql       8.0.39-0ubun...  active      1  mysql       8.0/stable   313  no
sackd       23.11.4-1.2u...  active      1  sackd       latest/edge    4  no
slurmctld   23.11.4-1.2u...  active      1  slurmctld   latest/edge   86  no       primary - UP
slurmd      23.11.4-1.2u...  active      1  slurmd      latest/edge  107  no
slurmdbd    23.11.4-1.2u...  active      1  slurmdbd    latest/edge   78  no
slurmrestd  23.11.4-1.2u...  active      1  slurmrestd  latest/edge   80  no

Unit           Workload  Agent      Machine  Public address  Ports           Message
mysql/0*       active    idle       5        10.32.18.127    3306,33060/tcp  Primary
sackd/0*       active    idle       4        10.32.18.203
slurmctld/0*   active    idle       0        10.32.18.15                     primary - UP
slurmd/0*      active    idle       1        10.32.18.207
slurmdbd/0*    active    idle       2        10.32.18.102
slurmrestd/0*  active    idle       3        10.32.18.9

Machine  State    Address       Inst id        Base          AZ  Message
0        started  10.32.18.15   juju-d566c2-0  ubuntu@24.04      Running
1        started  10.32.18.207  juju-d566c2-1  ubuntu@24.04      Running
2        started  10.32.18.102  juju-d566c2-2  ubuntu@24.04      Running
3        started  10.32.18.9    juju-d566c2-3  ubuntu@24.04      Running
4        started  10.32.18.203  juju-d566c2-4  ubuntu@24.04      Running
5        started  10.32.18.127  juju-d566c2-5  ubuntu@22.04      Running

Deploying slurmctld in high availability

The slurmctld charm optionally supports high availability (HA) through the native functionality provided by Slurm. This functionality requires a low-latency shared filesystem; follow the instructions in the Deploy a shared filesystem section to deploy a shared filesystem.

Choosing a shared filesystem

See the Shared StateSaveLocation using filesystem-client charm section for guidance on choosing a shared filesystem. It is recommended that the HA file system not be the same as the filesystem used for the cluster compute nodes to avoid I/O-intensive user jobs from impacting slurmctld’s responsiveness. The suggested approach is to deploy a dedicated HA file system then subsequently provision a separate file system for the compute nodes.

Once a chosen shared filesystem has been deployed and made available through a proxy or provider charm, use the following instructions, substituting [filesystem-provider] with the name of the provider charm to deploy slurmctld with HA enabled.

In this example, two slurmctld units are deployed. One slurmctld unit acts as the primary Slurm controller, and the other unit serves as the backup controller:

juju deploy filesystem-client --channel latest/edge
juju integrate filesystem-client:filesystem [filesystem-provider]:filesystem

juju deploy slurmctld --base "ubuntu@24.04" --channel "edge" --num-units 2
juju integrate slurmctld:mount filesystem-client:mount
slurm/main.tf
module "filesystem-client" {
  source     = "git::https://github.com/charmed-hpc/filesystem-charms//charms/filesystem-client/terraform"
  model_uuid  = juju_model.slurm.uuid
}

resource "juju_integration" "provider_to_filesystem" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.[filesystem-provider].app_name
    endpoint = module.[filesystem-provider].provides.filesystem
  }

  application {
    name     = module.filesystem-client.app_name
    endpoint = module.filesystem-client.requires.filesystem
  }
}

module "slurmctld" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
  model_uuid  = juju_model.slurm.uuid
  units       = 2
}

resource "juju_integration" "filesystem-to-slurmctld" {
  model_uuid = juju_model.slurm.uuid

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.provides.mount
  }

  application {
    name     = module.filesystem-client.app_name
    endpoint = module.filesystem-client.requires.mount
  }
}

Once slurmctld is scaled up, the output of juju status will be similar to the following. The output can be different depending on the shared filesystem you chose. CephFS is used in this example to provide the HA filesystem:

user@host:~$
juju status
Model  Controller   Cloud/Region         Version  SLA          Timestamp
slurm  charmed-hpc  charmed-hpc/default  3.6.0    unsupported  17:16:37Z

App                 Version          Status  Scale  Charm                Channel      Rev  Exposed  Message
cephfs-server-proxy                  active      1  cephfs-server-proxy  latest/edge   25  no
filesystem-client                    active      1  filesystem-client    latest/edge   20  no       Integrated with `cephfs` provider
mysql               8.0.39-0ubun...  active      1  mysql                8.0/stable   313  no
sackd               23.11.4-1.2u...  active      1  sackd                latest/edge    4  no
slurmctld           23.11.4-1.2u...  active      1  slurmctld            latest/edge   86  no       primary - UP
slurmd              23.11.4-1.2u...  active      1  slurmd               latest/edge  107  no
slurmdbd            23.11.4-1.2u...  active      1  slurmdbd             latest/edge   78  no
slurmrestd          23.11.4-1.2u...  active      1  slurmrestd           latest/edge   80  no

Unit                    Workload  Agent      Machine  Public address  Ports           Message
mysql/0*                active    idle       5        10.32.18.127    3306,33060/tcp  Primary
sackd/0*                active    idle       4        10.32.18.203
slurmctld/0*            active    idle       0        10.32.18.15                     primary - UP
  filesystem-client/0*  active    idle                10.32.18.15                     Mounted filesystem at `/srv/slurmctld-statefs`
slurmctld/1             active    idle       6        10.32.18.204                    backup - UP
  filesystem-client/1   active    idle                10.32.18.204                    Mounted filesystem at `/srv/slurmctld-statefs`
slurmd/0*               active    idle       1        10.32.18.207
slurmdbd/0*             active    idle       2        10.32.18.102
slurmrestd/0*           active    idle       3        10.32.18.9

Machine  State    Address       Inst id        Base          AZ  Message
0        started  10.32.18.15   juju-d566c2-0  ubuntu@24.04      Running
1        started  10.32.18.207  juju-d566c2-1  ubuntu@24.04      Running
2        started  10.32.18.102  juju-d566c2-2  ubuntu@24.04      Running
3        started  10.32.18.9    juju-d566c2-3  ubuntu@24.04      Running
4        started  10.32.18.203  juju-d566c2-4  ubuntu@24.04      Running
5        started  10.32.18.127  juju-d566c2-5  ubuntu@22.04      Running
6        started  10.32.18.204  juju-d566c2-6  ubuntu@24.04      Running

Deploying Slurm on LXD

Pass the constraint "virt-type=virtual-machine" to Juju to deploy Slurm on virtual machines instead of system containers:

juju deploy sackd \
  --base "ubuntu@24.04" \
  --channel "edge" \
  --constraints="virt-type=virtual-machine"

juju deploy slurmctld \
  --base "ubuntu@24.04" \
  --channel "edge" \
  --constraints="virt-type=virtual-machine"

juju deploy slurmd \
  --base "ubuntu@24.04" \
  --channel "edge" \
  --constraints="virt-type=virtual-machine"

juju deploy slurmdbd \
  --base "ubuntu@24.04" \
  --channel "edge" \
  --constraints="virt-type=virtual-machine"

juju deploy slurmrestd \
  --base "ubuntu@24.04" \
  --channel "edge" \
  --constraints="virt-type=virtual-machine"

juju deploy mysql \
  --channel "8.0/stable" \
  --constraints="virt-type=virtual-machine"
slurm/main.tf
module "sackd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
  model_uuid  = juju_model.slurm.uuid
  constraints = "virt-type=virtual-machine"
}

module "slurmctld" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
  model_uuid  = juju_model.slurm.uuid
  constraints = "virt-type=virtual-machine"
}

module "slurmd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
  model_uuid  = juju_model.slurm.uuid
  constraints = "virt-type=virtual-machine"
}

module "slurmdbd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
  model_uuid  = juju_model.slurm.uuid
  constraints = "virt-type=virtual-machine"
}

module "slurmrestd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
  model_uuid  = juju_model.slurm.uuid
  constraints = "virt-type=virtual-machine"
}

module "mysql" {
  source          = "git::https://github.com/canonical/mysql-operators//machines/terraform"
  model  = juju_model.slurm.uuid
  constraints     = "virt-type=virtual-machine"
}

Set compute nodes to idle

Compute nodes are initially registered with their state set to down after your Slurm deployment becomes active. You can use the set-node-state action to set the compute nodes’ state to idle to make them available for scheduled jobs. For example, to set the state of compute node slurmd-0 to idle, run:

juju run slurmctld/leader set-node-state nodes="slurmd-0" state=idle

Tips

  1. You can get the node name of a compute node by substituting the forward slash / character in the unit’s name with the dash - character. For example, the unit slurmd/0 in Juju would be named slurmd-0 in Slurm.

    juju status can be used to find unit names. For example, to list all the units that belong to the slurmd application, run:

user@host:~$
juju status slurmd
Model  Controller   Cloud/Region         Version  SLA          Timestamp
slurm  charmed-hpc  charmed-hpc/default  3.6.0    unsupported  17:16:37Z

App         Version          Status  Scale  Charm       Channel      Rev  Exposed  Message
slurmd      23.11.4-1.2u...  active      1  slurmd      latest/edge  107  no

Unit           Workload  Agent      Machine  Public address  Ports           Message
slurmd/0*      active    idle       1        10.32.18.207

Machine  State    Address       Inst id        Base          AZ  Message
1        started  10.32.18.207  juju-d566c2-1  ubuntu@24.04      Running
  1. The nodes parameter of the set-node-state action accepts node name ranges for updating the state of multiple nodes at once. For example, to set the state of compute nodes slurmd-0 to to slurmd-9 to idle, the node name range slurmd-[0-9] can be used:

juju run slurmctld/leader set-node-state nodes="slurmd-[0-9]" state=idle

Verify compute nodes are idle

You can use sinfo with juju exec to verify that a compute node’s state is idle. For example, to check if node slurmd-0 is idle:

user@host:~$
juju exec --unit sackd/0 -- sinfo --nodes slurmd-0
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmd       up   infinite      1   idle slurmd-0

To verify that all the nodes in a partition are idle, run sinfo without the --nodes flag:

user@host:~$
juju exec --unit sackd/0 -- sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmd       up   infinite     10   idle slurmd-[0-9]

Next steps

Now that Slurm is deployed, you can deploy the shared filesystem of your Charmed HPC cluster:

You can also explore the Glossary for further information on sackd, slurmctld, slurmd, slurmdbd, slurmrestd, and MySQL and how they are managed by their respective charms.