Getting started with Charmed HPC¶
This tutorial takes you through multiple aspects of Charmed HPC, such as:
Building a small Charmed HPC cluster with a shared filesystem
Preparing and submitting a multi-node batch job to your Charmed HPC cluster’s workload scheduler
Creating and using a container image to provide the runtime environment for a submitted batch job
By the end of this tutorial, you will have worked with a variety of open source projects, such as:
Multipass
Juju
Charms
Apptainer
Ceph
Slurm
This tutorial assumes that you have had some exposure to high-performance computing concepts such as batch scheduling, but does not assume prior experience building HPC clusters. This tutorial also does not expect you to have any prior experience with the listed projects.
Using Charmed HPC in production
The Charmed HPC cluster built in this tutorial is for learning purposes and should not be used as the basis for a production HPC cluster. For more in-depth steps on how to deploy a fully operational Charmed HPC cluster, see Charmed HPC’s How-to guides.
Prerequisites¶
To successfully complete this tutorial, you will need:
At least 8 CPU cores, 16GB RAM, and 40GB storage available
An active internet connection
Create a virtual machine with Multipass¶
First, download a copy of the cloud initialization (cloud-init) file, charmed-hpc-tutorial-cloud-init.yml, that defines the underlying cloud infrastructure for the virtual machine.
For this tutorial, the file includes instructions for creating and configuring your LXD machine cloud localhost with the charmed-hpc-controller Juju controller and creating workload and submit scripts for the example jobs. The cloud-init step will be completed as part of the virtual machine launch and will not be something you need to set up manually. You can expand the dropdown below to view the full cloud-init file before downloading onto your local system:
charmed-hpc-tutorial-cloud-init.yml
1#cloud-config
2
3# Ensure VM is fully up-to-date multipass does not support reboots.
4# See: https://github.com/canonical/multipass/issues/4199
5# Package management
6package_reboot_if_required: false
7package_update: true
8package_upgrade: true
9
10# Install prerequisites
11snap:
12 commands:
13 00: snap install juju --channel=3/stable
14 01: snap install lxd --channel=6/stable
15
16# Configure and initialize prerequisites
17lxd:
18 init:
19 storage_backend: dir
20
21# Commands to run at the end of the cloud-init process
22runcmd:
23 - lxc network set lxdbr0 ipv6.address none
24 - su ubuntu -c 'juju bootstrap localhost charmed-hpc-controller'
25
26# Write files to the Multipass instance
27write_files:
28 # MPI workload dependencies
29 - path: /home/ubuntu/mpi_hello_world.c
30 owner: ubuntu:ubuntu
31 permissions: !!str "0664"
32 defer: true
33 source:
34 uri: |
35 https://raw.githubusercontent.com/charmed-hpc/docs/refs/heads/main/reuse/tutorial/mpi_hello_world.c
36 - path: /home/ubuntu/submit_hello.sh
37 owner: ubuntu:ubuntu
38 permissions: !!str "0664"
39 defer: true
40 source:
41 uri: |
42 https://raw.githubusercontent.com/charmed-hpc/docs/refs/heads/main/reuse/tutorial/submit_hello.sh
43 # Container workload dependencies.
44 - path: /home/ubuntu/generate.py
45 owner: ubuntu:ubuntu
46 permissions: !!str "0664"
47 defer: true
48 source:
49 uri: |
50 https://raw.githubusercontent.com/charmed-hpc/docs/refs/heads/main/reuse/tutorial/generate.py
51 - path: /home/ubuntu/workload.py
52 owner: ubuntu:ubuntu
53 permissions: !!str "0664"
54 defer: true
55 source:
56 uri: |
57 https://raw.githubusercontent.com/charmed-hpc/docs/refs/heads/main/reuse/tutorial/workload.py
58 - path: /home/ubuntu/workload.def
59 owner: ubuntu:ubuntu
60 permissions: !!str "0664"
61 defer: true
62 source:
63 uri: |
64 https://raw.githubusercontent.com/charmed-hpc/docs/refs/heads/main/reuse/tutorial/workload.def
65 - path: /home/ubuntu/submit_apptainer_mascot.sh
66 owner: ubuntu:ubuntu
67 permissions: !!str "0664"
68 defer: true
69 source:
70 uri: |
71 https://raw.githubusercontent.com/charmed-hpc/docs/refs/heads/main/reuse/tutorial/submit_apptainer_mascot.sh
From the local directory holding the cloud-init file, launch a virtual machine using Multipass:
ubuntu@local:~$ The virtual machine launch process should take five minutes or less to complete, but may take longer due to network strength. Upon completion of the launch process, check the status of cloud-init to confirm that all processes completed successfully.
Enter the virtual machine:
ubuntu@local:~$ Then check cloud-init status:
ubuntu@charmed-hpc-tutorial:~$ status: done
extended_status: done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:03:45 +0000
detail: DataSourceNoCloud [seed=/dev/sr0]
errors: []
recoverable_errors: {}
If the status shows done and there are no errors, then you are ready to move on to deploying the cluster charms.
Get compute nodes ready for jobs¶
Now that Slurm and the filesystem have been successfully deployed, the next step is to set up the compute nodes themselves. The compute nodes must be moved from the down state to the idle state so that they can start having jobs ran on them. First, check that the compute nodes are still down, which will show something similar to:
user@host:~$ PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tutorial-partition up infinite 2 down juju-e16200-[1-2]
Then, bring up the compute nodes:
ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ And verify that the STATE is now set to idle, which should now show:
ubuntu@charmed-hpc-tutorial:~$ PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tutorial-partition up infinite 2 idle juju-e16200-[1-2]
Copy files onto cluster¶
The workload files that were created during the cloud initialization step now need to be copied onto the cluster filesystem from the virtual machine filesystem. First you will make the new example directories, then set appropriate permissions, and finally copy the files over:
ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ The /scratch directory is mounted on the compute nodes and will be used to read and write from during the batch jobs.
Run a batch job¶
In the following steps, you will compile a small Hello World MPI script and run it by submitting a batch job to Slurm.
Compile¶
First, SSH into the login node, sackd/0:
ubuntu@charmed-hpc-tutorial:~$ This will place you in your home directory /home/ubuntu. Next, you will need to move to the /scratch/mpi_example directory, install the Open MPI libraries needed for compiling, and then compile the mpi_hello_world.c file by running the mpicc command:
ubuntu@login:~$ ubuntu@login:~$ ubuntu@login:~$ For quick referencing, the two files for the MPI Hello World example are provided in dropdowns here:
mpi_hello_world.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of nodes
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Get the rank of the process
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Get the name of the node
char node_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(node_name, &name_len);
// Print hello world message
printf("Hello world from node %s, rank %d out of %d nodes\n",
node_name, rank, size);
// Finalize the MPI environment.
MPI_Finalize();
}
submit_hello.sh
#!/usr/bin/env bash
#SBATCH --job-name=hello_world
#SBATCH --partition=tutorial-partition
#SBATCH --nodes=2
#SBATCH --error=error.txt
#SBATCH --output=output.txt
mpirun ./mpi_hello_world
Submit batch job¶
Now, submit your batch job to the queue using sbatch:
ubuntu@login:~$ Your job will complete after a few seconds. The generated output.txt file will look similar to the following:
ubuntu@login:~$ Hello world from processor juju-640476-1, rank 0 out of 2 processors
Hello world from processor juju-640476-2, rank 1 out of 2 processors
The batch job successfully spread the MPI job across two nodes that were able to report back their MPI rank to a shared output file.
Run a container job¶
Next you will go through the steps to generate a random sample of Ubuntu mascot votes and plot the results. The process requires Python and a few specific libraries so you will use Apptainer to build a container job and run the job on the cluster.
Set up Apptainer¶
Apptainer must be deployed and integrated with the existing Slurm deployment using Juju and these steps need to be completed from the charmed-hpc-tutorial environment; to return to that environment from within sackd/0, use the exit command.
Deploy and integrate Apptainer:
ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ ubuntu@charmed-hpc-tutorial:~$ After a few minutes, juju status should look similar to the following:
ubuntu@charmed-hpc-tutorial:~$ Model Controller Cloud/Region Version SLA Timestamp
slurm charmed-hpc-controller localhost/localhost 3.6.9 unsupported 17:34:46-04:00
App Version Status Scale Charm Channel Rev Exposed Message
apptainer 1.4.2 active 3 apptainer latest/stable 6 no
ceph-fs 19.2.1 active 1 ceph-fs latest/edge 196 no Unit is ready
scratch active 3 filesystem-client latest/edge 20 no Integrated with `cephfs` provider
microceph active 1 microceph latest/edge 161 no (workload) charm is ready
sackd 23.11.4-1.2u... active 1 sackd latest/edge 38 no
slurmctld 23.11.4-1.2u... active 1 slurmctld latest/edge 120 no primary - UP
tutorial-partition 23.11.4-1.2u... active 2 slurmd latest/edge 141 no
Unit Workload Agent Machine Public address Ports Message
ceph-fs/0* active idle 5 10.196.78.232 Unit is ready
microceph/1* active idle 6 10.196.78.238 (workload) charm is ready
sackd/0* active idle 3 10.196.78.117 6818/tcp
apptainer/2 active idle 10.196.78.117
scratch/2 active idle 10.196.78.117 Mounted filesystem at `/scratch`
slurmctld/0* active idle 0 10.196.78.49 6817,9092/tcp primary - UP
tutorial-partition/0 active idle 1 10.196.78.244 6818/tcp
apptainer/0 active idle 10.196.78.244
scratch/0* active idle 10.196.78.244 Mounted filesystem at `/scratch`
tutorial-partition/1* active idle 2 10.196.78.26 6818/tcp
apptainer/1* active idle 10.196.78.26
scratch/1 active idle 10.196.78.26 Mounted filesystem at `/scratch`
Machine State Address Inst id Base AZ Message
0 started 10.196.78.49 juju-808105-0 ubuntu@24.04 charmed-hpc-tutorial Running
1 started 10.196.78.244 juju-808105-1 ubuntu@24.04 charmed-hpc-tutorial Running
2 started 10.196.78.26 juju-808105-2 ubuntu@24.04 charmed-hpc-tutorial Running
3 started 10.196.78.117 juju-808105-3 ubuntu@24.04 charmed-hpc-tutorial Running
5 started 10.196.78.232 juju-808105-5 ubuntu@24.04 charmed-hpc-tutorial Running
6 started 10.196.78.238 juju-808105-6 ubuntu@24.04 charmed-hpc-tutorial Running
Build the container image using apptainer¶
Before you can submit your container workload to your Charmed HPC cluster, you must build the container image from the build recipe. The build recipe file workload.def defines the environment and libraries that will be in the container image.
To build the image, return to the cluster login node, move to the example directory, and call apptainer build:
ubuntu@login:~$ ubuntu@login:~$ ubuntu@login:~$ The files for the Apptainer Mascot Vote example are provided here for reference.
generate.py
1#!/usr/bin/env python3
2
3"""Generate example dataset for workload."""
4
5import argparse
6
7from faker import Faker
8from faker.providers import DynamicProvider
9from pandas import DataFrame
10
11
12faker = Faker()
13favorite_lts_mascot = DynamicProvider(
14 provider_name="favorite_lts_mascot",
15 elements=[
16 "Dapper Drake",
17 "Hardy Heron",
18 "Lucid Lynx",
19 "Precise Pangolin",
20 "Trusty Tahr",
21 "Xenial Xerus",
22 "Bionic Beaver",
23 "Focal Fossa",
24 "Jammy Jellyfish",
25 "Noble Numbat",
26 ],
27)
28faker.add_provider(favorite_lts_mascot)
29
30
31def main(rows: int) -> None:
32 df = DataFrame(
33 [
34 [faker.email(), faker.country(), faker.favorite_lts_mascot()]
35 for _ in range(rows)
36 ],
37 columns=["email", "country", "favorite_lts_mascot"],
38 )
39 df.to_csv("favorite_lts_mascot.csv")
40
41
42if __name__ == "__main__":
43 parser = argparse.ArgumentParser()
44 parser.add_argument(
45 "--rows", type=int, default=1, help="Rows of fake data to generate"
46 )
47 args = parser.parse_args()
48
49 main(rows=args.rows)
workload.py
1#!/usr/bin/env python3
2
3"""Plot the most popular Ubuntu LTS mascot."""
4
5import argparse
6import os
7
8import pandas as pd
9import plotext as plt
10
11def main(dataset: str | os.PathLike, file: str | os.PathLike) -> None:
12 df = pd.read_csv(dataset)
13 mascots = df["favorite_lts_mascot"].value_counts().sort_index()
14
15 plt.simple_bar(
16 mascots.index,
17 mascots.values,
18 title="Favorite LTS mascot",
19 color="orange",
20 width=150,
21 )
22
23 if file:
24 plt.save_fig(
25 file if os.path.isabs(file) else f"{os.getcwd()}/{file}",
26 keep_colors=True
27 )
28 else:
29 plt.show()
30
31if __name__ == "__main__":
32 parser = argparse.ArgumentParser()
33 parser.add_argument("dataset", type=str, help="Path to CSV dataset to plot")
34 parser.add_argument(
35 "-o",
36 "--output",
37 type=str,
38 default="",
39 help="Output file to save plotted graph",
40 )
41 args = parser.parse_args()
42
43 main(args.dataset, args.output)
workload.def
bootstrap: docker
from: ubuntu:24.04
%files
generate.py /usr/bin/generate
workload.py /usr/bin/workload
%environment
export PATH=/usr/bin/venv/bin:${PATH}
export PYTHONPATH=/usr/bin/venv:${PYTHONPATH}
%post
export DEBIAN_FRONTEND=noninteractive
apt-get update -y
apt-get install -y python3-dev python3-venv
python3 -m venv /usr/bin/venv
alias python3=/usr/bin/venv/bin/python3
alias pip=/usr/bin/venv/bin/pip
pip install -U faker
pip install -U pandas
pip install -U plotext
chmod 755 /usr/bin/generate
chmod 755 /usr/bin/workload
%runscript
exec workload "$@"
submit_apptainer_mascot.sh
#!/usr/bin/env bash
#SBATCH --job-name=favorite-lts-mascot
#SBATCH --partition=tutorial-partition
#SBATCH --nodes=2
#SBATCH --error=mascot_error.txt
#SBATCH --output=mascot_output.txt
apptainer exec workload.sif generate --rows 1000000
apptainer run workload.sif favorite_lts_mascot.csv --output graph.out
Use the image to run jobs¶
Now that you have built the container image, you can submit a job to the cluster that uses the new workload.sif image to generate one million lines in a table and then uses the resulting favorite_lts_mascot.csv to build the bar plot:
ubuntu@login:~$ To view the status of the job while it is running, run squeue.
Once the job has completed, view the generated bar plot that will look similar to the following:
ubuntu@login:~$ ────────────────────── Favorite LTS mascot ───────────────────────
│Bionic Beaver ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 101124.00
│Dapper Drake ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99889.00
│Focal Fossa ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99956.00
│Hardy Heron ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99872.00
│Jammy Jellyfish ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99848.00
│Lucid Lynx ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99651.00
│Noble Numbat ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100625.00
│Precise Pangolin ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99670.00
│Trusty Tahr ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99366.00
│Xenial Xerus ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99999.00
Summary and clean up¶
In this tutorial, you:
Deployed and integrated Slurm and a shared filesystem
Launched an MPI batch job and saw cross-node communication results
Built a container image with Apptainer and used it to run a batch job and generate a bar plot
Now that you have completed the tutorial, if you would like to completely remove the virtual machine, return to your local terminal and multipass delete the virtual machine as follows:
ubuntu@local:~$ Next steps¶
Now that you have gotten started with Charmed HPC, check out the Explanation section for details on important concepts and the How-to guides for how to use more of Charmed HPC’s features.