Integrate with Charmed Apache Spark

Note

This feature is experimental and therefore should not to be used in a production environment.

This guide describes how Charmed Kubeflow (CKF) and the Charmed Apache Spark can be integrated using Juju. This integration enables running Spark jobs in Kubeflow Notebooks and Kubeflow Pipelines.

Requirements

Minimum requirements for this guide are:

Integrate Spark with an existing CKF deployment

This section of the guide assumes that you already have a CKF deployment in a Juju model named kubeflow in your Juju cloud. If not, please see CKF getting started guide for more details on how to do so, or refer to Deploy Kubeflow-Spark solution using Terraform section below.

Note

When using an existing CKF deployment, ensure that the metacontroller-operator charm is up to date with the 4.11/edge channel, since the changes that support Charmed Apache Spark integration are not yet merged to the stable channel.

To integrate Charmed Kubeflow with Charmed Apache Spark, you need to deploy Spark Integration Hub, Data-Kubeflow Integrator and Resource Dispatcher charms, configure them and integrate them with Juju relations. The following sections describe these actions in detail.

Spark Integration Hub setup

Charmed Kubeflow is integrated with the Charmed Apache Spark ecosystem with the help of the Spark Integration Hub charm. You can either use an existing deployment of this charm, or deploy a new instance of this charm as described in the subsections below.

Use an existing Spark Integration Hub deployment

Note

If you don’t yet have a deployment of Spark Integration Hub charm, please follow the step Deploy a new instance of Spark Integration Hub in place of this step.

This step assumes that there is an existing deployment of Spark Integration Hub charm from channel 3/edge with app name integration-hub in a separate Juju model named spark.

Check if the spark-service-account endpoint is already offered by the Spark Integration Hub charm in the spark model.

juju switch kubeflow
juju find-offers

If the offer doesn’t exist, create one with the commands below:

juju switch spark
juju offer integration-hub:spark-service-account

You can then verify the offer has indeed been created with the command juju offers.

Once you verify that the offer is available for use, switch back to the kubeflow model to consume the offer as follows:

juju switch kubeflow
juju consume spark.integration-hub

If successful, you should now see integration-hub listed as a saas at the top when you run juju status command in the kubeflow model.

Deploy a new instance of Spark Integration Hub

Note

If you have an existing deployment of Spark Integration Hub, follow the step Use an existing Spark Integration Hub deployment in place of this step.

Deploy the Spark Integration Hub charm in the kubeflow model by following the commands below.

juju switch kubeflow
juju deploy spark-integration-hub-k8s --channel=3/edge integration-hub --trust

Note that the --trust flag is essential when deploying the spark-integration-hub-k8s charm, for it to be able to create and watch resources in the Kubernetes cluster.

Data-Kubeflow Integrator setup

Within the kubeflow model, deploy the Data-Kubeflow Integrator as follows:

juju deploy data-kubeflow-integrator --channel=1/edge

Set the profile and spark-service-account config in the Data-Kubeflow Integrator charm to specify the name of Kubeflow profile where Spark is supposed to be enabled and the name of the Spark service account that is supposed to be created respectively.

juju config data-kubeflow-integrator \
    profile=* \
    spark-service-account=spark

Note that the value of profile is set as * to enable Spark in all Kubeflow profiles. Refer to this documentation to read more about the Spark service accounts.

Integrate the Data-Kubeflow Integrator with Spark Integration Hub as follows:

juju integrate data-kubeflow-integrator integration-hub

Resource Dispatcher setup

Within the kubeflow model, deploy the Resource Dispatcher charm as follows:

juju deploy resource-dispatcher --channel=latest/edge --trust

Note that the --trust flag is essential when deploying the resource-dispatcher charm, for it to be able to create resources in Kubernetes.

Integrate the Resource Dispatcher charm with Data-Kubeflow Integrator charm over the endpoints secrets, service-accounts, pod-defaults, roles and role-bindings as follows:

juju integrate resource-dispatcher:secrets data-kubeflow-integrator:secrets
juju integrate resource-dispatcher:service-accounts data-kubeflow-integrator:service-accounts
juju integrate resource-dispatcher:pod-defaults data-kubeflow-integrator:pod-defaults
juju integrate resource-dispatcher:roles data-kubeflow-integrator:roles
juju integrate resource-dispatcher:role-bindings data-kubeflow-integrator:role-bindings

Deploy CKF + Charmed Apache Spark solution using Terraform

Alternatively, you can deploy the entire Charmed Kubeflow-Spark solution from scratch on an existing Juju controller using Terraform. This section of the guide assumes you have a working Juju K8s controller and the terraform and charmcraft CLI installed.

Clone the charmed-kubeflow-solutions repository and change directory to the Kubeflow-Spark module as follows:

git clone https://github.com/canonical/charmed-kubeflow-solutions
cd charmed-kubeflow-solutions/modules/kubeflow-spark

Initialise Terraform. The following command downloads all the required Terraform modules and installs the Terraform Juju provider:

terraform init

Define credentials that will later be used to log into Kubeflow dashboard:

DEX_USERNAME="your-username"
DEX_PASSWORD="your-password"

Finally, deploy Charmed Kubeflow-Spark solution using Terraform as follows:

terraform apply \
   -var dex_static_username=$DEX_USERNAME \
   -var dex_static_password=$DEX_PASSWORD \
   -var risk=edge

The command above:

  • Creates a Juju model named kubeflow.

  • Deploys CKF 1.10/ from edge.

  • Deploys charms like Spark Integration Hub, Data-Kubeflow Integrator and Resource Dispatcher that are necessary for Spark integration.

  • Configures dex-auth charm with a static user username and password.

  • Deploys metacontroller-operator and resource-dispatcher charms from edge channel because the changes necessary for Spark integration aren’t released to the stable channel yet.

Wait until the deployment is complete, and the terraform apply command returns.

Verify the deployment

As the first step, verify all charms are in active status by monitoring the Juju model:

juju switch kubeflow
watch -n 1 "juju status"

Note

This may take up to some minutes, depending on the cluster’s node specifications.

For additional validation, Charmed Kubeflow-Spark User Acceptance Tests (UAT) can be run on top of this deployment to verify that the deployment was indeed successful and that Spark is enabled in the Kubeflow notebooks and pipelines.

To run the UAT on top of the deployment, first fetch the UAT tests from the charmed-kubeflow-uats repository:

git clone https://github.com/canonical/charmed-kubeflow-uats.git
cd charmed-kubeflow-uats

Now run the following command to run the UAT:

Note

Please make sure to use the correct Charmed Apache Spark Jupyterlab image for Apache Spark version of your choice. In the command below, the image for Apache Spark 3.5 is used. OCI image corresponding to other versions of Spark can be found here.

tox -e spark-remote -- \
    --test-image ghcr.io/canonical/charmed-spark-jupyterlab:3.5-22.04_edge@sha256:72a6e89985e35e0920fb40c063b3287425760ebf823b129a87143d5ec0e99af7  \
    --bundle ''

This will run the tests to verify that Spark is enabled in both Kubeflow Notebooks and Kubeflow Pipeline steps. The test takes around five minutes to complete and you should see some lines similar to the following lines at the end of the output, if the test was successful.

spark-remote: OK (223.32=setup[0.07]+cmd[1.15,222.10] seconds)
congratulations :) (223.35 seconds)

Access CKF dashboard to run Spark jobs

Once you have Charmed Kubeflow deployment along with the Spark support by following the instructions above, you can now access the CKF dashboard through an IP address. See Access CKF dashboard for more details on how to access the CKF dashboard.

Using the Kubeflow dashboard, you can now create Kubeflow Notebooks and Kubeflow Pipelines and write code to run Spark jobs within them. See Run Spark jobs for the guide on how to run sample Spark jobs using Kubeflow Notebook and Kubeflow Pipeline.