Enable NVIDIA GPUs

This guide describes how to configure Data Science Stack (DSS) to utilise your NVIDIA GPUs within a Canonical K8s environment.

DSS supports GPU acceleration by leveraging the NVIDIA GPU Operator. The operator ensures that the necessary components, including drivers and runtime, are set up correctly to enable GPU workloads.

Prerequisites

Install the NVIDIA GPU Operator

To enable GPU support, you must install the NVIDIA GPU Operator in your Kubernetes cluster. Follow NVIDIA GPU Operator Installation Guide for installation details.

Verify the NVIDIA Operator is up

Once the NVIDIA GPU Operator is installed, verify that it has been successfully initialized before running workloads.

Ensure DaemonSet is ready

Run the following command to verify that the DaemonSet for the NVIDIA Operator Validator is created:

while ! kubectl get ds -n gpu-operator-resources nvidia-operator-validator; do
   sleep 5
done

Note

It may take a few seconds for the DaemonSet to be created.

Once completed, you should see an output similar to this:

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                   AGE
nvidia-operator-validator   1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true   17s

Ensure the Validator Pod succeeded

Run the following command to check if the NVIDIA Operator validation is successful:

echo "Waiting for the NVIDIA Operator validations to complete..."

while ! kubectl logs -n gpu-operator-resources -l app=nvidia-operator-validator -c nvidia-operator-validator | grep "all validations are successful"; do
    sleep 5
done

Note

If the pod is still initializing, you may see an error like: Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xxxx" is waiting to start: PodInitializing

Once completed, the output should include:

all validations are successful

Use cases for different driver states

The NVIDIA GPU Operator behaves differently depending on whether your system already has an NVIDIA driver installed. Below are the three primary scenarios:

Device with no NVIDIA driver installed

  • The GPU Operator will automatically install the necessary NVIDIA driver.

  • This installation process may take longer as it involves setting up drivers and runtime components.

  • Once the process is complete, GPU should be detected successfully.

Device with an up-to-date NVIDIA driver

  • The GPU Operator detects the existing driver and proceeds without reinstalling it.

  • However, to avoid redundant installations, it is recommended to disable driver installation explicitly when deploying the operator.

  • Follow the upstream documentation for the correct configuration to disable driver installation in this case.

Device with an outdated NVIDIA driver

  • If an older driver version is detected, the operator may attempt to install a newer version.

  • This could lead to conflicts if the outdated driver does not match the required CUDA version.

  • To prevent issues, update the driver manually or remove the outdated version before deploying the GPU Operator.

Verify DSS detects the GPU

After installing and configuring the NVIDIA GPU Operator, verify that DSS detects the GPU by checking its status:

dss status

You should expect an output like this:

MLflow deployment: Ready
MLflow URL: http://10.152.183.74:5000
GPU acceleration: Enabled (NVIDIA-GeForce-RTX-3070-Ti)

Note

The GPU model displayed may differ based on your hardware.

See also

  • To learn how to manage your DSS environment, check Manage DSS.

  • If you are interested in managing Jupyter Notebooks within your DSS environment, see Manage Jupyter Notebooks.