Enable NVIDIA GPUs¶

This guide describes how to configure Data Science Stack (DSS) to utilise your NVIDIA GPUs within a Canonical K8s environment.

DSS supports GPU acceleration by leveraging the NVIDIA GPU Operator. The operator ensures that the necessary components, including drivers and runtime, are set up correctly to enable GPU workloads.

Prerequisites¶

DSS is installed and initialised.
Your machine includes an NVIDIA GPU.

Install the NVIDIA GPU Operator¶

To enable GPU support, you must install the NVIDIA GPU Operator in your Kubernetes cluster. Follow NVIDIA GPU Operator Installation Guide for installation details.

Verify the NVIDIA Operator is up¶

Once the NVIDIA GPU Operator is installed, verify that it has been successfully initialized before running workloads.

Ensure DaemonSet is ready¶

Run the following command to verify that the DaemonSet for the NVIDIA Operator Validator is created:

while ! kubectl get ds -n gpu-operator-resources nvidia-operator-validator; do
   sleep 5
done

Note

It may take a few seconds for the DaemonSet to be created.

Once completed, you should see an output similar to this:

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                   AGE
nvidia-operator-validator   1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true   17s

Ensure the Validator Pod succeeded¶

Run the following command to check if the NVIDIA Operator validation is successful:

echo "Waiting for the NVIDIA Operator validations to complete..."

while ! kubectl logs -n gpu-operator-resources -l app=nvidia-operator-validator -c nvidia-operator-validator | grep "all validations are successful"; do
    sleep 5
done

Note

If the pod is still initializing, you may see an error like: Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xxxx" is waiting to start: PodInitializing

Once completed, the output should include:

all validations are successful

Use cases for different driver states¶

The NVIDIA GPU Operator behaves differently depending on whether your system already has an NVIDIA driver installed. Below are the three primary scenarios:

Device with no NVIDIA driver installed¶

The GPU Operator will automatically install the necessary NVIDIA driver.
This installation process may take longer as it involves setting up drivers and runtime components.
Once the process is complete, GPU should be detected successfully.

Device with an up-to-date NVIDIA driver¶

The GPU Operator detects the existing driver and proceeds without reinstalling it.
However, to avoid redundant installations, it is recommended to disable driver installation explicitly when deploying the operator.
Follow the upstream documentation for the correct configuration to disable driver installation in this case.

Device with an outdated NVIDIA driver¶

If an older driver version is detected, the operator may attempt to install a newer version.
This could lead to conflicts if the outdated driver does not match the required CUDA version.
To prevent issues, update the driver manually or remove the outdated version before deploying the GPU Operator.

Verify DSS detects the GPU¶

After installing and configuring the NVIDIA GPU Operator, verify that DSS detects the GPU by checking its status:

dss status

You should expect an output like this:

MLflow deployment: Ready
MLflow URL: http://10.152.183.74:5000
GPU acceleration: Enabled (NVIDIA-GeForce-RTX-3070-Ti)

Note

The GPU model displayed may differ based on your hardware.

Enable NVIDIA GPUs¶

Prerequisites¶

Install the NVIDIA GPU Operator¶

Verify the NVIDIA Operator is up¶

Ensure DaemonSet is ready¶

Ensure the Validator Pod succeeded¶

Use cases for different driver states¶

Device with no NVIDIA driver installed¶

Device with an up-to-date NVIDIA driver¶

Device with an outdated NVIDIA driver¶

Verify DSS detects the GPU¶

See also¶