Enable NVIDIA GPUs¶
This guide describes how to configure Data Science Stack (DSS) to utilise your NVIDIA GPUs within a Canonical K8s environment.
DSS supports GPU acceleration by leveraging the NVIDIA GPU Operator. The operator ensures that the necessary components, including drivers and runtime, are set up correctly to enable GPU workloads.
Prerequisites¶
DSS is installed and initialised.
Your machine includes an NVIDIA GPU.
Install the NVIDIA GPU Operator¶
To enable GPU support, you must install the NVIDIA GPU Operator in your Kubernetes cluster. Follow NVIDIA GPU Operator Installation Guide for installation details.
Verify the NVIDIA Operator is up¶
Once the NVIDIA GPU Operator is installed, verify that it has been successfully initialized before running workloads.
Ensure DaemonSet is ready¶
Run the following command to verify that the DaemonSet for the NVIDIA Operator Validator is created:
while ! kubectl get ds -n gpu-operator-resources nvidia-operator-validator; do
sleep 5
done
Note
It may take a few seconds for the DaemonSet to be created.
Once completed, you should see an output similar to this:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 17s
Ensure the Validator Pod succeeded¶
Run the following command to check if the NVIDIA Operator validation is successful:
echo "Waiting for the NVIDIA Operator validations to complete..."
while ! kubectl logs -n gpu-operator-resources -l app=nvidia-operator-validator -c nvidia-operator-validator | grep "all validations are successful"; do
sleep 5
done
Note
If the pod is still initializing, you may see an error like:
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xxxx" is waiting to start: PodInitializing
Once completed, the output should include:
all validations are successful
Use cases for different driver states¶
The NVIDIA GPU Operator behaves differently depending on whether your system already has an NVIDIA driver installed. Below are the three primary scenarios:
Device with no NVIDIA driver installed¶
The GPU Operator will automatically install the necessary NVIDIA driver.
This installation process may take longer as it involves setting up drivers and runtime components.
Once the process is complete, GPU should be detected successfully.
Device with an up-to-date NVIDIA driver¶
The GPU Operator detects the existing driver and proceeds without reinstalling it.
However, to avoid redundant installations, it is recommended to disable driver installation explicitly when deploying the operator.
Follow the upstream documentation for the correct configuration to disable driver installation in this case.
Device with an outdated NVIDIA driver¶
If an older driver version is detected, the operator may attempt to install a newer version.
This could lead to conflicts if the outdated driver does not match the required CUDA version.
To prevent issues, update the driver manually or remove the outdated version before deploying the GPU Operator.
Verify DSS detects the GPU¶
After installing and configuring the NVIDIA GPU Operator, verify that DSS detects the GPU by checking its status:
dss status
You should expect an output like this:
MLflow deployment: Ready
MLflow URL: http://10.152.183.74:5000
GPU acceleration: Enabled (NVIDIA-GeForce-RTX-3070-Ti)
Note
The GPU model displayed may differ based on your hardware.
See also¶
To learn how to manage your DSS environment, check Manage DSS.
If you are interested in managing Jupyter Notebooks within your DSS environment, see Manage Jupyter Notebooks.