Enable GPUs on EKS worker nodes¶

GPU-based EKS worker nodes are needed for many applications, such as for the training of deep learning models.

Before enabling GPUs on your worker nodes, you’ll have to create a cluster and node groups with GPU based instances. You’ll also need SSH access to the nodes.

Install a GPU driver on each node¶

For the GPU based instances to work, you’ll need to install the appropriate NVIDIA drivers on them. For general purpose GPU usage, we recommend using a Tesla driver and the installation instructions from Option 2 - Public NVIDIA drivers.

After installation, use sudo nvidia-smi to verify that the driver is successfully installed.

Install and set up the ‘NVIDIA Container Toolkit’ on each node¶

To support containerized GPU-accelerated applications, the default runtime should be set to ‘NVIDIA Container Toolkit’ on all the nodes.

For this, first configure the source repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update

Next, install the NVIDIA Container Toolkit:

sudo apt-get install -y nvidia-container-toolkit

Restart containerd and check whether the runtime has been set correctly:

sudo systemctl restart containerd
sudo /etc/eks/bootstrap.sh ${YOUR_CLUSTER_NAME} --container-runtime nvidia-container-runtime

The output should be similar to:

Using nvidia-container-runtime as the container runtime
Aliasing EKS k8s snap commands
Stopping k8s daemons until configured
Stopped.
Cluster "kubernetes" set.
Container runtime is nvidia-container-runtime
Configuring kubelet snap
Starting k8s kubelet daemon
Started.
nvidia-smi found
Persistence mode is already Enabled for GPU 00000000:00:1E.0.
All done.
All done.
GPU 0: Tesla M60 (UUID: GPU-632acfab-13c7-fcf3-a9d5-1833d921cf80)
Applications clocks set to "(MEM 2505, SM 1177)" for GPU 00000000:00:1E.0
All done.

Apply ‘NVIDIA Device Plugin’ to the cluster¶

The ‘NVIDIA Device Plugin’ for Kubernetes is a DaemonSet that allows you to automatically expose and manage the GPUs in each of your nodes, and to run GPU enabled containers in your cluster.

Create the DaemonSet using:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

Note

These are ‘quick-start’ instructions that enable the basic features of the plugin. For production settings use the instructions for deployment via helm.

To apply the plugin to your cluster, run the following command from your local machine:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

Verify that there are allocatable GPUs:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Test the GPU nodes by deploying a pod¶

Create a file named nvidia-smi.yaml to act as a Kubernetes manifest for a pod. Include the following contents in it:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:tag
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1

Apply the manifest to create a pod:

kubectl apply -f nvidia-smi.yaml

Once the pod is up and running, check its log using:

kubectl logs nvidia-smi.yaml