Configure workloads for advanced node-pool scheduling¶
The following guides illustrate how to schedule workflows targeting specific node pools, both in general and when Charmed Kubeflow is set up for the most advanced scheduling capabilities as possible.
Configure workloads for general advanced node-pool scheduling¶
This guide describes how to configure different Charmed Kubeflow (CKF) workloads, such as Notebooks, Pipeline steps, and distributed jobs, to align with specific scheduling patterns that might be required.
Requirements¶
A CKF deployment and access to the Kubeflow dashboard. See Get started for more details.
An underlying Kubernetes (K8s) cluster with multiple nodes and labels.
Notebooks¶
You can configure Notebooks to be scheduled on specific nodes via the Notebooks page in the Kubeflow dashboard when creating a new Notebook.
Note
Configuring the Notebook creation page is intended only for admins. See this guide for more details.
To do so, configure the Affinity and Toleration settings during Notebook creation by:
Clicking on
+ Create Notebook.Scrolling to the bottom and expanding
Advanced Options.Configuring the
AffinityandTolerationssections.
Note
In case your cluster setup uses Taints and Tolerations, see Add Tolerations for more details.
Pipeline steps¶
K8s specific configurations, such as nodeSelectors and Tolerations, in a Kubeflow pipeline step can be configured via the kfp-kubernetes Python package.
The following example sets both in a pipeline step:
from kfp.kubernetes import add_node_selector, add_toleration
@dsl.component(base_image="python:3.12")
def print_node_name():
"""Print the Node's hostname."""
import socket
print("Node name: %s" % socket.gethostname())
@dsl.pipeline
def node_scheduling_pipeline():
print_node_task = print_node_name()
task = add_node_selector(print_node_task, "sku", "pool-1")
task = add_toleration(task, key="sku", operator="Exists", effect="NoSchedule")
Distributed training¶
Distributed training in CKF is achieved via the Katib and Training Operator components. Katib Trials can be implemented with different job types, which may use default settings defined in Trial Templates. These can include standard K8s Jobs or Distributed training jobs via the Training Operator.
All Trial definitions ultimately configure a PodSpec for the Trial’s Pods.
To accommodate the above scheduling use cases, you need to configure the nodeSelector and Tolerations of the PodSpec.
Below is an example of a TFJob that can be used in a Trial definition and satisfies all the above criteria:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
generateName: tfjob
namespace: your-user-namespace
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
nodeSelector: # Scheduling
pool: pool1
tolerations: # Scheduling
- effect: NoSchedule
key: sku
operator: Equal
value: pool1
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
nodeSelector: # Scheduling
pool: pool1
tolerations: # Scheduling
- effect: NoSchedule
key: sku
operator: Equal
value: pool1
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
resources:
limits:
nvidia.com/gpu: 1
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
KServe InferenceServices¶
KServe InferenceServices expose PodSpec attributes.
that can be used for configuring advanced scheduling scenarios.
See the example below for more details:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
nodeSelector: # Scheduling
sku: pool-1
tolerations: # Scheduling
- key: "sku"
operator: "Exists"
effect: "NoSchedule"
Configure workloads for the most advanced node-pool scheduling possible¶
The following guide shows how to schedule workloads to available, preconfigured node pools as selectively as possible.
Warning
While Notebooks created from the Dashboard can rely on PodDefaults to add labels to disable namespace-node-affinity-operator, in order to target a different node pool than the Profile’s default one, the API to create Kubeflow Notebooks programmatically does not allow for that.
Note
This guide does not support migrating or rescheduling Profile workloads. A practical workaround may be to delete and recreate workloads using whichever method best suits your needs.
Requirements¶
Charmed Kubeflow installed using the specific Install allowing for advanced node-pool scheduling guide. If Charmed Kubeflow was not installed with such additional precautions, refer to configure workloads for general advanced node-pool scheduling for alternative, more general scheduling guidelines.
Procedure¶
Workloads can either be scheduled to their respective Profiles’ default node pools, which is the default behaviour, or deployed to other arbitrary node pools.
Deploy workloads to their respective Profiles’ default node pools¶
Create workloads without extra precautions, since by default they will already be injected with affinities — and tolerations, when segregating Juju-system components — for their Profile’s default node pool.
Deploy workloads to some other, arbitrary node pools¶
Ensure the specific workloads are defined with:
The (set of) label(s) configured for disabling namespace-node-affinity-operator for the Profile — for instance, following the example set in Install allowing for advanced node-pool scheduling, the label exclude-me-from-namespace-node-affinity-operator=”true”
Node affinity (of type
requiredDuringSchedulingIgnoredDuringExecutionand not`preferredDuringSchedulingIgnoredDuringExecution) matching the label of the target node poolTolerations matching the taint of the target node pool
For more information on configuring node affinity and tolerations to specific workloads, see configure workloads for general advanced node-pool scheduling above.