Job Scheduling
This guide describes how to use Kueue, Volcano Scheduler and Scheduler Plugins with coscheduling to support gang-scheduling in Kubeflow, to allow jobs to run multiple pods at the same time.
Running jobs with gang-scheduling
Training Operator and MPI Operator support running jobs with gang-scheduling using Kueue, Volcano Scheduler, and Scheduler Plugins with coscheduling.
Using Kueue with Training Operator Jobs
Follow this guide to learn how to use Kueue with Training Operator Jobs and manage queues for your ML training jobs
Scheduler Plugins with coscheduling
You have to install the Scheduler Plugins with coscheduling in your cluster first as a default scheduler or a secondary scheduler of Kubernetes and configure operator to select the scheduler name for gang-scheduling in the following:
- training-operator
...
spec:
containers:
- command:
- /manager
+ - --gang-scheduler-name=scheduler-plugins
image: kubeflow/training-operator
name: training-operator
...
- mpi-operator (installed scheduler-plugins as a default scheduler)
...
spec:
containers:
- args:
+ - --gang-scheduling=default-scheduler
- -alsologtostderr
- --lock-namespace=mpi-operator
image: mpioperator/mpi-operator:0.4.0
name: mpi-operator
...
- mpi-operator (installed scheduler-plugins as a secondary scheduler)
...
spec:
containers:
- args:
+ - --gang-scheduling=scheduler-plugins-scheduler
- -alsologtostderr
- --lock-namespace=mpi-operator
image: mpioperator/mpi-operator:0.4.0
name: mpi-operator
...
- Follow to instructions in the kubernetes-sigs/scheduler-plugins repository to install the Scheduler Plugins with coscheduling.
Note: Scheduler Plugins and operator in Kubeflow achieve gang-scheduling by using PodGroup. Operator will create the PodGroup of the job automatically.
If you install Scheduler Plugins in your cluster as a secondary scheduler, you need to specify the scheduler name in CustomJob resources (e.g., TFJob), for example:
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: tfjob-simple
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
+ schedulerName: scheduler-plugins-scheduler
containers:
- name: tensorflow
image: kubeflow/tf-mnist-with-summaries:latest
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
In installing Scheduler Plugins as a default scheduler, you don’t need to specify the scheduler name in CustomJob resources (e.g., TFJob).
Volcano Scheduler
You have to install volcano scheduler in your cluster first as a secondary scheduler of Kubernetes and configure operator to select the scheduler name for gang-scheduling in the following:
- training-operator
...
spec:
containers:
- command:
- /manager
+ - --gang-scheduler-name=volcano
image: kubeflow/training-operator
name: training-operator
...
- mpi-operator
...
spec:
containers:
- args:
+ - --gang-scheduling=volcano
- -alsologtostderr
- --lock-namespace=mpi-operator
image: mpioperator/mpi-operator:0.4.0
name: mpi-operator
...
- Follow the instructions in the volcano repository to install Volcano.
Note: Volcano scheduler and operator in Kubeflow achieve gang-scheduling by using PodGroup. Operator will create the PodGroup of the job automatically.
The yaml to use volcano scheduler to schedule your job as a gang is the same as non-gang-scheduler, for example:
apiVersion: "kubeflow.org/v1beta1"
kind: "TFJob"
metadata:
name: "tfjob-gang-scheduling"
spec:
tfReplicaSpecs:
Worker:
replicas: 1
template:
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
resources:
limits:
nvidia.com/gpu: 1
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
PS:
replicas: 1
template:
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
resources:
limits:
cpu: "1"
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
About gang-scheduling
With using Volcano Scheduler or Scheduler Plugins with coscheduling to apply gang-scheduling, a job can run only if there are enough resources for all the pods of the job. Otherwise, all the pods will be in pending state waiting for enough resources. For example, if a job requiring N pods is created and there are only enough resources to schedule N-2 pods, then N pods of the job will stay pending.
Note: when in a high workload, if a pod of the job dies when the job is still running, it might give other pods a chance to occupy the resources and cause deadlock.
Troubleshooting
If you keep getting problems related to RBAC in your volcano scheduler.
You can try to add the following rules into your clusterrole of scheduler used by volcano scheduler.
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*'
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.