Volcano
This guide describes how to enable gang scheduling and advanced resource management with the Volcano Scheduler in Kubeflow Trainer.
By integrating Volcano, you can ensure that all Pods of a training job start together (gang scheduling), and take advantage of advanced AI-specific scheduling capabilities like priority scheduling, queue-based resource management, and network topology–aware scheduling.
Prerequisites
You have to install Volcano in your Kubernetes cluster before enabling the Volcano gang scheduling policy.
Enable Volcano Plugin
Volcano scheduling can be enabled through the podGroupPolicy
field in your TrainJob
specification.
Gang Scheduling
To enable gang scheduling, specify the volcano
policy in your runtime:
podGroupPolicy:
volcano:
{}
This configuration automatically creates Volcano PodGroups
for your training job.
Topology Aware Scheduling
Volcano also supports network topology–aware scheduling, which helps place Pods close to each other to minimize communication latency in distributed training. You can configure this behavior under the volcano policy:
podGroupPolicy:
volcano:
networkTopology:
mode: hard
highestTierAllowed: 1
Using Queues for Priority Scheduling
Volcano supports queue-based resource management, where multiple PodGroups are placed in queues and scheduled based on their priority and available capacity.
First, you have to create a custom queue.
Then, reference this queue in the annotations of TrainJob
:
spec:
annotations:
scheduling.volcano.sh/queue-name: "high-priority-queue"
Alternatively, you can specify the queue in the annotations of runtime for multiple TrainJobs
:
spec:
podGroupPolicy:
volcano: {}
template:
metadata:
annotations:
scheduling.volcano.sh/queue-name: "high-priority-queue"
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.