Overview
This guide describes how to enable gang scheduling with Kubeflow Trainer. It ensures that a group of related training nodes (e.g. Pods), only start when all required resources are available. Having this is crucial when working with expensive and limited GPU accelerators.
Before exploring this guide, make sure to follow the Runtime guide to understand the basics of Kubeflow Trainer Runtimes.
PodGroupPolicy Overview
The PodGroupPolicy
API
defines the configuration for gang scheduling. When this API is used Kubeflow Trainer controller
creates the appropriate PodGroup to enable gang scheduling for TrainJob.
Types of PodGroupPolicy
The PodGroupPolicy
API supports multiple policies, known as PodGroupPolicySources
. Each policy
represents plugin configuration to enable gang scheduling using that specific integration. You can
specify one of the supported policies in the PodGroupPolicy
API to enable gang scheduling with
supported plugins.
Next Steps
- Learn how to enable gang scheduling with the Coscheduling plugin.
- Learn how to configure advanced scheduling with Volcano Scheduler.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.