Gang Scheduling
This guide describes how to enable gang scheduling with Kubeflow Trainer. It ensures that a group of related training nodes (e.g. Pods), only start when all required resources are available. Having this is crucial when working with expensive and limited GPU accelerators.
Before exploring this guide, make sure to follow the Runtime guide to understand the basics of Kubeflow Trainer Runtimes.
PodGroupPolicy Overview
The PodGroupPolicy
API
defines the configuration for gang scheduling. When this API is used Kubeflow Trainer controller
creates the appropriate PodGroup to enable gang scheduling for TrainJob.
Types of PodGroupPolicy
The PodGroupPolicy
API supports multiple policies, known as PodGroupPolicySources
. Each policy
represents plugin configuration to enable gang scheduling using that specific integration. You can
specify one of the supported policies in the PodGroupPolicy
API to enable gang scheduling with
supported plugins.
Coscheduling
The Coscheduling
policy
configures gang scheduling with
Coscheduling plugin
podGroupPolicy:
coscheduling:
scheduleTimeoutSeconds: 30
You have to install and enable the Coscheduling plugin in your Kubernetes cluster before using this policy.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.