Overview

Introduction to gang scheduling with Kubeflow Trainer

This guide describes how to enable gang scheduling with Kubeflow Trainer. It ensures that a group of related training nodes (e.g. Pods), only start when all required resources are available. Having this is crucial when working with expensive and limited GPU accelerators.

Before exploring this guide, make sure to follow the Runtime guide to understand the basics of Kubeflow Trainer Runtimes.

PodGroupPolicy Overview

The PodGroupPolicy API defines the configuration for gang scheduling. When this API is used Kubeflow Trainer controller creates the appropriate PodGroup to enable gang scheduling for TrainJob.

Types of PodGroupPolicy

The PodGroupPolicy API supports multiple policies, known as PodGroupPolicySources. Each policy represents plugin configuration to enable gang scheduling using that specific integration. You can specify one of the supported policies in the PodGroupPolicy API to enable gang scheduling with supported plugins.

Next Steps

Learn how to enable gang scheduling with the Coscheduling plugin.
Learn how to configure advanced scheduling with Volcano Scheduler.
Learn how to configure job queueing and resource management with Kueue.

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified November 12, 2025: trainer: Add Kueue cross-reference to Job Scheduling documentation (#4235) (f376e4de)