KAI Scheduler

Configure gang scheduling with NVIDIA KAI Scheduler

This guide describes how to enable gang scheduling and advanced resource management with the NVIDIA KAI Scheduler in Kubeflow Trainer.

By integrating KAI Scheduler, you ensure “all-or-nothing” scheduling for distributed training jobs. This means the job only starts if all requested GPU resources are available simultaneously, preventing resource deadlocks in multi-node training.

Prerequisites

  1. Install KAI Scheduler: Follow the KAI Installation Guide to set up the scheduler and the podgrouper service in your Kubernetes cluster.
  2. Define a Queue: KAI uses queues to manage resources. Ensure you have a KAI Queue created (e.g., training-queue) or use the default-queue created during installation.

Enable KAI Plugin

KAI scheduling can be enabled by setting the schedulerName to kai-scheduler in the pod template of your TrainingRuntime or ClusterTrainingRuntime specification.

Note: KAI integrates externally via its PodGrouper component, which monitors pods requesting the kai-scheduler.

Example: ClusterTrainingRuntime with KAI

You can enforce KAI scheduling at the runtime level. This ensures that every job using this runtime automatically utilizes KAI gang-scheduling.

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: pytorch-kai-runtime
spec:
  mlPolicy:
    torch:
      numNodes: 1
  template:
    spec:
      schedulerName: kai-scheduler
      containers:
        - name: train
          image: pytorch/pytorch:latest

Example: TrainJob with KAI

Once your runtime is created, you can submit a TrainJob that references it. You can also add the kai.scheduler/queue label to your job to route it to a specific resource queue in KAI.

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: pytorch-kai-job
  labels:
    kai.scheduler/queue: "prod-queue" # KAI Scheduler uses this to route the job
spec:
  runtimeRef:
    name: pytorch-kai-runtime
  trainer:
    numNodes: 4
    resourcesPerNode:
      limits:
        nvidia.com/gpu: 1

How it Works

When a TrainJob is created using a runtime configured with the kai-scheduler:

  1. Metadata Propagation: The Trainer Operator applies the necessary labels and annotations to the underlying JobSet.
  2. Pod Grouping: The KAI podgrouper component detects the training pods via the OwnerReference chain and automatically creates a KAI PodGroup resource.
  3. Gang Scheduling: The KAI Scheduler identifies the PodGroup and ensures all replicas (workers) are scheduled at once on nodes assigned to the specified queue.

Feedback

Was this page helpful?