ML Policy

How to configure MLPolicy in Kubeflow Trainer Runtimes

This guide describes how to configure the MLPolicy API in the Kubeflow Trainer Runtimes.

Before exploring this guide, make sure to follow the Runtime guide to understand the basics of Kubeflow Trainer Runtimes.

MLPolicy Overview

The MLPolicy API defines the ML-specific configuration for the training jobs - for example, the number of training nodes (e.g., Pods) to launch, or PyTorch settings.

mlPolicy:
  numNodes: 3
  torch:
    numProcPerNode: gpu

Types of MLPolicy

The MLPolicy API supports multiple types, known as MLPolicySources. Each type defines how the training job is launched and orchestrated. You can specify one of the supported sources in the MLPolicy API.

Torch

The Torch policy configures distributed training for PyTorch.

TrainJobs using this policy are launched via the torchrun CLI. You can customize torchrun options such as numProcPerNode to define number of processes (e.g. GPUs) to launch per training node.

MPI

The MPI policy configures distributed training using Message Passing Interface (MPI).

TrainJobs using this policy are launched via the mpirun CLI, the standard entrypoint for MPI-based applications. This makes it compatible with frameworks like DeepSpeed which uses OpenMPI for distributed training.

You can customize the MPI options such as numProcPerNode to define the number of slots per training node in the MPI hostfile.

Feedback

Was this page helpful?