Kubeflow Trainer Extension Framework

Core concepts and architecture of the Kubeflow Trainer Extension Framework

This guide describes the Kubeflow Trainer Extension Framework’s core concepts and its four execution phases: startup, pre-execution, build, and post-execution.

Overview

The Kubeflow Trainer Extension Framework is an internal mechanism that enables flexible expansion of Runtime and TrainJob combinations in the Kubeflow ecosystem. The framework provides extension points and manages component lifecycle through four phases.

Kubeflow Trainer Extension Framework Architecture

Purpose and Intended Users

The primary purpose of the Kubeflow Trainer Extension Framework is to provide a flexible and extensible mechanism for managing and executing machine learning training jobs within the Kubeflow Trainer. It is designed for platform administrators who need to extend Kubeflow Trainer with their custom plugins to fit their specific requirements.

Examples

To illustrate how the framework can be used, consider the following scenarios:

Custom Validation: A user wants to add specific validation logic to ensure that certain fields in the TrainJob are correctly configured before execution.
Dynamic Resource Deployment: A user needs to deploy Kubernetes resources dynamically based on ML framework specific requirements.

Core Concepts

Phases

The Kubeflow Trainer Extension Framework follows a structured, step-by-step execution flow. Each phase represents a logical part of the workflow:

Startup Phase: Executes once during the initialization of the kubeflow-trainer-controller-manager. This phase sets up necessary internal components.
PreExecution Phase: Triggered when a TrainJob is created or updated. This phase validates and prepares the job for execution.
Build Phase: Builds and deploys the required Kubernetes resources for training jobs.
PostExecution Phase: Runs after the job has been built and executed, checking the status and applying relevant conditions to the job.

APIs and Extension Points

In each phase, there are two types of components:

Internal APIs: These APIs are used internally by the framework and cannot be extended or modified by the user.
Extension Points: These points are exposed to the user and allow for customization through plugins that can be added to the framework.

Kubeflow Trainer Extension Framework

Phase Details

1. Startup Phase

Purpose: Initialize the Kubeflow Trainer Extension Framework and set up necessary components for managing training jobs.

Internal APIs:
- Initialize Kubeflow Trainer Extension Framework: Sets up the entire Kubeflow Trainer Extension Framework.
- TrainJobController: Configures the TrainJob controller and registers it with the Manager.
- Built-in Webhook Servers: Initializes Validation Webhook Servers that handle job creation and updates.
- Start Manager: Starts the main management process.
Extension Point:
- WatchExtension: Registers custom reconciler builders that watch specific Kubernetes resources and trigger TrainJob reconciliations as necessary.

2. PreExecution Phase

Purpose: Triggered when a TrainJob is created or updated, this phase validates the job and prepares it for execution.

Extension Point:
- CustomValidation: Registers custom validation logic to validate resources before a TrainJob is executed. This can include checking specific fields or configurations before proceeding.

3. Build Phase

Purpose: In this phase, the required Kubernetes resources are built and deployed to the cluster for execution.

Internal API:
- TrainJobController: Deploys the built components (Kubernetes resources) to the cluster as part of the reconciliation process.
Extension Points:
- EnforcePodGroupPolicy: Configures pod-specific parameters, such as those specified in the TrainingRuntime.spec.podGroupPolicy, for any relevant resources (like PodSpecs).
- EnforceMLPolicy: Configures Machine Learning-specific parameters from the TrainingRuntime.spec.mlPolicy to adjust the deployment of training resources.
- ComponentBuilder: Builds Kubernetes resources using the RuntimeInfo and TrainJob objects, allowing for a dynamic, runtime-specific deployment configuration.

4. PostExecution Phase

Purpose: After the TrainJob has been executed, the framework checks the state of the job and applies terminal conditions if necessary.

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified September 23, 2025: trainer: Introduce trainer pipeline framework for new users in kubeflow trainer v2 (#4039) (0d3d2c81)