How to install Training Operator

This guide describes how to install Training Operator on your Kubernetes cluster. Training Operator is a lightweight Kubernetes controller that orchestrates appropriate Kubernetes workloads to perform distributed ML training and fine-tuning.


These are minimal requirements to install Training Operator:

  • Kubernetes >= 1.27
  • kubectl >= 1.27
  • Python >= 3.7

Installing Training Operator

You need to install Training Operator control plane and Python SDK to create training jobs.

Installing Control Plane

You can skip these steps if you have already installed Kubeflow platform using manifests or package distributions. Kubeflow platform includes Training Operator.

You can install Training Operator as a standalone component.

Run the following command to install the stable release of Training Operator control plane: v1.7.0

kubectl apply -k ""

Run the following command to install the latest changes of Training Operator control plane:

kubectl apply -k ""

After installing it, you can verify that Training Operator controller is running as follows:

$ kubectl get pods -n kubeflow

NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s

Run this command to check installed Kubernetes CRDs for each supported ML framework:

$ kubectl get crd                                     2023-06-09T00:31:07Z                                      2023-06-09T00:31:05Z                                  2023-06-09T00:31:09Z                                 2023-06-09T00:31:06Z                                      2023-06-09T00:31:04Z                                 2023-06-09T00:31:04Z

Installing Python SDK

Training Operator implements Python SDK to simplify creation of distributed training and fine-tuning jobs for Data Scientists.

Run the following command to install the stable release of Training Operator SDK:

pip install -U kubeflow-training

You can also install the Python SDK using the specific GitHub commit, for example:

pip install git+

Install Python SDK with Fine-Tuning Capabilities

If you want to use train API for LLM fine-tuning with Training Operator, install the Python SDK with the additional packages from HuggingFace:

pip install -U kubeflow-training[huggingface]

Next steps

Run your first Training Operator Job by following the Getting Started guide.


Was this page helpful?