How to Fine-Tune LLMs with Kubeflow

Overview of the LLM fine-tuning API in the Training Operator

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

This page describes how to use a train API from the Training Python SDK that simplifies the ability to fine-tune LLMs with distributed PyTorchJob workers.

If you want to learn more about how the fine-tuning API fits in the Kubeflow ecosystem, head to the explanation guide.

Prerequisites

You need to install the Training Python SDK with fine-tuning support to run this API.

How to use the Fine-Tuning API?

You need to provide the following parameters to use the train API:

Pre-trained model parameters.
Dataset parameters.
Trainer parameters.
Number of PyTorch workers and resources per workers.

For example, you can use the train API to fine-tune the BERT model using the Yelp Review dataset from HuggingFace Hub with the code below:

import transformers
from peft import LoraConfig

from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
    HuggingFaceModelParams,
    HuggingFaceTrainerParams,
    HuggingFaceDatasetParams,
)

TrainingClient().train(
    name="fine-tune-bert",
    # BERT model URI and type of Transformer to train it.
    model_provider_parameters=HuggingFaceModelParams(
        model_uri="hf://google-bert/bert-base-cased",
        transformer_type=transformers.AutoModelForSequenceClassification,
    ),
    # Use 3000 samples from Yelp dataset.
    dataset_provider_parameters=HuggingFaceDatasetParams(
        repo_id="yelp_review_full",
        split="train[:3000]",
    ),
    # Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
    trainer_parameters=HuggingFaceTrainerParams(
        training_parameters=transformers.TrainingArguments(
            output_dir="test_trainer",
            save_strategy="no",
            eval_strategy="no",
            do_eval=False,
            disable_tqdm=True,
            log_level="info",
        ),
        # Set LoRA config to reduce number of trainable model parameters.
        lora_config=LoraConfig(
            r=8,
            lora_alpha=8,
            lora_dropout=0.1,
            bias="none",
        ),
    ),
    num_workers=4, # nnodes parameter for torchrun command.
    num_procs_per_worker=2, # nproc-per-node parameter for torchrun command.
    resources_per_worker={
        "gpu": 2,
        "cpu": 5,
        "memory": "10G",
    },
)

After you execute train, the Training Operator will orchestrate the appropriate PyTorchJob resources to fine-tune the LLM.

Dataset and Model Parameter Classes

HuggingFaceModelParams

Description

The HuggingFaceModelParams dataclass holds configuration parameters for initializing Hugging Face models with validation checks.

Attribute	Type	Description
`model_uri`	`str`	URI or path to the Hugging Face model (must not be empty).
`transformer_type`	`TRANSFORMER_TYPES`	Specifies the model type for various NLP/ML tasks.
`access_token`	`Optional[str]` (default: `None`)	Token for accessing private models on Hugging Face.
`num_labels`	`Optional[int]` (default: `None`)	Number of output labels (used for classification tasks).

Supported Transformer Types (`TRANSFORMER_TYPES`)

Model Type	Task
`AutoModelForSequenceClassification`	Text classification
`AutoModelForTokenClassification`	Named entity recognition
`AutoModelForQuestionAnswering`	Question answering
`AutoModelForCausalLM`	Text generation (causal)
`AutoModelForMaskedLM`	Masked language modeling
`AutoModelForImageClassification`	Image classification

Example Usage

from transformers import AutoModelForSequenceClassification
from kubeflow.storage_initializer.hugging_face import HuggingFaceModelParams

params = HuggingFaceModelParams(
    model_uri="bert-base-uncased",
    transformer_type=AutoModelForSequenceClassification,
    access_token="huggingface_access_token",
    num_labels=2  # For binary classification
)

HuggingFaceDatasetParams

Description

The HuggingFaceDatasetParams class holds configuration parameters for loading datasets from Hugging Face with validation checks.

Attribute	Type	Description
`repo_id`	`str`	Identifier of the dataset repository on Hugging Face (must not be empty).
`access_token`	`Optional[str]` (default: `None`)	Token for accessing private datasets on Hugging Face.
`split`	`Optional[str]` (default: `None`)	Dataset split to load (e.g., `"train"`, `"test"`).

Example Usage

from kubeflow.storage_initializer.hugging_face import HuggingFaceDatasetParams

dataset_params = HuggingFaceDatasetParams(
    repo_id="imdb",            # Public dataset repository ID on Hugging Face
    split="train",             # Dataset split to load
    access_token=None          # Not needed for public datasets
)

HuggingFaceTrainerParams

Description

The HuggingFaceTrainerParams class is used to define parameters for the training process in the Hugging Face framework. It includes the training arguments and LoRA configuration to optimize model training.

Parameter	Type	Description
`training_parameters`	`transformers.TrainingArguments`	Contains the training arguments like learning rate, epochs, batch size, etc.
`lora_config`	`LoraConfig`	LoRA configuration to reduce the number of trainable parameters in the model.

Example Usage

from transformers import TrainingArguments
from peft import LoraConfig
from kubeflow.storage_initializer.hugging_face import HuggingFaceTrainerParams

trainer_params = HuggingFaceTrainerParams(
    training_parameters=TrainingArguments(
        output_dir="results",
        learning_rate=2e-5,
        num_train_epochs=3,
        per_device_train_batch_size=8,
    ),
    lora_config=LoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
    ),
)

S3DatasetParams

Description

The S3DatasetParams class is used for loading datasets from S3-compatible object storage. It includes validation checks to ensure proper configuration.

Parameter	Type	Description
`endpoint_url`	`str`	URL of the S3-compatible storage service.
`bucket_name`	`str`	Name of the S3 bucket containing the dataset.
`file_key`	`str`	Key (path) to the dataset file within the bucket.
`region_name`	`str`, optional	The AWS region of the S3 bucket (optional).
`access_key`	`str`, optional	The access key for authentication with S3 (optional).
`secret_key`	`str`, optional	The secret key for authentication with S3 (optional).

Implementation Details

The S3DatasetParams class includes validation checks to ensure required parameters are provided and the endpoint URL is valid. The actual dataset download is handled by the S3 class which uses boto3 to interact with the S3-compatible storage.

Example Usage

from kubeflow.storage_initializer.s3 import S3DatasetParams

s3_params = S3DatasetParams(
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-dataset-bucket",
    file_key="datasets/train.csv",
    region_name="us-west-2",
    access_key="YOUR_ACCESS_KEY",
    secret_key="YOUR_SECRET_KEY"
)

Using custom images with Fine-Tuning API

Platform engineers can customize the storage initializer and trainer images by setting the STORAGE_INITIALIZER_IMAGE and TRAINER_TRANSFORMER_IMAGE environment variables before executing the train command.

For example: In your python code, set the env vars before executing train:

...
os.environ['STORAGE_INITIALIZER_IMAGE'] = 'docker.io/<username>/<custom-storage-initiailizer_image>'
os.environ['TRAINER_TRANSFORMER_IMAGE'] = 'docker.io/<username>/<custom-trainer_transformer_image>'

TrainingClient().train(...)

Next Steps

Run the example to fine-tune the TinyLlama LLM
Check this example to compare the create_job and the train Python API for fine-tuning BERT LLM.
Understand the architecture behind train API.

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified June 17, 2025: katib: Update LLM HP tuning guide to clarify tunable fields and fix resource section (#4067) (63a12d7a)

How to Fine-Tune LLMs with Kubeflow

Old Version

Prerequisites

How to use the Fine-Tuning API?

Dataset and Model Parameter Classes

HuggingFaceModelParams

Description

Supported Transformer Types (TRANSFORMER_TYPES)

Example Usage

HuggingFaceDatasetParams

Description

Example Usage

HuggingFaceTrainerParams

Description

Example Usage

S3DatasetParams

Description

Implementation Details

Example Usage

Using custom images with Fine-Tuning API

Next Steps

Feedback

Supported Transformer Types (`TRANSFORMER_TYPES`)