GenAI Use Cases
Powering GenAI Use Cases with Kubeflow
Kubeflow Projects are powering every stage of the GenAI application lifecycle
From generating synthetic data to retrieval-augmented generation (RAG), fine-tuning large language models (LLMs), hyperparameter optimization, inference at scale, and evaluation, Kubeflow’s modular, Kubernetes-native architecture makes building end-to-end GenAI pipelines both reproducible and production-ready.
Table of Contents
- Synthetic Data Generation
- Retrieval-Augmented Generation (RAG)
- Scaling RAG Data Transformation with Spark
- Fine-Tuning LLMs
- Hyperparameter Optimization
- Inference at Scale
- Model Evaluation
- Beyond Core Use Cases
Synthetic Data Generation
When real data is scarce or sensitive, Kubeflow Pipelines automates the creation of high-fidelity synthetic datasets using techniques like GANs, VAEs, and statistical copulas. By defining parameterized pipeline components that:
- Train synthesizer models
- Validate synthetic output against privacy/fidelity criteria
- Parallelize jobs across on-prem or cloud clusters
teams can accelerate experimentation without compromising compliance. Learn more in the Synthetic Data Generation with KFP guide.
Retrieval-Augmented Generation (RAG)
RAG combines vector-based retrieval with generative models to produce contextually grounded outputs:
- Indexing & Storage
- Ingest document embeddings into Feast’s vector store (e.g., Milvus)
- Retrieval at Inference
- Fetch top-k relevant chunks via Feast → Milvus
- Concatenate retrieved context into the LLM prompt
- Tuning Retrieval Parameters
- Use Katib to sweep over
top_k
,temperature
, etc., and optimize metrics like BLEU
- Use Katib to sweep over
This workflow is detailed in the Katib RAG blog post and the Feast + Docling RAG tutorial.
Scaling RAG Data Transformation with Spark
Preprocessing massive document collections for RAG pipelines—text cleaning, chunking, and embedding generation—can become a bottleneck. Kubeflow integrates with the Spark Operator to run distributed Spark jobs on Kubernetes, allowing you to:
- Ingest & Process Raw Documents: Read text files or PDFs from cloud storage (S3, GCS) or databases.
- Text Chunking: Normalize, tokenize, and split content into fixed-size passages with overlap.
- Distributed Embedding: Call embedding services (e.g., OpenAI, HuggingFace) inside Spark UDFs or map operations to parallelize across executor pods.
- Direct Write to Feature Store: Persist chunk embeddings and metadata back into Feast’s offline store or vector store without intermediate bottlenecks.
Fine-Tuning LLMs
Kubeflow Trainer provides a seamless way to fine-tune large language models (LLMs) at scale using popular frameworks such as PyTorch, DeepSpeed, MLX, and others.
With the Kubeflow SDK and the train()
API, you can define
custom fine-tuning scripts and scale them across thousands of GPUs. For detailed usage, see
the Kubeflow Trainer documentation:
You can implement a wide range of advanced algorithms to fine-tune your LLMs such as supervised fine tuning (SFT), knowledge distillation, Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), quantization-aware training, and more.
Kubeflow Trainer enables efficient distributed training with both data parallelism and model parallelism, ensuring optimal GPU utilization and reduced training time.
For a faster start, you can use
the BuiltinTrainers()
, which
provide pre-configured blueprints for LLMs fine-tuning.
Simply specify the desired configuration to get started:
from kubeflow.trainer import (
TrainerClient,
Initializer,
HuggingFaceModelInitializer,
BuiltinTrainer,
TorchTuneConfig,
TorchTuneInstructDataset,
DataFormat,
DataType,
)
TrainerClient().train(
runtime=TrainerClient().get_runtime("torchtune-llama3.2-1b"),
initializer=Initializer(
model=HuggingFaceModelInitializer(
storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
access_token="<YOUR_HF_TOKEN>", # Replace with your Hugging Face token
)
),
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dataset_preprocess_config=TorchTuneInstructDataset(
source=DataFormat.PARQUET,
split="train[:1000]",
new_system_prompt="You are an AI assistant. ",
),
epochs=10,
dtype=DataType.BF16,
resources_per_node={
"memory": "200G",
"gpu": 4,
},
)
),
)
For details, see the Kubeflow Trainer TorchTune guide.
Hyperparameter Optimization
Automated tuning is essential to maximize model performance:
Katib Experiments let you define hyperparameter search spaces and optimization objectives (e.g., minimize loss, maximize BLEU).
Katib allows you to effortlessly optimize hyperparameters of LLMs using distributed PyTorchJobs.
Inference at Scale
After training and tuning, KServe delivers scalable, framework-agnostic inference:
Expose models via Kubernetes Custom Resources
Automate blue/green or canary rollouts
Autoscale based on real-time metrics (CPU, GPU, request latency)
Integrate KServe endpoints into Pipelines to orchestrate rollout strategies and ensure consistent low-latency GenAI services.
Model Evaluation
Rigorous evaluation guards against drift and degradation:
Pipeline-Embedded Eval Steps
- Statistical benchmarks for synthetic data
- Text metrics (BLEU, ROUGE) for generation quality
- Leverage Katib’s metric collector to enforce early-stop rules
- Close the loop: evaluation → tuning → retrain
By codifying evaluation in Pipelines, you maintain reproducibility and versioned lineage from data to model to metrics.
Beyond Core Use Cases
Kubeflow’s ecosystem continues to expand:
Notebooks UI for interactive development (Jupyter, VSCode)
Model Registry for artifact versioning & lineage
Spark & Batch Operators for data-intensive preprocessing
Feast supports vector similarity search for RAG and other next-gen AI workloads
Each Kubeflow Project plugs seamlessly into Kubernetes, ensuring portability and consistency.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.