Architecture

An overview of Kubeflow’s architecture

This guide introduces Kubeflow projects and how they fit in each stage of the AI lifecycle.

Read the introduction to learn more about Kubeflow, Kubeflow projects, and Kubeflow AI reference platform.

Kubeflow Ecosystem

The following diagram gives an overview of the Kubeflow Ecosystem and how it relates to the wider Kubernetes and AI landscape. Kubeflow builds on Kubernetes as a system for deploying, scaling, and managing AI platforms.

Kubeflow Ecosystem Diagram

Introducing the AI Lifecycle

When you develop and deploy an AI application, the AI lifecycle typically consists of several stages. Developing an AI system is an iterative process. You need to evaluate the output of various stages of the AI lifecycle, and apply changes to the model and parameters when necessary to ensure the model keeps producing the results you need.

The following diagram shows the AI lifecycle stages in sequence:

AI Lifecycle

Looking at the stages in more detail:

  • In the Data Preparation step you ingest raw data, perform feature engineering to extract ML features for the offline feature store, and prepare training data for model development. Usually, this step is associated with data processing tools such as Spark, Dask, Flink, or Ray.

  • In the Model Development step you choose an ML framework, develop your model architecture and explore the existing pre-trained models for fine-tuning like BERT or Llama.

  • In the Model Training step you train or fine-tune your models on the large-scale compute environment. You should use a distributed training if single GPU can’t handle your model size. The results of the model training is the trained model artifact that you can store in the Model Registry.

  • In the Model Optimization step you optimize your model hyperparameters and optimize your model with various AutoML algorithms such as neural architecture search and model compression. During model optimization you can store ML metadata in the Model Registry.

  • In the Model Serving step you serve your model artifact for online or batch inference. Your model may perform predictive or generative AI tasks depending on the use-case. During the model serving step you may use an online feature store to extract features. You monitor the model performance, and feed the results into your previous steps in the AI lifecycle.

AI Lifecycle for Production and Development Phases

The AI lifecycle for AI applications may be conceptually split between development and production phases, this diagram explores which stages fit into each phase:

AI Lifecycle with Development and Production

Kubeflow Projects in the AI Lifecycle

The next diagram shows how Kubeflow projects fit for each stage of the AI lifecycle:

Kubeflow Projects in the AI Lifecycle

See the following links for more information about each Kubeflow project:

  • Kubeflow Spark Operator can be used for data preparation and feature engineering step.

  • Kubeflow Notebooks can be used for model development and interactive data science to experiment with your AI workflows.

  • Kubeflow Trainer can be used for large-scale distributed training or LLMs fine-tuning.

  • Kubeflow Katib can be used for model optimization and hyperparameter tuning using various AutoML algorithms.

  • Kubeflow Model Registry can be used to store ML metadata, model artifacts, and preparing models for production serving.

  • KServe can be used for online and batch inference in the model serving step.

  • Feast can be used as a feature store and to manage offline and online features.

  • Kubeflow Pipelines can be used to build, deploy, and manage each step in the AI lifecycle.

AI platform teams can build on top of Kubeflow by using each project independently or deploying the entire AI reference platform to meet their specific needs.

Kubeflow Interfaces

This section introduces the interfaces that you can use to interact with Kubeflow projects.

Kubeflow Dashboard

The Kubeflow Central Dashboard looks like this:

Kubeflow Central Dashboard - Homepage

The Kubeflow AI reference platform includes Kubeflow Central Dashboard which acts as a hub for your AI platform and tools by exposing the UIs of components running in the cluster.

Kubeflow APIs and SDKs

Various Kubeflow projects offer APIs and Python SDKs.

See the following sets of reference documentation:

  • Pipelines reference docs for the Kubeflow Pipelines API and SDK, including the Kubeflow Pipelines domain-specific language (DSL).
  • Kubeflow Python SDK to interact with Kubeflow Trainer APIs and to manage TrainJobs.
  • Katib Python SDK to manage Katib hyperparameter tuning Experiments using Python APIs.

Next steps

Feedback

Was this page helpful?