Overview
The Kubeflow SDK allows you to run TrainJobs on your local machine without deploying to a Kubernetes cluster. This is ideal for:
- Development and testing of training scripts
- Quick prototyping and experimentation
- Learning and educational purposes
- Environments where Kubernetes is not available
Available Backends
Local Process Backend
Run TrainJobs directly using native Python processes and virtual environments. This is the fastest option for simple, single-node training.
Best for:
- Quick prototyping and development
- Testing training scripts without container overhead
- Environments where Docker/Podman is not available
Learn more about Local Process Backend
Container Backend with Docker
Run distributed TrainJobs in isolated Docker containers with full multi-node support.
Best for:
- General use cases, especially on macOS/Windows
- Distributed training with multiple containers
- Reproducible containerized environments
Learn more about Docker Backend
Container Backend with Podman
Run distributed TrainJobs using Podman, a daemonless container engine with enhanced security.
Best for:
- Security-focused environments
- Rootless containerized training
- Linux servers with systemd integration
Learn more about Podman Backend
Backend Comparison
| Feature | Local Process | Docker | Podman |
|---|---|---|---|
| Setup | No additional software | Docker Desktop/Engine | Podman installation |
| Isolation | Virtual environments | Full container isolation | Full container isolation |
| Multi-node | Not supported | Supported | Supported |
| Root Required | No | Docker group or root | Rootless supported |
| Startup Time | Fast (seconds) | Medium (container start) | Medium (container start) |
| Best For | Quick prototyping | General use, wide ecosystem | Security, Linux servers |
Switching Between Backends
All backends use the same TrainerClient interface, making it easy to progress from local development to production deployment. The same training code works across all backends - only the backend configuration changes.
Local Process Backend
Complete some quick local testing:
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)
Container Backend
Use Docker/Podman for multi-node distributed training:
# Switch to Docker Backend - same trainer works!
from kubeflow.trainer import ContainerBackendConfig
backend_config = ContainerBackendConfig(
container_runtime="docker",
)
client = TrainerClient(backend_config=backend_config)
# Same trainer, same train() call!
trainer = CustomTrainer(func=train_model, num_nodes=4) # Now with multi-node support
job_name = client.train(trainer=trainer)
Kubernetes Backend
Production environment with the Kubernetes backend:
# Deploy to Kubernetes - same trainer still works!
from kubeflow.trainer import KubernetesBackendConfig
backend_config = KubernetesBackendConfig(namespace="kubeflow")
client = TrainerClient(backend_config=backend_config)
# Same trainer, same train() call!
trainer = CustomTrainer(func=train_model, num_nodes=4)
job_name = client.train(trainer=trainer)
Job Management
All backends support the same job management operations through the TrainerClient interface using the same set of APIs.
Listing Jobs
# List all jobs
jobs = client.list_jobs()
for job in jobs:
print(f"Job: {job.name}, Status: {job.status}")
Viewing Logs
# Stream logs from a specific node
for log_line in client.get_job_logs(job_name, node_index=0, follow=True):
print(log_line, end='')
# Get logs from all nodes (Container backends only)
for node_index in range(trainer.num_nodes):
print(f"\n=== Logs from node {node_index} ===")
for log_line in client.get_job_logs(job_name, node_index=node_index):
print(log_line, end='')
Waiting for Job Completion
from kubeflow.trainer.constants import constants
# Wait for job to complete
job = client.wait_for_job_status(
job_name,
status={constants.TRAINJOB_COMPLETE},
timeout=600
)
print(f"Job completed with status: {job.status}")
Deleting Jobs
# Delete job and clean up resources
client.delete_job(job_name)
This removes:
- All containers/processes for the job
- Networks created for the job (Container backends)
- Job metadata
Working with Runtimes
Runtimes provide pre-configured training environments with specific frameworks and settings.
Listing Available Runtimes
# List available runtimes
runtimes = client.list_runtimes()
for runtime in runtimes:
print(f"Runtime: {runtime.name}")
Using a Specific Runtime
# Get a specific runtime
runtime = client.get_runtime("torch-distributed")
# Train with the runtime
job_name = client.train(
trainer=trainer,
runtime=runtime
)
Custom Runtime Sources (Container Backends)
By default, the Container Backends load runtimes from:
- GitHub -
github://kubeflow/trainer(official runtimes, cached for 24 hours) - Fallback - Built-in default images (e.g.,
pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime)
You can customize where runtimes are loaded from using the runtime_source configuration:
from kubeflow.trainer import ContainerBackendConfig, TrainingRuntimeSource
backend_config = ContainerBackendConfig(
container_runtime="docker", # or "podman"
runtime_source=TrainingRuntimeSource(sources=[
"github://kubeflow/trainer", # Official Kubeflow runtimes
"github://myorg/myrepo/path/to/runtimes", # Custom GitHub repository
"https://example.com/custom-runtime.yaml", # HTTP(S) endpoint
"file:///absolute/path/to/runtime.yaml", # Local YAML file
"/absolute/path/to/runtime.yaml", # Local YAML file (alternate)
])
)
client = TrainerClient(backend_config=backend_config)
Source Priority: Sources are checked in order. If a runtime is not found in any source, the system falls back to the default image for the framework.
Runtime YAML Example:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-custom
labels:
trainer.kubeflow.org/framework: torch
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
replicatedJobs:
- name: node
template:
spec:
template:
spec:
containers:
- name: node
image: myregistry.com/pytorch-custom:latest
Switching Between Container Backends
The unified Container Backend API makes it easy to switch between Docker and Podman:
# Use Docker
backend_config = ContainerBackendConfig(
container_runtime="docker",
)
# Switch to Podman - just change one line!
backend_config = ContainerBackendConfig(
container_runtime="podman",
)
# Or let it auto-detect
backend_config = ContainerBackendConfig(
container_runtime=None, # Auto-detect (tries Docker first, then Podman)
)
Key Points:
- Your training function (
func=train_model) doesn’t change - Job management operations (
list_jobs(),get_job_logs(),delete_job()) work the same across all backends - Only the backend configuration import and instantiation changes
- This progression allows you to test locally first, validate with containers, then deploy to production
Next Steps
Choose the backend that best fits your needs:
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.