Docker Backend
Overview
The Container Backend with Docker enables you to run distributed TrainJobs in isolated Docker containers on your local machine. This backend provides:
- Full Container Isolation: Each TrainJob runs in its own Docker container with isolated filesystem, network, and resources
- Multi-Node Support: Run distributed training across multiple containers with automatic networking
- Reproducibility: TrainJob runs in consistent containerized environments
- Flexible Configuration: Customize image pulling policies, resource allocation, and container settings
The Docker backend uses the adapter pattern to provide a unified interface, making it easy to switch between Docker and Podman without code changes.
Prerequisites
Required Software
- Docker: Install Docker Desktop (macOS/Windows) or Docker Engine (Linux)
- macOS/Windows: Download from docker.com
- Linux: Follow Docker Engine installation guide
- Python 3.9+
- Kubeflow SDK: Install with Docker support:
pip install "kubeflow[docker]"
Verify Installation
# Check Docker is running
docker version
# Test Docker daemon connectivity
docker ps
Basic Example
Here’s a simple example using the Docker Container Backend:
from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig
def train_model():
"""Simple training function."""
import torch
import os
rank = int(os.environ.get('RANK', '0'))
world_size = int(os.environ.get('WORLD_SIZE', '1'))
print(f"Training on rank {rank}/{world_size}")
# Your training code
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(5):
loss = torch.nn.functional.mse_loss(
model(torch.randn(32, 10)),
torch.randn(32, 1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"[Rank {rank}] Epoch {epoch + 1}/5, Loss: {loss.item():.4f}")
print(f"[Rank {rank}] Training completed!")
# Configure the Docker backend
backend_config = ContainerBackendConfig(
container_runtime="docker", # Explicitly use Docker
pull_policy="IfNotPresent", # Pull image if not cached locally
auto_remove=True # Clean up containers after completion
)
# Create the client
client = TrainerClient(backend_config=backend_config)
# Create a trainer with multi-node support
trainer = CustomTrainer(
func=train_model,
num_nodes=2 # Run distributed training across 2 containers
)
# Start the TrainJob
job_name = client.train(trainer=trainer)
print(f"TrainJob started: {job_name}")
# Wait for completion
job = client.wait_for_job_status(
job_name,
)
print(f"Job completed with status: {job.status}")
Configuration Options
ContainerBackendConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
container_runtime | str | None | None | Force specific runtime: "docker", "podman", or None (auto-detect). Use "docker" to ensure Docker is used. |
pull_policy | str | "IfNotPresent" | Image pull policy: "IfNotPresent" (pull if missing), "Always" (always pull), "Never" (use cached only). |
auto_remove | bool | True | Automatically remove containers and networks after job completion or deletion. Set to False for debugging. |
container_host | str | None | None | Override Docker daemon connection URL (e.g., "unix:///var/run/docker.sock", "tcp://192.168.1.100:2375"). |
runtime_source | TrainingRuntimeSource | GitHub sources | Configuration for training runtime sources. See “Custom Runtime Sources” section below. |
Configuration Examples
Basic Configuration
backend_config = ContainerBackendConfig(
container_runtime="docker",
)
Always Pull Latest Image
backend_config = ContainerBackendConfig(
container_runtime="docker",
pull_policy="Always" # Always pull latest image
)
Keep Containers for Debugging
backend_config = ContainerBackendConfig(
container_runtime="docker",
auto_remove=False # Containers remain after job completion
)
Multi-Node Distributed Training
The Docker backend automatically sets up networking and environment variables for distributed training:
from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig
def distributed_train():
"""PyTorch distributed training example."""
import os
import torch
import torch.distributed as dist
# Environment variables set by torchrun
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
print(f"Initializing process group: rank={rank}, world_size={world_size}")
# Initialize distributed training
dist.init_process_group(
backend='gloo', # Use 'gloo' for CPU, 'nccl' for GPU
rank=rank,
world_size=world_size
)
# Your distributed training code
model = torch.nn.Linear(10, 1)
ddp_model = torch.nn.parallel.DistributedDataParallel(model)
# Training loop
for epoch in range(5):
# Your training code here
print(f"[Rank {rank}] Training epoch {epoch + 1}")
dist.destroy_process_group()
print(f"[Rank {rank}] Training complete")
backend_config = ContainerBackendConfig(
container_runtime="docker",
)
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(
func=distributed_train,
num_nodes=4 # Run across 4 containers
)
job_name = client.train(trainer=trainer)
Job Management
For common job management operations (listing jobs, viewing logs, deleting jobs), see the Job Management section in the overview.
Inspecting Containers
When auto_remove=False, you can inspect containers after job completion:
# List containers for a job
docker ps -a --filter "label=kubeflow.org/job-name=<job-name>"
# Inspect a specific container
docker inspect <job-name>-node-0
# View logs directly
docker logs <job-name>-node-0
# Execute commands in a stopped container
docker start <job-name>-node-0
docker exec -it <job-name>-node-0 /bin/bash
Working with Runtimes
For information about using runtimes and custom runtime sources, see the Working with Runtimes section in the overview.
Troubleshooting
Docker Daemon Not Running
Error: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(61, 'Connection refused'))
Solution:
# macOS/Windows: Start Docker Desktop
# Linux: Start Docker daemon
sudo systemctl start docker
# Verify Docker is running
docker ps
Permission Denied
Error: Got permission denied while trying to connect to the Docker daemon socket
Solution (Linux):
# Add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in, or run
newgrp docker
GPU Not Available in Container
Error: RuntimeError: No CUDA GPUs are available
Solution:
# 1. Verify NVIDIA drivers on host
nvidia-smi
# 2. Verify NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# 3. Request GPU in your trainer
trainer = CustomTrainer(
func=train_model,
resources_per_node={"gpu": "1"}
)
Containers Not Removed
Problem: Containers remain after job completion
Solution:
# Ensure auto_remove is enabled
backend_config = ContainerBackendConfig(
container_runtime="docker",
auto_remove=True # Default
)
# Or manually clean up
client.delete_job(job_name)
# Or use Docker CLI
docker rm -f $(docker ps -aq --filter "label=kubeflow.org/job-name=<job-name>")
Network Conflicts
Error: network with name <job-name>-net already exists
Solution:
# Remove conflicting network
docker network rm <job-name>-net
# Or delete the previous job
# client.delete_job(job_name)
Next Steps
- Try the MNIST example notebook for a complete end-to-end example
- Learn about the Container Backend with Podman for rootless containerized training
- Learn about the Local Process Backend for non-containerized local execution
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.