Docker Backend

How to run TrainJobs with Docker containers

Overview

The Container Backend with Docker enables you to run distributed TrainJobs in isolated Docker containers on your local machine. This backend provides:

  • Full Container Isolation: Each TrainJob runs in its own Docker container with isolated filesystem, network, and resources
  • Multi-Node Support: Run distributed training across multiple containers with automatic networking
  • Reproducibility: TrainJob runs in consistent containerized environments
  • Flexible Configuration: Customize image pulling policies, resource allocation, and container settings

The Docker backend uses the adapter pattern to provide a unified interface, making it easy to switch between Docker and Podman without code changes.

Prerequisites

Required Software

  • Docker: Install Docker Desktop (macOS/Windows) or Docker Engine (Linux)
  • Python 3.9+
  • Kubeflow SDK: Install with Docker support:
    pip install "kubeflow[docker]"
    

Verify Installation

# Check Docker is running
docker version

# Test Docker daemon connectivity
docker ps

Basic Example

Here’s a simple example using the Docker Container Backend:

from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig

def train_model():
    """Simple training function."""
    import torch
    import os

    rank = int(os.environ.get('RANK', '0'))
    world_size = int(os.environ.get('WORLD_SIZE', '1'))

    print(f"Training on rank {rank}/{world_size}")

    # Your training code
    model = torch.nn.Linear(10, 1)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    for epoch in range(5):
        loss = torch.nn.functional.mse_loss(
            model(torch.randn(32, 10)),
            torch.randn(32, 1)
        )
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"[Rank {rank}] Epoch {epoch + 1}/5, Loss: {loss.item():.4f}")

    print(f"[Rank {rank}] Training completed!")

# Configure the Docker backend
backend_config = ContainerBackendConfig(
    container_runtime="docker",  # Explicitly use Docker
    pull_policy="IfNotPresent",  # Pull image if not cached locally
    auto_remove=True             # Clean up containers after completion
)

# Create the client
client = TrainerClient(backend_config=backend_config)

# Create a trainer with multi-node support
trainer = CustomTrainer(
    func=train_model,
    num_nodes=2  # Run distributed training across 2 containers
)

# Start the TrainJob
job_name = client.train(trainer=trainer)
print(f"TrainJob started: {job_name}")

# Wait for completion
job = client.wait_for_job_status(
    job_name,
)

print(f"Job completed with status: {job.status}")

Configuration Options

ContainerBackendConfig

ParameterTypeDefaultDescription
container_runtimestr | NoneNoneForce specific runtime: "docker", "podman", or None (auto-detect). Use "docker" to ensure Docker is used.
pull_policystr"IfNotPresent"Image pull policy: "IfNotPresent" (pull if missing), "Always" (always pull), "Never" (use cached only).
auto_removeboolTrueAutomatically remove containers and networks after job completion or deletion. Set to False for debugging.
container_hoststr | NoneNoneOverride Docker daemon connection URL (e.g., "unix:///var/run/docker.sock", "tcp://192.168.1.100:2375").
runtime_sourceTrainingRuntimeSourceGitHub sourcesConfiguration for training runtime sources. See “Custom Runtime Sources” section below.

Configuration Examples

Basic Configuration

backend_config = ContainerBackendConfig(
    container_runtime="docker",
)

Always Pull Latest Image

backend_config = ContainerBackendConfig(
    container_runtime="docker",
    pull_policy="Always"  # Always pull latest image
)

Keep Containers for Debugging

backend_config = ContainerBackendConfig(
    container_runtime="docker",
    auto_remove=False  # Containers remain after job completion
)

Multi-Node Distributed Training

The Docker backend automatically sets up networking and environment variables for distributed training:

from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig

def distributed_train():
    """PyTorch distributed training example."""
    import os
    import torch
    import torch.distributed as dist

    # Environment variables set by torchrun
    rank = int(os.environ['RANK'])
    world_size = int(os.environ['WORLD_SIZE'])

    print(f"Initializing process group: rank={rank}, world_size={world_size}")

    # Initialize distributed training
    dist.init_process_group(
        backend='gloo',  # Use 'gloo' for CPU, 'nccl' for GPU
        rank=rank,
        world_size=world_size
    )

    # Your distributed training code
    model = torch.nn.Linear(10, 1)
    ddp_model = torch.nn.parallel.DistributedDataParallel(model)

    # Training loop
    for epoch in range(5):
        # Your training code here
        print(f"[Rank {rank}] Training epoch {epoch + 1}")

    dist.destroy_process_group()
    print(f"[Rank {rank}] Training complete")

backend_config = ContainerBackendConfig(
    container_runtime="docker",
)

client = TrainerClient(backend_config=backend_config)

trainer = CustomTrainer(
    func=distributed_train,
    num_nodes=4  # Run across 4 containers
)

job_name = client.train(trainer=trainer)

Job Management

For common job management operations (listing jobs, viewing logs, deleting jobs), see the Job Management section in the overview.

Inspecting Containers

When auto_remove=False, you can inspect containers after job completion:

# List containers for a job
docker ps -a --filter "label=kubeflow.org/job-name=<job-name>"

# Inspect a specific container
docker inspect <job-name>-node-0

# View logs directly
docker logs <job-name>-node-0

# Execute commands in a stopped container
docker start <job-name>-node-0
docker exec -it <job-name>-node-0 /bin/bash

Working with Runtimes

For information about using runtimes and custom runtime sources, see the Working with Runtimes section in the overview.

Troubleshooting

Docker Daemon Not Running

Error: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(61, 'Connection refused'))

Solution:

# macOS/Windows: Start Docker Desktop
# Linux: Start Docker daemon
sudo systemctl start docker

# Verify Docker is running
docker ps

Permission Denied

Error: Got permission denied while trying to connect to the Docker daemon socket

Solution (Linux):

# Add your user to docker group
sudo usermod -aG docker $USER

# Log out and back in, or run
newgrp docker

GPU Not Available in Container

Error: RuntimeError: No CUDA GPUs are available

Solution:

# 1. Verify NVIDIA drivers on host
nvidia-smi

# 2. Verify NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# 3. Request GPU in your trainer
trainer = CustomTrainer(
    func=train_model,
    resources_per_node={"gpu": "1"}
)

Containers Not Removed

Problem: Containers remain after job completion

Solution:

# Ensure auto_remove is enabled
backend_config = ContainerBackendConfig(
    container_runtime="docker",
    auto_remove=True  # Default
)

# Or manually clean up
client.delete_job(job_name)

# Or use Docker CLI
docker rm -f $(docker ps -aq --filter "label=kubeflow.org/job-name=<job-name>")

Network Conflicts

Error: network with name <job-name>-net already exists

Solution:

# Remove conflicting network
docker network rm <job-name>-net

# Or delete the previous job
# client.delete_job(job_name)

Next Steps

Feedback

Was this page helpful?