Podman Backend

How to run TrainJobs with Podman containers

Overview

The Container Backend with Podman enables you to run distributed TrainJobs in isolated containers using Podman, a daemonless container engine. Podman offers several advantages over Docker:

  • Daemonless Architecture: No background daemon required, reducing attack surface
  • Rootless Containers: Run containers without root privileges for enhanced security
  • Full Container Isolation: Each training process runs in its own container with isolated filesystem, network, and resources
  • Multi-Node Support: Run distributed training across multiple containers with automatic DNS-enabled networking
  • Docker Compatibility: Compatible with Docker images and Docker CLI syntax
  • systemd Integration: Better integration with systemd for service management

The Podman backend uses the same adapter pattern as Docker, providing a unified interface for container operations.

Prerequisites

Required Software & Initial Setup

  • Podman 3.0+: Install Podman for your platform by following the podman installation instructions
  • Kubeflow SDK: Install with Podman support:
    pip install "kubeflow[podman]"
    

Verify Installation

podman version
podman ps

Custom Socket Location (Optional)

By default, Podman uses different socket locations than Docker. You can specify a custom socket:

# Start Podman with custom socket (macOS/Linux)
podman system service --time=0 unix:///tmp/podman.sock

# Or use systemd (Linux)
systemctl --user enable --now podman.socket

Basic Example

Here’s a simple example using the Podman Container Backend:

from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig

def train_model():
    """Simple training function."""
    import torch
    import os

    # Environment variables set by torchrun
    rank = int(os.environ.get('RANK', '0'))
    world_size = int(os.environ.get('WORLD_SIZE', '1'))

    print(f"Training on rank {rank}/{world_size}")

    # Your training code
    model = torch.nn.Linear(10, 1)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    for epoch in range(5):
        loss = torch.nn.functional.mse_loss(
            model(torch.randn(32, 10)),
            torch.randn(32, 1)
        )
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"[Rank {rank}] Epoch {epoch + 1}/5, Loss: {loss.item():.4f}")

    print(f"[Rank {rank}] Training completed!")

# Configure the Podman backend
backend_config = ContainerBackendConfig(
    container_runtime="podman",   # Explicitly use Podman
    pull_policy="IfNotPresent",   # Pull image if not cached locally
    auto_remove=True              # Clean up containers after completion
)

# Create the client
client = TrainerClient(backend_config=backend_config)

# Create a trainer with multi-node support
trainer = CustomTrainer(
    func=train_model,
    num_nodes=2  # Run distributed training across 2 containers
)

# Start the TrainJob
job_name = client.train(trainer=trainer)
print(f"TrainJob started: {job_name}")

# Wait for completion
job = client.wait_for_job_status(
    job_name,
)

print(f"Job completed with status: {job.status}")

Configuration Options

ContainerBackendConfig for Podman

ParameterTypeDefaultDescription
container_runtimestr | NoneNoneForce specific runtime: "podman", "docker", or None (auto-detect). Use "podman" to ensure Podman is used.
pull_policystr"IfNotPresent"Image pull policy: "IfNotPresent" (pull if missing), "Always" (always pull), "Never" (use cached only).
auto_removeboolTrueAutomatically remove containers and networks after job completion or deletion. Set to False for debugging.
container_hoststr | NoneNoneOverride Podman socket URL (e.g., "unix:///tmp/podman.sock", "unix:///run/user/1000/podman/podman.sock").
runtime_sourceTrainingRuntimeSourceGitHub sourcesConfiguration for training runtime sources. See “Custom Runtime Sources” section below.

Configuration Examples

Basic Podman Configuration

backend_config = ContainerBackendConfig(
    container_runtime="podman",
)

Custom Socket Location

# macOS with Podman machine
backend_config = ContainerBackendConfig(
    container_runtime="podman",
    container_host="unix:///tmp/podman.sock"
)

# Linux rootless (user-specific socket)
import os
uid = os.getuid()
backend_config = ContainerBackendConfig(
    container_runtime="podman",
    container_host=f"unix:///run/user/{uid}/podman/podman.sock"
)

Always Pull Latest Image

backend_config = ContainerBackendConfig(
    container_runtime="podman",
    pull_policy="Always"  # Always pull latest image
)

Keep Containers for Debugging

backend_config = ContainerBackendConfig(
    container_runtime="podman",
    auto_remove=False  # Containers remain after job completion
)

Multi-Node Distributed Training

The Podman backend automatically sets up networking and environment variables for distributed training:

from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig

def distributed_train():
    """PyTorch distributed training example."""
    import os
    import torch
    import torch.distributed as dist

    # Environment variables set by torchrun
    rank = int(os.environ['RANK'])
    world_size = int(os.environ['WORLD_SIZE'])

    print(f"Initializing process group: rank={rank}, world_size={world_size}")

    # Initialize distributed training
    dist.init_process_group(
        backend='gloo',  # Use 'gloo' for CPU, 'nccl' for GPU
        rank=rank,
        world_size=world_size
    )

    # Your distributed training code
    model = torch.nn.Linear(10, 1)
    ddp_model = torch.nn.parallel.DistributedDataParallel(model)

    # Training loop
    for epoch in range(5):
        # Your training code here
        print(f"[Rank {rank}] Training epoch {epoch + 1}")

    dist.destroy_process_group()
    print(f"[Rank {rank}] Training complete")

backend_config = ContainerBackendConfig(
    container_runtime="podman",
)

client = TrainerClient(backend_config=backend_config)

trainer = CustomTrainer(
    func=distributed_train,
    num_nodes=4  # Run across 4 containers
)

job_name = client.train(trainer=trainer)

Podman-Specific Networking

Podman creates networks with DNS enabled by default, allowing containers to resolve each other by hostname. The backend implementation uses IP addresses for the MASTER_ADDR environment variable to ensure reliable communication:

IP Address Resolution

The Podman backend automatically retrieves the IP address of the rank-0 container using podman inspect:

podman inspect --format '{{.NetworkSettings.Networks.<network-name>.IPAddress}}' <container-name>

This IP address is then set as MASTER_ADDR for all nodes in the job, ensuring that:

  • Communication works even if DNS resolution has timing issues
  • The master address is available immediately when containers start
  • Distributed training frameworks (PyTorch, TensorFlow) can connect reliably

DNS Resolution

While DNS is enabled and containers can resolve each other by hostname, the backend uses IP addresses for reliability:

def test_networking():
    import os
    import socket

    rank = int(os.environ['RANK'])
    master_addr = os.environ['MASTER_ADDR']

    print(f"Rank {rank}: My hostname is {socket.gethostname()}")
    print(f"Rank {rank}: Master address (IP): {master_addr}")
    # DNS is also available - containers can resolve by hostname
    if rank == 0:
        container_name = os.environ.get('HOSTNAME')
        print(f"Rank {rank}: My IP address: {socket.gethostbyname(container_name)}")
    
    # Containers can ping each other by name or IP
    if rank != 0:
        import subprocess
        result = subprocess.run(['ping', '-c', '1', master_addr], capture_output=True)
        print(f"Ping to master IP: {result.returncode == 0}")

Network Architecture

For a job with num_nodes=3, the Podman backend:

  1. Creates a dedicated network: <job-name>-net with DNS enabled
  2. Launches rank-0 container and waits for it to be running
  3. Inspects rank-0 container to get its IP address
  4. Sets MASTER_ADDR to this IP for all containers
  5. Launches remaining containers (rank 1, 2, …) connected to the same network

This approach combines the benefits of DNS (hostname resolution) with the reliability of IP addresses for critical communication paths.

Job Management

For common job management operations (listing jobs, viewing logs, deleting jobs), see the Job Management section in the overview.

Inspecting Containers with Podman CLI

When auto_remove=False, you can inspect containers:

# List containers for a job
podman ps -a --filter "label=kubeflow.org/job-name=<job-name>"

# Inspect a specific container
podman inspect <job-name>-node-0

# View logs directly
podman logs <job-name>-node-0

# Execute commands in a container
podman exec -it <job-name>-node-0 /bin/bash

# With custom socket
podman --url unix:///tmp/podman.sock logs <job-name>-node-0

Working with Runtimes

For information about using runtimes and custom runtime sources, see the Working with Runtimes section in the overview.

Troubleshooting

Podman Service Not Running (macOS)

Error: ConnectionRefusedError: [Errno 61] Connection refused

Solution:

# Check Podman machine status
podman machine list

# Start Podman machine
podman machine start

# Or start with custom socket
podman machine stop
podman machine start --now

Socket Not Found (Linux)

Error: FileNotFoundError: [Errno 2] No such file or directory: '/run/user/1000/podman/podman.sock'

Solution:

# Start Podman socket service
systemctl --user start podman.socket
systemctl --user enable podman.socket

# Verify socket exists
ls -la /run/user/$(id -u)/podman/podman.sock

# Or specify custom socket in config

Permission Denied (Rootless)

Error: Error: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting

Solution:

# Enable user namespaces (Linux)
sudo sysctl -w user.max_user_namespaces=15000

# Or edit /etc/sysctl.conf
echo "user.max_user_namespaces=15000" | sudo tee -a /etc/sysctl.conf

# For persistent storage, configure subuid/subgid
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $USER

DNS Resolution Issues

Error: Containers cannot resolve each other’s hostnames

Solution:

# Verify DNS is enabled in network
podman network inspect <job-name>-net | grep dns_enabled

# Should show: "dns_enabled": true

# Podman networks created by the backend have DNS enabled by default

Containers Not Removed

Problem: Containers remain after job completion

Solution:

# Ensure auto_remove is enabled
backend_config = ContainerBackendConfig(
    container_runtime="podman",
    auto_remove=True  # Default
)

# Or manually clean up
client.delete_job(job_name)

# Or use Podman CLI
podman rm -f $(podman ps -aq --filter "label=kubeflow.org/job-name=<job-name>")

Next Steps

Feedback

Was this page helpful?