Podman Backend
Overview
The Container Backend with Podman enables you to run distributed TrainJobs in isolated containers using Podman, a daemonless container engine. Podman offers several advantages over Docker:
- Daemonless Architecture: No background daemon required, reducing attack surface
- Rootless Containers: Run containers without root privileges for enhanced security
- Full Container Isolation: Each training process runs in its own container with isolated filesystem, network, and resources
- Multi-Node Support: Run distributed training across multiple containers with automatic DNS-enabled networking
- Docker Compatibility: Compatible with Docker images and Docker CLI syntax
- systemd Integration: Better integration with systemd for service management
The Podman backend uses the same adapter pattern as Docker, providing a unified interface for container operations.
Prerequisites
Required Software & Initial Setup
- Podman 3.0+: Install Podman for your platform by following the podman installation instructions
- Kubeflow SDK: Install with Podman support:
pip install "kubeflow[podman]"
Verify Installation
podman version
podman ps
Custom Socket Location (Optional)
By default, Podman uses different socket locations than Docker. You can specify a custom socket:
# Start Podman with custom socket (macOS/Linux)
podman system service --time=0 unix:///tmp/podman.sock
# Or use systemd (Linux)
systemctl --user enable --now podman.socket
Basic Example
Here’s a simple example using the Podman Container Backend:
from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig
def train_model():
"""Simple training function."""
import torch
import os
# Environment variables set by torchrun
rank = int(os.environ.get('RANK', '0'))
world_size = int(os.environ.get('WORLD_SIZE', '1'))
print(f"Training on rank {rank}/{world_size}")
# Your training code
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(5):
loss = torch.nn.functional.mse_loss(
model(torch.randn(32, 10)),
torch.randn(32, 1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"[Rank {rank}] Epoch {epoch + 1}/5, Loss: {loss.item():.4f}")
print(f"[Rank {rank}] Training completed!")
# Configure the Podman backend
backend_config = ContainerBackendConfig(
container_runtime="podman", # Explicitly use Podman
pull_policy="IfNotPresent", # Pull image if not cached locally
auto_remove=True # Clean up containers after completion
)
# Create the client
client = TrainerClient(backend_config=backend_config)
# Create a trainer with multi-node support
trainer = CustomTrainer(
func=train_model,
num_nodes=2 # Run distributed training across 2 containers
)
# Start the TrainJob
job_name = client.train(trainer=trainer)
print(f"TrainJob started: {job_name}")
# Wait for completion
job = client.wait_for_job_status(
job_name,
)
print(f"Job completed with status: {job.status}")
Configuration Options
ContainerBackendConfig for Podman
| Parameter | Type | Default | Description |
|---|---|---|---|
container_runtime | str | None | None | Force specific runtime: "podman", "docker", or None (auto-detect). Use "podman" to ensure Podman is used. |
pull_policy | str | "IfNotPresent" | Image pull policy: "IfNotPresent" (pull if missing), "Always" (always pull), "Never" (use cached only). |
auto_remove | bool | True | Automatically remove containers and networks after job completion or deletion. Set to False for debugging. |
container_host | str | None | None | Override Podman socket URL (e.g., "unix:///tmp/podman.sock", "unix:///run/user/1000/podman/podman.sock"). |
runtime_source | TrainingRuntimeSource | GitHub sources | Configuration for training runtime sources. See “Custom Runtime Sources” section below. |
Configuration Examples
Basic Podman Configuration
backend_config = ContainerBackendConfig(
container_runtime="podman",
)
Custom Socket Location
# macOS with Podman machine
backend_config = ContainerBackendConfig(
container_runtime="podman",
container_host="unix:///tmp/podman.sock"
)
# Linux rootless (user-specific socket)
import os
uid = os.getuid()
backend_config = ContainerBackendConfig(
container_runtime="podman",
container_host=f"unix:///run/user/{uid}/podman/podman.sock"
)
Always Pull Latest Image
backend_config = ContainerBackendConfig(
container_runtime="podman",
pull_policy="Always" # Always pull latest image
)
Keep Containers for Debugging
backend_config = ContainerBackendConfig(
container_runtime="podman",
auto_remove=False # Containers remain after job completion
)
Multi-Node Distributed Training
The Podman backend automatically sets up networking and environment variables for distributed training:
from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig
def distributed_train():
"""PyTorch distributed training example."""
import os
import torch
import torch.distributed as dist
# Environment variables set by torchrun
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
print(f"Initializing process group: rank={rank}, world_size={world_size}")
# Initialize distributed training
dist.init_process_group(
backend='gloo', # Use 'gloo' for CPU, 'nccl' for GPU
rank=rank,
world_size=world_size
)
# Your distributed training code
model = torch.nn.Linear(10, 1)
ddp_model = torch.nn.parallel.DistributedDataParallel(model)
# Training loop
for epoch in range(5):
# Your training code here
print(f"[Rank {rank}] Training epoch {epoch + 1}")
dist.destroy_process_group()
print(f"[Rank {rank}] Training complete")
backend_config = ContainerBackendConfig(
container_runtime="podman",
)
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(
func=distributed_train,
num_nodes=4 # Run across 4 containers
)
job_name = client.train(trainer=trainer)
Podman-Specific Networking
Podman creates networks with DNS enabled by default, allowing containers to resolve each other by hostname. The backend implementation uses IP addresses for the MASTER_ADDR environment variable to ensure reliable communication:
IP Address Resolution
The Podman backend automatically retrieves the IP address of the rank-0 container using podman inspect:
podman inspect --format '{{.NetworkSettings.Networks.<network-name>.IPAddress}}' <container-name>
This IP address is then set as MASTER_ADDR for all nodes in the job, ensuring that:
- Communication works even if DNS resolution has timing issues
- The master address is available immediately when containers start
- Distributed training frameworks (PyTorch, TensorFlow) can connect reliably
DNS Resolution
While DNS is enabled and containers can resolve each other by hostname, the backend uses IP addresses for reliability:
def test_networking():
import os
import socket
rank = int(os.environ['RANK'])
master_addr = os.environ['MASTER_ADDR']
print(f"Rank {rank}: My hostname is {socket.gethostname()}")
print(f"Rank {rank}: Master address (IP): {master_addr}")
# DNS is also available - containers can resolve by hostname
if rank == 0:
container_name = os.environ.get('HOSTNAME')
print(f"Rank {rank}: My IP address: {socket.gethostbyname(container_name)}")
# Containers can ping each other by name or IP
if rank != 0:
import subprocess
result = subprocess.run(['ping', '-c', '1', master_addr], capture_output=True)
print(f"Ping to master IP: {result.returncode == 0}")
Network Architecture
For a job with num_nodes=3, the Podman backend:
- Creates a dedicated network:
<job-name>-netwith DNS enabled - Launches rank-0 container and waits for it to be running
- Inspects rank-0 container to get its IP address
- Sets
MASTER_ADDRto this IP for all containers - Launches remaining containers (rank 1, 2, …) connected to the same network
This approach combines the benefits of DNS (hostname resolution) with the reliability of IP addresses for critical communication paths.
Job Management
For common job management operations (listing jobs, viewing logs, deleting jobs), see the Job Management section in the overview.
Inspecting Containers with Podman CLI
When auto_remove=False, you can inspect containers:
# List containers for a job
podman ps -a --filter "label=kubeflow.org/job-name=<job-name>"
# Inspect a specific container
podman inspect <job-name>-node-0
# View logs directly
podman logs <job-name>-node-0
# Execute commands in a container
podman exec -it <job-name>-node-0 /bin/bash
# With custom socket
podman --url unix:///tmp/podman.sock logs <job-name>-node-0
Working with Runtimes
For information about using runtimes and custom runtime sources, see the Working with Runtimes section in the overview.
Troubleshooting
Podman Service Not Running (macOS)
Error: ConnectionRefusedError: [Errno 61] Connection refused
Solution:
# Check Podman machine status
podman machine list
# Start Podman machine
podman machine start
# Or start with custom socket
podman machine stop
podman machine start --now
Socket Not Found (Linux)
Error: FileNotFoundError: [Errno 2] No such file or directory: '/run/user/1000/podman/podman.sock'
Solution:
# Start Podman socket service
systemctl --user start podman.socket
systemctl --user enable podman.socket
# Verify socket exists
ls -la /run/user/$(id -u)/podman/podman.sock
# Or specify custom socket in config
Permission Denied (Rootless)
Error: Error: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting
Solution:
# Enable user namespaces (Linux)
sudo sysctl -w user.max_user_namespaces=15000
# Or edit /etc/sysctl.conf
echo "user.max_user_namespaces=15000" | sudo tee -a /etc/sysctl.conf
# For persistent storage, configure subuid/subgid
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $USER
DNS Resolution Issues
Error: Containers cannot resolve each other’s hostnames
Solution:
# Verify DNS is enabled in network
podman network inspect <job-name>-net | grep dns_enabled
# Should show: "dns_enabled": true
# Podman networks created by the backend have DNS enabled by default
Containers Not Removed
Problem: Containers remain after job completion
Solution:
# Ensure auto_remove is enabled
backend_config = ContainerBackendConfig(
container_runtime="podman",
auto_remove=True # Default
)
# Or manually clean up
client.delete_job(job_name)
# Or use Podman CLI
podman rm -f $(podman ps -aq --filter "label=kubeflow.org/job-name=<job-name>")
Next Steps
- Try the MNIST example notebook for a complete end-to-end example
- Learn about the Container Backend with Docker for Docker-specific features
- Learn about the Local Process Backend for non-containerized local execution
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.