XGBoost Guide
This guide describes how to use TrainJob to run distributed XGBoost training on Kubernetes.
Prerequisites
Before exploring this guide, make sure to follow the Getting Started guide to understand the basics of Kubeflow Trainer.
XGBoost Overview
XGBoost supports distributed training through the Collective communication protocol (historically known as Rabit). In a distributed setting, multiple worker processes each operate on a shard of the data and synchronize histogram bin statistics via AllReduce to agree on the best tree splits.
Kubeflow Trainer integrates with XGBoost by:
- Deploying worker pods as a JobSet.
- Automatically injecting the
DMLC_*environment variables required by XGBoost’s Collective communication layer (DMLC_TRACKER_URI,DMLC_TRACKER_PORT,DMLC_TASK_ID,DMLC_NUM_WORKER). - Providing the rank-0 pod with the tracker address so user code can start a
RabitTrackerfor worker coordination. - Supporting both CPU and GPU training workloads.
The built-in runtime is called xgboost-distributed and uses the container image
ghcr.io/kubeflow/trainer/xgboost-runtime:latest, which includes XGBoost with
CUDA 12 support, NumPy, and scikit-learn.
Worker Count
The total number of XGBoost workers is calculated as:
DMLC_NUM_WORKER = numNodes Ă— workersPerNode
- CPU training: 1 worker per node. Each worker uses OpenMP to parallelize across all available CPU cores.
- GPU training: 1 worker per GPU. The GPU count is derived from
resourcesPerNodelimits in the TrainJob.
Next Steps
- check out the xgboost example
- learn more about
TrainerClinet()APIs in the KubeFlow SDK - Explore XGboost documentation for advanced configuration options
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.