Distributed XGBoost on Kubernetes

Distributed XGBoost training and batch prediction on Kubernetes are supported via Kubeflow XGBoost Operator.


In order to run a XGBoost job in a Kubernetes cluster, perform the following steps:

  1. Install XGBoost Operator on the Kubernetes cluster.

    1. XGBoost Operator is designed to manage the scheduling and monitoring of XGBoost jobs. Follow this installation guide to install XGBoost Operator.

  2. Write application code that will be executed by the XGBoost Operator.

    1. To use XGBoost Operator, you’ll have to write a couple of Python scripts that implement the distributed training logic for XGBoost. Please refer to the Iris classification example.

    2. Data reader/writer: you need to implement the data reader and writer based on the specific requirements of your chosen data source. For example, if your dataset is stored in a Hive table, you have to write the code to read from or write to the Hive table based on the index of the worker.

    3. Model persistence: in the Iris classification example, the model is stored in Alibaba OSS. If you want to store your model in other storages such as Amazon S3 or Google NFS, you’ll need to implement the model persistence logic based on the requirements of the chosen storage system.

  3. Configure the XGBoost job using a YAML file.

    1. YAML file is used to configure the computational resources and environment for your XGBoost job to run, e.g. the number of workers/masters and the number of CPU/GPUs. Please refer to this YAML template for an example.

  4. Submit XGBoost job to a Kubernetes cluster.

    1. Use kubectl to submit a distributed XGBoost job as illustrated here.


Please submit an issue on XGBoost Operator repo for any feature requests or problems.