Automated testing in XGBoost project

This document collects tips for using the Continuous Integration (CI) service of the XGBoost project.

Contents

GitHub Actions

The configuration files are located under the directory .github/workflows.

Most of the tests listed in the configuration files run automatically for every incoming pull requests and every update to branches. A few tests however require manual activation:

R tests with noLD option: Run R tests using a custom-built R with compilation flag --disable-long-double. See this page for more details about noLD. This is a requirement for keeping XGBoost on CRAN (the R package index). To invoke this test suite for a particular pull request, simply add a review comment /gha run r-nold-test. (Ordinary comment won’t work. It needs to be a review comment.)

GitHub Actions is also used to build Python wheels targeting MacOS Intel and Apple Silicon. See .github/workflows/python_wheels.yml. The python_wheels pipeline sets up environment variables prefixed CIBW_* to indicate the target OS and processor. The pipeline then invokes the script build_python_wheels.sh, which in turns calls cibuildwheel to build the wheel. The cibuildwheel is a library that sets up a suitable Python environment for each OS and processor target. Since we don’t have Apple Silion machine in GitHub Actions, cross-compilation is needed; cibuildwheel takes care of the complex task of cross-compiling a Python wheel. (Note that cibuildwheel will call setup.py bdist_wheel. Since XGBoost has a native library component, setup.py contains a glue code to call CMake and a C++ compiler to build the native library on the fly.)

Reproduce CI testing environments using Docker containers

In our CI pipelines, we use Docker containers extensively to package many software packages together. You can reproduce the same testing environment as the CI pipelines by running Docker locally.

Prerequisites

Install Docker: https://docs.docker.com/engine/install/ubuntu/
Install NVIDIA Docker runtime: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-ubuntu-and-debian The runtime lets you access NVIDIA GPUs inside a Docker container.

Building and Running Docker containers locally

For your convenience, we provide the wrapper script tests/ci_build/ci_build.sh. You can use it as follows:

tests/ci_build/ci_build.sh <CONTAINER_TYPE> <DOCKER_BINARY> --build-arg <BUILD_ARG> \
  <COMMAND> ...

where:

<CONTAINER_TYPE> is the identifier for the container. The wrapper script will use the container definition (Dockerfile) located at tests/ci_build/Dockerfile.<CONTAINER_TYPE>. For example, setting the container type to gpu will cause the script to load the Dockerfile tests/ci_build/Dockerfile.gpu.
<DOCKER_BINARY> must be either docker or nvidia-docker. Choose nvidia-docker as long as you need to run any GPU code.
<BUILD_ARG> is a build argument to be passed to Docker. Must be of form VAR=VALUE. Example: --build-arg CUDA_VERSION_ARG=11.0. You can pass multiple --build-arg.
<COMMAND> is the command to run inside the Docker container. This can be more than one argument. Example: tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON.

Optionally, you can set the environment variable CI_DOCKER_EXTRA_PARAMS_INIT to pass extra arguments to Docker. For example:

# Allocate extra space in /dev/shm to enable NCCL
export CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=4g'
# Run multi-GPU test suite
tests/ci_build/ci_build.sh gpu nvidia-docker --build-arg CUDA_VERSION_ARG=11.0 \
  tests/ci_build/test_python.sh mgpu

To pass multiple extra arguments:

export CI_DOCKER_EXTRA_PARAMS_INIT='-e VAR1=VAL1 -e VAR2=VAL2 -e VAR3=VAL3'

Update pipeline definitions for BuildKite CI

BuildKite is a SaaS (Software as a Service) platform that orchestrates cloud machines to host CI pipelines. The BuildKite platform allows us to define CI pipelines as a declarative YAML file.

The pipeline definitions are found in tests/buildkite/:

tests/buildkite/pipeline-win64.yml: This pipeline builds and tests XGBoost for the Windows platform.
tests/buildkite/pipeline-mgpu.yml: This pipeline builds and tests XGBoost with access to multiple NVIDIA GPUs.
tests/buildkite/pipeline.yml: This pipeline builds and tests XGBoost with access to a single NVIDIA GPU. Most tests are located here.

Managing Elastic CI Stack with BuildKite

BuildKite allows us to define cloud resources in a declarative fashion. Every configuration step is now documented explicitly as code.

Prerequisite: You should have some knowledge of CloudFormation. CloudFormation lets us define a stack of cloud resources (EC2 machines, Lambda functions, S3 etc) using a single YAML file.

Prerequisite: Gain access to the XGBoost project’s AWS account (admin@xgboost-ci.net), and then set up a credential pair in order to provision resources on AWS. See Creating an IAM user in your AWS account.

Option 1. Give full admin privileges to your IAM user. This is the simplest option.
Option 2. Give limited set of permissions to your IAM user, to reduce the possibility of messing up other resources. For this, use the script tests/buildkite/infrastructure/service-user/create_service_user.py.

Worker Image Pipeline

Building images for worker machines used to be a chore: you’d provision an EC2 machine, SSH into it, and manually install the necessary packages. This process is not only laborous but also error-prone. You may forget to install a package or change a system configuration.

No more. Now we have an automated pipeline for building images for worker machines.

Run tests/buildkite/infrastructure/worker-image-pipeline/create_worker_image_pipelines.py in order to provision CloudFormation stacks named buildkite-linux-amd64-gpu-worker and buildkite-windows-gpu-worker. They are pipelines that create AMIs (Amazon Machine Images) for Linux and Windows workers, respectively.
Navigate to the CloudFormation web console to verify that the image builder pipelines have been provisioned. It may take some time.
Once they pipelines have been fully provisioned, run the script tests/buildkite/infrastructure/worker-image-pipeline/run_pipelines.py to execute the pipelines. New AMIs will be uploaded to the EC2 service. You can locate them in the EC2 console.
Make sure to modify tests/buildkite/infrastructure/aws-stack-creator/metadata.py to use the correct AMI IDs. (For linux-amd64-cpu and linux-arm64-cpu, use the AMIs provided by BuildKite. Consult the AWSRegion2AMI section of https://s3.amazonaws.com/buildkite-aws-stack/latest/aws-stack.yml.)

EC2 Autoscaling Groups

In EC2, you can create auto-scaling groups, where you can dynamically adjust the number of worker instances according to workload. When a pull request is submitted, the following steps take place:

GitHub sends a signal to the registered webhook, which connects to the BuildKite server.
BuildKite sends a signal to a Lambda function named Autoscaling.
The Lambda function sends a signal to the auto-scaling group. The group scales up and adds additional worker instances.
New worker instances run the test jobs. Test results are reported back to BuildKite.
When the test jobs complete, BuildKite sends a signal to Autoscaling, which in turn requests the autoscaling group to scale down. Idle worker instances are shut down.

To set up the auto-scaling group, run the script tests/buildkite/infrastructure/aws-stack-creator/create_stack.py. Check the CloudFormation web console to verify successful provision of auto-scaling groups.