This page contains information about GPU algorithms supported in XGBoost. To install GPU support, checkout the Installation Guide.
Note
CUDA 8.0, Compute Capability 3.5 required
The GPU algorithms in XGBoost require a graphics card with compute capability 3.5 or higher, with CUDA toolkits 8.0 or later. (See this list to look up compute capability of your GPU card.)
Tree construction (training) and prediction can be accelerated with CUDA-capable GPUs.
Specify the tree_method
parameter as one of the following algorithms.
tree_method | Description |
---|---|
gpu_exact | The standard XGBoost tree construction algorithm. Performs exact search for splits. Slower and uses considerably more memory than gpu_hist . |
gpu_hist | Equivalent to the XGBoost fast histogram algorithm. Much faster and uses considerably less memory. NOTE: Will run very slowly on GPUs older than Pascal architecture. |
parameter | gpu_exact |
gpu_hist |
---|---|---|
subsample |
✘ | ✔ |
colsample_bytree |
✘ | ✔ |
colsample_bylevel |
✘ | ✔ |
max_bin |
✘ | ✔ |
gpu_id |
✔ | ✔ |
n_gpus |
✘ | ✔ |
predictor |
✔ | ✔ |
grow_policy |
✘ | ✔ |
monotone_constraints |
✘ | ✔ |
single_precision_histogram |
✘ | ✔ |
GPU accelerated prediction is enabled by default for the above mentioned tree_method
parameters but can be switched to CPU prediction by setting predictor
to cpu_predictor
. This could be useful if you want to conserve GPU memory. Likewise when using CPU algorithms, GPU accelerated prediction can be enabled by setting predictor
to gpu_predictor
.
The experimental parameter single_precision_histogram
can be set to True to enable building histograms using single precision. This may improve speed, in particular on older architectures.
The device ordinal can be selected using the gpu_id
parameter, which defaults to 0.
Multiple GPUs can be used with the gpu_hist
tree method using the n_gpus
parameter. which defaults to 1. If this is set to -1 all available GPUs will be used. If gpu_id
is specified as non-zero, the selected gpu devices will be from gpu_id
to gpu_id+n_gpus
, please note that gpu_id+n_gpus
must be less than or equal to the number of available GPUs on your system. As with GPU vs. CPU, multi-GPU will not always be faster than a single GPU due to PCI bus bandwidth that can limit performance.
Note
Enabling multi-GPU training
Default installation may not enable multi-GPU training. To use multiple GPUs, make sure to read Building with GPU support.
The GPU algorithms currently work with CLI, Python and R packages. See Installation Guide for details.
param['gpu_id'] = 0
param['max_bin'] = 16
param['tree_method'] = 'gpu_hist'
Most of the objective functions implemented in XGBoost can be run on GPU. Following table shows current support status.
Objectives | GPU support |
reg:squarederror | ✔ |
reg:logistic | ✔ |
binary:logistic | ✔ |
binary:logitraw | ✔ |
binary:hinge | ✔ |
count:poisson | ✔ |
reg:gamma | ✔ |
reg:tweedie | ✔ |
multi:softmax | ✔ |
multi:softprob | ✔ |
survival:cox | ✘ |
rank:pairwise | ✘ |
rank:ndcg | ✘ |
rank:map | ✘ |
For multi-gpu support, objective functions also honor the n_gpus
parameter,
which, by default is set to 1. To disable running objectives on GPU, just set
n_gpus
to 0.
Following table shows current support status for evaluation metrics on the GPU.
Metric | GPU Support |
---|---|
rmse | ✔ |
mae | ✔ |
logloss | ✔ |
error | ✔ |
merror | ✘ |
mlogloss | ✘ |
auc | ✘ |
aucpr | ✘ |
ndcg | ✘ |
map | ✘ |
poisson-nloglik | ✔ |
gamma-nloglik | ✔ |
cox-nloglik | ✘ |
gamma-deviance | ✔ |
tweedie-nloglik | ✔ |
As for objective functions, metrics honor the n_gpus
parameter,
which, by default is set to 1. To disable running metrics on GPU, just set
n_gpus
to 0.
You can run benchmarks on synthetic data for binary classification:
python tests/benchmark/benchmark.py
Training time time on 1,000,000 rows x 50 columns with 500 boosting iterations and 0.25/0.75 test/train split on i7-6700K CPU @ 4.00GHz and Pascal Titan X yields the following results:
tree_method | Time (s) |
---|---|
gpu_hist | 13.87 |
hist | 63.55 |
gpu_exact | 161.08 |
exact | 1082.20 |
See GPU Accelerated XGBoost and Updates to the XGBoost GPU algorithms for additional performance benchmarks of the gpu_exact
and gpu_hist
tree methods.
The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the ‘Monitor’ class in cuda code will automatically appear in the nsight profiler.
Nvidia Parallel Forall: Gradient Boosting, Decision Trees and XGBoost with CUDA
Many thanks to the following contributors (alphabetical order): * Andrey Adinets * Jiaming Yuan * Jonathan C. McKinney * Matthew Jones * Philip Cho * Rory Mitchell * Shankara Rao Thejaswi Nanditale * Vinay Deshpande
Please report bugs to the user forum https://discuss.xgboost.ai/.