3.2.0 (2026 Feb 09)
We are excited to announce the XGBoost 3.2 release. This release features significant progress on multi-target tree support with vector leaf, enhanced GPU external memory training, various optimizations, and the removal of the deprecated CLI.
External Memory
The latest XGBoost release features enhanced support for external memory training with GPUs. XGBoost has experimental support for using the CUDA async memory pool, which users can opt in to enable asynchronous memory management for efficient external memory training. Prior to 3.2, the RMM plugin was required. The feature is Linux-only at the moment. (#11706, #11715, #11718, #11931, #11865, #11959, #11962)
The adaptive cache is now used for all device types, including devices with full C2C
bandwidth, like GH200 and DGX station. Users can continue to specify the
cache_host_ratio parameter in case of memory fragmentation. XGBoost now supports
devices with mixed GPU models for configuring the host cache (#11998). As part of the
work for improved NUMA system support, we co-developed the pyhwloc project
(#11992).
Lastly, the old page-concat option for GPU external memory has been removed. XGBoost will use the full dataset for training. (#11882, #11897)
Multi-Target/Class
This release brings substantial progress on the vector-leaf-based multi-target tree model, building on the multi-target intercept work from 3.1. The vector leaf tree stores a vector of weights in each leaf node, enabling the model to capture correlations across targets during tree construction. In 3.2, we expanded the feature set to cover most of the commonly used training configurations.
Warning
The vector leaf is still a work in progress. Feedback is welcome.
New features for the multi-target tree include:
Reduced gradient (sketch boost) for the hist tree method, which avoids using the full gradient matrix to find tree structures for improving scalability with the number of targets. Users can use a custom objective to define the tree split gradient in addition to the full leaf gradient. Built-in objectives are not yet supported.
Support for all regression objectives, including MAE and the quantile loss.
GPU
histtree method implementation has features on par with the CPU one.Regularization parameters including L1/L2,
min_split_loss, andmax_delta_step.Row subsampling with both uniform sampling and gradient-based sampling.
Column sampling (feature selection), including feature weights.
Feature importance variants (gain and coverage).
Model dump support for all formats (JSON, text, graphviz).
External memory.
In addition, intercept initialization for the multinomial logistic objective now adheres to GLM semantics.
Related PRs: #11950, #11914, #11913, #11965, #11941, #11967, #11940, #11896, #11894, #11889, #11917, #11883, #11786, #11881, #11862, #11855, #11829, #11825, #11820, #11814, #11729, #11724, #11747, #11798, #11791, #11789, #11781, #11778, #11777, #11744, #11922, #11920
Currently missing features for the hist tree method with vector leaf:
Distributed training
Categorical features
Feature interaction constraints
Monotone constraints, which are not defined when the output is a vector.
Shapley values
Features
As part of the vector leaf work, CPU
`histnow supports gradient-based sampling.The deprecated CLI (command line interface) has been removed. It was deprecated in 2.1. (#11720)
Expose the categories container to the C API, allowing C users to access category information from the trained model. (#11794)
Support oneapi 2026 release. (#11994)
Compatibility fixes for the latest versions of nvcomp, RMM, and CCCL. (#11930, #11834, #11871, #11995, #11861, #11785, #11997). A nightly CI pipeline was added to test XGBoost with the latest versions of CCCL and RMM. (#11863)
Optimizations
Various optimizations for the GPU hist tree method, some of which were done as part of the vector leaf work. (#11895)
Enable multi-threaded data initialization for CPU. (#11974)
Make the
block_sizeof the CPU histogram building kernel adaptive based on model parameters and CPU cache size, demonstrating up to 2x speedup for certain workloads. (#11808)Small optimizations for some GPU kernels to use TMA. (#11841, #11802)
We now use device memory for storing the tree model, which eliminates data copies between host and device during training and inference. (#11759, #11735, #11750, #11741, #11752)
Fixes
Python Package
R Package
Fix RCHK warnings and memory safety issues. (#11938, #11935, #11847)
Error out on factors passed to
DMatrixwith an informative message. (#11810)Remove calls to R’s global RNG that are no longer needed. (#11848, #11887)
Various documentation fixes and updates. (#11773, #11890, #11732, #11846, #11981, #11842)
JVM Packages
Remove
synchronizedfrom predict, as internal prediction is already thread-safe, with a concurrency test added to verify. (#11746)Set GPU device ID explicitly at the beginning of training and avoid CUDA API guard for the tracker process, allowing Spark executors to run in exclusive mode. (#11939, #11929)
Use
inferBatchSizeParameterinstead of a hardcoded value. (#11745)Documentation updates, maintenance. (#11691, #11915, #11743)
Documents
CI and Maintenance
Support
pre-commitfor various linting and formatting tasks.clang-formatis now required by the CI. (#11984, #11978, #11980, #11958, #11953, #11946, #11993)We added sccache integration to XGBoost’s CI workflows, which brings significant speedup since a majority of the time is spent on compiling variants of XGBoost. In addition, most of the workflows now use GHA container support. (#11956, #11952, #11949, #11937, #11934, #11927, #11932, #11924, #11979)
Various dependency updates, fixes, test refactoring, and cleanups. (#11955, #11957, #11963, #11945, #11912, #11909, #11888, #11898, #11925, #11877, #11824, #11748, #11721, #11705, #11699, #11832, #11796, #11828, #11852, #11800, #11999, #11991)