DART booster¶

XGBoost mostly combines a huge number of regression trees with a small learning rate. In this situation, trees added early are significant and trees added late are unimportant.

Vinayak and Gilad-Bachrach proposed a new method to add dropout techniques from the deep neural net community to boosted trees, and reported better results in some situations.

This is a instruction of new tree booster dart.

Original paper¶

Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive Regression Trees.” JMLR.

Features¶

Drop trees in order to solve the over-fitting.
- Trivial trees (to correct trivial errors) may be prevented.

Because of the randomness introduced in the training, expect the following few differences:

Training can be slower than gbtree because the random dropout prevents usage of the prediction buffer.
The early stop might not be stable, due to the randomness.

How it works¶

In \(m\)-th training round, suppose \(k\) trees are selected to be dropped.
Let \(D = \sum_{i \in \mathbf{K}} F_i\) be the leaf scores of dropped trees and \(F_m = \eta \tilde{F}_m\) be the leaf scores of a new tree.
The objective function is as follows:

\[\mathrm{Obj} = \sum_{j=1}^n L \left( y_j, \hat{y}_j^{m-1} - D_j + \tilde{F}_m \right) + \Omega \left( \tilde{F}_m \right).\]

\(D\) and \(F_m\) are overshooting, so using scale factor

\[\hat{y}_j^m = \sum_{i \not\in \mathbf{K}} F_i + a \left( \sum_{i \in \mathbf{K}} F_i + b F_m \right) .\]

Parameters¶

The booster dart inherits gbtree booster, so it supports all parameters that gbtree does, such as eta, gamma, max_depth etc.

Additional parameters are noted below:

sample_type: type of sampling algorithm.
- uniform: (default) dropped trees are selected uniformly.
- weighted: dropped trees are selected in proportion to weight.
normalize_type: type of normalization algorithm.
- tree: (default) New trees have the same weight of each of dropped trees.
\[\begin{split}a \left( \sum_{i \in \mathbf{K}} F_i + \frac{1}{k} F_m \right) &= a \left( \sum_{i \in \mathbf{K}} F_i + \frac{\eta}{k} \tilde{F}_m \right) \\ &\sim a \left( 1 + \frac{\eta}{k} \right) D \\ &= a \frac{k + \eta}{k} D = D , \\ &\quad a = \frac{k}{k + \eta}\end{split}\]
- forest: New trees have the same weight of sum of dropped trees (forest).
\[\begin{split}a \left( \sum_{i \in \mathbf{K}} F_i + F_m \right) &= a \left( \sum_{i \in \mathbf{K}} F_i + \eta \tilde{F}_m \right) \\ &\sim a \left( 1 + \eta \right) D \\ &= a (1 + \eta) D = D , \\ &\quad a = \frac{1}{1 + \eta} .\end{split}\]
rate_drop: dropout rate.
- range: [0.0, 1.0]
skip_drop: probability of skipping dropout.
- If a dropout is skipped, new trees are added in the same manner as gbtree.
- range: [0.0, 1.0]

Sample Script¶

import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# specify parameters via map
param = {'booster': 'dart',
         'max_depth': 5, 'learning_rate': 0.1,
         'objective': 'binary:logistic', 'silent': True,
         'sample_type': 'uniform',
         'normalize_type': 'tree',
         'rate_drop': 0.1,
         'skip_drop': 0.5}
num_round = 50
bst = xgb.train(param, dtrain, num_round)
# make prediction
# ntree_limit must not be 0
preds = bst.predict(dtest, ntree_limit=num_round)

Note

Specify ntree_limit when predicting with test sets

By default, bst.predict() will perform dropouts on trees. To obtain correct results on test sets, disable dropouts by specifying a nonzero value for ntree_limit.

Table Of Contents