Skip to contents

Parse a boosted tree model text dump into a data.table structure.

Usage

xgb.model.dt.tree(model, trees = NULL, use_int_id = FALSE, ...)

Arguments

model

Object of class xgb.Booster. If it contains feature names (they can be set through setinfo()), they will be used in the output from this function.

If the model contains categorical features, an error will be thrown.

trees

An integer vector of (base-1) tree indices that should be used. The default (NULL) uses all trees. Useful, e.g., in multiclass classification to get only the trees of one class.

use_int_id

A logical flag indicating whether nodes in columns "Yes", "No", and "Missing" should be represented as integers (when TRUE) or as "Tree-Node" character strings (when FALSE, default).

...

Not used.

Some arguments that were part of this function in previous XGBoost versions are currently deprecated or have been renamed. If a deprecated or renamed argument is passed, will throw a warning (by default) and use its current equivalent instead. This warning will become an error if using the 'strict mode' option.

If some additional argument is passed that is neither a current function argument nor a deprecated or renamed argument, a warning or error will be thrown depending on the 'strict mode' option.

Important: ... will be removed in a future version, and all the current deprecation warnings will become errors. Please use only arguments that form part of the function signature.

Value

A data.table with detailed information about tree nodes. It has the following columns:

  • Tree: integer ID of a tree in a model (zero-based index).

  • Node: integer ID of a node in a tree (zero-based index).

  • ID: character identifier of a node in a model (only when use_int_id = FALSE).

  • Feature: for a branch node, a feature ID or name (when available); for a leaf node, it simply labels it as "Leaf".

  • Split: location of the split for a branch node (split condition is always "less than").

  • Yes: ID of the next node when the split condition is met.

  • No: ID of the next node when the split condition is not met.

  • Missing: ID of the next node when the branch value is missing.

  • Gain: either the split gain (change in loss) or the leaf value.

  • Cover: metric related to the number of observations either seen by a split or collected by a leaf during training.

When use_int_id = FALSE, columns "Yes", "No", and "Missing" point to model-wide node identifiers in the "ID" column. When use_int_id = TRUE, those columns point to node identifiers from the corresponding trees in the "Node" column.

Details

Note that this function does not work with models that were fitted to categorical data, and is only applicable to tree-based boosters (not gblinear).

Examples

# Basic use:

data(agaricus.train, package = "xgboost")
## Keep the number of threads to 1 for examples
nthread <- 1
data.table::setDTthreads(nthread)

bst <- xgb.train(
  data = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label, nthread = 1),
  nrounds = 2,
  params = xgb.params(
    max_depth = 2,
    nthread = nthread,
    objective = "binary:logistic"
  )
)

# This bst model already has feature_names stored with it, so those would be used when
# feature_names is not set:
dt <- xgb.model.dt.tree(bst)

# How to match feature names of splits that are following a current 'Yes' branch:
merge(
  dt,
  dt[, .(ID, Y.Feature = Feature)], by.x = "Yes", by.y = "ID", all.x = TRUE
)[
  order(Tree, Node)
]