Skip to contents

Represents previously calculated feature importance as a bar graph.

  • xgb.plot.importance() uses base R graphics, while

  • xgb.ggplot.importance() uses "ggplot".

Usage

xgb.ggplot.importance(
  importance_matrix = NULL,
  top_n = NULL,
  measure = NULL,
  rel_to_first = FALSE,
  n_clusters = seq_len(10),
  ...
)

xgb.plot.importance(
  importance_matrix = NULL,
  top_n = NULL,
  measure = NULL,
  rel_to_first = FALSE,
  left_margin = 10,
  cex = NULL,
  plot = TRUE,
  ...
)

Arguments

importance_matrix

A data.table as returned by xgb.importance().

top_n

Maximal number of top features to include into the plot.

measure

The name of importance measure to plot. When NULL, 'Gain' would be used for trees and 'Weight' would be used for gblinear.

rel_to_first

Whether importance values should be represented as relative to the highest ranked feature, see Details.

n_clusters

A numeric vector containing the min and the max range of the possible number of clusters of bars.

...

Other parameters passed to graphics::barplot() (except horiz, border, cex.names, names.arg, and las). Only used in xgb.plot.importance().

left_margin

Adjust the left margin size to fit feature names. When NULL, the existing par("mar") is used.

cex

Passed as cex.names parameter to graphics::barplot().

plot

Should the barplot be shown? Default is TRUE.

Value

The return value depends on the function:

  • xgb.plot.importance(): Invisibly, a "data.table" with n_top features sorted by importance. If plot = TRUE, the values are also plotted as barplot.

  • xgb.ggplot.importance(): A customizable "ggplot" object. E.g., to change the title, set + ggtitle("A GRAPH NAME").

Details

The graph represents each feature as a horizontal bar of length proportional to the importance of a feature. Features are sorted by decreasing importance. It works for both "gblinear" and "gbtree" models.

When rel_to_first = FALSE, the values would be plotted as in importance_matrix. For a "gbtree" model, that would mean being normalized to the total of 1 ("what is feature's importance contribution relative to the whole model?"). For linear models, rel_to_first = FALSE would show actual values of the coefficients. Setting rel_to_first = TRUE allows to see the picture from the perspective of "what is feature's importance contribution relative to the most important feature?"

The "ggplot" backend performs 1-D clustering of the importance values, with bar colors corresponding to different clusters having similar importance values.

Examples

data(agaricus.train)

## Keep the number of threads to 2 for examples
nthread <- 2
data.table::setDTthreads(nthread)

model <- xgboost(
  agaricus.train$data, factor(agaricus.train$label),
  nrounds = 2,
  max_depth = 3,
  nthreads = nthread
)

importance_matrix <- xgb.importance(model)
xgb.plot.importance(
  importance_matrix, rel_to_first = TRUE, xlab = "Relative importance"
)

gg <- xgb.ggplot.importance(
  importance_matrix, measure = "Frequency", rel_to_first = TRUE
)
gg
gg + ggplot2::ylab("Frequency")