The cross validation function of xgboost.
Usage
xgb.cv(
params = xgb.params(),
data,
nrounds,
nfold,
prediction = FALSE,
showsd = TRUE,
metrics = list(),
objective = NULL,
custom_metric = NULL,
stratified = "auto",
folds = NULL,
train_folds = NULL,
verbose = TRUE,
print_every_n = 1L,
early_stopping_rounds = NULL,
maximize = NULL,
callbacks = list(),
...
)
Arguments
- params
List of XGBoost parameters which control the model building process. See the online documentation and the documentation for
xgb.params()
for details.Should be passed as list with named entries. Parameters that are not specified in this list will use their default values.
A list of named parameters can be created through the function
xgb.params()
, which accepts all valid parameters as function arguments.- data
An
xgb.DMatrix
object, with corresponding fields likelabel
or bounds as required for model training by the objective.Note that only the basic
xgb.DMatrix
class is supported - variants such asxgb.QuantileDMatrix
orxgb.ExtMemDMatrix
are not supported here.- nrounds
Max number of boosting iterations.
- nfold
The original dataset is randomly partitioned into
nfold
equal size subsamples.- prediction
A logical value indicating whether to return the test fold predictions from each CV model. This parameter engages the
xgb.cb.cv.predict()
callback.- showsd
Logical value whether to show standard deviation of cross validation.
- metrics
List of evaluation metrics to be used in cross validation, when it is not specified, the evaluation metric is chosen according to objective function. Possible options are:
error
: Binary classification error ratermse
: Root mean square errorlogloss
: Negative log-likelihood functionmae
: Mean absolute errormape
: Mean absolute percentage errorauc
: Area under curveaucpr
: Area under PR curvemerror
: Exact matching error used to evaluate multi-class classification
- objective
Customized objective function. Should take two arguments: the first one will be the current predictions (either a numeric vector or matrix depending on the number of targets / classes), and the second one will be the
data
DMatrix object that is used for training.It should return a list with two elements
grad
andhess
(in that order), as either numeric vectors or numeric matrices depending on the number of targets / classes (same dimension as the predictions that are passed as first argument).- custom_metric
Customized evaluation function. Just like
objective
, should take two arguments, with the first one being the predictions and the second one thedata
DMatrix.Should return a list with two elements
metric
(name that will be displayed for this metric, should be a string / character), andvalue
(the number that the function calculates, should be a numeric scalar).Note that even if passing
custom_metric
, objectives also have an associated default metric that will be evaluated in addition to it. In order to disable the built-in metric, one can pass parameterdisable_default_eval_metric = TRUE
.- stratified
Logical flag indicating whether sampling of folds should be stratified by the values of outcome labels. For real-valued labels in regression objectives, stratification will be done by discretizing the labels into up to 5 buckets beforehand.
If passing "auto", will be set to
TRUE
if the objective inparams
is a classification objective (from XGBoost's built-in objectives, doesn't apply to custom ones), and toFALSE
otherwise.This parameter is ignored when
data
has agroup
field - in such case, the splitting will be based on whole groups (note that this might make the folds have different sizes).Value
TRUE
here is not supported for custom objectives.- folds
List with pre-defined CV folds (each element must be a vector of test fold's indices). When folds are supplied, the
nfold
andstratified
parameters are ignored.If
data
has agroup
field and the objective requires this field, each fold (list element) must additionally have two attributes (retrievable throughattributes
) namedgroup_test
andgroup_train
, which should hold thegroup
to assign throughsetinfo.xgb.DMatrix()
to the resulting DMatrices.- train_folds
List specifying which indices to use for training. If
NULL
(the default) all indices not specified infolds
will be used for training.This is not supported when
data
hasgroup
field.- verbose
If 0, xgboost will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Note that setting
verbose > 0
automatically engages thexgb.cb.print.evaluation(period=1)
callback function.- print_every_n
When passing
verbose>0
, evaluation logs (metrics calculated on the data passed underevals
) will be printed every nth iteration according to the value passed here. The first and last iteration are always included regardless of this 'n'.Only has an effect when passing data under
evals
and when passingverbose>0
. The parameter is passed to thexgb.cb.print.evaluation()
callback.- early_stopping_rounds
Number of boosting rounds after which training will be stopped if there is no improvement in performance (as measured by the evaluatiation metric that is supplied or selected by default for the objective) on the evaluation data passed under
evals
.Must pass
evals
in order to use this functionality. Setting this parameter adds thexgb.cb.early.stop()
callback.If
NULL
, early stopping will not be used.- maximize
If
feval
andearly_stopping_rounds
are set, then this parameter must be set as well. When it isTRUE
, it means the larger the evaluation score the better. This parameter is passed to thexgb.cb.early.stop()
callback.- callbacks
A list of callback functions to perform various task during boosting. See
xgb.Callback()
. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.- ...
Not used.
Some arguments that were part of this function in previous XGBoost versions are currently deprecated or have been renamed. If a deprecated or renamed argument is passed, will throw a warning (by default) and use its current equivalent instead. This warning will become an error if using the 'strict mode' option.
If some additional argument is passed that is neither a current function argument nor a deprecated or renamed argument, a warning or error will be thrown depending on the 'strict mode' option.
Important:
...
will be removed in a future version, and all the current deprecation warnings will become errors. Please use only arguments that form part of the function signature.
Value
An object of class 'xgb.cv.synchronous' with the following elements:
call
: Function call.params
: Parameters that were passed to the xgboost library. Note that it does not capture parameters changed by thexgb.cb.reset.parameters()
callback.evaluation_log
: Evaluation history stored as adata.table
with the first column corresponding to iteration number and the rest corresponding to the CV-based evaluation means and standard deviations for the training and test CV-sets. It is created by thexgb.cb.evaluation.log()
callback.niter
: Number of boosting iterations.nfeatures
: Number of features in training data.folds
: The list of CV folds' indices - either those passed through thefolds
parameter or randomly generated.best_iteration
: Iteration number with the best evaluation metric value (only available with early stopping).
Plus other potential elements that are the result of callbacks, such as a list cv_predict
with
a sub-element pred
when passing prediction = TRUE
, which is added by the xgb.cb.cv.predict()
callback (note that one can also pass it manually under callbacks
with different settings,
such as saving also the models created during cross validation); or a list early_stop
which
will contain elements such as best_iteration
when using the early stopping callback (xgb.cb.early.stop()
).
Details
The original sample is randomly partitioned into nfold
equal size subsamples.
Of the nfold
subsamples, a single subsample is retained as the validation data for testing the model,
and the remaining nfold - 1
subsamples are used as training data.
The cross-validation process is then repeated nrounds
times, with each of the
nfold
subsamples used exactly once as the validation data.
All observations are used for both training and validation.
Adapted from https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
Examples
data(agaricus.train, package = "xgboost")
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label, nthread = 2))
cv <- xgb.cv(
data = dtrain,
nrounds = 3,
params = xgb.params(
nthread = 2,
max_depth = 3,
objective = "binary:logistic"
),
nfold = 5,
metrics = list("rmse","auc")
)
print(cv)
print(cv, verbose = TRUE)