Python API Reference<a class="headerlink" href="#python-api-reference" title="Permalink to this heading">

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (XGBRegressor) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (XGBRegressor) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (XGBRegressor) –

Returns:

self – The updated object.

Return type:

class xgboost.XGBClassifier(*, objective='binary:logistic', **kwargs)

Bases: XGBModel, ClassifierMixin

Implementation of the scikit-learn API for XGBoost classification. See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (Any) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

XGBClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

predict_proba(X, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (Any) – Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (XGBClassifier) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_proba_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict_proba.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict_proba.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict_proba.
self (XGBClassifier) –

Returns:

self – The updated object.

Return type:

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (XGBClassifier) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (XGBClassifier) –

Returns:

self – The updated object.

Return type:

class xgboost.XGBRanker(*, objective='rank:ndcg', **kwargs)

Bases: XGBModel, XGBRankerMixIn

Implementation of the Scikit-Learn API for XGBoost Ranking.

See Learning to Rank for an introducion.

See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

A custom objective function is currently not supported by XGBRanker.

Note

Query group information is only required for ranking training but not prediction. Multiple groups can be predicted on a single call to predict().

When fitting the model with the group parameter, your data need to be sorted by the query group first. group is an array that contains the size of each query group.

Similarly, when fitting the model with the qid parameter, the data should be sorted according to query index and qid is an array that contains the query index for each training sample.

For example, if your original data look like:

qid

label

features

1

0

x_1

1

1

x_2

1

0

x_3

2

0

x_4

2

1

x_5

2

1

x_6

2

1

x_7

then fit() method can be called with either group array as [3, 4] or with qid as [1, 1, 1, 2, 2, 2, 2], that is the qid column. Also, the qid can be a special column of input X instead of a separated parameter, see fit() for more info.

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, group=None, qid=None, sample_weight=None, base_margin=None, eval_set=None, eval_group=None, eval_qid=None, eval_metric=None, early_stopping_rounds=None, verbose=False, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting ranker

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (Any) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When this is a pandas.DataFrame or a cudf.DataFrame, it may contain a special column called qid for specifying the query index. Using a special column is the same as using the qid parameter, except for being compatible with sklearn utility functions like sklearn.model_selection.cross_validation(). The same convention applies to the XGBRanker.score() and XGBRanker.predict().

qid

feat_0

feat_1

0

$x_{00}$

$x_{01}$

1

$x_{10}$

$x_{11}$

1

$x_{20}$

$x_{21}$

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
group (Any | None) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.
qid (Any | None) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group or a special column in X.
sample_weight (Any | None) –
Query group weights

Note

Weights are per-group for ranking tasks

In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (Any | None) – Global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_group (Sequence[Any] | None) – A list in which eval_group[i] is the list containing the sizes of all query groups in the i-th pair in eval_set.
eval_qid (Sequence[Any] | None) – A list in which eval_qid[i] is the array containing query ID of i-th pair in eval_set. The special column convention in X applies to validation datasets as well.
eval_metric (str, list of str, optional) –

Deprecated since version 1.6.0: use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0: use early_stopping_rounds in __init__() or set_params() instead.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) –
A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.

Note

Weights are per-group for ranking tasks

In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

XGBRanker

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y)

Evaluate score for data using the last evaluation metric. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (Union[pd.DataFrame, cudf.DataFrame]) – Feature matrix. A DataFrame with a special qid column.
y (Any) – Labels

Returns:

The result of the first evaluation metric for the ranker.

Return type:

score

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_group='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_qid='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', group='$UNCHANGED$', qid='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_group (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_group parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_qid (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_qid parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
group (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for group parameter in fit.
qid (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for qid parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (XGBRanker) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (XGBRanker) –

Returns:

self – The updated object.

Return type:

class xgboost.XGBRFRegressor(*, learning_rate=1.0, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: XGBRegressor

scikit-learn API for XGBoost random forest regression. See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (Any) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

XGBRFRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination $R^2$ is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – $R^2$ of self.predict(X) w.r.t. y.

Return type:

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (XGBRFRegressor) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (XGBRFRegressor) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (XGBRFRegressor) –

Returns:

self – The updated object.

Return type:

class xgboost.XGBRFClassifier(*, learning_rate=1.0, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: XGBClassifier

scikit-learn API for XGBoost random forest classification. See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (Any) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

XGBRFClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

predict_proba(X, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (Any) – Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (XGBRFClassifier) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_proba_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict_proba.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict_proba.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict_proba.
self (XGBRFClassifier) –

Returns:

self – The updated object.

Return type:

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (XGBRFClassifier) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (XGBRFClassifier) –

Returns:

self – The updated object.

Return type:

Plotting API

Plotting Library.

xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='F score', ylabel='Features', fmap='', importance_type='weight', max_num_features=None, grid=True, show_values=True, values_format='{v}', **kwargs)

Plot importance based on fitted trees.

Parameters:

booster (XGBModel | Booster | dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()
ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
grid (bool) – Turn the axes grids on or off. Default is True (On).
importance_type (str) –
How the importance is calculated: either “weight”, “gain”, or “cover”
- ”weight” is the number of times a feature appears in a tree
- ”gain” is the average gain of splits which use the feature
- ”cover” is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split
max_num_features (int | None) – Maximum number of top features displayed on plot. If None, all features will be displayed.
height (float) – Bar height, passed to ax.barh()
xlim (tuple | None) – Tuple passed to axes.xlim()
ylim (tuple | None) – Tuple passed to axes.ylim()
title (str) – Axes title. To disable, pass None.
xlabel (str) – X axis title label. To disable, pass None.
ylabel (str) – Y axis title label. To disable, pass None.
fmap (str | PathLike) – The name of feature map file.
show_values (bool) – Show values on plot. To disable, pass False.
values_format (str) – Format string for values. “v” will be replaced by the value of the feature importance. e.g. Pass “{v:.2f}” in order to limit the number of digits after the decimal point to two, for each value printed on the graph.
kwargs (Any) – Other keywords passed to ax.barh()

Returns:

ax

Return type:

matplotlib Axes

xgboost.plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs)

Plot specified tree.

Parameters:

booster (Booster, XGBModel) – Booster or XGBModel instance
fmap (str (optional)) – The name of feature map file
num_trees (int, default 0) – Specify the ordinal number of target tree
rankdir (str, default "TB") – Passed to graphviz via graph_attr
ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.
kwargs (Any) – Other keywords passed to to_graphviz

Returns:

ax

Return type:

matplotlib Axes

xgboost.to_graphviz(booster, fmap='', num_trees=0, rankdir=None, yes_color=None, no_color=None, condition_node_params=None, leaf_node_params=None, **kwargs)

Convert specified tree to graphviz instance. IPython can automatically plot the returned graphviz instance. Otherwise, you should call .render() method of the returned graphviz instance.

Parameters:

booster (Booster | XGBModel) – Booster or XGBModel instance
fmap (str | PathLike) – The name of feature map file
num_trees (int) – Specify the ordinal number of target tree
rankdir (str | None) – Passed to graphviz via graph_attr
yes_color (str | None) – Edge color when meets the node condition.
no_color (str | None) – Edge color when doesn’t meet the node condition.
condition_node_params (dict | None) –
Condition node configuration for for graphviz. Example:
```
{'shape': 'box',
 'style': 'filled,rounded',
 'fillcolor': '#78bceb'}
```
leaf_node_params (dict | None) –
Leaf node configuration for graphviz. Example:
```
{'shape': 'box',
 'style': 'filled',
 'fillcolor': '#e48038'}
```
kwargs (Any) – Other keywords passed to graphviz graph_attr, e.g. graph [ {key} = {value} ]

Returns:

graph

Return type:

graphviz.Source

Callback API

Callback library containing training routines. See Callback Functions for a quick introduction.

class xgboost.callback.TrainingCallback

Interface for training callback.

New in version 1.3.0.

after_iteration(model, epoch, evals_log)

Run after each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

after_training(model)

Run after training is finished.

Parameters:: model (Any) –
Return type:: Any

before_iteration(model, epoch, evals_log)

Run before each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

before_training(model)

Run before training starts.

Parameters:: model (Any) –
Return type:: Any

class xgboost.callback.EvaluationMonitor(rank=0, period=1, show_stdv=False)

Print the evaluation result at each iteration.

New in version 1.3.0.

Parameters:

rank (int) – Which worker should be used for printing the result.
period (int) – How many epoches between printing.
show_stdv (bool) – Used in cv to show standard deviation. Users should not specify it.

after_iteration(model, epoch, evals_log)

Run after each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

after_training(model)

Run after training is finished.

Parameters:: model (Any) –
Return type:: Any

before_iteration(model, epoch, evals_log)

Run before each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

before_training(model)

Run before training starts.

Parameters:: model (Any) –
Return type:: Any

class xgboost.callback.EarlyStopping(rounds, metric_name=None, data_name=None, maximize=None, save_best=False, min_delta=0.0)

Callback function for early stopping

New in version 1.3.0.

Parameters:

rounds (int) – Early stopping rounds.
metric_name (str | None) – Name of metric that is used for early stopping.
data_name (str | None) – Name of dataset that is used for early stopping.
maximize (bool | None) – Whether to maximize evaluation metric. None means auto (discouraged).
save_best (bool | None) – Whether training should return the best model or the last model.
min_delta (float) –

New in version 1.5.0.

Minimum absolute change in score to be qualified as an improvement.

Examples

es = xgboost.callback.EarlyStopping(
    rounds=2,
    min_delta=1e-3,
    save_best=True,
    maximize=False,
    data_name="validation_0",
    metric_name="mlogloss",
)
clf = xgboost.XGBClassifier(tree_method="hist", device="cuda", callbacks=[es])

X, y = load_digits(return_X_y=True)
clf.fit(X, y, eval_set=[(X, y)])

after_iteration(model, epoch, evals_log)

Run after each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

after_training(model)

Run after training is finished.

Parameters:: model (Any) –
Return type:: Any

before_iteration(model, epoch, evals_log)

Run before each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

before_training(model)

Run before training starts.

Parameters:: model (Any) –
Return type:: Any

class xgboost.callback.LearningRateScheduler(learning_rates)

Callback function for scheduling learning rate.

New in version 1.3.0.

Parameters:: learning_rates (Callable[[int], float] | Sequence[float]) – If it’s a callable object, then it should accept an integer parameter epoch and returns the corresponding learning rate. Otherwise it should be a sequence like list or tuple with the same size of boosting rounds.

after_iteration(model, epoch, evals_log)

Run after each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

after_training(model)

Run after training is finished.

Parameters:: model (Any) –
Return type:: Any

before_iteration(model, epoch, evals_log)

Run before each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

before_training(model)

Run before training starts.

Parameters:: model (Any) –
Return type:: Any

class xgboost.callback.TrainingCheckPoint(directory, name='model', as_pickle=False, iterations=100)

Checkpointing operation.

New in version 1.3.0.

Parameters:

directory (str | PathLike) – Output model directory.
name (str) – pattern of output model file. Models will be saved as name_0.json, name_1.json, name_2.json ….
as_pickle (bool) – When set to True, all training parameters will be saved in pickle format, instead of saving only the model.
iterations (int) – Interval of checkpointing. Checkpointing is slow so setting a larger number can reduce performance hit.

after_iteration(model, epoch, evals_log)

Run after each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

after_training(model)

Run after training is finished.

Parameters:: model (Any) –
Return type:: Any

before_iteration(model, epoch, evals_log)

Run before each iteration. Return True when training should stop.

Parameters:

model (Any) –
epoch (int) –
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –

Return type:

before_training(model)

Run before training starts.

Parameters:: model (Any) –
Return type:: Any

Dask API

Dask extensions for distributed training

See Distributed XGBoost with Dask for simple tutorial. Also XGBoost Dask Feature Walkthrough for some examples.

There are two sets of APIs in this module, one is the functional API including train and predict methods. Another is stateful Scikit-Learner wrapper inherited from single-node Scikit-Learn interface.

The implementation is heavily influenced by dask_xgboost: https://github.com/dask/dask-xgboost

Optional dask configuration

xgboost.scheduler_address: Specify the scheduler address, see Troubleshooting.

New in version 1.6.0.

dask.config.set({"xgboost.scheduler_address": "192.0.0.100"})
# We can also specify the port.
dask.config.set({"xgboost.scheduler_address": "192.0.0.100:12345"})

class xgboost.dask.DaskDMatrix(client, data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

Bases: object

DMatrix holding on references to Dask DataFrame or Dask Array. Constructing a DaskDMatrix forces all lazy computation to be carried out. Wait for the input data explicitly if you want to see actual computation of constructing DaskDMatrix.

See doc for xgboost.DMatrix constructor for other parameters. DaskDMatrix accepts only dask collection.

Note

DaskDMatrix does not repartition or move data between workers. It’s the caller’s responsibility to balance the data.

New in version 1.0.0.

Parameters:

client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
data (da.Array | dd.DataFrame) –
label (da.Array | dd.DataFrame | dd.Series | None) –
weight (da.Array | dd.DataFrame | dd.Series | None) –
base_margin (da.Array | dd.DataFrame | dd.Series | None) –
missing (float | None) –
silent (bool) –
feature_names (Sequence[str] | None) –
feature_types (Sequence[str] | None) –
group (da.Array | dd.DataFrame | dd.Series | None) –
qid (da.Array | dd.DataFrame | dd.Series | None) –
label_lower_bound (da.Array | dd.DataFrame | dd.Series | None) –
label_upper_bound (da.Array | dd.DataFrame | dd.Series | None) –
feature_weights (da.Array | dd.DataFrame | dd.Series | None) –
enable_categorical (bool) –

num_col()

Get the number of columns (features) in the DMatrix.

Return type:: number of columns

class xgboost.dask.DaskQuantileDMatrix(client, data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, max_bin=None, ref=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

Bases: DaskDMatrix

A dask version of QuantileDMatrix.

Parameters:

client (distributed.Client) –
data (da.Array | dd.DataFrame) –
label (da.Array | dd.DataFrame | dd.Series | None) –
weight (da.Array | dd.DataFrame | dd.Series | None) –
base_margin (da.Array | dd.DataFrame | dd.Series | None) –
missing (float | None) –
silent (bool) –
feature_names (Sequence[str] | None) –
feature_types (Any | List[Any] | None) –
max_bin (int | None) –
ref (DMatrix | None) –
group (da.Array | dd.DataFrame | dd.Series | None) –
qid (da.Array | dd.DataFrame | dd.Series | None) –
label_lower_bound (da.Array | dd.DataFrame | dd.Series | None) –
label_upper_bound (da.Array | dd.DataFrame | dd.Series | None) –
feature_weights (da.Array | dd.DataFrame | dd.Series | None) –
enable_categorical (bool) –

num_col()

Get the number of columns (features) in the DMatrix.

Return type:: number of columns

xgboost.dask.train(client, params, dtrain, num_boost_round=10, *, evals=None, obj=None, feval=None, early_stopping_rounds=None, xgb_model=None, verbose_eval=True, callbacks=None, custom_metric=None)

Train XGBoost model.

New in version 1.0.0.

Note

Other parameters are the same as xgboost.train() except for evals_result, which is returned as part of function return value instead of argument.

Parameters:

client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
params (Dict[str, Any]) –
dtrain (DaskDMatrix) –
num_boost_round (int) –
evals (Sequence[Tuple[DaskDMatrix, str]] | None) –
obj (Callable[[ndarray, DMatrix], Tuple[ndarray, ndarray]] | None) –
feval (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
early_stopping_rounds (int | None) –
xgb_model (Booster | None) –
verbose_eval (int | bool) –
callbacks (Sequence[TrainingCallback] | None) –
custom_metric (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –

Returns:

results – A dictionary containing trained booster and evaluation history. history field is the same as eval_result from xgboost.train.

{'booster': xgboost.Booster,
 'history': {'train': {'logloss': ['0.48253', '0.35953']},
             'eval': {'logloss': ['0.480385', '0.357756']}}}

Return type:

dict

xgboost.dask.predict(client, model, data, output_margin=False, missing=nan, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, iteration_range=(0, 0), strict_shape=False)

Run prediction with a trained booster.

Note

Using inplace_predict might be faster when some features are not needed. See xgboost.Booster.predict() for details on various parameters. When output has more than 2 dimensions (shap value, leaf with strict_shape), input should be da.Array or DaskDMatrix.

New in version 1.0.0.

Parameters:

client (distributed.Client | None) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
model (TrainReturnT | Booster | distributed.Future) – The trained model. It can be a distributed.Future so user can pre-scatter it onto all workers.
data (DaskDMatrix | da.Array | dd.DataFrame) – Input data used for prediction. When input is a dataframe object, prediction output is a series.
missing (float) – Used when input data is not DaskDMatrix. Specify the value considered as missing.
output_margin (bool) –
pred_leaf (bool) –
pred_contribs (bool) –
approx_contribs (bool) –
pred_interactions (bool) –
validate_features (bool) –
iteration_range (Tuple[int, int]) –
strict_shape (bool) –

Returns:

prediction – When input data is dask.array.Array or DaskDMatrix, the return value is an array, when input data is dask.dataframe.DataFrame, return value can be dask.dataframe.Series, dask.dataframe.DataFrame, depending on the output shape.

Return type:

dask.array.Array/dask.dataframe.Series

xgboost.dask.inplace_predict(client, model, data, iteration_range=(0, 0), predict_type='value', missing=nan, validate_features=True, base_margin=None, strict_shape=False)

Inplace prediction. See doc in xgboost.Booster.inplace_predict() for details.

New in version 1.1.0.

Parameters:

client (distributed.Client | None) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
model (TrainReturnT | Booster | distributed.Future) – See xgboost.dask.predict() for details.
data (da.Array | dd.DataFrame) – dask collection.
iteration_range (Tuple[int, int]) – See xgboost.Booster.predict() for details.
predict_type (str) – See xgboost.Booster.inplace_predict() for details.
missing (float) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
base_margin (da.Array | dd.DataFrame | dd.Series | None) –
See xgboost.DMatrix for details.

New in version 1.4.0.
strict_shape (bool) –
See xgboost.Booster.predict() for details.

New in version 1.4.0.
validate_features (bool) –

Returns:

When input data is dask.array.Array, the return value is an array, when input data is dask.dataframe.DataFrame, return value can be dask.dataframe.Series, dask.dataframe.DataFrame, depending on the output shape.

Return type:

prediction

class xgboost.dask.DaskXGBClassifier(max_depth=None, max_leaves=None, max_bin=None, grow_policy=None, learning_rate=None, n_estimators=None, verbosity=None, objective=None, booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, sampling_method=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type=None, device=None, validate_parameters=None, enable_categorical=False, feature_types=None, max_cat_to_onehot=None, max_cat_threshold=None, multi_strategy=None, eval_metric=None, early_stopping_rounds=None, callbacks=None, **kwargs)

Bases: DaskScikitLearnBase, ClassifierMixin

Implementation of the scikit-learn API for XGBoost classification. See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (int | None) – Maximum number of leaves; 0 indicates no limit.
max_bin (int | None) – If using histogram-based algorithm, maximum number of bins per feature
grow_policy (str | None) – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (str | None) –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property client: distributed.Client: The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (da.Array | dd.DataFrame) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

DaskXGBClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

predict_proba(X, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame | dd.Series) – Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (DaskXGBClassifier) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_proba_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict_proba.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict_proba.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict_proba.
self (DaskXGBClassifier) –

Returns:

self – The updated object.

Return type:

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (DaskXGBClassifier) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (DaskXGBClassifier) –

Returns:

self – The updated object.

Return type:

class xgboost.dask.DaskXGBRegressor(max_depth=None, max_leaves=None, max_bin=None, grow_policy=None, learning_rate=None, n_estimators=None, verbosity=None, objective=None, booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, sampling_method=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type=None, device=None, validate_parameters=None, enable_categorical=False, feature_types=None, max_cat_to_onehot=None, max_cat_threshold=None, multi_strategy=None, eval_metric=None, early_stopping_rounds=None, callbacks=None, **kwargs)

Bases: DaskScikitLearnBase, RegressorMixin

Implementation of the Scikit-Learn API for XGBoost. See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (int | None) – Maximum number of leaves; 0 indicates no limit.
max_bin (int | None) – If using histogram-based algorithm, maximum number of bins per feature
grow_policy (str | None) – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (str | None) –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property client: distributed.Client: The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (da.Array | dd.DataFrame) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

DaskXGBRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination $R^2$ is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – $R^2$ of self.predict(X) w.r.t. y.

Return type:

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (DaskXGBRegressor) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (DaskXGBRegressor) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (DaskXGBRegressor) –

Returns:

self – The updated object.

Return type:

class xgboost.dask.DaskXGBRanker(*, objective='rank:pairwise', **kwargs)

Bases: DaskScikitLearnBase, XGBRankerMixIn

Implementation of the Scikit-Learn API for XGBoost Ranking.

New in version 1.4.0.

See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

For dask implementation, group is not supported, use qid instead.

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property client: distributed.Client: The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, group=None, qid=None, sample_weight=None, base_margin=None, eval_set=None, eval_group=None, eval_qid=None, eval_metric=None, early_stopping_rounds=None, verbose=False, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting ranker

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (da.Array | dd.DataFrame) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When this is a pandas.DataFrame or a cudf.DataFrame, it may contain a special column called qid for specifying the query index. Using a special column is the same as using the qid parameter, except for being compatible with sklearn utility functions like sklearn.model_selection.cross_validation(). The same convention applies to the XGBRanker.score() and XGBRanker.predict().

qid

feat_0

feat_1

0

$x_{00}$

$x_{01}$

1

$x_{10}$

$x_{11}$

1

$x_{20}$

$x_{21}$

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (da.Array | dd.DataFrame | dd.Series) – Labels
group (da.Array | dd.DataFrame | dd.Series | None) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.
qid (da.Array | dd.DataFrame | dd.Series | None) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group or a special column in X.
sample_weight (da.Array | dd.DataFrame | dd.Series | None) –
Query group weights

Note

Weights are per-group for ranking tasks

In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_group (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list in which eval_group[i] is the list containing the sizes of all query groups in the i-th pair in eval_set.
eval_qid (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list in which eval_qid[i] is the array containing query ID of i-th pair in eval_set. The special column convention in X applies to validation datasets as well.
eval_metric (str, list of str, optional) –

Deprecated since version 1.6.0: use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0: use early_stopping_rounds in __init__() or set_params() instead.
verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) –
A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.

Note

Weights are per-group for ranking tasks

In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

DaskXGBRanker

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_group='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_qid='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', group='$UNCHANGED$', qid='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_group (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_group parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_qid (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_qid parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
group (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for group parameter in fit.
qid (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for qid parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (DaskXGBRanker) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (DaskXGBRanker) –

Returns:

self – The updated object.

Return type:

class xgboost.dask.DaskXGBRFRegressor(*, learning_rate=1, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: DaskXGBRegressor

Implementation of the Scikit-Learn API for XGBoost Random Forest Regressor.

New in version 1.4.0.

See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (int) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property client: distributed.Client: The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (da.Array | dd.DataFrame) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

DaskXGBRFRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination $R^2$ is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – $R^2$ of self.predict(X) w.r.t. y.

Return type:

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (DaskXGBRFRegressor) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (DaskXGBRFRegressor) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (DaskXGBRFRegressor) –

Returns:

self – The updated object.

Return type:

class xgboost.dask.DaskXGBRFClassifier(*, learning_rate=1, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: DaskXGBClassifier

Implementation of the Scikit-Learn API for XGBoost Random Forest Classifier.

New in version 1.4.0.

See Using the Scikit-Learn Estimator Interface for more information.

Parameters:

n_estimators (int) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by the GPU version of hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –

New in version 2.0.0.

Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –

New in version 2.0.0.

Note

This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
- one_output_per_tree: One model for each target.
- multi_output_tree: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.
- Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().
- If early stopping occurs, the model will have two additional attributes: best_score and best_iteration. These are used by the predict() and apply() methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
- If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping.
- If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
    reg.fit(X, y)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

apply(X, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters:

X (array_like, shape=[n_samples, n_features]) – Input features matrix.
iteration_range (Tuple[int, int] | None) – See predict().

Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

property best_iteration: int: The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.

property best_score: float: The best score obtained by early stopping.

property client: distributed.Client: The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: coef_
Return type:: array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit() function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit() function, the evals_result will contain the eval_metrics passed to the fit() function.

The returned evaluation result is a dictionary:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}

Return type:: evals_result

property feature_importances_: ndarray

Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.

Returns:

feature_importances_ (array of shape [n_features] except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)

property feature_names_in_: ndarray: Names of features seen during fit(). Defined only when X has feature names that are all strings.

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X (da.Array | dd.DataFrame) –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.

When the tree_method is set to hist, internally, the QuantileDMatrix will be used instead of the DMatrix for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda is used for training, then the data is first processed on CPU then transferred to GPU.
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0.

Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0.

Use early_stopping_rounds in __init__() or set_params() instead.
verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

Return type:

DaskXGBRFClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:: booster
Return type:: a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:: int

get_params(deep=True)

Get parameters.

Parameters:: deep (bool) –
Return type:: Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:: Dict[str, Any]

property intercept_: ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns:: intercept_
Return type:: array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.load_model("model.json")
# or
model.load_model("model.ubj")

Parameters:: fname (str | bytearray | PathLike) – Input file name or memory buffer(see also save_raw)
Return type:: None

property n_features_in_: int: Number of features seen during fit().

predict(X, output_margin=False, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

predict_proba(X, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class. If the model is trained with early stopping, then best_iteration is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix if devices between the data and the estimator don’t match.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (da.Array | dd.DataFrame | dd.Series) – Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.

model.save_model("model.json")
# or
model.save_model("model.ubj")

Parameters:: fname (str | PathLike) – Output file name
Return type:: None

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

set_fit_request(*, base_margin='$UNCHANGED$', base_margin_eval_set='$UNCHANGED$', callbacks='$UNCHANGED$', early_stopping_rounds='$UNCHANGED$', eval_metric='$UNCHANGED$', eval_set='$UNCHANGED$', feature_weights='$UNCHANGED$', sample_weight='$UNCHANGED$', sample_weight_eval_set='$UNCHANGED$', verbose='$UNCHANGED$', xgb_model='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in fit.
base_margin_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin_eval_set parameter in fit.
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for early_stopping_rounds parameter in fit.
eval_metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter in fit.
eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter in fit.
feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feature_weights parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight_eval_set parameter in fit.
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for xgb_model parameter in fit.
self (DaskXGBRFClassifier) –

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Return type:: self
Parameters:: params (Any) –

set_predict_proba_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict_proba.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict_proba.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict_proba.
self (DaskXGBRFClassifier) –

Returns:

self – The updated object.

Return type:

set_predict_request(*, base_margin='$UNCHANGED$', iteration_range='$UNCHANGED$', output_margin='$UNCHANGED$', validate_features='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

base_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for base_margin parameter in predict.
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for iteration_range parameter in predict.
output_margin (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for output_margin parameter in predict.
validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validate_features parameter in predict.
self (DaskXGBRFClassifier) –

Returns:

self – The updated object.

Return type:

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (DaskXGBRFClassifier) –

Returns:

self – The updated object.

Return type: