Python API Reference

This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about the Python package.

Global Configuration

xgboost.config_context(**new_config)

Context manager for global XGBoost configuration.

Global configuration consists of a collection of parameters that can be applied in the global scope. See https://xgboost.readthedocs.io/en/stable/parameter.html for the full list of parameters supported in the global configuration.

Note

All settings, not just those presently modified, will be returned to their previous values when the context manager is exited. This is not thread-safe.

New in version 1.4.0.

Parameters

new_config (Dict[str, Any]) – Keyword arguments representing the parameters and their values

Example

import xgboost as xgb

# Show all messages, including ones pertaining to debugging
xgb.set_config(verbosity=2)

# Get current value of global configuration
# This is a dict containing all parameters in the global configuration,
# including 'verbosity'
config = xgb.get_config()
assert config['verbosity'] == 2

# Example of using the context manager xgb.config_context().
# The context manager will restore the previous value of the global
# configuration upon exiting.
with xgb.config_context(verbosity=0):
    # Suppress warning caused by model generated with XGBoost version < 1.0.0
    bst = xgb.Booster(model_file='./old_model.bin')
assert xgb.get_config()['verbosity'] == 2  # old value restored

See also

set_config

Set global XGBoost configuration

get_config

Get current values of the global configuration

xgboost.set_config(**new_config)

Set global configuration.

Global configuration consists of a collection of parameters that can be applied in the global scope. See https://xgboost.readthedocs.io/en/stable/parameter.html for the full list of parameters supported in the global configuration.

New in version 1.4.0.

Parameters

new_config (Dict[str, Any]) – Keyword arguments representing the parameters and their values

Example

import xgboost as xgb

# Show all messages, including ones pertaining to debugging
xgb.set_config(verbosity=2)

# Get current value of global configuration
# This is a dict containing all parameters in the global configuration,
# including 'verbosity'
config = xgb.get_config()
assert config['verbosity'] == 2

# Example of using the context manager xgb.config_context().
# The context manager will restore the previous value of the global
# configuration upon exiting.
with xgb.config_context(verbosity=0):
    # Suppress warning caused by model generated with XGBoost version < 1.0.0
    bst = xgb.Booster(model_file='./old_model.bin')
assert xgb.get_config()['verbosity'] == 2  # old value restored
xgboost.get_config()

Get current values of the global configuration.

Global configuration consists of a collection of parameters that can be applied in the global scope. See https://xgboost.readthedocs.io/en/stable/parameter.html for the full list of parameters supported in the global configuration.

New in version 1.4.0.

Returns

args – The list of global parameters and their values

Return type

Dict[str, Any]

Example

import xgboost as xgb

# Show all messages, including ones pertaining to debugging
xgb.set_config(verbosity=2)

# Get current value of global configuration
# This is a dict containing all parameters in the global configuration,
# including 'verbosity'
config = xgb.get_config()
assert config['verbosity'] == 2

# Example of using the context manager xgb.config_context().
# The context manager will restore the previous value of the global
# configuration upon exiting.
with xgb.config_context(verbosity=0):
    # Suppress warning caused by model generated with XGBoost version < 1.0.0
    bst = xgb.Booster(model_file='./old_model.bin')
assert xgb.get_config()['verbosity'] == 2  # old value restored

Core Data Structure

Core XGBoost Library.

class xgboost.DMatrix(data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, nthread=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

Bases: object

Data Matrix used in XGBoost.

DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.

Parameters
  • data (os.PathLike/string/numpy.array/scipy.sparse/pd.DataFrame/) – dt.Frame/cudf.DataFrame/cupy.array/dlpack Data source of DMatrix. When data is string or os.PathLike type, it represents the path libsvm format txt file, csv file (by specifying uri parameter ‘path_to_csv?format=csv’), or binary file that xgboost can read from.

  • label (array_like) – Label of the training data.

  • weight (array_like) –

    Weight for each instance.

    Note

    For ranking task, weights are per-group.

    In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • base_margin (array_like) – Base margin used for boosting from existing model.

  • missing (float, optional) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.

  • silent (boolean, optional) – Whether print messages during construction

  • feature_names (list, optional) – Set names for features.

  • feature_types (Optional[List[str]]) – Set types for features. When enable_categorical is set to True, string “c” represents categorical data type.

  • nthread (integer, optional) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.

  • group (array_like) – Group size for all ranking group.

  • qid (array_like) – Query ID for data samples, used for ranking.

  • label_lower_bound (array_like) – Lower bound for survival training.

  • label_upper_bound (array_like) – Upper bound for survival training.

  • feature_weights (array_like, optional) – Set feature weights for column sampling.

  • enable_categorical (boolean, optional) –

    New in version 1.3.0.

    Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split. Also, JSON serialization format is required.

Return type

None

property feature_names: Optional[List[str]]

Get feature names (column labels).

Returns

feature_names

Return type

list or None

property feature_types: Optional[List[str]]

Get feature types (column types).

Returns

feature_types

Return type

list or None

get_base_margin()

Get the base margin of the DMatrix.

Returns

base_margin

Return type

float

get_float_info(field)

Get float property from the DMatrix.

Parameters

field (str) – The field name of the information

Returns

info – a numpy array of float information of the data

Return type

array

get_label()

Get the label of the DMatrix.

Returns

label

Return type

array

get_uint_info(field)

Get unsigned integer property from the DMatrix.

Parameters

field (str) – The field name of the information

Returns

info – a numpy array of unsigned integer information of the data

Return type

array

get_weight()

Get the weight of the DMatrix.

Returns

weight

Return type

array

num_col()

Get the number of columns (features) in the DMatrix.

Returns

number of columns

Return type

int

num_row()

Get the number of rows in the DMatrix.

Returns

number of rows

Return type

int

save_binary(fname, silent=True)

Save DMatrix to an XGBoost buffer. Saved binary can be later loaded by providing the path to xgboost.DMatrix() as input.

Parameters
  • fname (string or os.PathLike) – Name of the output buffer file.

  • silent (bool (optional; default: True)) – If set, the output is suppressed.

set_base_margin(margin)

Set base margin of booster to start from.

This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py

Parameters

margin (array like) – Prediction margin of each datapoint

set_float_info(field, data)

Set float type property into the DMatrix.

Parameters
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

set_float_info_npy2d(field, data)
Set float type property into the DMatrix

for numpy 2d array input

Parameters
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

set_group(group)

Set group size of DMatrix (used for ranking).

Parameters

group (array like) – Group size of each group

set_info(*, label=None, weight=None, base_margin=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_names=None, feature_types=None, feature_weights=None)

Set meta info for DMatrix. See doc string for xgboost.DMatrix.

Parameters
  • feature_names (Optional[List[str]]) –

  • feature_types (Optional[List[str]]) –

Return type

None

set_label(label)

Set label of dmatrix

Parameters

label (array like) – The label information to be set into DMatrix

set_uint_info(field, data)

Set uint type property into the DMatrix.

Parameters
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

set_weight(weight)

Set weight of each instance.

Parameters

weight (array like) –

Weight for each data point

Note

For ranking task, weights are per-group.

In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

slice(rindex, allow_groups=False)

Slice the DMatrix and return a new DMatrix that only contains rindex.

Parameters
  • rindex (Union[List[int], numpy.ndarray]) – List of indices to be selected.

  • allow_groups (bool) – Allow slicing of a matrix with a groups attribute

Returns

A new DMatrix containing only selected indices.

Return type

res

class xgboost.DeviceQuantileDMatrix(data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, nthread=None, max_bin=256, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

Bases: xgboost.core.DMatrix

Device memory Data Matrix used in XGBoost for training with tree_method=’gpu_hist’. Do not use this for test/validation tasks as some information may be lost in quantisation. This DMatrix is primarily designed to save memory in training from device memory inputs by avoiding intermediate storage. Set max_bin to control the number of bins during quantisation. See doc string in xgboost.DMatrix for documents on meta info.

You can construct DeviceQuantileDMatrix from cupy/cudf/dlpack.

New in version 1.1.0.

Parameters
  • data (os.PathLike/string/numpy.array/scipy.sparse/pd.DataFrame/) – dt.Frame/cudf.DataFrame/cupy.array/dlpack Data source of DMatrix. When data is string or os.PathLike type, it represents the path libsvm format txt file, csv file (by specifying uri parameter ‘path_to_csv?format=csv’), or binary file that xgboost can read from.

  • label (array_like) – Label of the training data.

  • weight (array_like) –

    Weight for each instance.

    Note

    For ranking task, weights are per-group.

    In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • base_margin (array_like) – Base margin used for boosting from existing model.

  • missing (float, optional) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.

  • silent (boolean, optional) – Whether print messages during construction

  • feature_names (list, optional) – Set names for features.

  • feature_types – Set types for features. When enable_categorical is set to True, string “c” represents categorical data type.

  • nthread (integer, optional) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.

  • group (array_like) – Group size for all ranking group.

  • qid (array_like) – Query ID for data samples, used for ranking.

  • label_lower_bound (array_like) – Lower bound for survival training.

  • label_upper_bound (array_like) – Upper bound for survival training.

  • feature_weights (array_like, optional) – Set feature weights for column sampling.

  • enable_categorical (boolean, optional) –

    New in version 1.3.0.

    Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split. Also, JSON serialization format is required.

  • max_bin (int) –

class xgboost.Booster(params=None, cache=(), model_file=None)

Bases: object

A Booster of XGBoost.

Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.

Parameters
  • params (dict) – Parameters for boosters.

  • cache (list) – List of cache items.

  • model_file (string/os.PathLike/Booster/bytearray) – Path to the model file if it’s string or PathLike.

attr(key)

Get attribute string from the Booster.

Parameters

key (str) – The key to get attribute from.

Returns

value – The attribute value of the key, returns None if attribute do not exist.

Return type

str

attributes()

Get attributes stored in the Booster as a dictionary.

Returns

result – Returns an empty dict if there’s no attributes.

Return type

dictionary of attribute_name: attribute_value pairs of strings.

boost(dtrain, grad, hess)

Boost the booster for one iteration, with customized gradient statistics. Like xgboost.Booster.update(), this function should not be called directly by users.

Parameters
  • dtrain (DMatrix) – The training DMatrix.

  • grad (list) – The first order of gradient.

  • hess (list) – The second order of gradient.

copy()

Copy the booster object.

Returns

booster – a copied booster model

Return type

Booster

dump_model(fout, fmap='', with_stats=False, dump_format='text')

Dump model into a text or JSON file. Unlike save_model, the output format is primarily used for visualization or interpretation, hence it’s more human readable but cannot be loaded back to XGBoost.

Parameters
  • fout (string or os.PathLike) – Output file name.

  • fmap (string or os.PathLike, optional) – Name of the file containing feature map names.

  • with_stats (bool, optional) – Controls whether the split statistics are output.

  • dump_format (string, optional) – Format of model dump file. Can be ‘text’ or ‘json’.

eval(data, name='eval', iteration=0)

Evaluate the model on mat.

Parameters
  • data (DMatrix) – The dmatrix storing the input.

  • name (str, optional) – The name of the dataset.

  • iteration (int, optional) – The current iteration number.

Returns

result – Evaluation result string.

Return type

str

eval_set(evals, iteration=0, feval=None)

Evaluate a set of data.

Parameters
  • evals (list of tuples (DMatrix, string)) – List of items to be evaluated.

  • iteration (int) – Current iteration.

  • feval (function) – Custom evaluation function.

Returns

result – Evaluation result string.

Return type

str

property feature_names: Optional[List[str]]

Feature names for this booster. Can be directly set by input data or by assignment.

property feature_types: Optional[List[str]]

Feature types for this booster. Can be directly set by input data or by assignment.

get_dump(fmap='', with_stats=False, dump_format='text')

Returns the model dump as a list of strings. Unlike save_model, the output format is primarily used for visualization or interpretation, hence it’s more human readable but cannot be loaded back to XGBoost.

Parameters
  • fmap (string or os.PathLike, optional) – Name of the file containing feature map names.

  • with_stats (bool, optional) – Controls whether the split statistics are output.

  • dump_format (string, optional) – Format of model dump. Can be ‘text’, ‘json’ or ‘dot’.

get_fscore(fmap='')

Get feature importance of each feature.

Note

Zero-importance features will not be included

Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.

Parameters

fmap (str or os.PathLike (optional)) – The name of feature map file

get_score(fmap='', importance_type='weight')

Get feature importance of each feature. For tree model Importance type can be defined as:

  • ‘weight’: the number of times a feature is used to split the data across all trees.

  • ‘gain’: the average gain across all splits the feature is used in.

  • ‘cover’: the average coverage across all splits the feature is used in.

  • ‘total_gain’: the total gain across all splits the feature is used in.

  • ‘total_cover’: the total coverage across all splits the feature is used in.

Note

For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

Note

Zero-importance features will not be included

Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.

Parameters
  • fmap (str or os.PathLike (optional)) – The name of feature map file.

  • importance_type (str, default 'weight') – One of the importance types defined above.

Returns

  • A map between feature names and their scores. When gblinear is used for

  • multi-class classification the scores for each feature is a list with length

  • n_classes, otherwise they’re scalars.

Return type

Dict[str, Union[float, List[float]]]

get_split_value_histogram(feature, fmap='', bins=None, as_pandas=True)

Get split value histogram of a feature

Parameters
  • feature (str) – The name of the feature.

  • fmap (str or os.PathLike (optional)) – The name of feature map file.

  • bin (int, default None) – The maximum number of bins. Number of bins equals number of unique split values n_unique, if bins == None or bins > n_unique.

  • as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return numpy ndarray.

  • bins (Optional[int]) –

Returns

  • a histogram of used splitting values for the specified feature

  • either as numpy array or pandas DataFrame.

Return type

Union[numpy.ndarray, <Mock name=’mock.DataFrame’ id=’140290591471952’>]

inplace_predict(data, iteration_range=(0, 0), predict_type='value', missing=nan, validate_features=True, base_margin=None, strict_shape=False)

Run prediction in-place, Unlike predict method, inplace prediction does not cache the prediction result.

Calling only inplace_predict in multiple threads is safe and lock free. But the safety does not hold when used in conjunction with other methods. E.g. you can’t train the booster in one thread and perform prediction in the other.

booster.set_param({'predictor': 'gpu_predictor'})
booster.inplace_predict(cupy_array)

booster.set_param({'predictor': 'cpu_predictor})
booster.inplace_predict(numpy_array)

New in version 1.1.0.

Parameters
  • data (numpy.ndarray/scipy.sparse.csr_matrix/cupy.ndarray/) – cudf.DataFrame/pd.DataFrame The input data, must not be a view for numpy array. Set predictor to gpu_predictor for running prediction on CuPy array or CuDF DataFrame.

  • iteration_range (Tuple[int, int]) – See xgboost.Booster.predict() for details.

  • predict_type (str) –

    • value Output model prediction values.

    • margin Output the raw untransformed margin value.

  • missing (float) – See xgboost.DMatrix for details.

  • validate_features (bool) – See xgboost.Booster.predict() for details.

  • base_margin (Optional[Any]) –

    See xgboost.DMatrix for details.

    New in version 1.4.0.

  • strict_shape (bool) –

    See xgboost.Booster.predict() for details.

    New in version 1.4.0.

Returns

prediction – The prediction result. When input data is on GPU, prediction result is stored in a cupy array.

Return type

numpy.ndarray/cupy.ndarray

load_config(config)

Load configuration returned by save_config.

New in version 1.0.0.

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

num_boosted_rounds()

Get number of boosted rounds. For gblinear this is reset to 0 after serializing the model.

Return type

int

num_features()

Number of features in booster.

Return type

int

predict(data, output_margin=False, ntree_limit=0, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, training=False, iteration_range=(0, 0), strict_shape=False)

Predict with data. The full model will be used unless iteration_range is specified, meaning user have to either slice the model or use the best_iteration attribute to get prediction from best model returned from early stopping.

Note

See Prediction for issues like thread safety and a summary of outputs from this function.

Parameters
  • data (xgboost.core.DMatrix) – The dmatrix storing the input.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (int) – Deprecated, use iteration_range instead.

  • pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.

  • pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.

  • approx_contribs (bool) – Approximate the contributions of each feature. Used when pred_contribs or pred_interactions is set to True. Changing the default of this parameter (False) is not recommended.

  • pred_interactions (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • training (bool) –

    Whether the prediction value is used for training. This can effect dart booster, which performs dropouts during training iterations but use all trees for inference. If you want to obtain result with dropouts, set this parameter to True. Also, the parameter is set to true when obtaining prediction for custom objective function.

    New in version 1.0.0.

  • iteration_range (Tuple[int, int]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

  • strict_shape (bool) –

    When set to True, output shape is invariant to whether classification is used. For both value and margin prediction, the output shape is (n_samples, n_groups), n_groups == 1 when multi-class is not used. Default to False, in which case the output shape can be (n_samples, ) if multi-class is not used.

    New in version 1.4.0.

Returns

prediction

Return type

numpy array

save_config()

Output internal parameter configuration of Booster as a JSON string.

New in version 1.0.0.

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

save_raw()

Save the model to a in memory buffer representation instead of file.

Returns

Return type

a in memory buffer representation of the model

set_attr(**kwargs)

Set the attribute of the Booster.

Parameters
  • **kwargs – The attributes to set. Setting a value to None deletes an attribute.

  • kwargs (Optional[str]) –

Return type

None

set_param(params, value=None)

Set parameters into the Booster.

Parameters
  • params (dict/list/str) – list of key,value pairs, dict of key to value or simply str key

  • value (optional) – value of the specified parameter, when params is str key

trees_to_dataframe(fmap='')

Parse a boosted tree model text dump into a pandas DataFrame structure.

This feature is only defined when the decision tree model is chosen as base learner (booster in {gbtree, dart}). It is not defined for other base learner types, such as linear learners (booster=gblinear).

Parameters

fmap (str or os.PathLike (optional)) – The name of feature map file.

update(dtrain, iteration, fobj=None)

Update for one iteration, with objective function calculated internally. This function should not be called directly by users.

Parameters
  • dtrain (DMatrix) – Training data.

  • iteration (int) – Current iteration number.

  • fobj (function) – Customized objective function.

Learning API

Training Library containing training routines.

xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=None, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None)

Train a booster with given parameters.

Parameters
  • params (dict) – Booster params.

  • dtrain (DMatrix) – Data to be trained.

  • num_boost_round (int) – Number of boosting iterations.

  • evals (list of pairs (DMatrix, string)) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.

  • obj (function) – Customized objective function.

  • feval (function) – Customized evaluation function.

  • maximize (bool) – Whether to maximize feval.

  • early_stopping_rounds (int) – Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in evals. The method returns the model from the last iteration (not the best one). Use custom callback or model slicing if the best model is desired. If there’s more than one item in evals, the last entry will be used for early stopping. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping. If early stopping occurs, the model will have two additional fields: bst.best_score, bst.best_iteration.

  • evals_result (dict) –

    This dictionary stores the evaluation results of all the items in watchlist.

    Example: with a watchlist containing [(dtest,'eval'), (dtrain,'train')] and a parameter containing ('eval_metric': 'logloss'), the evals_result returns

    {'train': {'logloss': ['0.48253', '0.35953']},
     'eval': {'logloss': ['0.480385', '0.357756']}}
    

  • verbose_eval (bool or int) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage. If verbose_eval is an integer then the evaluation metric on the validation set is printed at every given verbose_eval boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed. Example: with verbose_eval=4 and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.

  • xgb_model (file name of stored xgb model or 'Booster' instance) – Xgb model to be loaded before training (allows training continuation).

  • callbacks (list of callback functions) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    [xgb.callback.LearningRateScheduler(custom_rates)]
    

Returns

Booster

Return type

a trained booster model

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=None, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)

Cross-validation with given parameters.

Parameters
  • params (dict) – Booster params.

  • dtrain (DMatrix) – Data to be trained.

  • num_boost_round (int) – Number of boosting iterations.

  • nfold (int) – Number of folds in CV.

  • stratified (bool) – Perform stratified sampling.

  • folds (a KFold or StratifiedKFold instance or list of fold indices) – Sklearn KFolds or StratifiedKFolds object. Alternatively may explicitly pass sample indices for each fold. For n folds, folds should be a length n list of tuples. Each tuple is (in,out) where in is a list of indices to be used as the training samples for the n th fold and out is a list of indices to be used as the testing samples for the n th fold.

  • metrics (string or list of strings) – Evaluation metrics to be watched in CV.

  • obj (function) – Custom objective function.

  • feval (function) – Custom evaluation function.

  • maximize (bool) – Whether to maximize feval.

  • early_stopping_rounds (int) – Activates early stopping. Cross-Validation metric (average of validation metric computed over CV folds) needs to improve at least once in every early_stopping_rounds round(s) to continue training. The last entry in the evaluation history will represent the best iteration. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping.

  • fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.

  • as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return np.ndarray

  • verbose_eval (bool, int, or None, default None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given verbose_eval boosting stage.

  • show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.

  • seed (int) – Seed used to generate the folds (passed to numpy.random.seed).

  • callbacks (list of callback functions) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    [xgb.callback.LearningRateScheduler(custom_rates)]
    

  • shuffle (bool) – Shuffle data before creating folds.

Returns

evaluation history

Return type

list(string)

Scikit-Learn API

Scikit-Learn Wrapper interface for XGBoost.

class xgboost.XGBRegressor(*, objective='reg:squarederror', **kwargs)

Bases: xgboost.sklearn.XGBModel, object

Implementation of the scikit-learn API for XGBoost regression.

Parameters
  • n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

Return type

None

apply(X, ntree_limit=0, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (int) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Any) – Feature matrix

  • y (Any) – Labels

  • sample_weight (Optional[Any]) – instance weights

  • base_margin (Optional[Any]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Any, Any]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (Optional[bool]) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, xgboost.sklearn.XGBModel, str]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Any]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Any]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

xgboost.sklearn.XGBModel

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Any) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Any]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.XGBClassifier(*, objective='binary:logistic', use_label_encoder=True, **kwargs)

Bases: xgboost.sklearn.XGBModel, object

Implementation of the scikit-learn API for XGBoost classification.

Parameters
  • n_estimators (int) – Number of boosting rounds.

  • use_label_encoder (bool) – (Deprecated) Use the label encoder from scikit-learn to encode the labels. For new code, we recommend that you set this parameter to False.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

Return type

None

apply(X, ntree_limit=0, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (int) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBClassifier(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain

{'validation_0': {'logloss': ['0.604835', '0.531479']},
'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Any) – Feature matrix

  • y (Any) – Labels

  • sample_weight (Optional[Any]) – instance weights

  • base_margin (Optional[Any]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Any, Any]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (Optional[bool]) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, str, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Any]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Any]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

xgboost.sklearn.XGBClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Any) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Any]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (array_like) – Feature matrix.

  • ntree_limit (int) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (array_like) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.XGBRanker(*, objective='rank:pairwise', **kwargs)

Bases: xgboost.sklearn.XGBModel, xgboost.sklearn.XGBRankerMixIn

Implementation of the Scikit-Learn API for XGBoost Ranking.

Parameters
  • n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    A custom objective function is currently not supported by XGBRanker. Likewise, a custom metric function is not supported either.

    Note

    Query group information is required for ranking tasks by either using the group parameter or qid parameter in fit method.

    Before fitting the model, your data need to be sorted by query group. When fitting the model, you need to provide an additional array that contains the size of each query group.

    For example, if your original data look like:

    qid

    label

    features

    1

    0

    x_1

    1

    1

    x_2

    1

    0

    x_3

    2

    0

    x_4

    2

    1

    x_5

    2

    1

    x_6

    2

    1

    x_7

    then your group array should be [3, 4]. Sometimes using query id (qid) instead of group can be more convenient.

apply(X, ntree_limit=0, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (int) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, group=None, qid=None, sample_weight=None, base_margin=None, eval_set=None, eval_group=None, eval_qid=None, eval_metric=None, early_stopping_rounds=None, verbose=False, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting ranker

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Any) – Feature matrix

  • y (Any) – Labels

  • group (Optional[Any]) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.

  • qid (Optional[Any]) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group.

  • sample_weight (Optional[Any]) –

    Query group weights

    Note

    Weights are per-group for ranking tasks

    In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • base_margin (Optional[Any]) – Global bias for each instance.

  • eval_set (Optional[List[Tuple[Any, Any]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_group (Optional[List[Any]]) – A list in which eval_group[i] is the list containing the sizes of all query groups in the i-th pair in eval_set.

  • eval_qid (Optional[List[Any]]) – A list in which eval_qid[i] is the array containing query ID of i-th pair in eval_set.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) – If a str, should be a built-in evaluation metric to use. See doc/parameter.rst. If a list of str, should be the list of multiple built-in evaluation metrics to use. The custom evaluation metric is not yet supported for the ranker.

  • early_stopping_rounds (Optional[int]) – Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set. The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping. If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration and clf.best_ntree_limit.

  • verbose (Optional[bool]) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, str, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Any]]) –

    A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.

    Note

    Weights are per-group for ranking tasks

    In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • base_margin_eval_set (Optional[List[Any]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

xgboost.sklearn.XGBRanker

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Any) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Any]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.XGBRFRegressor(*, learning_rate=1.0, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: xgboost.sklearn.XGBRegressor

scikit-learn API for XGBoost random forest regression.

Parameters
  • n_estimators (int) – Number of trees in random forest to fit.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

Return type

None

apply(X, ntree_limit=0, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (int) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Any) – Feature matrix

  • y (Any) – Labels

  • sample_weight (Optional[Any]) – instance weights

  • base_margin (Optional[Any]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Any, Any]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (Optional[bool]) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, str, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Any]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Any]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

xgboost.sklearn.XGBRFRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Any) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Any]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.XGBRFClassifier(*, learning_rate=1.0, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, use_label_encoder=True, **kwargs)

Bases: xgboost.sklearn.XGBClassifier

scikit-learn API for XGBoost random forest classification.

Parameters
  • n_estimators (int) – Number of trees in random forest to fit.

  • use_label_encoder (bool) – (Deprecated) Use the label encoder from scikit-learn to encode the labels. For new code, we recommend that you set this parameter to False.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

apply(X, ntree_limit=0, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (int) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBClassifier(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain

{'validation_0': {'logloss': ['0.604835', '0.531479']},
'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Any) – Feature matrix

  • y (Any) – Labels

  • sample_weight (Optional[Any]) – instance weights

  • base_margin (Optional[Any]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Any, Any]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (Optional[bool]) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, str, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Any]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Any]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

xgboost.sklearn.XGBRFClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Any) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Any]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (array_like) – Feature matrix.

  • ntree_limit (int) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (array_like) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

Plotting API

Plotting Library.

xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='F score', ylabel='Features', fmap='', importance_type='weight', max_num_features=None, grid=True, show_values=True, **kwargs)

Plot importance based on fitted trees.

Parameters
  • booster (Booster, XGBModel or dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()

  • ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.

  • grid (bool, Turn the axes grids on or off. Default is True (On)) –

  • importance_type (str, default "weight") –

    How the importance is calculated: either “weight”, “gain”, or “cover”

    • ”weight” is the number of times a feature appears in a tree

    • ”gain” is the average gain of splits which use the feature

    • ”cover” is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split

  • max_num_features (int, default None) – Maximum number of top features displayed on plot. If None, all features will be displayed.

  • height (float, default 0.2) – Bar height, passed to ax.barh()

  • xlim (tuple, default None) – Tuple passed to axes.xlim()

  • ylim (tuple, default None) – Tuple passed to axes.ylim()

  • title (str, default "Feature importance") – Axes title. To disable, pass None.

  • xlabel (str, default "F score") – X axis title label. To disable, pass None.

  • ylabel (str, default "Features") – Y axis title label. To disable, pass None.

  • fmap (str or os.PathLike (optional)) – The name of feature map file.

  • show_values (bool, default True) – Show values on plot. To disable, pass False.

  • kwargs – Other keywords passed to ax.barh()

Returns

ax

Return type

matplotlib Axes

xgboost.plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs)

Plot specified tree.

Parameters
  • booster (Booster, XGBModel) – Booster or XGBModel instance

  • fmap (str (optional)) – The name of feature map file

  • num_trees (int, default 0) – Specify the ordinal number of target tree

  • rankdir (str, default "TB") – Passed to graphiz via graph_attr

  • ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.

  • kwargs – Other keywords passed to to_graphviz

Returns

ax

Return type

matplotlib Axes

xgboost.to_graphviz(booster, fmap='', num_trees=0, rankdir=None, yes_color=None, no_color=None, condition_node_params=None, leaf_node_params=None, **kwargs)

Convert specified tree to graphviz instance. IPython can automatically plot the returned graphiz instance. Otherwise, you should call .render() method of the returned graphiz instance.

Parameters
  • booster (Booster, XGBModel) – Booster or XGBModel instance

  • fmap (str (optional)) – The name of feature map file

  • num_trees (int, default 0) – Specify the ordinal number of target tree

  • rankdir (str, default "UT") – Passed to graphiz via graph_attr

  • yes_color (str, default '#0000FF') – Edge color when meets the node condition.

  • no_color (str, default '#FF0000') – Edge color when doesn’t meet the node condition.

  • condition_node_params (dict, optional) –

    Condition node configuration for for graphviz. Example:

    {'shape': 'box',
     'style': 'filled,rounded',
     'fillcolor': '#78bceb'}
    

  • leaf_node_params (dict, optional) –

    Leaf node configuration for graphviz. Example:

    {'shape': 'box',
     'style': 'filled',
     'fillcolor': '#e48038'}
    

  • **kwargs (dict, optional) – Other keywords passed to graphviz graph_attr, e.g. graph [ {key} = {value} ]

Returns

graph

Return type

graphviz.Source

Callback API

xgboost.callback.TrainingCallback()

Interface for training callback.

New in version 1.3.0.

xgboost.callback.EvaluationMonitor(rank=0, period=1, show_stdv=False)

Print the evaluation result at each iteration.

New in version 1.3.0.

Parameters
  • metric (callable) – Extra user defined metric.

  • rank (int) – Which worker should be used for printing the result.

  • period (int) – How many epoches between printing.

  • show_stdv (bool) – Used in cv to show standard deviation. Users should not specify it.

Return type

None

xgboost.callback.EarlyStopping(rounds, metric_name=None, data_name=None, maximize=None, save_best=False, min_delta=0.0)

Callback function for early stopping

New in version 1.3.0.

Parameters
  • rounds (int) – Early stopping rounds.

  • metric_name (Optional[str]) – Name of metric that is used for early stopping.

  • data_name (Optional[str]) – Name of dataset that is used for early stopping.

  • maximize (Optional[bool]) – Whether to maximize evaluation metric. None means auto (discouraged).

  • save_best (Optional[bool]) – Whether training should return the best model or the last model.

  • min_delta (float) –

    Minimum absolute change in score to be qualified as an improvement.

    New in version 1.5.0.

    clf = xgboost.XGBClassifier(tree_method="gpu_hist")
    es = xgboost.callback.EarlyStopping(
        rounds=2,
        abs_tol=1e-3,
        save_best=True,
        maximize=False,
        data_name="validation_0",
        metric_name="mlogloss",
    )
    
    X, y = load_digits(return_X_y=True)
    clf.fit(X, y, eval_set=[(X, y)], callbacks=[es])
    

Return type

None

xgboost.callback.LearningRateScheduler(learning_rates)

Callback function for scheduling learning rate.

New in version 1.3.0.

Parameters

learning_rates (callable/collections.Sequence) – If it’s a callable object, then it should accept an integer parameter epoch and returns the corresponding learning rate. Otherwise it should be a sequence like list or tuple with the same size of boosting rounds.

Return type

None

xgboost.callback.TrainingCheckPoint(directory, name='model', as_pickle=False, iterations=100)

Checkpointing operation.

New in version 1.3.0.

Parameters
  • directory (os.PathLike) – Output model directory.

  • name (str) – pattern of output model file. Models will be saved as name_0.json, name_1.json, name_2.json ….

  • as_pickle (boolean) – When set to Ture, all training parameters will be saved in pickle format, instead of saving only the model.

  • iterations (int) – Interval of checkpointing. Checkpointing is slow so setting a larger number can reduce performance hit.

Dask API

Dask extensions for distributed training. See https://xgboost.readthedocs.io/en/latest/tutorials/dask.html for simple tutorial. Also xgboost/demo/dask for some examples.

There are two sets of APIs in this module, one is the functional API including train and predict methods. Another is stateful Scikit-Learner wrapper inherited from single-node Scikit-Learn interface.

The implementation is heavily influenced by dask_xgboost: https://github.com/dask/dask-xgboost

class xgboost.dask.DaskDMatrix(client, data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

Bases: object

DMatrix holding on references to Dask DataFrame or Dask Array. Constructing a DaskDMatrix forces all lazy computation to be carried out. Wait for the input data explicitly if you want to see actual computation of constructing DaskDMatrix.

See doc for xgboost.DMatrix constructor for other parameters. DaskDMatrix accepts only dask collection.

Note

DaskDMatrix does not repartition or move data between workers. It’s the caller’s responsibility to balance the data.

New in version 1.0.0.

Parameters
  • client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.

  • data (Union[da.Array, dd.DataFrame, dd.Series]) –

  • label (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • missing (float) –

  • silent (bool) –

  • feature_names (Optional[Union[List[str], str]]) –

  • feature_types (Optional[Union[Any, List[Any]]]) –

  • group (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • qid (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • label_lower_bound (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • label_upper_bound (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • enable_categorical (bool) –

Return type

None

class xgboost.dask.DaskDeviceQuantileDMatrix(client, data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, max_bin=256, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

Bases: xgboost.dask.DaskDMatrix

Specialized data type for gpu_hist tree method. This class is used to reduce the memory usage by eliminating data copies. Internally the all partitions/chunks of data are merged by weighted GK sketching. So the number of partitions from dask may affect training accuracy as GK generates bounded error for each merge. See doc string for xgboost.DeviceQuantileDMatrix and xgboost.DMatrix for other parameters.

New in version 1.2.0.

Parameters
  • max_bin (Number of bins for histogram construction.) –

  • client (distributed.Client) –

  • data (Union[da.Array, dd.DataFrame, dd.Series]) –

  • label (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • missing (float) –

  • silent (bool) –

  • feature_names (Optional[Union[List[str], str]]) –

  • feature_types (Optional[Union[Any, List[Any]]]) –

  • group (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • qid (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • label_lower_bound (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • label_upper_bound (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

  • enable_categorical (bool) –

Return type

None

xgboost.dask.train(client, params, dtrain, num_boost_round=10, evals=None, obj=None, feval=None, early_stopping_rounds=None, xgb_model=None, verbose_eval=True, callbacks=None)

Train XGBoost model.

New in version 1.0.0.

Note

Other parameters are the same as xgboost.train() except for evals_result, which is returned as part of function return value instead of argument.

Parameters
Returns

results – A dictionary containing trained booster and evaluation history. history field is the same as eval_result from xgboost.train.

{'booster': xgboost.Booster,
 'history': {'train': {'logloss': ['0.48253', '0.35953']},
             'eval': {'logloss': ['0.480385', '0.357756']}}}

Return type

dict

xgboost.dask.predict(client, model, data, output_margin=False, missing=nan, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, iteration_range=(0, 0), strict_shape=False)

Run prediction with a trained booster.

Note

Using inplace_predict might be faster when some features are not needed. See xgboost.Booster.predict() for details on various parameters. When output has more than 2 dimensions (shap value, leaf with strict_shape), input should be da.Array or DaskDMatrix.

New in version 1.0.0.

Parameters
  • client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.

  • model (Union[Dict[str, Any], xgboost.core.Booster, distributed.Future]) – The trained model. It can be a distributed.Future so user can pre-scatter it onto all workers.

  • data (Union[xgboost.dask.DaskDMatrix, da.Array, dd.DataFrame, dd.Series]) – Input data used for prediction. When input is a dataframe object, prediction output is a series.

  • missing (float) – Used when input data is not DaskDMatrix. Specify the value considered as missing.

  • output_margin (bool) –

  • pred_leaf (bool) –

  • pred_contribs (bool) –

  • approx_contribs (bool) –

  • pred_interactions (bool) –

  • validate_features (bool) –

  • iteration_range (Tuple[int, int]) –

  • strict_shape (bool) –

Returns

prediction – When input data is dask.array.Array or DaskDMatrix, the return value is an array, when input data is dask.dataframe.DataFrame, return value can be dask.dataframe.Series, dask.dataframe.DataFrame, depending on the output shape.

Return type

dask.array.Array/dask.dataframe.Series

xgboost.dask.inplace_predict(client, model, data, iteration_range=(0, 0), predict_type='value', missing=nan, validate_features=True, base_margin=None, strict_shape=False)

Inplace prediction. See doc in xgboost.Booster.inplace_predict() for details.

New in version 1.1.0.

Parameters
  • client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.

  • model (Union[Dict[str, Any], xgboost.core.Booster, distributed.Future]) – See xgboost.dask.predict() for details.

  • data (Union[da.Array, dd.DataFrame, dd.Series]) – dask collection.

  • iteration_range (Tuple[int, int]) – See xgboost.Booster.predict() for details.

  • predict_type (str) – See xgboost.Booster.inplace_predict() for details.

  • missing (float) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

    See xgboost.DMatrix for details. Right now classifier is not well supported with base_margin as it requires the size of base margin to be n_classes * n_samples.

    New in version 1.4.0.

  • strict_shape (bool) –

    See xgboost.Booster.predict() for details.

    New in version 1.4.0.

  • validate_features (bool) –

Returns

When input data is dask.array.Array, the return value is an array, when input data is dask.dataframe.DataFrame, return value can be dask.dataframe.Series, dask.dataframe.DataFrame, depending on the output shape.

Return type

prediction

class xgboost.dask.DaskXGBClassifier(max_depth=None, learning_rate=None, n_estimators=100, verbosity=None, objective=None, booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type=None, gpu_id=None, validate_parameters=None, predictor=None, enable_categorical=False, **kwargs)

Bases: xgboost.dask.DaskScikitLearnBase, object

Implementation of the scikit-learn API for XGBoost classification.

Parameters
  • n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Return type

None

apply(X, ntree_limit=None, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property client: distributed.Client

The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Feature matrix

  • y (Union[da.Array, dd.DataFrame, dd.Series]) – Labels

  • sample_weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – instance weights

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Union[da.Array, dd.DataFrame, dd.Series], Union[da.Array, dd.DataFrame, dd.Series]]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

DaskXGBClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (array_like) – Feature matrix.

  • ntree_limit (int) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (array_like) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.dask.DaskXGBRegressor(max_depth=None, learning_rate=None, n_estimators=100, verbosity=None, objective=None, booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type=None, gpu_id=None, validate_parameters=None, predictor=None, enable_categorical=False, **kwargs)

Bases: xgboost.dask.DaskScikitLearnBase, object

Implementation of the Scikit-Learn API for XGBoost.

Parameters
  • n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Return type

None

apply(X, ntree_limit=None, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property client: distributed.Client

The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Feature matrix

  • y (Union[da.Array, dd.DataFrame, dd.Series]) – Labels

  • sample_weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – instance weights

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Union[da.Array, dd.DataFrame, dd.Series], Union[da.Array, dd.DataFrame, dd.Series]]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

DaskXGBRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.dask.DaskXGBRanker(*, objective='rank:pairwise', **kwargs)

Bases: xgboost.dask.DaskScikitLearnBase, xgboost.sklearn.XGBRankerMixIn

Implementation of the Scikit-Learn API for XGBoost Ranking.

New in version 1.4.0.

Parameters
  • n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    For dask implementation, group is not supported, use qid instead.

apply(X, ntree_limit=None, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property client: distributed.Client

The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, group=None, qid=None, sample_weight=None, base_margin=None, eval_set=None, eval_group=None, eval_qid=None, eval_metric=None, early_stopping_rounds=None, verbose=False, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting ranker

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Feature matrix

  • y (Union[da.Array, dd.DataFrame, dd.Series]) – Labels

  • group (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.

  • qid (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group.

  • sample_weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) –

    Query group weights

    Note

    Weights are per-group for ranking tasks

    In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Global bias for each instance.

  • eval_set (Optional[List[Tuple[Union[da.Array, dd.DataFrame, dd.Series], Union[da.Array, dd.DataFrame, dd.Series]]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_group (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list in which eval_group[i] is the list containing the sizes of all query groups in the i-th pair in eval_set.

  • eval_qid (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list in which eval_qid[i] is the array containing query ID of i-th pair in eval_set.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) – If a str, should be a built-in evaluation metric to use. See doc/parameter.rst. If a list of str, should be the list of multiple built-in evaluation metrics to use. The custom evaluation metric is not yet supported for the ranker.

  • early_stopping_rounds (int) – Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set. The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping. If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration and clf.best_ntree_limit.

  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) –

    A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.

    Note

    Weights are per-group for ranking tasks

    In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • base_margin_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

DaskXGBRanker

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.dask.DaskXGBRFRegressor(*, learning_rate=1, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: xgboost.dask.DaskXGBRegressor

Implementation of the Scikit-Learn API for XGBoost Random Forest Regressor.

New in version 1.4.0.

Parameters
  • n_estimators (int) – Number of trees in random forest to fit.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

Return type

None

apply(X, ntree_limit=None, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property client: distributed.Client

The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Feature matrix

  • y (Union[da.Array, dd.DataFrame, dd.Series]) – Labels

  • sample_weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – instance weights

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Union[da.Array, dd.DataFrame, dd.Series], Union[da.Array, dd.DataFrame, dd.Series]]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

DaskXGBRFRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –

class xgboost.dask.DaskXGBRFClassifier(*, learning_rate=1, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)

Bases: xgboost.dask.DaskXGBClassifier

Implementation of the Scikit-Learn API for XGBoost Random Forest Classifier.

New in version 1.4.0.

Parameters
  • n_estimators (int) – Number of trees in random forest to fit.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document: https://xgboost.readthedocs.io/en/latest/treemethod.html.

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Experimental support for categorical data. Do not set to true unless you are interested in development. Only valid when gpu_hist and dataframe are used.

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

Return type

None

apply(X, ntree_limit=None, iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.

Parameters
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.

  • iteration_range (Optional[Tuple[int, int]]) – See xgboost.XGBRegressor.predict().

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

Returns

X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type

array_like, shape=[n_samples, n_trees]

property client: distributed.Client

The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.

property coef_: numpy.ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

coef_

Return type

array of shape [n_features] or [n_classes, n_features]

evals_result()

Return the evaluation results.

If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the eval_metrics passed to the fit function.

Returns

evals_result

Return type

dictionary

Example

param_dist = {'objective':'binary:logistic', 'n_estimators':2}

clf = xgb.XGBModel(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain:

{'validation_0': {'logloss': ['0.604835', '0.531479']},
 'validation_1': {'logloss': ['0.41965', '0.17686']}}
property feature_importances_: numpy.ndarray

Feature importances property, return depends on importance_type parameter.

Returns

  • feature_importances_ (array of shape [n_features] except for multi-class)

  • linear model, which returns an array with shape (n_features, n_classes)

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)

Fit gradient boosting model.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Feature matrix

  • y (Union[da.Array, dd.DataFrame, dd.Series]) – Labels

  • sample_weight (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – instance weights

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – global bias for each instance.

  • eval_set (Optional[List[Tuple[Union[da.Array, dd.DataFrame, dd.Series], Union[da.Array, dd.DataFrame, dd.Series]]]]) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (Optional[Union[str, List[str], Callable[[numpy.ndarray, xgboost.core.DMatrix], Tuple[str, float]]]]) –

    If a str, should be a built-in evaluation metric to use. See doc/parameter.rst.

    If a list of str, should be the list of multiple built-in evaluation metrics to use.

    If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. The callable custom objective is always minimized.

  • early_stopping_rounds (Optional[int]) –

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set.

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping.

    If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: clf.best_score, clf.best_iteration.

  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

  • xgb_model (Optional[Union[xgboost.core.Booster, xgboost.sklearn.XGBModel]]) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set (Optional[List[Union[da.Array, dd.DataFrame, dd.Series]]]) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown. Only available for hist, gpu_hist and exact tree methods.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API. Example:

    callbacks = [xgb.callback.EarlyStopping(rounds=early_stopping_rounds,
                                            save_best=True)]
    

Return type

DaskXGBRFClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns

booster

Return type

a xgboost booster of underlying model

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type

int

get_params(deep=True)

Get parameters.

Parameters

deep (bool) –

Return type

Dict[str, Any]

get_xgb_params()

Get xgboost specific parameters.

Return type

Dict[str, Any]

property intercept_: numpy.ndarray

Intercept (bias) property

Note

Intercept is defined only for linear learners

Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).

Returns

intercept_

Return type

array of shape (1,) or [n_classes]

load_model(fname)

Load the model from a file or bytearray. Path to file can be local or as an URI.

The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (Union[str, bytearray, os.PathLike]) – Input file name or memory buffer(see also save_raw)

Return type

None

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (Union[da.Array, dd.DataFrame, dd.Series]) – Data to predict with.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (Optional[int]) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Union[da.Array, dd.DataFrame, dd.Series]]) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) –

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Returns

Return type

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters
  • X (array_like) – Feature matrix.

  • ntree_limit (int) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (array_like) – Margin added to prediction.

  • iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON instead. See: Model IO for more info.

Parameters

fname (string or os.PathLike) – Output file name

Return type

None

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.

Returns

Return type

self

Parameters

params (Any) –