Python API Reference
This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about the Python package.
Global Configuration
- xgboost.config_context(**new_config)
Context manager for global XGBoost configuration.
Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Configuration for the full list of parameters supported in the global configuration.
Note
All settings, not just those presently modified, will be returned to their previous values when the context manager is exited. This is not thread-safe.
New in version 1.4.0.
- Parameters:
new_config (Dict[str, Any]) – Keyword arguments representing the parameters and their values
- Return type:
Iterator[None]
Example
import xgboost as xgb # Show all messages, including ones pertaining to debugging xgb.set_config(verbosity=2) # Get current value of global configuration # This is a dict containing all parameters in the global configuration, # including 'verbosity' config = xgb.get_config() assert config['verbosity'] == 2 # Example of using the context manager xgb.config_context(). # The context manager will restore the previous value of the global # configuration upon exiting. with xgb.config_context(verbosity=0): # Suppress warning caused by model generated with XGBoost version < 1.0.0 bst = xgb.Booster(model_file='./old_model.bin') assert xgb.get_config()['verbosity'] == 2 # old value restored
Nested configuration context is also supported:
Example
with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3 with xgb.config_context(verbosity=2): assert xgb.get_config()["verbosity"] == 2 xgb.set_config(verbosity=2) assert xgb.get_config()["verbosity"] == 2 with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3
See also
set_config
Set global XGBoost configuration
get_config
Get current values of the global configuration
- xgboost.set_config(**new_config)
Set global configuration.
Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Configuration for the full list of parameters supported in the global configuration.
New in version 1.4.0.
- Parameters:
new_config (Dict[str, Any]) – Keyword arguments representing the parameters and their values
- Return type:
None
Example
import xgboost as xgb # Show all messages, including ones pertaining to debugging xgb.set_config(verbosity=2) # Get current value of global configuration # This is a dict containing all parameters in the global configuration, # including 'verbosity' config = xgb.get_config() assert config['verbosity'] == 2 # Example of using the context manager xgb.config_context(). # The context manager will restore the previous value of the global # configuration upon exiting. with xgb.config_context(verbosity=0): # Suppress warning caused by model generated with XGBoost version < 1.0.0 bst = xgb.Booster(model_file='./old_model.bin') assert xgb.get_config()['verbosity'] == 2 # old value restored
Nested configuration context is also supported:
Example
with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3 with xgb.config_context(verbosity=2): assert xgb.get_config()["verbosity"] == 2 xgb.set_config(verbosity=2) assert xgb.get_config()["verbosity"] == 2 with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3
- xgboost.get_config()
Get current values of the global configuration.
Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Configuration for the full list of parameters supported in the global configuration.
New in version 1.4.0.
- Returns:
args – The list of global parameters and their values
- Return type:
Dict[str, Any]
Example
import xgboost as xgb # Show all messages, including ones pertaining to debugging xgb.set_config(verbosity=2) # Get current value of global configuration # This is a dict containing all parameters in the global configuration, # including 'verbosity' config = xgb.get_config() assert config['verbosity'] == 2 # Example of using the context manager xgb.config_context(). # The context manager will restore the previous value of the global # configuration upon exiting. with xgb.config_context(verbosity=0): # Suppress warning caused by model generated with XGBoost version < 1.0.0 bst = xgb.Booster(model_file='./old_model.bin') assert xgb.get_config()['verbosity'] == 2 # old value restored
Nested configuration context is also supported:
Example
with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3 with xgb.config_context(verbosity=2): assert xgb.get_config()["verbosity"] == 2 xgb.set_config(verbosity=2) assert xgb.get_config()["verbosity"] == 2 with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3
Core Data Structure
Core XGBoost Library.
- class xgboost.DMatrix(data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, nthread=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)
Bases:
object
Data Matrix used in XGBoost.
DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.
- Parameters:
data (os.PathLike/string/numpy.array/scipy.sparse/pd.DataFrame/) –
dt.Frame/cudf.DataFrame/cupy.array/dlpack/arrow.Table
Data source of DMatrix.
When data is string or os.PathLike type, it represents the path libsvm format txt file, csv file (by specifying uri parameter ‘path_to_csv?format=csv’), or binary file that xgboost can read from.
label (array_like) – Label of the training data.
weight (array_like) –
Weight for each instance.
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (array_like) – Base margin used for boosting from existing model.
missing (float, optional) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
silent (boolean, optional) – Whether print messages during construction
feature_names (list, optional) – Set names for features.
feature_types (FeatureTypes) – Set types for features. When enable_categorical is set to True, string “c” represents categorical data type while “q” represents numerical feature type. For categorical features, the input is assumed to be preprocessed and encoded by the users. The encoding can be done via
sklearn.preprocessing.OrdinalEncoder
or pandas dataframe .cat.codes method. This is useful when users want to specify categorical features without having to construct a dataframe as input.nthread (integer, optional) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.
group (array_like) – Group size for all ranking group.
qid (array_like) – Query ID for data samples, used for ranking.
label_lower_bound (array_like) – Lower bound for survival training.
label_upper_bound (array_like) – Upper bound for survival training.
feature_weights (array_like, optional) – Set feature weights for column sampling.
enable_categorical (boolean, optional) –
New in version 1.3.0.
Note
This parameter is experimental
Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Also, JSON/UBJSON serialization format is required.
- property feature_names: Sequence[str] | None
Get feature names (column labels).
- Returns:
feature_names
- Return type:
list or None
- property feature_types: Sequence[str] | None
Get feature types (column types).
- Returns:
feature_types
- Return type:
list or None
- get_base_margin()
Get the base margin of the DMatrix.
- Return type:
base_margin
- get_data()
Get the predictors from DMatrix as a CSR matrix. This getter is mostly for testing purposes. If this is a quantized DMatrix then quantized values are returned instead of input values.
New in version 1.7.0.
- Return type:
- get_float_info(field)
Get float property from the DMatrix.
- Parameters:
field (str) – The field name of the information
- Returns:
info – a numpy array of float information of the data
- Return type:
array
- get_group()
Get the group of the DMatrix.
- Return type:
group
- get_label()
Get the label of the DMatrix.
- Returns:
label
- Return type:
array
- get_uint_info(field)
Get unsigned integer property from the DMatrix.
- Parameters:
field (str) – The field name of the information
- Returns:
info – a numpy array of unsigned integer information of the data
- Return type:
array
- get_weight()
Get the weight of the DMatrix.
- Returns:
weight
- Return type:
array
- save_binary(fname, silent=True)
Save DMatrix to an XGBoost buffer. Saved binary can be later loaded by providing the path to
xgboost.DMatrix()
as input.- Parameters:
fname (string or os.PathLike) – Name of the output buffer file.
silent (bool (optional; default: True)) – If set, the output is suppressed.
- Return type:
None
- set_base_margin(margin)
Set base margin of booster to start from.
This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py
- Parameters:
margin (array like) – Prediction margin of each datapoint
- Return type:
None
- set_float_info(field, data)
Set float type property into the DMatrix.
- Parameters:
field (str) – The field name of the information
data (numpy array) – The array of data to be set
- Return type:
None
- set_float_info_npy2d(field, data)
- Set float type property into the DMatrix
for numpy 2d array input
- Parameters:
field (str) – The field name of the information
data (numpy array) – The array of data to be set
- Return type:
None
- set_group(group)
Set group size of DMatrix (used for ranking).
- Parameters:
group (array like) – Group size of each group
- Return type:
None
- set_info(*, label=None, weight=None, base_margin=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_names=None, feature_types=None, feature_weights=None)
Set meta info for DMatrix. See doc string for
xgboost.DMatrix
.- Parameters:
- Return type:
None
- set_label(label)
Set label of dmatrix
- Parameters:
label (array like) – The label information to be set into DMatrix
- Return type:
None
- set_uint_info(field, data)
Set uint type property into the DMatrix.
- Parameters:
field (str) – The field name of the information
data (numpy array) – The array of data to be set
- Return type:
None
- set_weight(weight)
Set weight of each instance.
- Parameters:
weight (array like) –
Weight for each data point
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
- Return type:
None
- slice(rindex, allow_groups=False)
Slice the DMatrix and return a new DMatrix that only contains rindex.
- class xgboost.QuantileDMatrix(data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, nthread=None, max_bin=None, ref=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)
Bases:
DMatrix
A DMatrix variant that generates quantilized data directly from input for
hist
andgpu_hist
tree methods. This DMatrix is primarily designed to save memory in training by avoiding intermediate storage. Setmax_bin
to control the number of bins during quantisation, which should be consistent with the training parametermax_bin
. WhenQuantileDMatrix
is used for validation/test dataset,ref
should be anotherQuantileDMatrix``(or ``DMatrix
, but not recommended as it defeats the purpose of saving memory) constructed from training dataset. Seexgboost.DMatrix
for documents on meta info.Note
Do not use
QuantileDMatrix
as validation/test dataset without supplying a reference (the training dataset)QuantileDMatrix
usingref
as some information may be lost in quantisation.New in version 1.7.0.
- Parameters:
max_bin (int | None) – The number of histogram bin, should be consistent with the training parameter
max_bin
.ref (DMatrix | None) – The training dataset that provides quantile information, needed when creating validation/test dataset with
QuantileDMatrix
. Supplying the training DMatrix as a reference means that the same quantisation applied to the training data is applied to the validation/test datadata (os.PathLike/string/numpy.array/scipy.sparse/pd.DataFrame/) –
dt.Frame/cudf.DataFrame/cupy.array/dlpack/arrow.Table
Data source of DMatrix.
When data is string or os.PathLike type, it represents the path libsvm format txt file, csv file (by specifying uri parameter ‘path_to_csv?format=csv’), or binary file that xgboost can read from.
label (array_like) – Label of the training data.
weight (array_like) –
Weight for each instance.
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (array_like) – Base margin used for boosting from existing model.
missing (float, optional) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
silent (boolean, optional) – Whether print messages during construction
feature_names (list, optional) – Set names for features.
feature_types (FeatureTypes) – Set types for features. When enable_categorical is set to True, string “c” represents categorical data type while “q” represents numerical feature type. For categorical features, the input is assumed to be preprocessed and encoded by the users. The encoding can be done via
sklearn.preprocessing.OrdinalEncoder
or pandas dataframe .cat.codes method. This is useful when users want to specify categorical features without having to construct a dataframe as input.nthread (integer, optional) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.
group (array_like) – Group size for all ranking group.
qid (array_like) – Query ID for data samples, used for ranking.
label_lower_bound (array_like) – Lower bound for survival training.
label_upper_bound (array_like) – Upper bound for survival training.
feature_weights (array_like, optional) – Set feature weights for column sampling.
enable_categorical (boolean, optional) –
New in version 1.3.0.
Note
This parameter is experimental
Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Also, JSON/UBJSON serialization format is required.
- class xgboost.Booster(params=None, cache=None, model_file=None)
Bases:
object
A Booster of XGBoost.
Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.
- Parameters:
- attr(key)
Get attribute string from the Booster.
- attributes()
Get attributes stored in the Booster as a dictionary.
- Returns:
result – Returns an empty dict if there’s no attributes.
- Return type:
dictionary of attribute_name: attribute_value pairs of strings.
- boost(dtrain, grad, hess)
Boost the booster for one iteration, with customized gradient statistics. Like
xgboost.Booster.update()
, this function should not be called directly by users.
- copy()
Copy the booster object.
- Returns:
booster – a copied booster model
- Return type:
Booster
- dump_model(fout, fmap='', with_stats=False, dump_format='text')
Dump model into a text or JSON file. Unlike
save_model()
, the output format is primarily used for visualization or interpretation, hence it’s more human readable but cannot be loaded back to XGBoost.- Parameters:
fout (string or os.PathLike) – Output file name.
fmap (string or os.PathLike, optional) – Name of the file containing feature map names.
with_stats (bool, optional) – Controls whether the split statistics are output.
dump_format (string, optional) – Format of model dump file. Can be ‘text’ or ‘json’.
- Return type:
None
- eval(data, name='eval', iteration=0)
Evaluate the model on mat.
- eval_set(evals, iteration=0, feval=None, output_margin=True)
Evaluate a set of data.
- property feature_names: Sequence[str] | None
Feature names for this booster. Can be directly set by input data or by assignment.
- property feature_types: Sequence[str] | None
Feature types for this booster. Can be directly set by input data or by assignment. See
DMatrix
for details.
- get_dump(fmap='', with_stats=False, dump_format='text')
Returns the model dump as a list of strings. Unlike
save_model()
, the output format is primarily used for visualization or interpretation, hence it’s more human readable but cannot be loaded back to XGBoost.
- get_fscore(fmap='')
Get feature importance of each feature.
Note
Zero-importance features will not be included
Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.
- get_score(fmap='', importance_type='weight')
Get feature importance of each feature. For tree model Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
Note
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Note
Zero-importance features will not be included
Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.
- Parameters:
- Returns:
A map between feature names and their scores. When gblinear is used for
multi-class classification the scores for each feature is a list with length
n_classes, otherwise they’re scalars.
- Return type:
- get_split_value_histogram(feature, fmap='', bins=None, as_pandas=True)
Get split value histogram of a feature
- Parameters:
feature (str) – The name of the feature.
fmap (str or os.PathLike (optional)) – The name of feature map file.
bin (int, default None) – The maximum number of bins. Number of bins equals number of unique split values n_unique, if bins == None or bins > n_unique.
as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return numpy ndarray.
bins (int | None) –
- Returns:
a histogram of used splitting values for the specified feature
either as numpy array or pandas DataFrame.
- Return type:
- inplace_predict(data, iteration_range=(0, 0), predict_type='value', missing=nan, validate_features=True, base_margin=None, strict_shape=False)
Run prediction in-place, Unlike
predict()
method, inplace prediction does not cache the prediction result.Calling only
inplace_predict
in multiple threads is safe and lock free. But the safety does not hold when used in conjunction with other methods. E.g. you can’t train the booster in one thread and perform prediction in the other.booster.set_param({"predictor": "gpu_predictor"}) booster.inplace_predict(cupy_array) booster.set_param({"predictor": "cpu_predictor"}) booster.inplace_predict(numpy_array)
New in version 1.1.0.
- Parameters:
data (numpy.ndarray/scipy.sparse.csr_matrix/cupy.ndarray/) – cudf.DataFrame/pd.DataFrame The input data, must not be a view for numpy array. Set
predictor
togpu_predictor
for running prediction on CuPy array or CuDF DataFrame.iteration_range (Tuple[int, int]) – See
predict()
for details.predict_type (str) –
value Output model prediction values.
margin Output the raw untransformed margin value.
missing (float) – See
xgboost.DMatrix
for details.validate_features (bool) – See
xgboost.Booster.predict()
for details.base_margin (Any | None) –
See
xgboost.DMatrix
for details.New in version 1.4.0.
strict_shape (bool) –
See
xgboost.Booster.predict()
for details.New in version 1.4.0.
- Returns:
prediction – The prediction result. When input data is on GPU, prediction result is stored in a cupy array.
- Return type:
numpy.ndarray/cupy.ndarray
- load_config(config)
Load configuration returned by save_config.
New in version 1.0.0.
- Parameters:
config (str) –
- Return type:
None
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- num_boosted_rounds()
Get number of boosted rounds. For gblinear this is reset to 0 after serializing the model.
- Return type:
- predict(data, output_margin=False, ntree_limit=0, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, training=False, iteration_range=(0, 0), strict_shape=False)
Predict with data. The full model will be used unless iteration_range is specified, meaning user have to either slice the model or use the
best_iteration
attribute to get prediction from best model returned from early stopping.Note
See Prediction for issues like thread safety and a summary of outputs from this function.
- Parameters:
data (DMatrix) – The dmatrix storing the input.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int) – Deprecated, use iteration_range instead.
pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.
pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.
approx_contribs (bool) – Approximate the contributions of each feature. Used when
pred_contribs
orpred_interactions
is set to True. Changing the default of this parameter (False) is not recommended.pred_interactions (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
training (bool) –
Whether the prediction value is used for training. This can effect dart booster, which performs dropouts during training iterations but use all trees for inference. If you want to obtain result with dropouts, set this parameter to True. Also, the parameter is set to true when obtaining prediction for custom objective function.
New in version 1.0.0.
iteration_range (Tuple[int, int]) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
New in version 1.4.0.
strict_shape (bool) –
When set to True, output shape is invariant to whether classification is used. For both value and margin prediction, the output shape is (n_samples, n_groups), n_groups == 1 when multi-class is not used. Default to False, in which case the output shape can be (n_samples, ) if multi-class is not used.
New in version 1.4.0.
- Returns:
prediction
- Return type:
numpy array
- save_config()
Output internal parameter configuration of Booster as a JSON string.
New in version 1.0.0.
- Return type:
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- save_raw(raw_format='deprecated')
Save the model to a in memory buffer representation instead of file.
- Parameters:
raw_format (str) – Format of output buffer. Can be json, ubj or deprecated. Right now the default is deprecated but it will be changed to ubj (univeral binary json) in the future.
- Return type:
An in memory buffer representation of the model
- set_attr(**kwargs)
Set the attribute of the Booster.
- Parameters:
**kwargs (str | None) – The attributes to set. Setting a value to None deletes an attribute.
- Return type:
None
- set_param(params, value=None)
Set parameters into the Booster.
- Parameters:
params (dict/list/str) – list of key,value pairs, dict of key to value or simply str key
value (optional) – value of the specified parameter, when params is str key
- Return type:
None
- trees_to_dataframe(fmap='')
Parse a boosted tree model text dump into a pandas DataFrame structure.
This feature is only defined when the decision tree model is chosen as base learner (booster in {gbtree, dart}). It is not defined for other base learner types, such as linear learners (booster=gblinear).
- Parameters:
fmap (str or os.PathLike (optional)) – The name of feature map file.
- Return type:
- update(dtrain, iteration, fobj=None)
Update for one iteration, with objective function calculated internally. This function should not be called directly by users.
Learning API
Training Library containing training routines.
- xgboost.train(params, dtrain, num_boost_round=10, *, evals=None, obj=None, feval=None, maximize=None, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None, custom_metric=None)
Train a booster with given parameters.
- Parameters:
dtrain (DMatrix) – Data to be trained.
num_boost_round (int) – Number of boosting iterations.
evals (Sequence[Tuple[DMatrix, str]] | None) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
obj (Callable[[ndarray, DMatrix], Tuple[ndarray, ndarray]] | None) – Custom objective function. See Custom Objective for details.
feval (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
Deprecated since version 1.6.0: Use custom_metric instead.
maximize (bool) – Whether to maximize feval.
early_stopping_rounds (int | None) – Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in evals. The method returns the model from the last iteration (not the best one). Use custom callback or model slicing if the best model is desired. If there’s more than one item in evals, the last entry will be used for early stopping. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping. If early stopping occurs, the model will have two additional fields:
bst.best_score
,bst.best_iteration
.evals_result (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –
This dictionary stores the evaluation results of all the items in watchlist.
Example: with a watchlist containing
[(dtest,'eval'), (dtrain,'train')]
and a parameter containing('eval_metric': 'logloss')
, the evals_result returns{'train': {'logloss': ['0.48253', '0.35953']}, 'eval': {'logloss': ['0.480385', '0.357756']}}
verbose_eval (bool | int | None) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage. If verbose_eval is an integer then the evaluation metric on the validation set is printed at every given verbose_eval boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed. Example: with
verbose_eval=4
and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.xgb_model (str | PathLike | Booster | bytearray | None) – Xgb model to be loaded before training (allows training continuation).
callbacks (Sequence[TrainingCallback] | None) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
custom_metric (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
Custom metric function. See Custom Metric for details.
- Returns:
Booster
- Return type:
a trained booster model
- xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=None, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True, custom_metric=None)
Cross-validation with given parameters.
- Parameters:
params (dict) – Booster params.
dtrain (DMatrix) – Data to be trained.
num_boost_round (int) – Number of boosting iterations.
nfold (int) – Number of folds in CV.
stratified (bool) – Perform stratified sampling.
folds (a KFold or StratifiedKFold instance or list of fold indices) – Sklearn KFolds or StratifiedKFolds object. Alternatively may explicitly pass sample indices for each fold. For
n
folds, folds should be a lengthn
list of tuples. Each tuple is(in,out)
wherein
is a list of indices to be used as the training samples for then
th fold andout
is a list of indices to be used as the testing samples for then
th fold.metrics (string or list of strings) – Evaluation metrics to be watched in CV.
obj (Callable[[ndarray, DMatrix], Tuple[ndarray, ndarray]] | None) – Custom objective function. See Custom Objective for details.
feval (function) –
Deprecated since version 1.6.0: Use custom_metric instead.
maximize (bool) – Whether to maximize feval.
early_stopping_rounds (int) – Activates early stopping. Cross-Validation metric (average of validation metric computed over CV folds) needs to improve at least once in every early_stopping_rounds round(s) to continue training. The last entry in the evaluation history will represent the best iteration. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping.
fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.
as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return np.ndarray
verbose_eval (bool, int, or None, default None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given verbose_eval boosting stage.
show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.
seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
callbacks (Sequence[TrainingCallback] | None) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
shuffle (bool) – Shuffle data before creating folds.
custom_metric (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
Custom metric function. See Custom Metric for details.
- Returns:
evaluation history
- Return type:
list(string)
Scikit-Learn API
Scikit-Learn Wrapper interface for XGBoost.
- class xgboost.XGBRegressor(*, objective='reg:squarederror', **kwargs)
Bases:
XGBModel
,RegressorMixin
Implementation of the scikit-learn API for XGBoost regression.
- Parameters:
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess
:- y_true: array_like of shape [n_samples]
The target values
- y_pred: array_like of shape [n_samples]
The predicted values
- grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
- hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
- apply(X, ntree_limit=0, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (Any) – Feature matrix
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | str | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
XGBModel
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)
w.r.t. y.- Return type:
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- class xgboost.XGBClassifier(*, objective='binary:logistic', use_label_encoder=None, **kwargs)
Bases:
XGBModel
,ClassifierMixin
Implementation of the scikit-learn API for XGBoost classification.
- Parameters:
n_estimators (int) – Number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess
:- y_true: array_like of shape [n_samples]
The target values
- y_pred: array_like of shape [n_samples]
The predicted values
- grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
- hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
use_label_encoder (bool | None) –
- apply(X, ntree_limit=0, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting classifier.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (Any) – Feature matrix
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict the probability of each X example being of a given class.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (array_like) – Feature matrix.
ntree_limit (int) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (array_like) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
- Returns:
a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
- class xgboost.XGBRanker(*, objective='rank:pairwise', **kwargs)
Bases:
XGBModel
,XGBRankerMixIn
Implementation of the Scikit-Learn API for XGBoost Ranking.
- Parameters:
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
The default objective for XGBRanker is “rank:pairwise”
Note
A custom objective function is currently not supported by XGBRanker. Likewise, a custom metric function is not supported either.
Note
Query group information is required for ranking tasks by either using the group parameter or qid parameter in fit method. This information is not required in ‘predict’ method and multiple groups can be predicted on a single call to predict.
When fitting the model with the group parameter, your data need to be sorted by query group first. group must be an array that contains the size of each query group. When fitting the model with the qid parameter, your data does not need sorting. qid must be an array that contains the group of each training sample.
For example, if your original data look like:
qid
label
features
1
0
x_1
1
1
x_2
1
0
x_3
2
0
x_4
2
1
x_5
2
1
x_6
2
1
x_7
then fit method can be called with either group array as
[3, 4]
or with qid as[`1, 1, 1, 2, 2, 2, 2]
, that is the qid column.
- apply(X, ntree_limit=0, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, group=None, qid=None, sample_weight=None, base_margin=None, eval_set=None, eval_group=None, eval_qid=None, eval_metric=None, early_stopping_rounds=None, verbose=False, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting ranker
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (Any) – Feature matrix
y (Any) – Labels
group (Any | None) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.
qid (Any | None) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group.
sample_weight (Any | None) –
Query group weights
Note
Weights are per-group for ranking tasks
In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (Any | None) – Global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_group (Sequence[Any] | None) – A list in which
eval_group[i]
is the list containing the sizes of all query groups in thei
-th pair in eval_set.eval_qid (Sequence[Any] | None) – A list in which
eval_qid[i]
is the array containing query ID ofi
-th pair in eval_set.eval_metric (str, list of str, optional) –
Deprecated since version 1.6.0: use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) –
A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.
Note
Weights are per-group for ranking tasks
In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- class xgboost.XGBRFRegressor(*, learning_rate=1.0, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)
Bases:
XGBRegressor
scikit-learn API for XGBoost random forest regression.
- Parameters:
n_estimators (int) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess
:- y_true: array_like of shape [n_samples]
The target values
- y_pred: array_like of shape [n_samples]
The predicted values
- grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
- hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
- apply(X, ntree_limit=0, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (Any) – Feature matrix
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)
w.r.t. y.- Return type:
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- class xgboost.XGBRFClassifier(*, learning_rate=1.0, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)
Bases:
XGBClassifier
scikit-learn API for XGBoost random forest classification.
- Parameters:
n_estimators (int) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess
:- y_true: array_like of shape [n_samples]
The target values
- y_pred: array_like of shape [n_samples]
The predicted values
- grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
- hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
- apply(X, ntree_limit=0, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting classifier.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (Any) – Feature matrix
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – global bias for each instance.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (Any) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict the probability of each X example being of a given class.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (array_like) – Feature matrix.
ntree_limit (int) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (array_like) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
- Returns:
a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
Plotting API
Plotting Library.
- xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='F score', ylabel='Features', fmap='', importance_type='weight', max_num_features=None, grid=True, show_values=True, **kwargs)
Plot importance based on fitted trees.
- Parameters:
booster (Booster, XGBModel or dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()
ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.
grid (bool, Turn the axes grids on or off. Default is True (On).) –
importance_type (str, default "weight") –
How the importance is calculated: either “weight”, “gain”, or “cover”
”weight” is the number of times a feature appears in a tree
”gain” is the average gain of splits which use the feature
”cover” is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split
max_num_features (int, default None) – Maximum number of top features displayed on plot. If None, all features will be displayed.
height (float, default 0.2) – Bar height, passed to ax.barh()
xlim (tuple, default None) – Tuple passed to axes.xlim()
ylim (tuple, default None) – Tuple passed to axes.ylim()
title (str, default "Feature importance") – Axes title. To disable, pass None.
xlabel (str, default "F score") – X axis title label. To disable, pass None.
ylabel (str, default "Features") – Y axis title label. To disable, pass None.
fmap (str or os.PathLike (optional)) – The name of feature map file.
show_values (bool, default True) – Show values on plot. To disable, pass False.
kwargs (Any) – Other keywords passed to ax.barh()
- Returns:
ax
- Return type:
matplotlib Axes
- xgboost.plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs)
Plot specified tree.
- Parameters:
booster (Booster, XGBModel) – Booster or XGBModel instance
fmap (str (optional)) – The name of feature map file
num_trees (int, default 0) – Specify the ordinal number of target tree
rankdir (str, default "TB") – Passed to graphviz via graph_attr
ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.
kwargs (Any) – Other keywords passed to to_graphviz
- Returns:
ax
- Return type:
matplotlib Axes
- xgboost.to_graphviz(booster, fmap='', num_trees=0, rankdir=None, yes_color=None, no_color=None, condition_node_params=None, leaf_node_params=None, **kwargs)
Convert specified tree to graphviz instance. IPython can automatically plot the returned graphviz instance. Otherwise, you should call .render() method of the returned graphviz instance.
- Parameters:
booster (Booster, XGBModel) – Booster or XGBModel instance
fmap (str (optional)) – The name of feature map file
num_trees (int, default 0) – Specify the ordinal number of target tree
rankdir (str, default "UT") – Passed to graphviz via graph_attr
yes_color (str, default '#0000FF') – Edge color when meets the node condition.
no_color (str, default '#FF0000') – Edge color when doesn’t meet the node condition.
condition_node_params (dict, optional) –
Condition node configuration for for graphviz. Example:
{'shape': 'box', 'style': 'filled,rounded', 'fillcolor': '#78bceb'}
leaf_node_params (dict, optional) –
Leaf node configuration for graphviz. Example:
{'shape': 'box', 'style': 'filled', 'fillcolor': '#e48038'}
**kwargs (dict, optional) – Other keywords passed to graphviz graph_attr, e.g.
graph [ {key} = {value} ]
- Returns:
graph
- Return type:
graphviz.Source
Callback API
Callback library containing training routines. See Callback Functions for a quick introduction.
- class xgboost.callback.TrainingCallback
Interface for training callback.
New in version 1.3.0.
- after_iteration(model, epoch, evals_log)
Run after each iteration. Return True when training should stop.
- before_iteration(model, epoch, evals_log)
Run before each iteration. Return True when training should stop.
- class xgboost.callback.EvaluationMonitor(rank=0, period=1, show_stdv=False)
Bases:
TrainingCallback
Print the evaluation result at each iteration.
New in version 1.3.0.
- Parameters:
- after_iteration(model, epoch, evals_log)
Run after each iteration. Return True when training should stop.
- before_iteration(model, epoch, evals_log)
Run before each iteration. Return True when training should stop.
- class xgboost.callback.EarlyStopping(rounds, metric_name=None, data_name=None, maximize=None, save_best=False, min_delta=0.0)
Bases:
TrainingCallback
Callback function for early stopping
New in version 1.3.0.
- Parameters:
rounds (int) – Early stopping rounds.
metric_name (str | None) – Name of metric that is used for early stopping.
data_name (str | None) – Name of dataset that is used for early stopping.
maximize (bool | None) – Whether to maximize evaluation metric. None means auto (discouraged).
save_best (bool | None) – Whether training should return the best model or the last model.
min_delta (float) –
Minimum absolute change in score to be qualified as an improvement.
New in version 1.5.0.
clf = xgboost.XGBClassifier(tree_method="gpu_hist") es = xgboost.callback.EarlyStopping( rounds=2, abs_tol=1e-3, save_best=True, maximize=False, data_name="validation_0", metric_name="mlogloss", ) X, y = load_digits(return_X_y=True) clf.fit(X, y, eval_set=[(X, y)], callbacks=[es])
- after_iteration(model, epoch, evals_log)
Run after each iteration. Return True when training should stop.
- before_iteration(model, epoch, evals_log)
Run before each iteration. Return True when training should stop.
- class xgboost.callback.LearningRateScheduler(learning_rates)
Bases:
TrainingCallback
Callback function for scheduling learning rate.
New in version 1.3.0.
- Parameters:
learning_rates (Callable[[int], float] | Sequence[float]) – If it’s a callable object, then it should accept an integer parameter epoch and returns the corresponding learning rate. Otherwise it should be a sequence like list or tuple with the same size of boosting rounds.
- after_iteration(model, epoch, evals_log)
Run after each iteration. Return True when training should stop.
- before_iteration(model, epoch, evals_log)
Run before each iteration. Return True when training should stop.
- class xgboost.callback.TrainingCheckPoint(directory, name='model', as_pickle=False, iterations=100)
Bases:
TrainingCallback
Checkpointing operation.
New in version 1.3.0.
- Parameters:
name (str) – pattern of output model file. Models will be saved as name_0.json, name_1.json, name_2.json ….
as_pickle (bool) – When set to True, all training parameters will be saved in pickle format, instead of saving only the model.
iterations (int) – Interval of checkpointing. Checkpointing is slow so setting a larger number can reduce performance hit.
- after_iteration(model, epoch, evals_log)
Run after each iteration. Return True when training should stop.
- before_iteration(model, epoch, evals_log)
Run before each iteration. Return True when training should stop.
Dask API
Dask extensions for distributed training
See Distributed XGBoost with Dask for simple tutorial. Also XGBoost Dask Feature Walkthrough for some examples.
There are two sets of APIs in this module, one is the functional API including
train
and predict
methods. Another is stateful Scikit-Learner wrapper
inherited from single-node Scikit-Learn interface.
The implementation is heavily influenced by dask_xgboost: https://github.com/dask/dask-xgboost
Optional dask configuration
xgboost.scheduler_address: Specify the scheduler address, see Troubleshooting.
New in version 1.6.0.
dask.config.set({"xgboost.scheduler_address": "192.0.0.100"}) # We can also specify the port. dask.config.set({"xgboost.scheduler_address": "192.0.0.100:12345"})
- class xgboost.dask.DaskDMatrix(client, data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)
Bases:
object
DMatrix holding on references to Dask DataFrame or Dask Array. Constructing a DaskDMatrix forces all lazy computation to be carried out. Wait for the input data explicitly if you want to see actual computation of constructing DaskDMatrix.
See doc for
xgboost.DMatrix
constructor for other parameters. DaskDMatrix accepts only dask collection.Note
DaskDMatrix does not repartition or move data between workers. It’s the caller’s responsibility to balance the data.
New in version 1.0.0.
- Parameters:
client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
data (da.Array | dd.DataFrame) –
label (da.Array | dd.DataFrame | dd.Series | None) –
weight (da.Array | dd.DataFrame | dd.Series | None) –
base_margin (da.Array | dd.DataFrame | dd.Series | None) –
missing (float) –
silent (bool) –
group (da.Array | dd.DataFrame | dd.Series | None) –
qid (da.Array | dd.DataFrame | dd.Series | None) –
label_lower_bound (da.Array | dd.DataFrame | dd.Series | None) –
label_upper_bound (da.Array | dd.DataFrame | dd.Series | None) –
feature_weights (da.Array | dd.DataFrame | dd.Series | None) –
enable_categorical (bool) –
- num_col()
Get the number of columns (features) in the DMatrix.
- Return type:
number of columns
- class xgboost.dask.DaskQuantileDMatrix(client, data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, max_bin=None, ref=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)
Bases:
DaskDMatrix
- Parameters:
client (distributed.Client) –
data (da.Array | dd.DataFrame) –
label (da.Array | dd.DataFrame | dd.Series | None) –
weight (da.Array | dd.DataFrame | dd.Series | None) –
base_margin (da.Array | dd.DataFrame | dd.Series | None) –
missing (float) –
silent (bool) –
max_bin (int | None) –
ref (DMatrix | None) –
group (da.Array | dd.DataFrame | dd.Series | None) –
qid (da.Array | dd.DataFrame | dd.Series | None) –
label_lower_bound (da.Array | dd.DataFrame | dd.Series | None) –
label_upper_bound (da.Array | dd.DataFrame | dd.Series | None) –
feature_weights (da.Array | dd.DataFrame | dd.Series | None) –
enable_categorical (bool) –
- num_col()
Get the number of columns (features) in the DMatrix.
- Return type:
number of columns
- xgboost.dask.train(client, params, dtrain, num_boost_round=10, *, evals=None, obj=None, feval=None, early_stopping_rounds=None, xgb_model=None, verbose_eval=True, callbacks=None, custom_metric=None)
Train XGBoost model.
New in version 1.0.0.
Note
Other parameters are the same as
xgboost.train()
except for evals_result, which is returned as part of function return value instead of argument.- Parameters:
client (distributed.Client) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
dtrain (DaskDMatrix) –
num_boost_round (int) –
evals (Sequence[Tuple[DaskDMatrix, str]] | None) –
obj (Callable[[ndarray, DMatrix], Tuple[ndarray, ndarray]] | None) –
feval (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
early_stopping_rounds (int | None) –
xgb_model (Booster | None) –
callbacks (Sequence[TrainingCallback] | None) –
custom_metric (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
- Returns:
results – A dictionary containing trained booster and evaluation history. history field is the same as eval_result from xgboost.train.
{'booster': xgboost.Booster, 'history': {'train': {'logloss': ['0.48253', '0.35953']}, 'eval': {'logloss': ['0.480385', '0.357756']}}}
- Return type:
- xgboost.dask.predict(client, model, data, output_margin=False, missing=nan, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, iteration_range=(0, 0), strict_shape=False)
Run prediction with a trained booster.
Note
Using
inplace_predict
might be faster when some features are not needed. Seexgboost.Booster.predict()
for details on various parameters. When output has more than 2 dimensions (shap value, leaf with strict_shape), input should beda.Array
orDaskDMatrix
.New in version 1.0.0.
- Parameters:
client (distributed.Client | None) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
model (TrainReturnT | Booster | distributed.Future) – The trained model. It can be a distributed.Future so user can pre-scatter it onto all workers.
data (DaskDMatrix | da.Array | dd.DataFrame) – Input data used for prediction. When input is a dataframe object, prediction output is a series.
missing (float) – Used when input data is not DaskDMatrix. Specify the value considered as missing.
output_margin (bool) –
pred_leaf (bool) –
pred_contribs (bool) –
approx_contribs (bool) –
pred_interactions (bool) –
validate_features (bool) –
strict_shape (bool) –
- Returns:
prediction – When input data is
dask.array.Array
orDaskDMatrix
, the return value is an array, when input data isdask.dataframe.DataFrame
, return value can bedask.dataframe.Series
,dask.dataframe.DataFrame
, depending on the output shape.- Return type:
dask.array.Array/dask.dataframe.Series
- xgboost.dask.inplace_predict(client, model, data, iteration_range=(0, 0), predict_type='value', missing=nan, validate_features=True, base_margin=None, strict_shape=False)
Inplace prediction. See doc in
xgboost.Booster.inplace_predict()
for details.New in version 1.1.0.
- Parameters:
client (distributed.Client | None) – Specify the dask client used for training. Use default client returned from dask if it’s set to None.
model (TrainReturnT | Booster | distributed.Future) – See
xgboost.dask.predict()
for details.data (da.Array | dd.DataFrame) – dask collection.
iteration_range (Tuple[int, int]) – See
xgboost.Booster.predict()
for details.predict_type (str) – See
xgboost.Booster.inplace_predict()
for details.missing (float) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
base_margin (da.Array | dd.DataFrame | dd.Series | None) –
See
xgboost.DMatrix
for details.New in version 1.4.0.
strict_shape (bool) –
See
xgboost.Booster.predict()
for details.New in version 1.4.0.
validate_features (bool) –
- Returns:
When input data is
dask.array.Array
, the return value is an array, when input data isdask.dataframe.DataFrame
, return value can bedask.dataframe.Series
,dask.dataframe.DataFrame
, depending on the output shape.- Return type:
prediction
- class xgboost.dask.DaskXGBClassifier(max_depth=None, max_leaves=None, max_bin=None, grow_policy=None, learning_rate=None, n_estimators=100, verbosity=None, objective=None, booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, sampling_method=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type=None, gpu_id=None, validate_parameters=None, predictor=None, enable_categorical=False, feature_types=None, max_cat_to_onehot=None, max_cat_threshold=None, eval_metric=None, early_stopping_rounds=None, callbacks=None, **kwargs)
Bases:
DaskScikitLearnBase
,ClassifierMixin
Implementation of the scikit-learn API for XGBoost classification.
- Parameters:
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (int | None) – Maximum number of leaves; 0 indicates no limit.
max_bin (int | None) – If using histogram-based algorithm, maximum number of bins per feature
grow_policy (str | None) – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (str | None) –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
- apply(X, ntree_limit=None, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property client: distributed.Client
The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (da.Array | dd.DataFrame) – Feature matrix
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict the probability of each X example being of a given class.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (array_like) – Feature matrix.
ntree_limit (int) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (array_like) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
- Returns:
a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
- class xgboost.dask.DaskXGBRegressor(max_depth=None, max_leaves=None, max_bin=None, grow_policy=None, learning_rate=None, n_estimators=100, verbosity=None, objective=None, booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, sampling_method=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type=None, gpu_id=None, validate_parameters=None, predictor=None, enable_categorical=False, feature_types=None, max_cat_to_onehot=None, max_cat_threshold=None, eval_metric=None, early_stopping_rounds=None, callbacks=None, **kwargs)
Bases:
DaskScikitLearnBase
,RegressorMixin
Implementation of the Scikit-Learn API for XGBoost.
- Parameters:
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (int | None) – Maximum number of leaves; 0 indicates no limit.
max_bin (int | None) – If using histogram-based algorithm, maximum number of bins per feature
grow_policy (str | None) – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (str | None) –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
- apply(X, ntree_limit=None, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property client: distributed.Client
The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (da.Array | dd.DataFrame) – Feature matrix
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)
w.r.t. y.- Return type:
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- class xgboost.dask.DaskXGBRanker(*, objective='rank:pairwise', **kwargs)
Bases:
DaskScikitLearnBase
,XGBRankerMixIn
Implementation of the Scikit-Learn API for XGBoost Ranking.
New in version 1.4.0.
- Parameters:
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
For dask implementation, group is not supported, use qid instead.
- apply(X, ntree_limit=None, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property client: distributed.Client
The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, group=None, qid=None, sample_weight=None, base_margin=None, eval_set=None, eval_group=None, eval_qid=None, eval_metric=None, early_stopping_rounds=None, verbose=False, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting ranker
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (da.Array | dd.DataFrame) – Feature matrix
y (da.Array | dd.DataFrame | dd.Series) – Labels
group (da.Array | dd.DataFrame | dd.Series | None) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.
qid (da.Array | dd.DataFrame | dd.Series | None) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group.
sample_weight (da.Array | dd.DataFrame | dd.Series | None) –
Query group weights
Note
Weights are per-group for ranking tasks
In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_group (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list in which
eval_group[i]
is the list containing the sizes of all query groups in thei
-th pair in eval_set.eval_qid (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list in which
eval_qid[i]
is the array containing query ID ofi
-th pair in eval_set.eval_metric (str, list of str, optional) –
Deprecated since version 1.6.0: use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) –
A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.
Note
Weights are per-group for ranking tasks
In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- class xgboost.dask.DaskXGBRFRegressor(*, learning_rate=1, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)
Bases:
DaskXGBRegressor
Implementation of the Scikit-Learn API for XGBoost Random Forest Regressor.
New in version 1.4.0.
- Parameters:
n_estimators (int) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess
:- y_true: array_like of shape [n_samples]
The target values
- y_pred: array_like of shape [n_samples]
The predicted values
- grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
- hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
- apply(X, ntree_limit=None, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property client: distributed.Client
The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (da.Array | dd.DataFrame) – Feature matrix
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)
w.r.t. y.- Return type:
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- class xgboost.dask.DaskXGBRFClassifier(*, learning_rate=1, subsample=0.8, colsample_bynode=0.8, reg_lambda=1e-05, **kwargs)
Bases:
DaskXGBClassifier
Implementation of the Scikit-Learn API for XGBoost Random Forest Classifier.
New in version 1.4.0.
- Parameters:
n_estimators (int) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by gpu_hist tree method.
uniform: select random training instances uniformly.
gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
If early stopping occurs, the model will have three additional fields:
best_score
,best_iteration
andbest_ntree_limit
.Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess
:- y_true: array_like of shape [n_samples]
The target values
- y_pred: array_like of shape [n_samples]
The predicted values
- grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
- hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
- apply(X, ntree_limit=None, iteration_range=None)
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration is used automatically.
- Parameters:
- Returns:
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within
[0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.- Return type:
array_like, shape=[n_samples, n_trees]
- property best_iteration: int
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
- property client: distributed.Client
The dask client used in this model. The Client object can not be serialized for transmission, so if task is launched from a worker instead of directly from the client process, this attribute needs to be set at that worker.
- property coef_: ndarray
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
coef_
- Return type:
array of shape
[n_features]
or[n_classes, n_features]
- evals_result()
Return the evaluation results.
If eval_set is passed to the
fit()
function, you can callevals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to thefit()
function, the evals_result will contain the eval_metrics passed to thefit()
function.The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
- Return type:
evals_result
- property feature_importances_: ndarray
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
- Returns:
feature_importances_ (array of shape
[n_features]
except for multi-class)linear model, which returns an array with shape (n_features, n_classes)
- property feature_names_in_: ndarray
Names of features seen during
fit()
. Defined only when X has feature names that are all strings.
- fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)
Fit gradient boosting model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X (da.Array | dd.DataFrame) – Feature matrix
y (da.Array | dd.DataFrame | dd.Series) – Labels
sample_weight (da.Array | dd.DataFrame | dd.Series | None) – instance weights
base_margin (da.Array | dd.DataFrame | dd.Series | None) – global bias for each instance.
eval_set (Sequence[Tuple[da.Array | dd.DataFrame | dd.Series, da.Array | dd.DataFrame | dd.Series]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –
Deprecated since version 1.6.0: Use eval_metric in
__init__()
orset_params()
instead.early_stopping_rounds (int) –
Deprecated since version 1.6.0: Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose (int | bool) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[da.Array | dd.DataFrame | dd.Series] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (da.Array | dd.DataFrame | dd.Series | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks (Sequence[TrainingCallback] | None) –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- Return type:
- get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
- Returns:
booster
- Return type:
a xgboost booster of underlying model
- property intercept_: ndarray
Intercept (bias) property
Note
Intercept is defined only for linear learners
Intercept (bias) is only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
- Returns:
intercept_
- Return type:
array of shape
(1,)
or[n_classes]
- load_model(fname)
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (da.Array | dd.DataFrame) – Data to predict with.
output_margin (bool) – Whether to output the raw untransformed margin value.
ntree_limit (int | None) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (da.Array | dd.DataFrame | dd.Series | None) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying
iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.New in version 1.4.0.
- Return type:
prediction
- predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)
Predict the probability of each X example being of a given class.
Note
This function is only thread safe for gbtree and dart.
- Parameters:
X (array_like) – Feature matrix.
ntree_limit (int) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (array_like) – Margin added to prediction.
iteration_range (Tuple[int, int] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
- Returns:
a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.
- Return type:
prediction
- save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname (string or os.PathLike) – Output file name
- Return type:
None
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Mean accuracy of
self.predict(X)
w.r.t. y.- Return type:
PySpark API
PySpark XGBoost integration interface
- class xgboost.spark.SparkXGBClassifier(**kwargs)
Bases:
_SparkXGBEstimator
,HasProbabilityCol
,HasRawPredictionCol
SparkXGBClassifier is a PySpark ML estimator. It implements the XGBoost classification algorithm based on XGBoost python library, and it can be used in PySpark Pipeline and PySpark ML meta algorithms like
CrossValidator
/TrainValidationSplit
/OneVsRest
SparkXGBClassifier automatically supports most of the parameters in xgboost.XGBClassifier constructor and most of the parameters used in
xgboost.XGBClassifier
fit and predict method.SparkXGBClassifier doesn’t support setting gpu_id but support another param use_gpu, see doc below for more details.
SparkXGBClassifier doesn’t support setting base_margin explicitly as well, but support another param called base_margin_col. see doc below for more details.
SparkXGBClassifier doesn’t support setting output_margin, but we can get output margin from the raw prediction column. See raw_prediction_col param doc below for more details.
SparkXGBClassifier doesn’t support validate_features and output_margin param.
SparkXGBClassifier doesn’t support setting nthread xgboost param, instead, the nthread param for each xgboost worker will be set equal to spark.task.cpus config value.
- Parameters:
callbacks – The export and import of the callback functions are at best effort. For details, see
xgboost.spark.SparkXGBClassifier.callbacks
param doc.raw_prediction_col – The output_margin=True is implicitly supported by the rawPredictionCol output column, which is always returned with the predicted margin values.
validation_indicator_col – For params related to xgboost.XGBClassifier training with evaluation dataset’s supervision, set
xgboost.spark.SparkXGBClassifier.validation_indicator_col
parameter instead of setting the eval_set parameter in xgboost.XGBClassifier fit method.weight_col – To specify the weight of the training and validation dataset, set
xgboost.spark.SparkXGBClassifier.weight_col
parameter instead of setting sample_weight and sample_weight_eval_set parameter in xgboost.XGBClassifier fit method.xgb_model – Set the value to be the instance returned by
xgboost.spark.SparkXGBClassifierModel.get_booster()
.num_workers – Integer that specifies the number of XGBoost workers to use. Each XGBoost worker corresponds to one spark task.
use_gpu – Boolean that specifies whether the executors are running on GPU instances.
base_margin_col – To specify the base margins of the training and validation dataset, set
xgboost.spark.SparkXGBClassifier.base_margin_col
parameter instead of setting base_margin and base_margin_eval_set in the xgboost.XGBClassifier fit method. Note: this isn’t available for distributed training.Note: (..) – The Parameters chart above contains parameters that need special handling.: For a full list of parameters, see entries with Param(parent=… below.
Note: – This API is experimental.:
Examples
>>> from xgboost.spark import SparkXGBClassifier >>> from pyspark.ml.linalg import Vectors >>> df_train = spark.createDataFrame([ ... (Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0), ... (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0), ... (Vectors.dense(4.0, 5.0, 6.0), 0, True, 1.0), ... (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, True, 2.0), ... ], ["features", "label", "isVal", "weight"]) >>> df_test = spark.createDataFrame([ ... (Vectors.dense(1.0, 2.0, 3.0), ), ... ], ["features"]) >>> xgb_classifier = SparkXGBClassifier(max_depth=5, missing=0.0, ... validation_indicator_col='isVal', weight_col='weight', ... early_stopping_rounds=1, eval_metric='logloss') >>> xgb_clf_model = xgb_classifier.fit(df_train) >>> xgb_clf_model.transform(df_test).show()
- clear(param)
Clears a param from the param map if it has been explicitly set.
- Parameters:
param (Param) –
- Return type:
None
- copy(extra=None)
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
self (P) –
- Returns:
Copy of this instance
- Return type:
Params
- explainParam(param)
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams()
Returns the documentation of all params with their optionally default values and user-supplied values.
- Return type:
- extractParamMap(extra=None)
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- fit(dataset, params=None)
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset, paramMaps)
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getOrDefault(param)
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getValidationIndicatorCol()
Gets the value of validationIndicatorCol or its default value.
- Return type:
- hasDefault(param)
Checks whether a param has a default value.
- hasParam(paramName)
Tests whether this instance contains a param with a given (string) name.
- isDefined(param)
Checks whether a param is explicitly set by user or has a default value.
- isSet(param)
Checks whether a param is explicitly set by user.
- classmethod load(path)
Reads an ML instance from the input path, a shortcut of read().load(path).
- Parameters:
path (str) –
- Return type:
RL
- property params: List[Param]
Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
- classmethod read()
Return the reader for loading the estimator.
- save(path)
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- Parameters:
path (str) –
- Return type:
None
- set(param, value)
Sets a parameter in the embedded param map.
- setParams(**kwargs)
Set params for the estimator.
- uid
A unique id for the object.
- write()
Return the writer for saving the estimator.
- class xgboost.spark.SparkXGBClassifierModel(xgb_sklearn_model=None)
Bases:
_SparkXGBModel
,HasProbabilityCol
,HasRawPredictionCol
The model returned by
xgboost.spark.SparkXGBClassifier.fit()
Note
This API is experimental.
- clear(param)
Clears a param from the param map if it has been explicitly set.
- Parameters:
param (Param) –
- Return type:
None
- copy(extra=None)
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
self (P) –
- Returns:
Copy of this instance
- Return type:
Params
- explainParam(param)
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams()
Returns the documentation of all params with their optionally default values and user-supplied values.
- Return type:
- extractParamMap(extra=None)
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- getOrDefault(param)
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getValidationIndicatorCol()
Gets the value of validationIndicatorCol or its default value.
- Return type:
- get_booster()
Return the xgboost.core.Booster instance.
- get_feature_importances(importance_type='weight')
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
- Parameters:
importance_type (str, default 'weight') – One of the importance types defined above.
- hasDefault(param)
Checks whether a param has a default value.
- hasParam(paramName)
Tests whether this instance contains a param with a given (string) name.