# Categorical Data

Starting from version 1.5, XGBoost has experimental support for categorical data available for public testing. At the moment, the support is implemented as one-hot encoding based categorical tree splits. For numerical data, the split condition is defined as $$value < threshold$$, while for categorical data the split is defined as $$value == category$$ and category is a discrete value. More advanced categorical split strategy is planned for future releases and this tutorial details how to inform XGBoost about the data type. Also, the current support for training is limited to gpu_hist tree method.

## Training with scikit-learn Interface

The easiest way to pass categorical data into XGBoost is using dataframe and the scikit-learn interface like XGBClassifier. For preparing the data, users need to specify the data type of input predictor as category. For pandas/cudf Dataframe, this can be achieved by

X["cat_feature"].astype("category")


for all columns that represent categorical features. After which, users can tell XGBoost to enable training with categorical data. Assuming that you are using the XGBClassifier for classification problem, specify the parameter enable_categorical:

# Only gpu_hist is supported for categorical data as mentioned previously
clf = xgb.XGBClassifier(
tree_method="gpu_hist", enable_categorical=True, use_label_encoder=False
)
# X is the dataframe we created in previous snippet
clf.fit(X, y)
# Must use JSON for serialization, otherwise the information is lost
clf.save_model("categorical-model.json")


Once training is finished, most of other features can utilize the model. For instance one can plot the model and calculate the global feature importance:

# Get a graph
graph = xgb.to_graphviz(clf, num_trees=1)
# Or get a matplotlib axis
ax = xgb.plot_tree(clf, num_trees=1)
# Get feature importances
clf.feature_importances_


The scikit-learn interface from dask is similar to single node version. The basic idea is create dataframe with category feature type, and tell XGBoost to use gpu_hist with parameter enable_categorical. See Getting started with categorical data for a worked example of using categorical data with scikit-learn interface. A comparison between using one-hot encoded data and XGBoost’s categorical data support can be found Train XGBoost with cat_in_the_dat dataset.

## Using native interface

The scikit-learn interface is user friendly, but lacks some features that are only available in native interface. For instance users cannot compute SHAP value directly or use quantized DMatrix. Also native interface supports data types other than dataframe, like numpy/cupy array. To use the native interface with categorical data, we need to pass the similar parameter to DMatrix and the train function. For dataframe input:

# X is a dataframe we created in previous snippet
Xy = xgb.DMatrix(X, y, enable_categorical=True)
booster = xgb.train({"tree_method": "gpu_hist"}, Xy)
# Must use JSON for serialization, otherwise the information is lost
booster.save_model("categorical-model.json")


SHAP value computation:

SHAP = booster.predict(Xy, pred_interactions=True)

# categorical features are listed as "c"
print(booster.feature_types)


For other types of input, like numpy array, we can tell XGBoost about the feature types by using the feature_types parameter in DMatrix:

# "q" is numerical feature, while "c" is categorical feature
ft = ["q", "c", "c"]

For numerical data, the feature type can be "q" or "float", while for categorical feature it’s specified as "c". The Dask module in XGBoost has the same interface so dask.Array can also be used as categorical data.
By default, XGBoost assumes input categories are integers starting from 0 till the number of categories $$[0, n\_categories)$$. However, user might provide inputs with invalid values due to mistakes or missing values. It can be negative value, integer values that can not be accurately represented by 32-bit floating point, or values that are larger than actual number of unique categories. During training this is validated but for prediction it’s treated as the same as missing value for performance reasons. Lastly, missing values are treated as the same as numerical features (using the learned split direction).