Introduction to Model IO
Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is enabled for serializing models to file, serializing models to buffer, and for memory snapshot (pickle and alike).
In XGBoost 1.0.0, we introduced support of using JSON for saving/loading XGBoost models and related hyper-parameters for training, aiming to replace the old binary internal format with an open format that can be easily reused. Later in XGBoost 1.6.0, additional support for Universal Binary JSON is added as an optimization for more efficient model IO, which is set to default in 2.1.
JSON and UBJSON have the same document structure with different representations, and we
will refer them collectively as the JSON format. This tutorial aims to share some basic
insights into the JSON serialisation method used in XGBoost. Without explicitly
mentioned, the following sections assume you are using the one of the 2 outputs formats,
which can be enabled by providing the file name with .json
(or .ubj
for binary
JSON) as file extension when saving/loading model: booster.save_model('model.json')
.
More details below.
Before we get started, XGBoost is a gradient boosting library with focus on tree model, which means inside XGBoost, there are 2 distinct parts:
The model consisting of trees and
Hyperparameters and configurations used for building the model.
If you come from Deep Learning community, then it should be clear to you that there are differences between the neural network structures composed of weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
So when one calls booster.save_model
(xgb.save
in R), XGBoost saves the trees,
some model parameters like number of input columns in trained trees, and the objective
function, which combined to represent the concept of “model” in XGBoost. As for why are
we saving the objective as part of model, that’s because objective controls transformation
of global bias (called base_score
in XGBoost) and task-specific information. Users
can share this model with others for prediction, evaluation or continue the training with
a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save something more than just the model itself. For example, in distributed training, XGBoost performs checkpointing operation. Or for some reasons, your favorite distributed computing framework decide to copy the model from one worker to another and continue the training in there. In such cases, the serialisation output is required to contain enough information to continue previous training without user providing any parameters again. We consider such scenario as memory snapshot (or memory based serialisation method) and distinguish it with normal model IO operation. Currently, memory snapshot is used in the following places:
Python package: when the
Booster
object is pickled with the built-inpickle
module.R package: when the
xgb.Booster
object is persisted with the built-in functionssaveRDS
orsave
.JVM packages: when the
Booster
object is serialized with the built-in functionssaveModel
.
Other language bindings are still working in progress.
Note
The old binary format doesn’t distinguish difference between model and raw memory serialisation format, it’s a mix of everything, which is part of the reason why we want to replace it with a more robust serialisation method. JVM Package has its own memory based serialisation methods.
To enable JSON format support for model IO (saving only the trees and objective), provide
a filename with .json
or .ubj
as file extension, the latter is the extension for
Universal Binary JSON
bst.save_model('model_file_name.json')
xgb.save(bst, 'model_file_name.json')
val format = "json" // or val format = "ubj"
model.write.option("format", format).save("model_directory_path")
Note
Only load models from JSON files that were produced by XGBoost. Attempting to load JSON files that were produced by an external source may lead to undefined behaviors and crashes.
While for memory snapshot, UBJSON is the default starting with xgboost 1.6. When loading
the model back, XGBoost recognizes the file extensions .json
and .ubj
, and can
dispatch accordingly. If the extension is not specified, XGBoost tries to guess the right
one.
A note on backward compatibility of models and memory snapshots
We guarantee backward compatibility for models but not for memory snapshots.
Models (trees and objective) use a stable representation, so that models produced in earlier
versions of XGBoost are accessible in later versions of XGBoost. If you’d like to store or archive
your model for long-term storage, use save_model
(Python) and xgb.save
(R).
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
checkpointing only, where you persist the complete snapshot of the training configurations so that
you can recover robustly from possible failures and resume the training process. Loading memory
snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
If a model is persisted with pickle.dump
(Python) or saveRDS
(R), then the model may
not be accessible in later versions of XGBoost.
Custom objective and metric
XGBoost accepts user provided objective and metric functions as an extension. These functions are not saved in model file as they are language dependent features. With Python, user can pickle the model to include these functions in saved binary. One drawback is, the output from pickle is not a stable serialization format and doesn’t work on different Python version nor XGBoost version, not to mention different language environments. Another way to workaround this limitation is to provide these functions again after the model is loaded. If the customized function is useful, please consider making a PR for implementing it inside XGBoost, this way we can have your functions working with different language bindings.
Loading pickled file from different version of XGBoost
As noted, pickled model is neither portable nor stable, but in some cases the pickled models are valuable. One way to restore it in the future is to load it back with that specific version of Python and XGBoost, export the model by calling save_model.
A similar procedure may be used to recover the model persisted in an old RDS file. In R,
you are able to install an older version of XGBoost using the remotes
package:
library(remotes)
remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1
Once the desired version is installed, you can load the RDS file with readRDS
and recover the
xgb.Booster
object. Then call xgb.save
to export the model using the stable representation.
Now you should be able to use the model in the latest version of XGBoost.
Saving and Loading the internal parameters configuration
XGBoost’s C API
, Python API
and R API
support saving and loading the internal
configuration directly as a JSON string. In Python package:
bst = xgboost.train(...)
config = bst.save_config()
print(config)
or in R:
config <- xgb.config(bst)
print(config)
Will print out something similar to (not actual output as it’s too long for demonstration):
{
"Learner": {
"generic_parameter": {
"device": "cuda:0",
"gpu_page_size": "0",
"n_jobs": "0",
"random_state": "0",
"seed": "0",
"seed_per_iteration": "0"
},
"gradient_booster": {
"gbtree_train_param": {
"num_parallel_tree": "1",
"process_type": "default",
"tree_method": "hist",
"updater": "grow_gpu_hist",
"updater_seq": "grow_gpu_hist"
},
"name": "gbtree",
"updater": {
"grow_gpu_hist": {
"gpu_hist_train_param": {
"debug_synchronize": "0",
},
"train_param": {
"alpha": "0",
"cache_opt": "1",
"colsample_bylevel": "1",
"colsample_bynode": "1",
"colsample_bytree": "1",
"default_direction": "learn",
...
"subsample": "1"
}
}
}
},
"learner_train_param": {
"booster": "gbtree",
"disable_default_eval_metric": "0",
"objective": "reg:squarederror"
},
"metrics": [],
"objective": {
"name": "reg:squarederror",
"reg_loss_param": {
"scale_pos_weight": "1"
}
}
},
"version": [1, 0, 0]
}
You can load it back to the model generated by same version of XGBoost by:
bst.load_config(config)
This way users can study the internal representation more closely. Please note that some JSON generators make use of locale dependent floating point serialization methods, which is not supported by XGBoost.
Difference between saving model and dumping model
XGBoost has a function called dump_model
in Booster object, which lets you to export
the model in a readable format like text
, json
or dot
(graphviz). The primary
use case for it is for model interpretation or visualization, and is not supposed to be
loaded back to XGBoost. The JSON version has a schema. See next section for
more info.
JSON Schema
Another important feature of JSON format is a documented schema, based on which one can easily reuse the output model from
XGBoost. Here is the JSON schema for the output model (not serialization, which will not
be stable as noted above). For an example of parsing XGBoost tree model, see
/demo/json-model
. Please notice the “weight_drop” field used in “dart” booster.
XGBoost does not scale tree leaf directly, instead it saves the weights as a separated
array.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"gbtree": {
"type": "object",
"properties": {
"name": {
"const": "gbtree"
},
"model": {
"type": "object",
"properties": {
"gbtree_model_param": {
"$ref": "#/definitions/gbtree_model_param"
},
"trees": {
"type": "array",
"items": {
"type": "object",
"properties": {
"tree_param": {
"$ref": "#/definitions/tree_param"
},
"id": {
"type": "integer"
},
"loss_changes": {
"type": "array",
"items": {
"type": "number"
}
},
"sum_hessian": {
"type": "array",
"items": {
"type": "number"
}
},
"base_weights": {
"type": "array",
"items": {
"type": "number"
}
},
"left_children": {
"type": "array",
"items": {
"type": "integer"
}
},
"right_children": {
"type": "array",
"items": {
"type": "integer"
}
},
"parents": {
"type": "array",
"items": {
"type": "integer"
}
},
"split_indices": {
"type": "array",
"items": {
"type": "integer"
}
},
"split_conditions": {
"type": "array",
"items": {
"type": "number"
}
},
"split_type": {
"type": "array",
"items": {
"type": "integer"
}
},
"default_left": {
"type": "array",
"items": {
"type": "integer"
}
},
"categories": {
"type": "array",
"items": {
"type": "integer"
}
},
"categories_nodes": {
"type": "array",
"items": {
"type": "integer"
}
},
"categories_segments": {
"type": "array",
"items": {
"type": "integer"
}
},
"categories_sizes": {
"type": "array",
"items": {
"type": "integer"
}
}
},
"required": [
"tree_param",
"loss_changes",
"sum_hessian",
"base_weights",
"left_children",
"right_children",
"parents",
"split_indices",
"split_conditions",
"default_left",
"categories",
"categories_nodes",
"categories_segments",
"categories_sizes"
]
}
},
"tree_info": {
"type": "array",
"items": {
"type": "integer"
}
}
},
"required": [
"gbtree_model_param",
"trees",
"tree_info"
]
}
},
"required": [
"name",
"model"
]
},
"gbtree_model_param": {
"type": "object",
"properties": {
"num_trees": {
"type": "string"
},
"num_parallel_tree": {
"type": "string"
}
},
"required": [
"num_trees",
"num_parallel_tree"
]
},
"tree_param": {
"type": "object",
"properties": {
"num_nodes": {
"type": "string"
},
"size_leaf_vector": {
"type": "string"
},
"num_feature": {
"type": "string"
}
},
"required": [
"num_nodes",
"num_feature",
"size_leaf_vector"
]
},
"reg_loss_param": {
"type": "object",
"properties": {
"scale_pos_weight": {
"type": "string"
}
}
},
"pseudo_huber_param": {
"type": "object",
"properties": {
"huber_slope": {
"type": "string"
}
}
},
"aft_loss_param": {
"type": "object",
"properties": {
"aft_loss_distribution": {
"type": "string"
},
"aft_loss_distribution_scale": {
"type": "string"
}
}
},
"softmax_multiclass_param": {
"type": "object",
"properties": {
"num_class": { "type": "string" }
}
},
"lambda_rank_param": {
"type": "object",
"properties": {
"num_pairsample": { "type": "string" },
"fix_list_weight": { "type": "string" }
}
},
"lambdarank_param": {
"type": "object",
"properties": {
"lambdarank_num_pair_per_sample": { "type": "string" },
"lambdarank_pair_method": { "type": "string" },
"lambdarank_unbiased": {"type": "string" },
"lambdarank_bias_norm": {"type": "string" },
"ndcg_exp_gain": {"type": "string"}
}
}
},
"type": "object",
"properties": {
"version": {
"type": "array",
"items": [
{
"type": "number",
"minimum": 1
},
{
"type": "number",
"minimum": 0
},
{
"type": "number",
"minimum": 0
}
],
"minItems": 3,
"maxItems": 3
},
"learner": {
"type": "object",
"properties": {
"feature_names": {
"type": "array",
"items": {
"type": "string"
}
},
"feature_types": {
"type": "array",
"items": {
"type": "string"
}
},
"gradient_booster": {
"oneOf": [
{
"$ref": "#/definitions/gbtree"
},
{
"type": "object",
"properties": {
"name": { "const": "gblinear" },
"model": {
"type": "object",
"properties": {
"weights": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
}
},
{
"type": "object",
"properties": {
"name": { "const": "dart" },
"gbtree": {
"$ref": "#/definitions/gbtree"
},
"weight_drop": {
"type": "array",
"items": {
"type": "number"
}
}
},
"required": [
"name",
"gbtree",
"weight_drop"
]
}
]
},
"objective": {
"oneOf": [
{
"type": "object",
"properties": {
"name": { "const": "reg:squarederror" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "reg:pseudohubererror" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "reg:squaredlogerror" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "reg:linear" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "reg:logistic" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "binary:logistic" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "binary:logitraw" },
"reg_loss_param": { "$ref": "#/definitions/reg_loss_param"}
},
"required": [
"name",
"reg_loss_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "count:poisson" },
"poisson_regression_param": {
"type": "object",
"properties": {
"max_delta_step": { "type": "string" }
}
}
},
"required": [
"name",
"poisson_regression_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "reg:tweedie" },
"tweedie_regression_param": {
"type": "object",
"properties": {
"tweedie_variance_power": { "type": "string" }
}
}
},
"required": [
"name",
"tweedie_regression_param"
]
},
{
"properties": {
"name": {
"const": "reg:absoluteerror"
}
},
"type": "object"
},
{
"properties": {
"name": {
"const": "reg:quantileerror"
},
"quantile_loss_param": {
"type": "object",
"properties": {
"quantle_alpha": {"type": "array"}
}
}
},
"type": "object"
},
{
"type": "object",
"properties": {
"name": { "const": "survival:cox" }
},
"required": [ "name" ]
},
{
"type": "object",
"properties": {
"name": { "const": "reg:gamma" }
},
"required": [ "name" ]
},
{
"type": "object",
"properties": {
"name": { "const": "multi:softprob" },
"softmax_multiclass_param": { "$ref": "#/definitions/softmax_multiclass_param"}
},
"required": [
"name",
"softmax_multiclass_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "multi:softmax" },
"softmax_multiclass_param": { "$ref": "#/definitions/softmax_multiclass_param"}
},
"required": [
"name",
"softmax_multiclass_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "rank:pairwise" },
"lambda_rank_param": { "$ref": "#/definitions/lambdarank_param"}
},
"required": [
"name",
"lambdarank_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "rank:ndcg" },
"lambda_rank_param": { "$ref": "#/definitions/lambdarank_param"}
},
"required": [
"name",
"lambdarank_param"
]
},
{
"type": "object",
"properties": {
"name": { "const": "rank:map" },
"lambda_rank_param": { "$ref": "#/definitions/lambda_rank_param"}
},
"required": [
"name",
"lambda_rank_param"
]
},
{
"type": "object",
"properties": {
"name": {"const": "survival:aft"},
"aft_loss_param": { "$ref": "#/definitions/aft_loss_param"}
}
},
{
"type": "object",
"properties": {
"name": {"const": "binary:hinge"}
}
}
]
},
"learner_model_param": {
"type": "object",
"properties": {
"base_score": { "type": "string" },
"num_class": { "type": "string" },
"num_feature": { "type": "string" },
"num_target": { "type": "string" }
}
}
},
"required": [
"gradient_booster",
"objective"
]
}
},
"required": [
"version",
"learner"
]
}