Construct an 'xgb.DMatrix' object from a given data source, which can then be passed to functions
such as xgb.train()
or predict()
.
Usage
xgb.DMatrix(
data,
label = NULL,
weight = NULL,
base_margin = NULL,
missing = NA,
silent = FALSE,
feature_names = colnames(data),
feature_types = NULL,
nthread = NULL,
group = NULL,
qid = NULL,
label_lower_bound = NULL,
label_upper_bound = NULL,
feature_weights = NULL,
data_split_mode = "row",
...
)
xgb.QuantileDMatrix(
data,
label = NULL,
weight = NULL,
base_margin = NULL,
missing = NA,
feature_names = colnames(data),
feature_types = NULL,
nthread = NULL,
group = NULL,
qid = NULL,
label_lower_bound = NULL,
label_upper_bound = NULL,
feature_weights = NULL,
ref = NULL,
max_bin = NULL
)
Arguments
- data
Data from which to create a DMatrix, which can then be used for fitting models or for getting predictions out of a fitted model.
Supported input types are as follows:
matrix
objects, with typesnumeric
,integer
, orlogical
.data.frame
objects, with columns of typesnumeric
,integer
,logical
, orfactor
Note that xgboost uses base-0 encoding for categorical types, hence
factor
types (which use base-1 encoding') will be converted inside the function call. Be aware that the encoding used forfactor
types is not kept as part of the model, so in subsequent calls topredict
, it is the user's responsibility to ensure that factor columns have the same levels as the ones from which the DMatrix was constructed.Other column types are not supported.
CSR matrices, as class
dgRMatrix
from packageMatrix
.CSC matrices, as class
dgCMatrix
from packageMatrix
.
These are not supported by
xgb.QuantileDMatrix
.XGBoost's own binary format for DMatrices, as produced by
xgb.DMatrix.save()
.Single-row CSR matrices, as class
dsparseVector
from packageMatrix
, which is interpreted as a single row (only when making predictions from a fitted model).
- label
Label of the training data. For classification problems, should be passed encoded as integers with numeration starting at zero.
- weight
Weight for each instance.
Note that, for ranking task, weights are per-group. In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn't make sense to assign weights to individual data points.
- base_margin
Base margin used for boosting from existing model.
In the case of multi-output models, one can also pass multi-dimensional base_margin.
- missing
A float value to represents missing values in data (not used when creating DMatrix from text files). It is useful to change when a zero, infinite, or some other extreme value represents missing values in data.
- silent
whether to suppress printing an informational message after loading from a file.
- feature_names
Set names for features. Overrides column names in data frame and matrix.
Note: columns are not referenced by name when calling
predict
, so the column order there must be the same as in the DMatrix construction, regardless of the column names.- feature_types
Set types for features.
If
data
is adata.frame
and passingfeature_types
is not supplied, feature types will be deduced automatically from the column types.Otherwise, one can pass a character vector with the same length as number of columns in
data
, with the following possible values:"c", which represents categorical columns.
"q", which represents numeric columns.
"int", which represents integer columns.
"i", which represents logical (boolean) columns.
Note that, while categorical types are treated differently from the rest for model fitting purposes, the other types do not influence the generated model, but have effects in other functionalities such as feature importances.
Important: Categorical features, if specified manually through
feature_types
, must be encoded as integers with numeration starting at zero, and the same encoding needs to be applied when passing data topredict()
. Even if passingfactor
types, the encoding will not be saved, so make sure thatfactor
columns passed topredict
have the samelevels
.- nthread
Number of threads used for creating DMatrix.
- group
Group size for all ranking group.
- qid
Query ID for data samples, used for ranking.
- label_lower_bound
Lower bound for survival training.
- label_upper_bound
Upper bound for survival training.
- feature_weights
Set feature weights for column sampling.
- data_split_mode
Not used yet. This parameter is for distributed training, which is not yet available for the R package.
- ...
Not used.
Some arguments that were part of this function in previous XGBoost versions are currently deprecated or have been renamed. If a deprecated or renamed argument is passed, will throw a warning (by default) and use its current equivalent instead. This warning will become an error if using the 'strict mode' option.
If some additional argument is passed that is neither a current function argument nor a deprecated or renamed argument, a warning or error will be thrown depending on the 'strict mode' option.
Important:
...
will be removed in a future version, and all the current deprecation warnings will become errors. Please use only arguments that form part of the function signature.- ref
The training dataset that provides quantile information, needed when creating validation/test dataset with
xgb.QuantileDMatrix()
. Supplying the training DMatrix as a reference means that the same quantisation applied to the training data is applied to the validation/test data- max_bin
The number of histogram bin, should be consistent with the training parameter
max_bin
.This is only supported when constructing a QuantileDMatrix.
Value
An 'xgb.DMatrix' object. If calling xgb.QuantileDMatrix
, it will have additional
subclass xgb.QuantileDMatrix
.
Details
Function xgb.QuantileDMatrix()
will construct a DMatrix with quantization for the histogram
method already applied to it, which can be used to reduce memory usage (compared to using a
a regular DMatrix first and then creating a quantization out of it) when using the histogram
method (tree_method = "hist"
, which is the default algorithm), but is not usable for the
sorted-indices method (tree_method = "exact"
), nor for the approximate method
(tree_method = "approx"
).
Note that DMatrix objects are not serializable through R functions such as saveRDS()
or save()
.
If a DMatrix gets serialized and then de-serialized (for example, when saving data in an R session or caching
chunks in an Rmd file), the resulting object will not be usable anymore and will need to be reconstructed
from the original source of data.
Examples
data(agaricus.train, package = "xgboost")
## Keep the number of threads to 1 for examples
nthread <- 1
data.table::setDTthreads(nthread)
dtrain <- with(
agaricus.train, xgb.DMatrix(data, label = label, nthread = nthread)
)
fname <- file.path(tempdir(), "xgb.DMatrix.data")
xgb.DMatrix.save(dtrain, fname)
dtrain <- xgb.DMatrix(fname, nthread = 1)