Create an xgb.QuantileDMatrix
object (exact same class as would be returned by
calling function xgb.QuantileDMatrix()
, with the same advantages and limitations) from
external data supplied by xgb.DataIter()
, potentially passed in batches from
a bigger set that might not fit entirely in memory, same way as xgb.ExtMemDMatrix()
.
Note that, while external data will only be loaded through the iterator (thus the full data might not be held entirely in-memory), the quantized representation of the data will get created in-memory, being concatenated from multiple calls to the data iterator. The quantized version is typically lighter than the original data, so there might be cases in which this representation could potentially fit in memory even if the full data does not.
For more information, see the guide 'Using XGBoost External Memory Version': https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html
Usage
xgb.QuantileDMatrix.from_iterator(
data_iterator,
missing = NA,
nthread = NULL,
ref = NULL,
max_bin = NULL
)
Arguments
- data_iterator
A data iterator structure as returned by
xgb.DataIter()
, which includes an environment shared between function calls, and functions to access the data in batches on-demand.- missing
A float value to represents missing values in data.
Note that, while functions like
xgb.DMatrix()
can take a genericNA
and interpret it correctly for different types likenumeric
andinteger
, if anNA
value is passed here, it will not be adapted for different input types.For example, in R
integer
types, missing values are represented by integer number-2147483648
(since machine 'integer' types do not have an inherent 'NA' value) - hence, if one passesNA
, which is interpreted as a floating-point NaN byxgb.ExtMemDMatrix()
and byxgb.QuantileDMatrix.from_iterator()
, these integer missing values will not be treated as missing. This should not pose any problem fornumeric
types, since they do have an inheret NaN value.- nthread
Number of threads used for creating DMatrix.
- ref
The training dataset that provides quantile information, needed when creating validation/test dataset with
xgb.QuantileDMatrix()
. Supplying the training DMatrix as a reference means that the same quantisation applied to the training data is applied to the validation/test data- max_bin
The number of histogram bin, should be consistent with the training parameter
max_bin
.This is only supported when constructing a QuantileDMatrix.