Skip to contents

Create an xgb.QuantileDMatrix object (exact same class as would be returned by calling function xgb.QuantileDMatrix(), with the same advantages and limitations) from external data supplied by xgb.DataIter(), potentially passed in batches from a bigger set that might not fit entirely in memory, same way as xgb.ExtMemDMatrix().

Note that, while external data will only be loaded through the iterator (thus the full data might not be held entirely in-memory), the quantized representation of the data will get created in-memory, being concatenated from multiple calls to the data iterator. The quantized version is typically lighter than the original data, so there might be cases in which this representation could potentially fit in memory even if the full data does not.

For more information, see the guide 'Using XGBoost External Memory Version': https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html

Usage

xgb.QuantileDMatrix.from_iterator(
  data_iterator,
  missing = NA,
  nthread = NULL,
  ref = NULL,
  max_bin = NULL
)

Arguments

data_iterator

A data iterator structure as returned by xgb.DataIter(), which includes an environment shared between function calls, and functions to access the data in batches on-demand.

missing

A float value to represents missing values in data.

Note that, while functions like xgb.DMatrix() can take a generic NA and interpret it correctly for different types like numeric and integer, if an NA value is passed here, it will not be adapted for different input types.

For example, in R integer types, missing values are represented by integer number -2147483648 (since machine 'integer' types do not have an inherent 'NA' value) - hence, if one passes NA, which is interpreted as a floating-point NaN by xgb.ExtMemDMatrix() and by xgb.QuantileDMatrix.from_iterator(), these integer missing values will not be treated as missing. This should not pose any problem for numeric types, since they do have an inheret NaN value.

nthread

Number of threads used for creating DMatrix.

ref

The training dataset that provides quantile information, needed when creating validation/test dataset with xgb.QuantileDMatrix(). Supplying the training DMatrix as a reference means that the same quantisation applied to the training data is applied to the validation/test data

max_bin

The number of histogram bin, should be consistent with the training parameter max_bin.

This is only supported when constructing a QuantileDMatrix.

Value

An 'xgb.DMatrix' object, with subclass 'xgb.QuantileDMatrix'.