Quantile DMatrix and external memory DMatrix can be created from batches of data.
More...
|
int | XGDMatrixCreateFromDataIter (DataIterHandle data_handle, XGBCallbackDataIterNext *callback, const char *cache_info, float missing, DMatrixHandle *out) |
| Create a DMatrix from a data iterator. More...
|
|
int | XGProxyDMatrixCreate (DMatrixHandle *out) |
| Create a DMatrix proxy for setting data, can be freed by XGDMatrixFree. More...
|
|
int | XGDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out) |
| Create an external memory DMatrix with data iterator. More...
|
|
int | XGQuantileDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterHandle ref, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out) |
| Create a Quantile DMatrix with a data iterator. More...
|
|
int | XGExtMemQuantileDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterHandle ref, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out) |
| Create a Quantile DMatrix backed by external memory. More...
|
|
int | XGProxyDMatrixSetDataCudaArrayInterface (DMatrixHandle handle, const char *data) |
| Set data on a DMatrix proxy. More...
|
|
int | XGProxyDMatrixSetDataColumnar (DMatrixHandle handle, char const *data) |
| Set columnar (table) data on a DMatrix proxy. More...
|
|
int | XGProxyDMatrixSetDataCudaColumnar (DMatrixHandle handle, const char *data) |
| Set CUDA-based columnar (table) data on a DMatrix proxy. More...
|
|
int | XGProxyDMatrixSetDataDense (DMatrixHandle handle, char const *data) |
| Set data on a DMatrix proxy. More...
|
|
int | XGProxyDMatrixSetDataCSR (DMatrixHandle handle, char const *indptr, char const *indices, char const *data, bst_ulong ncol) |
| Set data on a DMatrix proxy. More...
|
|
Quantile DMatrix and external memory DMatrix can be created from batches of data.
There are 2 sets of data callbacks for DMatrix. The first one is currently exclusively used by JVM packages. It uses XGBoostBatchCSR
to accept batches for CSR formated input, and concatenate them into 1 final big CSR. The related functions are:
Another set is used by external data iterator. It accepts foreign data iterators as callbacks. There are 2 different senarios where users might want to pass in callbacks instead of raw data. First it's the Quantile DMatrix used by the hist and GPU-based hist tree method. For this case, the data is first compressed by quantile sketching then merged. This is particular useful for distributed setting as it eliminates 2 copies of data. First one by a concat
from external library to make the data into a blob for normal DMatrix initialization, another one by the internal CSR copy of DMatrix.
The second use case is external memory support where users can pass a custom data iterator into XGBoost for loading data in batches. For both cases, the iterator is only used during the construction of the DMatrix and can be safely freed after construction finishes. There are short notes on each of the use cases in respected DMatrix factory function.
Related functions are:
Factory functions
Proxy that callers can use to pass data to XGBoost
◆ DataHolderHandle
handle to an internal data holder.
◆ DataIterHandle
handle to a external data iterator
◆ DataIterResetCallback
Callback function prototype for resetting the external iterator.
◆ XGBCallbackDataIterNext
The data reading callback function. The iterator will be able to give subset of batch in the data.
If there is data, the function will call set_function to set the data.
- Parameters
-
data_handle | The handle to the callback. |
set_function | The batch returned by the iterator |
set_function_handle | The handle to be passed to set function. |
- Returns
- 0 if we are reaching the end and batch is not returned.
◆ XGBCallbackSetData
Callback to set the data to handle,.
- Parameters
-
handle | The handle to the callback. |
batch | The data content to be set. |
◆ XGDMatrixCallbackNext
Callback function prototype for getting next batch of data.
- Parameters
-
iter | A handler to the user defined iterator. |
- Returns
- 0 when success, -1 when failure happens.
◆ XGDMatrixCreateFromCallback()
Create an external memory DMatrix with data iterator.
Short note for how to use second set of callback for external memory data support:
- Step 0: Define a data iterator with 2 methods
reset
, and next
.
- Step 1: Create a DMatrix proxy by XGProxyDMatrixCreate and hold the handle.
- Step 2: Pass the iterator handle, proxy handle and 2 methods into XGDMatrixCreateFromCallback, along with other parameters encoded as a JSON object.
- Step 3: Call appropriate data setters in
next
functions.
- Parameters
-
| iter | A handle to external data iterator. |
| proxy | A DMatrix proxy handle created by XGProxyDMatrixCreate. |
| reset | Callback function resetting the iterator state. |
| next | Callback function yielding the next batch of data. |
| config | JSON encoded parameters for DMatrix construction. Accepted fields are:
- missing: Which value to represent missing value
- cache_prefix: The path of cache file, caller must initialize all the directories in this path.
- nthread (optional): Number of threads used for initializing DMatrix.
|
[out] | out | The created external memory DMatrix |
- Returns
- 0 when success, -1 when failure happens
- Examples
- external_memory.c.
◆ XGDMatrixCreateFromDataIter()
Create a DMatrix from a data iterator.
- Parameters
-
data_handle | The handle to the data. |
callback | The callback to get the data. |
cache_info | Additional information about cache file, can be null. |
missing | Which value to represent missing value. |
out | The created DMatrix |
- Returns
- 0 when success, -1 when failure happens.
◆ XGExtMemQuantileDMatrixCreateFromCallback()
Create a Quantile DMatrix backed by external memory.
- Since
- 3.0.0
- Note
- This is experimental and subject to change.
- Parameters
-
iter | A handle to external data iterator. |
proxy | A DMatrix proxy handle created by XGProxyDMatrixCreate. |
ref | Reference DMatrix for providing quantile information. |
reset | Callback function resetting the iterator state. |
next | Callback function yielding the next batch of data. |
config | JSON encoded parameters for DMatrix construction. Accepted fields are:
- missing: Which value to represent missing value
- cache_prefix: The path of cache file, caller must initialize all the directories in this path.
- nthread (optional): Number of threads used for initializing DMatrix.
- max_bin (optional): Maximum number of bins for building histogram. Must be consistent with the corresponding booster training parameter.
- on_host (optional): Whether the data should be placed on host memory. Used by GPU inputs.
- min_cache_page_bytes (optional): The minimum number of bytes for each internal GPU page. Set to 0 to disable page concatenation. Automatic configuration if the parameter is not provided or set to None.
- max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming batches with multiple growing substreams. This parameter sets the maximum number of batches before XGBoost can cut the sub-stream and create a new one. This can help bound the memory usage. By default, XGBoost grows new sub-streams exponentially until batches are exhausted. Only used for the training dataset and the default is None (unbounded).
|
out | The created Quantile DMatrix. |
- Returns
- 0 when success, -1 when failure happens
◆ XGProxyDMatrixCreate()
Create a DMatrix proxy for setting data, can be freed by XGDMatrixFree.
Second set of callback functions, used by constructing Quantile DMatrix or external memory DMatrix using a custom iterator.
The DMatrix proxy is only a temporary reference (wrapper) to the actual user data. For instance, if a dense matrix (like a numpy array) is passed into the proxy DMatrix via the XGProxyDMatrixSetDataDense method, then the proxy DMatrix holds only a reference and the input array cannot be freed until the next iteration starts, signaled by a call to the XGDMatrixCallbackNext by XGBoost. It's called ProxyDMatrix
because it reuses the interface of the DMatrix class in XGBoost, but it's just a mid interface for the XGDMatrixCreateFromCallback and related constructors to consume various user input types.
User inputs -> Proxy DMatrix (wrapper) -> Actual DMatrix
- Parameters
-
out | The created Proxy DMatrix. |
- Returns
- 0 when success, -1 when failure happens.
- Examples
- external_memory.c.
◆ XGProxyDMatrixSetDataColumnar()
int XGProxyDMatrixSetDataColumnar |
( |
DMatrixHandle |
handle, |
|
|
char const * |
data |
|
) |
| |
Set columnar (table) data on a DMatrix proxy.
- Parameters
-
- Returns
- 0 when success, -1 when failure happens
◆ XGProxyDMatrixSetDataCSR()
int XGProxyDMatrixSetDataCSR |
( |
DMatrixHandle |
handle, |
|
|
char const * |
indptr, |
|
|
char const * |
indices, |
|
|
char const * |
data, |
|
|
bst_ulong |
ncol |
|
) |
| |
Set data on a DMatrix proxy.
- Parameters
-
handle | A DMatrix proxy created by XGProxyDMatrixCreate |
indptr | JSON encoded array_interface to row pointer in CSR. |
indices | JSON encoded array_interface to column indices in CSR. |
data | JSON encoded array_interface to values in CSR.. |
ncol | The number of columns of input CSR matrix. |
- Returns
- 0 when success, -1 when failure happens
◆ XGProxyDMatrixSetDataCudaArrayInterface()
int XGProxyDMatrixSetDataCudaArrayInterface |
( |
DMatrixHandle |
handle, |
|
|
const char * |
data |
|
) |
| |
Set data on a DMatrix proxy.
- Parameters
-
handle | A DMatrix proxy created by XGProxyDMatrixCreate |
data | Null terminated JSON document string representation of CUDA array interface. |
- Returns
- 0 when success, -1 when failure happens
◆ XGProxyDMatrixSetDataCudaColumnar()
int XGProxyDMatrixSetDataCudaColumnar |
( |
DMatrixHandle |
handle, |
|
|
const char * |
data |
|
) |
| |
Set CUDA-based columnar (table) data on a DMatrix proxy.
- Parameters
-
- Returns
- 0 when success, -1 when failure happens
◆ XGProxyDMatrixSetDataDense()
int XGProxyDMatrixSetDataDense |
( |
DMatrixHandle |
handle, |
|
|
char const * |
data |
|
) |
| |
Set data on a DMatrix proxy.
- Parameters
-
handle | A DMatrix proxy created by XGProxyDMatrixCreate |
data | Null terminated JSON document string representation of array interface. |
- Returns
- 0 when success, -1 when failure happens
- Examples
- external_memory.c.
◆ XGQuantileDMatrixCreateFromCallback()
Create a Quantile DMatrix with a data iterator.
Short note for how to use the second set of callback for (GPU)Hist tree method:
- Step 0: Define a data iterator with 2 methods
reset
, and next
.
- Step 1: Create a DMatrix proxy by XGProxyDMatrixCreate and hold the handle.
- Step 2: Pass the iterator handle, proxy handle and 2 methods into XGQuantileDMatrixCreateFromCallback.
- Step 3: Call appropriate data setters in
next
functions.
See test_iterative_dmatrix.cu or Python interface for examples.
- Parameters
-
iter | A handle to external data iterator. |
proxy | A DMatrix proxy handle created by XGProxyDMatrixCreate. |
ref | Reference DMatrix for providing quantile information. |
reset | Callback function resetting the iterator state. |
next | Callback function yielding the next batch of data. |
config | JSON encoded parameters for DMatrix construction. Accepted fields are:
- missing: Which value to represent missing value
- nthread (optional): Number of threads used for initializing DMatrix.
- max_bin (optional): Maximum number of bins for building histogram. Must be consistent with the corresponding booster training parameter.
- max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming batches with multiple growing substreams. This parameter sets the maximum number of batches before XGBoost can cut the sub-stream and create a new one. This can help bound the memory usage. By default, XGBoost grows new sub-streams exponentially until batches are exhausted. Only used for the training dataset and the default is None (unbounded).
|
out | The created Quantile DMatrix. |
- Returns
- 0 when success, -1 when failure happens