xgboost
Classes | Typedefs | Functions
Streaming

Quantile DMatrix and external memory DMatrix can be created from batches of data. More...

Collaboration diagram for Streaming:

Classes

struct  XGBoostBatchCSR
 Mini batch used in XGBoost Data Iteration. More...
 

Typedefs

typedef void * DataIterHandle
 handle to a external data iterator More...
 
typedef void * DataHolderHandle
 handle to an internal data holder. More...
 
typedef int XGBCallbackSetData(DataHolderHandle handle, XGBoostBatchCSR batch)
 Callback to set the data to handle,. More...
 
typedef int XGBCallbackDataIterNext(DataIterHandle data_handle, XGBCallbackSetData *set_function, DataHolderHandle set_function_handle)
 The data reading callback function. The iterator will be able to give subset of batch in the data. More...
 
typedef int XGDMatrixCallbackNext(DataIterHandle iter)
 Callback function prototype for getting next batch of data. More...
 
typedef void DataIterResetCallback(DataIterHandle handle)
 Callback function prototype for resetting the external iterator. More...
 

Functions

int XGDMatrixCreateFromDataIter (DataIterHandle data_handle, XGBCallbackDataIterNext *callback, const char *cache_info, float missing, DMatrixHandle *out)
 Create a DMatrix from a data iterator. More...
 
int XGProxyDMatrixCreate (DMatrixHandle *out)
 Create a DMatrix proxy for setting data, can be freed by XGDMatrixFree. More...
 
int XGDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out)
 Create an external memory DMatrix with data iterator. More...
 
int XGQuantileDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterHandle ref, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out)
 Create a Quantile DMatrix with a data iterator. More...
 
int XGExtMemQuantileDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterHandle ref, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out)
 Create a Quantile DMatrix backed by external memory. More...
 
int XGProxyDMatrixSetDataCudaArrayInterface (DMatrixHandle handle, const char *data)
 Set data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataColumnar (DMatrixHandle handle, char const *data)
 Set columnar (table) data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataCudaColumnar (DMatrixHandle handle, const char *data)
 Set CUDA-based columnar (table) data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataDense (DMatrixHandle handle, char const *data)
 Set data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataCSR (DMatrixHandle handle, char const *indptr, char const *indices, char const *data, bst_ulong ncol)
 Set data on a DMatrix proxy. More...
 

Detailed Description

Quantile DMatrix and external memory DMatrix can be created from batches of data.

There are 2 sets of data callbacks for DMatrix. The first one is currently exclusively used by JVM packages. It uses XGBoostBatchCSR to accept batches for CSR formated input, and concatenate them into 1 final big CSR. The related functions are:

Another set is used by external data iterator. It accepts foreign data iterators as callbacks. There are 2 different senarios where users might want to pass in callbacks instead of raw data. First it's the Quantile DMatrix used by the hist and GPU-based hist tree method. For this case, the data is first compressed by quantile sketching then merged. This is particular useful for distributed setting as it eliminates 2 copies of data. First one by a concat from external library to make the data into a blob for normal DMatrix initialization, another one by the internal CSR copy of DMatrix.

The second use case is external memory support where users can pass a custom data iterator into XGBoost for loading data in batches. For both cases, the iterator is only used during the construction of the DMatrix and can be safely freed after construction finishes. There are short notes on each of the use cases in respected DMatrix factory function.

Related functions are:

Factory functions

Proxy that callers can use to pass data to XGBoost

Typedef Documentation

◆ DataHolderHandle

typedef void* DataHolderHandle

handle to an internal data holder.

◆ DataIterHandle

typedef void* DataIterHandle

handle to a external data iterator

◆ DataIterResetCallback

typedef void DataIterResetCallback(DataIterHandle handle)

Callback function prototype for resetting the external iterator.

◆ XGBCallbackDataIterNext

typedef int XGBCallbackDataIterNext( DataIterHandle data_handle, XGBCallbackSetData *set_function, DataHolderHandle set_function_handle)

The data reading callback function. The iterator will be able to give subset of batch in the data.

If there is data, the function will call set_function to set the data.

Parameters
data_handleThe handle to the callback.
set_functionThe batch returned by the iterator
set_function_handleThe handle to be passed to set function.
Returns
0 if we are reaching the end and batch is not returned.

◆ XGBCallbackSetData

typedef int XGBCallbackSetData( DataHolderHandle handle, XGBoostBatchCSR batch)

Callback to set the data to handle,.

Parameters
handleThe handle to the callback.
batchThe data content to be set.

◆ XGDMatrixCallbackNext

typedef int XGDMatrixCallbackNext(DataIterHandle iter)

Callback function prototype for getting next batch of data.

Parameters
iterA handler to the user defined iterator.
Returns
0 when success, -1 when failure happens.

Function Documentation

◆ XGDMatrixCreateFromCallback()

int XGDMatrixCreateFromCallback ( DataIterHandle  iter,
DMatrixHandle  proxy,
DataIterResetCallback reset,
XGDMatrixCallbackNext next,
char const *  config,
DMatrixHandle out 
)

Create an external memory DMatrix with data iterator.

Short note for how to use second set of callback for external memory data support:

  • Step 0: Define a data iterator with 2 methods reset, and next.
  • Step 1: Create a DMatrix proxy by XGProxyDMatrixCreate and hold the handle.
  • Step 2: Pass the iterator handle, proxy handle and 2 methods into XGDMatrixCreateFromCallback, along with other parameters encoded as a JSON object.
  • Step 3: Call appropriate data setters in next functions.
Parameters
iterA handle to external data iterator.
proxyA DMatrix proxy handle created by XGProxyDMatrixCreate.
resetCallback function resetting the iterator state.
nextCallback function yielding the next batch of data.
configJSON encoded parameters for DMatrix construction. Accepted fields are:
  • missing: Which value to represent missing value
  • cache_prefix: The path of cache file, caller must initialize all the directories in this path.
  • nthread (optional): Number of threads used for initializing DMatrix.
[out]outThe created external memory DMatrix
Returns
0 when success, -1 when failure happens
Examples
external_memory.c.

◆ XGDMatrixCreateFromDataIter()

int XGDMatrixCreateFromDataIter ( DataIterHandle  data_handle,
XGBCallbackDataIterNext callback,
const char *  cache_info,
float  missing,
DMatrixHandle out 
)

Create a DMatrix from a data iterator.

Parameters
data_handleThe handle to the data.
callbackThe callback to get the data.
cache_infoAdditional information about cache file, can be null.
missingWhich value to represent missing value.
outThe created DMatrix
Returns
0 when success, -1 when failure happens.

◆ XGExtMemQuantileDMatrixCreateFromCallback()

int XGExtMemQuantileDMatrixCreateFromCallback ( DataIterHandle  iter,
DMatrixHandle  proxy,
DataIterHandle  ref,
DataIterResetCallback reset,
XGDMatrixCallbackNext next,
char const *  config,
DMatrixHandle out 
)

Create a Quantile DMatrix backed by external memory.

Since
3.0.0
Note
This is experimental and subject to change.
Parameters
iterA handle to external data iterator.
proxyA DMatrix proxy handle created by XGProxyDMatrixCreate.
refReference DMatrix for providing quantile information.
resetCallback function resetting the iterator state.
nextCallback function yielding the next batch of data.
configJSON encoded parameters for DMatrix construction. Accepted fields are:
  • missing: Which value to represent missing value
  • cache_prefix: The path of cache file, caller must initialize all the directories in this path.
  • nthread (optional): Number of threads used for initializing DMatrix.
  • max_bin (optional): Maximum number of bins for building histogram. Must be consistent with the corresponding booster training parameter.
  • on_host (optional): Whether the data should be placed on host memory. Used by GPU inputs.
  • min_cache_page_bytes (optional): The minimum number of bytes for each internal GPU page. Set to 0 to disable page concatenation. Automatic configuration if the parameter is not provided or set to None.
  • max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming batches with multiple growing substreams. This parameter sets the maximum number of batches before XGBoost can cut the sub-stream and create a new one. This can help bound the memory usage. By default, XGBoost grows new sub-streams exponentially until batches are exhausted. Only used for the training dataset and the default is None (unbounded).
outThe created Quantile DMatrix.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixCreate()

int XGProxyDMatrixCreate ( DMatrixHandle out)

Create a DMatrix proxy for setting data, can be freed by XGDMatrixFree.

Second set of callback functions, used by constructing Quantile DMatrix or external memory DMatrix using a custom iterator.

The DMatrix proxy is only a temporary reference (wrapper) to the actual user data. For instance, if a dense matrix (like a numpy array) is passed into the proxy DMatrix via the XGProxyDMatrixSetDataDense method, then the proxy DMatrix holds only a reference and the input array cannot be freed until the next iteration starts, signaled by a call to the XGDMatrixCallbackNext by XGBoost. It's called ProxyDMatrix because it reuses the interface of the DMatrix class in XGBoost, but it's just a mid interface for the XGDMatrixCreateFromCallback and related constructors to consume various user input types.

User inputs -> Proxy DMatrix (wrapper) -> Actual DMatrix
Parameters
outThe created Proxy DMatrix.
Returns
0 when success, -1 when failure happens.
Examples
external_memory.c.

◆ XGProxyDMatrixSetDataColumnar()

int XGProxyDMatrixSetDataColumnar ( DMatrixHandle  handle,
char const *  data 
)

Set columnar (table) data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
dataSee XGDMatrixCreateFromColumnar for details.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataCSR()

int XGProxyDMatrixSetDataCSR ( DMatrixHandle  handle,
char const *  indptr,
char const *  indices,
char const *  data,
bst_ulong  ncol 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
indptrJSON encoded array_interface to row pointer in CSR.
indicesJSON encoded array_interface to column indices in CSR.
dataJSON encoded array_interface to values in CSR..
ncolThe number of columns of input CSR matrix.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataCudaArrayInterface()

int XGProxyDMatrixSetDataCudaArrayInterface ( DMatrixHandle  handle,
const char *  data 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
dataNull terminated JSON document string representation of CUDA array interface.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataCudaColumnar()

int XGProxyDMatrixSetDataCudaColumnar ( DMatrixHandle  handle,
const char *  data 
)

Set CUDA-based columnar (table) data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
dataSee XGDMatrixCreateFromColumnar for details.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataDense()

int XGProxyDMatrixSetDataDense ( DMatrixHandle  handle,
char const *  data 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
dataNull terminated JSON document string representation of array interface.
Returns
0 when success, -1 when failure happens
Examples
external_memory.c.

◆ XGQuantileDMatrixCreateFromCallback()

int XGQuantileDMatrixCreateFromCallback ( DataIterHandle  iter,
DMatrixHandle  proxy,
DataIterHandle  ref,
DataIterResetCallback reset,
XGDMatrixCallbackNext next,
char const *  config,
DMatrixHandle out 
)

Create a Quantile DMatrix with a data iterator.

Short note for how to use the second set of callback for (GPU)Hist tree method:

  • Step 0: Define a data iterator with 2 methods reset, and next.
  • Step 1: Create a DMatrix proxy by XGProxyDMatrixCreate and hold the handle.
  • Step 2: Pass the iterator handle, proxy handle and 2 methods into XGQuantileDMatrixCreateFromCallback.
  • Step 3: Call appropriate data setters in next functions.

See test_iterative_dmatrix.cu or Python interface for examples.

Parameters
iterA handle to external data iterator.
proxyA DMatrix proxy handle created by XGProxyDMatrixCreate.
refReference DMatrix for providing quantile information.
resetCallback function resetting the iterator state.
nextCallback function yielding the next batch of data.
configJSON encoded parameters for DMatrix construction. Accepted fields are:
  • missing: Which value to represent missing value
  • nthread (optional): Number of threads used for initializing DMatrix.
  • max_bin (optional): Maximum number of bins for building histogram. Must be consistent with the corresponding booster training parameter.
  • max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming batches with multiple growing substreams. This parameter sets the maximum number of batches before XGBoost can cut the sub-stream and create a new one. This can help bound the memory usage. By default, XGBoost grows new sub-streams exponentially until batches are exhausted. Only used for the training dataset and the default is None (unbounded).
outThe created Quantile DMatrix.
Returns
0 when success, -1 when failure happens