xgboost
Classes | Typedefs | Functions
Streaming

Quantile DMatrix and external memory DMatrix can be created from batches of data. More...

Collaboration diagram for Streaming:

Classes

struct  XGBoostBatchCSR
 Mini batch used in XGBoost Data Iteration. More...
 

Typedefs

typedef void * DataIterHandle
 handle to a external data iterator More...
 
typedef void * DataHolderHandle
 handle to a internal data holder. More...
 
typedef int XGBCallbackSetData(DataHolderHandle handle, XGBoostBatchCSR batch)
 Callback to set the data to handle,. More...
 
typedef int XGBCallbackDataIterNext(DataIterHandle data_handle, XGBCallbackSetData *set_function, DataHolderHandle set_function_handle)
 The data reading callback function. The iterator will be able to give subset of batch in the data. More...
 
typedef int XGDMatrixCallbackNext(DataIterHandle iter)
 Callback function prototype for getting next batch of data. More...
 
typedef void DataIterResetCallback(DataIterHandle handle)
 Callback function prototype for resetting external iterator. More...
 

Functions

int XGDMatrixCreateFromDataIter (DataIterHandle data_handle, XGBCallbackDataIterNext *callback, const char *cache_info, DMatrixHandle *out)
 Create a DMatrix from a data iterator. More...
 
int XGProxyDMatrixCreate (DMatrixHandle *out)
 Create a DMatrix proxy for setting data, can be free by XGDMatrixFree. More...
 
int XGDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out)
 Create an external memory DMatrix with data iterator. More...
 
int XGQuantileDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterHandle ref, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, char const *config, DMatrixHandle *out)
 Create a Quantile DMatrix with data iterator. More...
 
int XGDeviceQuantileDMatrixCreateFromCallback (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, float missing, int nthread, int max_bin, DMatrixHandle *out)
 Create a Device Quantile DMatrix with data iterator. More...
 
int XGProxyDMatrixSetDataCudaArrayInterface (DMatrixHandle handle, const char *c_interface_str)
 Set data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataColumnar (DMatrixHandle handle, char const *c_interface_str)
 Set columnar (table) data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataCudaColumnar (DMatrixHandle handle, const char *c_interface_str)
 Set data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataDense (DMatrixHandle handle, char const *c_interface_str)
 Set data on a DMatrix proxy. More...
 
int XGProxyDMatrixSetDataCSR (DMatrixHandle handle, char const *indptr, char const *indices, char const *data, bst_ulong ncol)
 Set data on a DMatrix proxy. More...
 

Detailed Description

Quantile DMatrix and external memory DMatrix can be created from batches of data.

There are 2 sets of data callbacks for DMatrix. The first one is currently exclusively used by JVM packages. It uses XGBoostBatchCSR to accept batches for CSR formated input, and concatenate them into 1 final big CSR. The related functions are:

Another set is used by external data iterator. It accept foreign data iterators as callbacks. There are 2 different senarios where users might want to pass in callbacks instead of raw data. First it's the Quantile DMatrix used by hist and GPU Hist. For this case, the data is first compressed by quantile sketching then merged. This is particular useful for distributed setting as it eliminates 2 copies of data. 1 by a concat from external library to make the data into a blob for normal DMatrix initialization, another by the internal CSR copy of DMatrix. The second use case is external memory support where users can pass a custom data iterator into XGBoost for loading data in batches. There are short notes on each of the use cases in respected DMatrix factory function.

Related functions are:

Factory functions

Proxy that callers can use to pass data to XGBoost

Typedef Documentation

◆ DataHolderHandle

typedef void* DataHolderHandle

handle to a internal data holder.

◆ DataIterHandle

typedef void* DataIterHandle

handle to a external data iterator

◆ DataIterResetCallback

typedef void DataIterResetCallback(DataIterHandle handle)

Callback function prototype for resetting external iterator.

◆ XGBCallbackDataIterNext

typedef int XGBCallbackDataIterNext( DataIterHandle data_handle, XGBCallbackSetData *set_function, DataHolderHandle set_function_handle)

The data reading callback function. The iterator will be able to give subset of batch in the data.

If there is data, the function will call set_function to set the data.

Parameters
data_handleThe handle to the callback.
set_functionThe batch returned by the iterator
set_function_handleThe handle to be passed to set function.
Returns
0 if we are reaching the end and batch is not returned.

◆ XGBCallbackSetData

typedef int XGBCallbackSetData( DataHolderHandle handle, XGBoostBatchCSR batch)

Callback to set the data to handle,.

Parameters
handleThe handle to the callback.
batchThe data content to be set.

◆ XGDMatrixCallbackNext

typedef int XGDMatrixCallbackNext(DataIterHandle iter)

Callback function prototype for getting next batch of data.

Parameters
iterA handler to the user defined iterator.
Returns
0 when success, -1 when failure happens

Function Documentation

◆ XGDeviceQuantileDMatrixCreateFromCallback()

int XGDeviceQuantileDMatrixCreateFromCallback ( DataIterHandle  iter,
DMatrixHandle  proxy,
DataIterResetCallback reset,
XGDMatrixCallbackNext next,
float  missing,
int  nthread,
int  max_bin,
DMatrixHandle out 
)

Create a Device Quantile DMatrix with data iterator.

Deprecated:
since 1.7.0
See also
XGQuantileDMatrixCreateFromCallback()

◆ XGDMatrixCreateFromCallback()

int XGDMatrixCreateFromCallback ( DataIterHandle  iter,
DMatrixHandle  proxy,
DataIterResetCallback reset,
XGDMatrixCallbackNext next,
char const *  config,
DMatrixHandle out 
)

Create an external memory DMatrix with data iterator.

Short note for how to use second set of callback for external memory data support:

  • Step 0: Define a data iterator with 2 methods reset, and next.
  • Step 1: Create a DMatrix proxy by XGProxyDMatrixCreate and hold the handle.
  • Step 2: Pass the iterator handle, proxy handle and 2 methods into XGDMatrixCreateFromCallback, along with other parameters encoded as a JSON object.
  • Step 3: Call appropriate data setters in next functions.
Parameters
iterA handle to external data iterator.
proxyA DMatrix proxy handle created by XGProxyDMatrixCreate.
resetCallback function resetting the iterator state.
nextCallback function yielding the next batch of data.
configJSON encoded parameters for DMatrix construction. Accepted fields are:
  • missing: Which value to represent missing value
  • cache_prefix: The path of cache file, caller must initialize all the directories in this path.
  • nthread (optional): Number of threads used for initializing DMatrix.
[out]outThe created external memory DMatrix
Returns
0 when success, -1 when failure happens
Examples
external_memory.c.

◆ XGDMatrixCreateFromDataIter()

int XGDMatrixCreateFromDataIter ( DataIterHandle  data_handle,
XGBCallbackDataIterNext callback,
const char *  cache_info,
DMatrixHandle out 
)

Create a DMatrix from a data iterator.

Parameters
data_handleThe handle to the data.
callbackThe callback to get the data.
cache_infoAdditional information about cache file, can be null.
outThe created DMatrix
Returns
0 when success, -1 when failure happens.

◆ XGProxyDMatrixCreate()

int XGProxyDMatrixCreate ( DMatrixHandle out)

Create a DMatrix proxy for setting data, can be free by XGDMatrixFree.

Second set of callback functions, used by constructing Quantile DMatrix or external memory DMatrix using custom iterator.

Parameters
outThe created Device Quantile DMatrix
Returns
0 when success, -1 when failure happens
Examples
external_memory.c.

◆ XGProxyDMatrixSetDataColumnar()

int XGProxyDMatrixSetDataColumnar ( DMatrixHandle  handle,
char const *  c_interface_str 
)

Set columnar (table) data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
c_interface_strSee XGBoosterPredictFromColumnar for details.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataCSR()

int XGProxyDMatrixSetDataCSR ( DMatrixHandle  handle,
char const *  indptr,
char const *  indices,
char const *  data,
bst_ulong  ncol 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
indptrJSON encoded array_interface to row pointer in CSR.
indicesJSON encoded array_interface to column indices in CSR.
dataJSON encoded array_interface to values in CSR..
ncolThe number of columns of input CSR matrix.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataCudaArrayInterface()

int XGProxyDMatrixSetDataCudaArrayInterface ( DMatrixHandle  handle,
const char *  c_interface_str 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
c_interface_strNull terminated JSON document string representation of CUDA array interface.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataCudaColumnar()

int XGProxyDMatrixSetDataCudaColumnar ( DMatrixHandle  handle,
const char *  c_interface_str 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
c_interface_strNull terminated JSON document string representation of CUDA array interface, with an array of columns.
Returns
0 when success, -1 when failure happens

◆ XGProxyDMatrixSetDataDense()

int XGProxyDMatrixSetDataDense ( DMatrixHandle  handle,
char const *  c_interface_str 
)

Set data on a DMatrix proxy.

Parameters
handleA DMatrix proxy created by XGProxyDMatrixCreate
c_interface_strNull terminated JSON document string representation of array interface.
Returns
0 when success, -1 when failure happens
Examples
external_memory.c.

◆ XGQuantileDMatrixCreateFromCallback()

int XGQuantileDMatrixCreateFromCallback ( DataIterHandle  iter,
DMatrixHandle  proxy,
DataIterHandle  ref,
DataIterResetCallback reset,
XGDMatrixCallbackNext next,
char const *  config,
DMatrixHandle out 
)

Create a Quantile DMatrix with data iterator.

Short note for how to use the second set of callback for (GPU)Hist tree method:

  • Step 0: Define a data iterator with 2 methods reset, and next.
  • Step 1: Create a DMatrix proxy by XGProxyDMatrixCreate and hold the handle.
  • Step 2: Pass the iterator handle, proxy handle and 2 methods into XGQuantileDMatrixCreateFromCallback.
  • Step 3: Call appropriate data setters in next functions.

See test_iterative_dmatrix.cu or Python interface for examples.

Parameters
iterA handle to external data iterator.
proxyA DMatrix proxy handle created by XGProxyDMatrixCreate.
refReference DMatrix for providing quantile information.
resetCallback function resetting the iterator state.
nextCallback function yielding the next batch of data.
configJSON encoded parameters for DMatrix construction. Accepted fields are:
  • missing: Which value to represent missing value
  • nthread (optional): Number of threads used for initializing DMatrix.
  • max_bin (optional): Maximum number of bins for building histogram.
outThe created Device Quantile DMatrix
Returns
0 when success, -1 when failure happens