Using XGBoost External Memory Version (beta)

There is no big difference between using external memory version and in-memory version. The only difference is the filename format.

The external memory version takes in the following URI format:

filename#cacheprefix

The filename is the normal path to libsvm format file you want to load in, and cacheprefix is a path to a cache file that XGBoost will use for caching preprocessed data in binary form.

Note

External memory is also available with GPU algorithms (i.e. when tree_method is set to gpu_hist)

To provide a simple example for illustration, extracting the code from demo/guide-python/external_memory.py. If you have a dataset stored in a file similar to agaricus.txt.train with libSVM format, the external memory support can be enabled by:

dtrain = DMatrix('../data/agaricus.txt.train#dtrain.cache')

XGBoost will first load agaricus.txt.train in, preprocess it, then write to a new file named dtrain.cache as an on disk cache for storing preprocessed data in a internal binary format. For more notes about text input formats, see Text Input Format of DMatrix.

dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')

For CLI version, simply add the cache suffix, e.g. "../data/agaricus.txt.train#dtrain.cache".

Performance Note

  • the parameter nthread should be set to number of physical cores

    • Most modern CPUs use hyperthreading, which means a 4 core CPU may carry 8 threads

    • Set nthread to be 4 for maximum performance in such case

Distributed Version

The external memory mode naturally works on distributed version, you can simply set path like

data = "hdfs://path-to-data/#dtrain.cache"

XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal so that you can directly use dtrain.cache to cache to current folder.

Usage Note

  • This is an experimental version

  • Currently only importing from libsvm format is supported

  • OSX is not tested.

    • Contribution of ingestion from other common external memory data source is welcomed