Text Input Format of DMatrix

Basic Input Format

XGBoost currently supports two text formats for ingesting data: LIBSVM and CSV. The rest of this document will describe the LIBSVM format. (See this Wikipedia article for a description of the CSV format.). Please be careful that, XGBoost does not understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LIBSVM or CSV. Instead it employs URI format for specifying the precise input file type. For example if you provide a csv file ./data.train.csv as input, XGBoost will blindly use the default LIBSVM parser to digest it and generate a parser error. Instead, users need to provide an URI in the form of train.csv?format=csv. For external memory input, the URI should of a form similar to train.csv?format=csv#dtrain.cache. See Data Interface and Using XGBoost External Memory Version also.

For training or predicting, XGBoost takes an instance file with the format as below:

train.txt

101:1.2 102:0.03
1:2.1 10001:300 10002:400
0:1.3 1:0.3
0:0.01 1:0.3
0:0.2 1:0.3

Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices, ‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.

Auxiliary Files for Additional Information

Note: all information below is applicable only to single-node version of the package. If you’d like to perform distributed training with multiple nodes, skip to the section Embedding additional information inside LIBSVM file.

Group Input Format

For ranking task, XGBoost supports the group input format. In ranking task, instances are categorized into query groups in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance file is the train.txt shown above, the group file should be named train.txt.group and be of the following format:

train.txt.group

2
3

This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order. At the time of configuration, you do not have to indicate the path of the group file. If the instance file name is xxx, XGBoost will check whether there is a file named xxx.group in the same directory.

Instance Weight File

Instances in the training data may be assigned weights to differentiate relative importance among them. For example, if we provide an instance weight file for the train.txt file in the example as below:

train.txt.weight

It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training. The configuration is similar to configuring the group information. If the instance file name is xxx, XGBoost will look for a file named xxx.weight in the same directory. If the file exists, the instance weights will be extracted and used at the time of training.

Note

Binary buffer format and instance weights

If you choose to save the training data as a binary buffer (using save_binary()), keep in mind that the resulting binary buffer file will include the instance weights. To update the weights, use the set_weight() function.

Initial Margin File

XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for train.txt file, we can create the following file:

train.txt.base_margin

-0.4
1.0
3.4

XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values.

Embedding additional information inside LIBSVM file

This section is applicable to both single- and multiple-node settings.

Query ID Columns

This is most useful for ranking task, where the instances are grouped into query groups. You may embed query group ID for each instance in the LIBSVM file by adding a token of form qid:xx in each row:

train.txt

qid:1 101:1.2 102:0.03
qid:1 1:2.1 10001:300 10002:400
qid:2 0:1.3 1:0.3
qid:2 0:0.01 1:0.3
qid:3 0:0.2 1:0.3
qid:3 3:-0.1 10:-0.3
qid:3 6:0.2 10:0.15

Keep in mind the following restrictions:

You are not allowed to specify query ID’s for some instances but not for others. Either every row is assigned query ID’s or none at all.
The rows have to be sorted in ascending order by the query IDs. So, for instance, you may not have one row having large query ID than any of the following rows.

Instance weights

You may specify instance weights in the LIBSVM file by appending each instance label with the corresponding weight in the form of [label]:[weight], as shown by the following example:

train.txt

1.0 101:1.2 102:0.03
0.5 1:2.1 10001:300 10002:400
0.5 0:1.3 1:0.3
1.0 0:0.01 1:0.3
0.5 0:0.2 1:0.3

where the negative instances are assigned half weights compared to the positive instances.