Text Input Format of DMatrix
Basic Input Format
XGBoost currently supports two text formats for ingesting data: LIBSVM and CSV. The rest of this document will describe the LIBSVM format. (See this Wikipedia article for a description of the CSV format.). Please be careful that, XGBoost does not understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LIBSVM or CSV. Instead it employs URI format for specifying the precise input file type. For example if you provide a csv file
./data.train.csv as input, XGBoost will blindly use the default LIBSVM parser to digest it and generate a parser error. Instead, users need to provide an URI in the form of
train.csv?format=csv. For external memory input, the URI should of a form similar to
train.csv?format=csv#dtrain.cache. See Data Interface and Using XGBoost External Memory Version also.
For training or predicting, XGBoost takes an instance file with the format as below:
1 101:1.2 102:0.03 0 1:2.1 10001:300 10002:400 0 0:1.3 1:0.3 1 0:0.01 1:0.3 0 0:0.2 1:0.3
Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices, ‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
Auxiliary Files for Additional Information
Note: all information below is applicable only to single-node version of the package. If you’d like to perform distributed training with multiple nodes, skip to the section Embedding additional information inside LIBSVM file.
Group Input Format
For ranking task, XGBoost supports the group input format. In ranking task, instances are categorized into query groups in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance file is the
train.txt shown above, the group file should be named
train.txt.group and be of the following format:
This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order.
At the time of configuration, you do not have to indicate the path of the group file. If the instance file name is
xxx, XGBoost will check whether there is a file named
xxx.group in the same directory.
Instance Weight File
Instances in the training data may be assigned weights to differentiate relative importance among them. For example, if we provide an instance weight file for the
train.txt file in the example as below:
1 0.5 0.5 1 0.5
It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training.
The configuration is similar to configuring the group information. If the instance file name is
xxx, XGBoost will look for a file named
xxx.weight in the same directory. If the file exists, the instance weights will be extracted and used at the time of training.
Binary buffer format and instance weights
If you choose to save the training data as a binary buffer (using
save_binary()), keep in mind that the resulting binary buffer file will include the instance weights. To update the weights, use the
Initial Margin File
XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for
train.txt file, we can create the following file:
-0.4 1.0 3.4
XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use
pred_margin=1 to output margin values.
Embedding additional information inside LIBSVM file
This section is applicable to both single- and multiple-node settings.
Query ID Columns
This is most useful for ranking task, where the instances are grouped into query groups. You may embed query group ID for each instance in the LIBSVM file by adding a token of form
qid:xx in each row:
1 qid:1 101:1.2 102:0.03 0 qid:1 1:2.1 10001:300 10002:400 0 qid:2 0:1.3 1:0.3 1 qid:2 0:0.01 1:0.3 0 qid:3 0:0.2 1:0.3 1 qid:3 3:-0.1 10:-0.3 0 qid:3 6:0.2 10:0.15
Keep in mind the following restrictions:
You are not allowed to specify query ID’s for some instances but not for others. Either every row is assigned query ID’s or none at all.
The rows have to be sorted in ascending order by the query IDs. So, for instance, you may not have one row having large query ID than any of the following rows.
You may specify instance weights in the LIBSVM file by appending each instance label with the corresponding weight in the form of
[label]:[weight], as shown by the following example:
1:1.0 101:1.2 102:0.03 0:0.5 1:2.1 10001:300 10002:400 0:0.5 0:1.3 1:0.3 1:1.0 0:0.01 1:0.3 0:0.5 0:0.2 1:0.3
where the negative instances are assigned half weights compared to the positive instances.