Data Preparation¶
dataset_info.json contains all pre-processed local datasets and online datasets. If you wish to use a custom dataset, you must add the definition of the dataset and its content to the dataset_info.json file.
Currently, we support datasets in Alpaca format and ShareGPT format.
Alpaca¶
The dataset format requirements for different tasks are as follows:
Supervised Fine-Tuning Datasets¶
Sample Dataset: SFT Sample Dataset
Instruct Tuning optimizes the model’s performance on specific instructions by enabling it to learn from detailed instructions and their corresponding responses.
The content of the instruction column corresponds to human instructions, the input column corresponds to human input, and the output column corresponds to the model’s response. Below is an example:
"alpaca_zh_demo.json"
{
"instruction": "计算这些物品的总费用。 ",
"input": "输入:汽车 - $3000,衣服 - $100,书 - $20。",
"output": "汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。"
},
During supervised fine-tuning, the content of the instruction column is concatenated with the content of the input column to form the final human input, i.e., instruction
input. The content of the output column serves as the model’s response. In the example above, the final human input is:
计算这些物品的总费用。
输入:汽车 - $3000,衣服 - $100,书 - $20。
The model’s response is:
汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。
If specified, the content of the system column will be used as the system prompt.
The history column is a list consisting of multiple string tuples, representing the instruction and response for each turn in the history. Note that during supervised fine-tuning, the responses in the history messages are also used for model learning.
The format requirements for Supervised Fine-Tuning datasets are as follows:
[
{
"instruction": "人类指令(必填)",
"input": "人类输入(选填)",
"output": "模型回答(必填)",
"system": "系统提示词(选填)",
"history": [
["第一轮指令(选填)", "第一轮回答(选填)"],
["第二轮指令(选填)", "第二轮回答(选填)"]
]
}
]
Below is an example of multi-turn conversation in Alpaca format. For single-turn conversations, simply omit the history column.
[
{
"instruction": "今天的天气怎么样?",
"input": "",
"output": "今天的天气不错,是晴天。",
"history": [
[
"今天会下雨吗?",
"今天不会下雨,是个好天气。"
],
[
"今天适合出去玩吗?",
"非常适合,空气质量很好。"
]
]
}
]
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"system": "system",
"history": "history"
}
}
Pre-training Datasets¶
Sample Dataset: Pre-training Sample Dataset
Large Language Models (LLMs) are pre-trained by learning from unlabeled text to acquire language representations. Typically, pre-training datasets are obtained from the internet, as it provides a vast amount of text information from different domains, which helps improve the model’s generalization ability.
The text description format for pre-training datasets is as follows:
[
{"text": "document"},
{"text": "document"}
]
During pre-training, only the content in the text column (i.e., the document) is used for model learning.
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "text"
}
}
Preference Datasets¶
Preference datasets are used for Reward Model training, DPO training, and ORPO training. For a given system instruction and human input, the preference dataset provides a better response and a worse response.
Some research suggests that teaching models “what is better” can make them more aligned with human needs. It can even enable models with fewer parameters to outperform those with more parameters.
Preference datasets need to provide the better response in the chosen column and the worse response in the rejected column. The format for a single turn is as follows:
[
{
"instruction": "人类指令(必填)",
"input": "人类输入(选填)",
"chosen": "优质回答(必填)",
"rejected": "劣质回答(必填)"
}
]
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"ranking": true,
"columns": {
"prompt": "instruction",
"query": "input",
"chosen": "chosen",
"rejected": "rejected"
}
}
KTO Datasets¶
KTO datasets are similar to preference datasets, but instead of providing a better and a worse response, KTO datasets only provide a true/false label for each turn.
In addition to the final human input (formed by instruction and input) and the model response output, KTO datasets require an additional kto_tag column (true/false) to represent human feedback.
- The format for a single turn is as follows:
[ { "instruction": "人类指令(必填)", "input": "人类输入(选填)", "output": "模型回答(必填)", "kto_tag": "人类反馈 [true/false](必填)" } ]
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"kto_tag": "kto_tag"
}
}
Multimodal Datasets¶
Currently, we support multimodal image datasets, video datasets, and audio datasets.
Image Datasets¶
Multimodal image datasets require an additional images column containing the paths to the input images. Note that the number of images must strictly match the number of <image> tags in the text.
[
{
"instruction": "人类指令(必填)",
"input": "人类输入(选填)",
"output": "模型回答(必填)",
"images": [
"图像路径(必填)"
]
}
]
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"images": "images"
}
}
Video Datasets¶
Multimodal video datasets require an additional videos column containing the paths to the input videos. Note that the number of videos must strictly match the number of <video> tags in the text.
[
{
"instruction": "人类指令(必填)",
"input": "人类输入(选填)",
"output": "模型回答(必填)",
"videos": [
"视频路径(必填)"
]
}
]
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"videos": "videos"
}
}
Audio Datasets¶
Multimodal audio datasets require an additional audio column containing the paths to the input audios. Note that the number of audios must strictly match the number of <audio> tags in the text.
[
{
"instruction": "人类指令(必填)",
"input": "人类输入(选填)",
"output": "模型回答(必填)",
"audios": [
"音频路径(必填)"
]
}
]
For data in the above format, the dataset description in dataset_info.json should be:
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"audios": "audios"
}
}