README.md
7.3 KB · 148 lines · markdown Raw
1 ---
2 license: cc-by-nc-4.0
3 language:
4 - en
5 tags:
6 - instruction-finetuning
7 pretty_name: Alpaca
8 task_categories:
9 - text-generation
10 ---
11
12 # Dataset Card for Alpaca
13
14 ## Dataset Description
15
16 - **Homepage:** https://crfm.stanford.edu/2023/03/13/alpaca.html
17 - **Repository:** https://github.com/tatsu-lab/stanford_alpaca
18 - **Paper:**
19 - **Leaderboard:**
20 - **Point of Contact:** Rohan Taori
21
22 ### Dataset Summary
23
24 Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
25
26 The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications:
27
28 - The `text-davinci-003` engine to generate the instruction data instead of `davinci`.
29 - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`.
30 - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
31 - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions.
32 - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct.
33
34 This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500).
35 In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl).
36
37 ### Supported Tasks and Leaderboards
38
39 The Alpaca dataset designed for instruction training pretrained language models.
40
41 ### Languages
42
43 The data in Alpaca are in English (BCP-47 en).
44
45 ## Dataset Structure
46
47 ### Data Instances
48
49 An example of "train" looks as follows:
50
51 ```json
52 {
53 "instruction": "Create a classification task by clustering the given list of items.",
54 "input": "Apples, oranges, bananas, strawberries, pineapples",
55 "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples",
56 "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples",
57 }
58 ```
59
60 ### Data Fields
61
62 The data fields are as follows:
63
64 * `instruction`: describes the task the model should perform. Each of the 52K instructions is unique.
65 * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
66 * `output`: the answer to the instruction as generated by `text-davinci-003`.
67 * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors for fine-tuning their models.
68
69 ### Data Splits
70
71 | | train |
72 |---------------|------:|
73 | alpaca | 52002 |
74
75 ## Dataset Creation
76
77 ### Curation Rationale
78
79 [More Information Needed]
80
81 ### Source Data
82
83 #### Initial Data Collection and Normalization
84
85 [More Information Needed]
86
87 #### Who are the source language producers?
88
89 [More Information Needed]
90
91 ### Annotations
92
93 #### Annotation process
94
95 [More Information Needed]
96
97 #### Who are the annotators?
98
99 [More Information Needed]
100
101 ### Personal and Sensitive Information
102
103 [More Information Needed]
104
105 ## Considerations for Using the Data
106
107 ### Social Impact of Dataset
108
109 Excerpt the [blog post](https://crfm.stanford.edu/2023/03/13/alpaca.html) accompanying the release of this dataset:
110
111 > We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models.
112
113 ### Discussion of Biases
114
115 [More Information Needed]
116
117 ### Other Known Limitations
118
119 The `alpaca` data is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections.
120
121
122 ## Additional Information
123
124 ### Dataset Curators
125
126 [More Information Needed]
127
128 ### Licensing Information
129
130 The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode).
131
132
133 ### Citation Information
134
135 ```
136 @misc{alpaca,
137 author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
138 title = {Stanford Alpaca: An Instruction-following LLaMA model},
139 year = {2023},
140 publisher = {GitHub},
141 journal = {GitHub repository},
142 howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
143 }
144 ```
145
146 ### Contributions
147
148 [More Information Needed]