README.md
13.3 KB · 280 lines · markdown Raw
1 ---
2 language:
3 - en
4 tags:
5 - pytorch
6 - causal-lm
7 - pythia
8 license: apache-2.0
9 datasets:
10 - EleutherAI/pile
11 ---
12
13 The *Pythia Scaling Suite* is a collection of models developed to facilitate
14 interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf).
15 It contains two sets of eight models of sizes
16 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
17 models: one trained on the Pile, and one trained on the Pile after the dataset
18 has been globally deduplicated. All 8 model sizes are trained on the exact
19 same data, in the exact same order. We also provide 154 intermediate
20 checkpoints per model, hosted on Hugging Face as branches.
21
22 The Pythia model suite was deliberately designed to promote scientific
23 research on large language models, especially interpretability research.
24 Despite not centering downstream performance as a design goal, we find the
25 models <a href="#evaluations">match or exceed</a> the performance of
26 similar and same-sized models, such as those in the OPT and GPT-Neo suites.
27
28 <details>
29 <summary style="font-weight:600">Details on previous early release and naming convention.</summary>
30
31 Previously, we released an early version of the Pythia suite to the public.
32 However, we decided to retrain the model suite to address a few hyperparameter
33 discrepancies. This model card <a href="#changelog">lists the changes</a>;
34 see appendix B in the Pythia paper for further discussion. We found no
35 difference in benchmark performance between the two Pythia versions.
36 The old models are
37 [still available](https://huggingface.co/models?other=pythia_v0), but we
38 suggest the retrained suite if you are just starting to use Pythia.<br>
39 **This is the current release.**
40
41 Please note that all models in the *Pythia* suite were renamed in January
42 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
43 comparing the old and new names</a> is provided in this model card, together
44 with exact parameter counts.
45 </details>
46 <br>
47
48 # Pythia-160M
49
50 ## Model Details
51
52 - Developed by: [EleutherAI](http://eleuther.ai)
53 - Model type: Transformer-based Language Model
54 - Language: English
55 - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
56 for training procedure, config files, and details on how to use.
57 [See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation
58 details.
59 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
60 - License: Apache 2.0
61 - Contact: to ask questions about this model, join the [EleutherAI
62 Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
63 Please read the existing *Pythia* documentation before asking about it in the
64 EleutherAI Discord. For general correspondence: [contact@eleuther.
65 ai](mailto:contact@eleuther.ai).
66
67 <figure>
68
69 | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
70 | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
71 | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
72 | 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
73 | 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10<sup>-4</sup> | OPT-350M |
74 | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
75 | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
76 | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
77 | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
78 | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
79 <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
80 non-deduped models of a given size have the same hyperparameters. “Equivalent”
81 models have <b>exactly</b> the same architecture, and the same number of
82 non-embedding parameters.</figcaption>
83 </figure>
84
85 ## Uses and Limitations
86
87 ### Intended Use
88
89 The primary intended use of Pythia is research on the behavior, functionality,
90 and limitations of large language models. This suite is intended to provide
91 a controlled setting for performing scientific experiments. We also provide
92 154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
93 `step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
94 `step143000`. These checkpoints are hosted on Hugging Face as branches. Note
95 that branch `143000` corresponds exactly to the model checkpoint on the `main`
96 branch of each model.
97
98 You may also further fine-tune and adapt Pythia-160M for deployment,
99 as long as your use is in accordance with the Apache 2.0 license. Pythia
100 models work with the Hugging Face [Transformers
101 Library](https://huggingface.co/docs/transformers/index). If you decide to use
102 pre-trained Pythia-160M as a basis for your fine-tuned model, please
103 conduct your own risk and bias assessment.
104
105 ### Out-of-scope use
106
107 The Pythia Suite is **not** intended for deployment. It is not a in itself
108 a product and cannot be used for human-facing interactions. For example,
109 the model may generate harmful or offensive text. Please evaluate the risks
110 associated with your particular use case.
111
112 Pythia models are English-language only, and are not suitable for translation
113 or generating text in other languages.
114
115 Pythia-160M has not been fine-tuned for downstream contexts in which
116 language models are commonly deployed, such as writing genre prose,
117 or commercial chatbots. This means Pythia-160M will **not**
118 respond to a given prompt the way a product like ChatGPT does. This is because,
119 unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
120 Learning from Human Feedback (RLHF) to better “follow” human instructions.
121
122 ### Limitations and biases
123
124 The core functionality of a large language model is to take a string of text
125 and predict the next token. The token used by the model need not produce the
126 most “accurate” text. Never rely on Pythia-160M to produce factually accurate
127 output.
128
129 This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
130 known to contain profanity and texts that are lewd or otherwise offensive.
131 See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
132 discussion of documented biases with regards to gender, religion, and race.
133 Pythia-160M may produce socially unacceptable or undesirable text, *even if*
134 the prompt itself does not include anything explicitly offensive.
135
136 If you plan on using text generated through, for example, the Hosted Inference
137 API, we recommend having a human curate the outputs of this language model
138 before presenting it to other people. Please inform your audience that the
139 text was generated by Pythia-160M.
140
141 ### Quickstart
142
143 Pythia models can be loaded and used via the following code, demonstrated here
144 for the third `pythia-70m-deduped` checkpoint:
145
146 ```python
147 from transformers import GPTNeoXForCausalLM, AutoTokenizer
148
149 model = GPTNeoXForCausalLM.from_pretrained(
150 "EleutherAI/pythia-70m-deduped",
151 revision="step3000",
152 cache_dir="./pythia-70m-deduped/step3000",
153 )
154
155 tokenizer = AutoTokenizer.from_pretrained(
156 "EleutherAI/pythia-70m-deduped",
157 revision="step3000",
158 cache_dir="./pythia-70m-deduped/step3000",
159 )
160
161 inputs = tokenizer("Hello, I am", return_tensors="pt")
162 tokens = model.generate(**inputs)
163 tokenizer.decode(tokens[0])
164 ```
165
166 Revision/branch `step143000` corresponds exactly to the model checkpoint on
167 the `main` branch of each model.<br>
168 For more information on how to use all Pythia models, see [documentation on
169 GitHub](https://github.com/EleutherAI/pythia).
170
171 ## Training
172
173 ### Training data
174
175 [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
176 English. It was created by EleutherAI specifically for training large language
177 models. It contains texts from 22 diverse sources, roughly broken down into
178 five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
179 prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
180 miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
181 paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
182 methodology, and a discussion of ethical implications. Consult [the
183 datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
184 about the Pile and its component datasets. The Pile can be downloaded from
185 the [official website](https://pile.eleuther.ai/), or from a [community
186 mirror](https://the-eye.eu/public/AI/pile/).<br>
187 The Pile was **not** deduplicated before being used to train Pythia-160M.
188
189 ### Training procedure
190
191 All models were trained on the exact same data, in the exact same order. Each
192 model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
193 model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
194 from `step1000` to `step143000` (which is the same as `main`). In addition, we
195 also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
196 This corresponds to training for just under 1 epoch on the Pile for
197 non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
198
199 All *Pythia* models trained for 143000 steps at a batch size
200 of 2M (2,097,152 tokens).<br>
201 See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
202 procedure, including [how to reproduce
203 it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
204 Pythia uses the same tokenizer as [GPT-NeoX-
205 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
206
207 ## Evaluations
208
209 All 16 *Pythia* models were evaluated using the [LM Evaluation
210 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
211 the results by model and step at `results/json/*` in the [GitHub
212 repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
213 Expand the sections below to see plots of evaluation results for all
214 Pythia and Pythia-deduped models compared with OPT and BLOOM.
215
216 <details>
217 <summary>LAMBADA – OpenAI</summary>
218 <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>
219 </details>
220
221 <details>
222 <summary>Physical Interaction: Question Answering (PIQA)</summary>
223 <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>
224 </details>
225
226 <details>
227 <summary>WinoGrande</summary>
228 <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>
229 </details>
230
231 <details>
232 <summary>AI2 Reasoning Challenge—Easy Set</summary>
233 <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>
234 </details>
235
236 <details>
237 <summary>SciQ</summary>
238 <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>
239 </details>
240
241 ## Changelog
242
243 This section compares differences between previously released
244 [Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current
245 models. See Appendix B of the Pythia paper for further discussion of these
246 changes and the motivation behind them. We found that retraining Pythia had no
247 impact on benchmark performance.
248
249 - All model sizes are now trained with uniform batch size of 2M tokens.
250 Previously, the models of size 160M, 410M, and 1.4B parameters were trained
251 with batch sizes of 4M tokens.
252 - We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,
253 128,256,512} in addition to every 1000 training steps.
254 - Flash Attention was used in the new retrained suite.
255 - We remedied a minor inconsistency that existed in the original suite: all
256 models of size 2.8B parameters or smaller had a learning rate (LR) schedule
257 which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and
258 12B models all used an LR schedule which decayed to a minimum LR of 0. In
259 the redone training runs, we rectified this inconsistency: all models now were
260 trained with LR decaying to a minimum of 0.1× their maximum LR.
261
262 ### Naming convention and parameter count
263
264 *Pythia* models were renamed in January 2023. It is possible that the old
265 naming convention still persists in some documentation by accident. The
266 current naming convention (70M, 160M, etc.) is based on total parameter count.
267
268 <figure style="width:32em">
269
270 | current Pythia suffix | old suffix | total params | non-embedding params |
271 | --------------------: | ---------: | -------------: | -------------------: |
272 | 70M | 19M | 70,426,624 | 18,915,328 |
273 | 160M | 125M | 162,322,944 | 85,056,000 |
274 | 410M | 350M | 405,334,016 | 302,311,424 |
275 | 1B | 800M | 1,011,781,632 | 805,736,448 |
276 | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
277 | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
278 | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
279 | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
280 </figure>