README.md
| 1 | --- |
| 2 | language: |
| 3 | - en |
| 4 | tags: |
| 5 | - pytorch |
| 6 | - causal-lm |
| 7 | - pythia |
| 8 | license: apache-2.0 |
| 9 | datasets: |
| 10 | - EleutherAI/pile |
| 11 | --- |
| 12 | |
| 13 | The *Pythia Scaling Suite* is a collection of models developed to facilitate |
| 14 | interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf). |
| 15 | It contains two sets of eight models of sizes |
| 16 | 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two |
| 17 | models: one trained on the Pile, and one trained on the Pile after the dataset |
| 18 | has been globally deduplicated. All 8 model sizes are trained on the exact |
| 19 | same data, in the exact same order. We also provide 154 intermediate |
| 20 | checkpoints per model, hosted on Hugging Face as branches. |
| 21 | |
| 22 | The Pythia model suite was deliberately designed to promote scientific |
| 23 | research on large language models, especially interpretability research. |
| 24 | Despite not centering downstream performance as a design goal, we find the |
| 25 | models <a href="#evaluations">match or exceed</a> the performance of |
| 26 | similar and same-sized models, such as those in the OPT and GPT-Neo suites. |
| 27 | |
| 28 | <details> |
| 29 | <summary style="font-weight:600">Details on previous early release and naming convention.</summary> |
| 30 | |
| 31 | Previously, we released an early version of the Pythia suite to the public. |
| 32 | However, we decided to retrain the model suite to address a few hyperparameter |
| 33 | discrepancies. This model card <a href="#changelog">lists the changes</a>; |
| 34 | see appendix B in the Pythia paper for further discussion. We found no |
| 35 | difference in benchmark performance between the two Pythia versions. |
| 36 | The old models are |
| 37 | [still available](https://huggingface.co/models?other=pythia_v0), but we |
| 38 | suggest the retrained suite if you are just starting to use Pythia.<br> |
| 39 | **This is the current release.** |
| 40 | |
| 41 | Please note that all models in the *Pythia* suite were renamed in January |
| 42 | 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table |
| 43 | comparing the old and new names</a> is provided in this model card, together |
| 44 | with exact parameter counts. |
| 45 | </details> |
| 46 | <br> |
| 47 | |
| 48 | # Pythia-160M |
| 49 | |
| 50 | ## Model Details |
| 51 | |
| 52 | - Developed by: [EleutherAI](http://eleuther.ai) |
| 53 | - Model type: Transformer-based Language Model |
| 54 | - Language: English |
| 55 | - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia) |
| 56 | for training procedure, config files, and details on how to use. |
| 57 | [See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation |
| 58 | details. |
| 59 | - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) |
| 60 | - License: Apache 2.0 |
| 61 | - Contact: to ask questions about this model, join the [EleutherAI |
| 62 | Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`. |
| 63 | Please read the existing *Pythia* documentation before asking about it in the |
| 64 | EleutherAI Discord. For general correspondence: [contact@eleuther. |
| 65 | ai](mailto:contact@eleuther.ai). |
| 66 | |
| 67 | <figure> |
| 68 | |
| 69 | | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models | |
| 70 | | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: | |
| 71 | | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — | |
| 72 | | 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M | |
| 73 | | 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10<sup>-4</sup> | OPT-350M | |
| 74 | | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — | |
| 75 | | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B | |
| 76 | | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B | |
| 77 | | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B | |
| 78 | | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — | |
| 79 | <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and |
| 80 | non-deduped models of a given size have the same hyperparameters. “Equivalent” |
| 81 | models have <b>exactly</b> the same architecture, and the same number of |
| 82 | non-embedding parameters.</figcaption> |
| 83 | </figure> |
| 84 | |
| 85 | ## Uses and Limitations |
| 86 | |
| 87 | ### Intended Use |
| 88 | |
| 89 | The primary intended use of Pythia is research on the behavior, functionality, |
| 90 | and limitations of large language models. This suite is intended to provide |
| 91 | a controlled setting for performing scientific experiments. We also provide |
| 92 | 154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints |
| 93 | `step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to |
| 94 | `step143000`. These checkpoints are hosted on Hugging Face as branches. Note |
| 95 | that branch `143000` corresponds exactly to the model checkpoint on the `main` |
| 96 | branch of each model. |
| 97 | |
| 98 | You may also further fine-tune and adapt Pythia-160M for deployment, |
| 99 | as long as your use is in accordance with the Apache 2.0 license. Pythia |
| 100 | models work with the Hugging Face [Transformers |
| 101 | Library](https://huggingface.co/docs/transformers/index). If you decide to use |
| 102 | pre-trained Pythia-160M as a basis for your fine-tuned model, please |
| 103 | conduct your own risk and bias assessment. |
| 104 | |
| 105 | ### Out-of-scope use |
| 106 | |
| 107 | The Pythia Suite is **not** intended for deployment. It is not a in itself |
| 108 | a product and cannot be used for human-facing interactions. For example, |
| 109 | the model may generate harmful or offensive text. Please evaluate the risks |
| 110 | associated with your particular use case. |
| 111 | |
| 112 | Pythia models are English-language only, and are not suitable for translation |
| 113 | or generating text in other languages. |
| 114 | |
| 115 | Pythia-160M has not been fine-tuned for downstream contexts in which |
| 116 | language models are commonly deployed, such as writing genre prose, |
| 117 | or commercial chatbots. This means Pythia-160M will **not** |
| 118 | respond to a given prompt the way a product like ChatGPT does. This is because, |
| 119 | unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement |
| 120 | Learning from Human Feedback (RLHF) to better “follow” human instructions. |
| 121 | |
| 122 | ### Limitations and biases |
| 123 | |
| 124 | The core functionality of a large language model is to take a string of text |
| 125 | and predict the next token. The token used by the model need not produce the |
| 126 | most “accurate” text. Never rely on Pythia-160M to produce factually accurate |
| 127 | output. |
| 128 | |
| 129 | This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset |
| 130 | known to contain profanity and texts that are lewd or otherwise offensive. |
| 131 | See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a |
| 132 | discussion of documented biases with regards to gender, religion, and race. |
| 133 | Pythia-160M may produce socially unacceptable or undesirable text, *even if* |
| 134 | the prompt itself does not include anything explicitly offensive. |
| 135 | |
| 136 | If you plan on using text generated through, for example, the Hosted Inference |
| 137 | API, we recommend having a human curate the outputs of this language model |
| 138 | before presenting it to other people. Please inform your audience that the |
| 139 | text was generated by Pythia-160M. |
| 140 | |
| 141 | ### Quickstart |
| 142 | |
| 143 | Pythia models can be loaded and used via the following code, demonstrated here |
| 144 | for the third `pythia-70m-deduped` checkpoint: |
| 145 | |
| 146 | ```python |
| 147 | from transformers import GPTNeoXForCausalLM, AutoTokenizer |
| 148 | |
| 149 | model = GPTNeoXForCausalLM.from_pretrained( |
| 150 | "EleutherAI/pythia-70m-deduped", |
| 151 | revision="step3000", |
| 152 | cache_dir="./pythia-70m-deduped/step3000", |
| 153 | ) |
| 154 | |
| 155 | tokenizer = AutoTokenizer.from_pretrained( |
| 156 | "EleutherAI/pythia-70m-deduped", |
| 157 | revision="step3000", |
| 158 | cache_dir="./pythia-70m-deduped/step3000", |
| 159 | ) |
| 160 | |
| 161 | inputs = tokenizer("Hello, I am", return_tensors="pt") |
| 162 | tokens = model.generate(**inputs) |
| 163 | tokenizer.decode(tokens[0]) |
| 164 | ``` |
| 165 | |
| 166 | Revision/branch `step143000` corresponds exactly to the model checkpoint on |
| 167 | the `main` branch of each model.<br> |
| 168 | For more information on how to use all Pythia models, see [documentation on |
| 169 | GitHub](https://github.com/EleutherAI/pythia). |
| 170 | |
| 171 | ## Training |
| 172 | |
| 173 | ### Training data |
| 174 | |
| 175 | [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in |
| 176 | English. It was created by EleutherAI specifically for training large language |
| 177 | models. It contains texts from 22 diverse sources, roughly broken down into |
| 178 | five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), |
| 179 | prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and |
| 180 | miscellaneous (e.g. GitHub, Enron Emails). See [the Pile |
| 181 | paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources, |
| 182 | methodology, and a discussion of ethical implications. Consult [the |
| 183 | datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation |
| 184 | about the Pile and its component datasets. The Pile can be downloaded from |
| 185 | the [official website](https://pile.eleuther.ai/), or from a [community |
| 186 | mirror](https://the-eye.eu/public/AI/pile/).<br> |
| 187 | The Pile was **not** deduplicated before being used to train Pythia-160M. |
| 188 | |
| 189 | ### Training procedure |
| 190 | |
| 191 | All models were trained on the exact same data, in the exact same order. Each |
| 192 | model saw 299,892,736,000 tokens during training, and 143 checkpoints for each |
| 193 | model are saved every 2,097,152,000 tokens, spaced evenly throughout training, |
| 194 | from `step1000` to `step143000` (which is the same as `main`). In addition, we |
| 195 | also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`. |
| 196 | This corresponds to training for just under 1 epoch on the Pile for |
| 197 | non-deduplicated models, and about 1.5 epochs on the deduplicated Pile. |
| 198 | |
| 199 | All *Pythia* models trained for 143000 steps at a batch size |
| 200 | of 2M (2,097,152 tokens).<br> |
| 201 | See [GitHub](https://github.com/EleutherAI/pythia) for more details on training |
| 202 | procedure, including [how to reproduce |
| 203 | it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br> |
| 204 | Pythia uses the same tokenizer as [GPT-NeoX- |
| 205 | 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). |
| 206 | |
| 207 | ## Evaluations |
| 208 | |
| 209 | All 16 *Pythia* models were evaluated using the [LM Evaluation |
| 210 | Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access |
| 211 | the results by model and step at `results/json/*` in the [GitHub |
| 212 | repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br> |
| 213 | Expand the sections below to see plots of evaluation results for all |
| 214 | Pythia and Pythia-deduped models compared with OPT and BLOOM. |
| 215 | |
| 216 | <details> |
| 217 | <summary>LAMBADA – OpenAI</summary> |
| 218 | <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/> |
| 219 | </details> |
| 220 | |
| 221 | <details> |
| 222 | <summary>Physical Interaction: Question Answering (PIQA)</summary> |
| 223 | <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/> |
| 224 | </details> |
| 225 | |
| 226 | <details> |
| 227 | <summary>WinoGrande</summary> |
| 228 | <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/> |
| 229 | </details> |
| 230 | |
| 231 | <details> |
| 232 | <summary>AI2 Reasoning Challenge—Easy Set</summary> |
| 233 | <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/> |
| 234 | </details> |
| 235 | |
| 236 | <details> |
| 237 | <summary>SciQ</summary> |
| 238 | <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/> |
| 239 | </details> |
| 240 | |
| 241 | ## Changelog |
| 242 | |
| 243 | This section compares differences between previously released |
| 244 | [Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current |
| 245 | models. See Appendix B of the Pythia paper for further discussion of these |
| 246 | changes and the motivation behind them. We found that retraining Pythia had no |
| 247 | impact on benchmark performance. |
| 248 | |
| 249 | - All model sizes are now trained with uniform batch size of 2M tokens. |
| 250 | Previously, the models of size 160M, 410M, and 1.4B parameters were trained |
| 251 | with batch sizes of 4M tokens. |
| 252 | - We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64, |
| 253 | 128,256,512} in addition to every 1000 training steps. |
| 254 | - Flash Attention was used in the new retrained suite. |
| 255 | - We remedied a minor inconsistency that existed in the original suite: all |
| 256 | models of size 2.8B parameters or smaller had a learning rate (LR) schedule |
| 257 | which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and |
| 258 | 12B models all used an LR schedule which decayed to a minimum LR of 0. In |
| 259 | the redone training runs, we rectified this inconsistency: all models now were |
| 260 | trained with LR decaying to a minimum of 0.1× their maximum LR. |
| 261 | |
| 262 | ### Naming convention and parameter count |
| 263 | |
| 264 | *Pythia* models were renamed in January 2023. It is possible that the old |
| 265 | naming convention still persists in some documentation by accident. The |
| 266 | current naming convention (70M, 160M, etc.) is based on total parameter count. |
| 267 | |
| 268 | <figure style="width:32em"> |
| 269 | |
| 270 | | current Pythia suffix | old suffix | total params | non-embedding params | |
| 271 | | --------------------: | ---------: | -------------: | -------------------: | |
| 272 | | 70M | 19M | 70,426,624 | 18,915,328 | |
| 273 | | 160M | 125M | 162,322,944 | 85,056,000 | |
| 274 | | 410M | 350M | 405,334,016 | 302,311,424 | |
| 275 | | 1B | 800M | 1,011,781,632 | 805,736,448 | |
| 276 | | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 | |
| 277 | | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 | |
| 278 | | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 | |
| 279 | | 12B | 13B | 11,846,072,320 | 11,327,027,200 | |
| 280 | </figure> |