README.md · pythia-160m

1

---

2

language:

3

- en

4

tags:

5

- pytorch

6

- causal-lm

7

- pythia

8

license: apache-2.0

9

datasets:

10

- EleutherAI/pile

11

---

12

13

The *Pythia Scaling Suite* is a collection of models developed to facilitate

14

interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf).

15

It contains two sets of eight models of sizes

16

70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two

17

models: one trained on the Pile, and one trained on the Pile after the dataset

18

has been globally deduplicated. All 8 model sizes are trained on the exact

19

same data, in the exact same order. We also provide 154 intermediate

20

checkpoints per model, hosted on Hugging Face as branches.

21

22

The Pythia model suite was deliberately designed to promote scientific

23

research on large language models, especially interpretability research.

24

Despite not centering downstream performance as a design goal, we find the

25

models <a href="#evaluations">match or exceed</a> the performance of

26

similar and same-sized models, such as those in the OPT and GPT-Neo suites.

27

28

<details>

29

<summary style="font-weight:600">Details on previous early release and naming convention.</summary>

30

31

Previously, we released an early version of the Pythia suite to the public.

32

However, we decided to retrain the model suite to address a few hyperparameter

33

discrepancies. This model card <a href="#changelog">lists the changes</a>;

34

see appendix B in the Pythia paper for further discussion. We found no

35

difference in benchmark performance between the two Pythia versions.

36

The old models are

37

[still available](https://huggingface.co/models?other=pythia_v0), but we

38

suggest the retrained suite if you are just starting to use Pythia. 

39

**This is the current release.**

40

41

Please note that all models in the *Pythia* suite were renamed in January

42

2023. For clarity, a <a href="#naming-convention-and-parameter-count">table

43

comparing the old and new names</a> is provided in this model card, together

44

with exact parameter counts.

45

</details>

46

 

47

48

# Pythia-160M

49

50

## Model Details

51

52

- Developed by: [EleutherAI](http://eleuther.ai)

53

- Model type: Transformer-based Language Model

54

- Language: English

55

- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)

56

for training procedure, config files, and details on how to use.

57

[See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation

58

details.

59

- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)

60

- License: Apache 2.0

61

- Contact: to ask questions about this model, join the [EleutherAI

62

Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.

63

Please read the existing *Pythia* documentation before asking about it in the

64

EleutherAI Discord. For general correspondence: [contact@eleuther.

65

ai](mailto:contact@eleuther.ai).

66

67

<figure>

68

69

| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate         | Equivalent Models      |

70

| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |

71

| 70M          | 18,915,328           | 6      | 512       | 8     | 2M         | 1.0 x 10<sup>-3</sup> | —                      |

72

| 160M         | 85,056,000           | 12     | 768       | 12    | 2M         | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |

73

| 410M         | 302,311,424          | 24     | 1024      | 16    | 2M         | 3.0 x 10<sup>-4</sup> | OPT-350M               |

74

| 1.0B         | 805,736,448          | 16     | 2048      | 8     | 2M         | 3.0 x 10<sup>-4</sup> | —                      |

75

| 1.4B         | 1,208,602,624        | 24     | 2048      | 16    | 2M         | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |

76

| 2.8B         | 2,517,652,480        | 32     | 2560      | 32    | 2M         | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |

77

| 6.9B         | 6,444,163,072        | 32     | 4096      | 32    | 2M         | 1.2 x 10<sup>-4</sup> | OPT-6.7B               |

78

| 12B          | 11,327,027,200       | 36     | 5120      | 40    | 2M         | 1.2 x 10<sup>-4</sup> | —                      |

79

<figcaption>Engineering details for the Pythia Suite. Deduped and

80

non-deduped models of a given size have the same hyperparameters. “Equivalent”

81

models have exactly the same architecture, and the same number of

82

non-embedding parameters.</figcaption>

83

</figure>

84

85

## Uses and Limitations

86

87

### Intended Use

88

89

The primary intended use of Pythia is research on the behavior, functionality,

90

and limitations of large language models. This suite is intended to provide

91

a controlled setting for performing scientific experiments. We also provide

92

154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints

93

`step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to

94

`step143000`. These checkpoints are hosted on Hugging Face as branches. Note

95

that branch `143000` corresponds exactly to the model checkpoint on the `main`

96

branch of each model.

97

98

You may also further fine-tune and adapt Pythia-160M for deployment,

99

as long as your use is in accordance with the Apache 2.0 license. Pythia

100

models work with the Hugging Face [Transformers

101

Library](https://huggingface.co/docs/transformers/index). If you decide to use

102

pre-trained Pythia-160M as a basis for your fine-tuned model, please

103

conduct your own risk and bias assessment.

104

105

### Out-of-scope use

106

107

The Pythia Suite is **not** intended for deployment. It is not a in itself

108

a product and cannot be used for human-facing interactions. For example,

109

the model may generate harmful or offensive text. Please evaluate the risks

110

associated with your particular use case.

111

112

Pythia models are English-language only, and are not suitable for translation

113

or generating text in other languages.

114

115

Pythia-160M has not been fine-tuned for downstream contexts in which

116

language models are commonly deployed, such as writing genre prose,

117

or commercial chatbots. This means Pythia-160M will **not**

118

respond to a given prompt the way a product like ChatGPT does. This is because,

119

unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement

120

Learning from Human Feedback (RLHF) to better “follow” human instructions.

121

122

### Limitations and biases

123

124

The core functionality of a large language model is to take a string of text

125

and predict the next token. The token used by the model need not produce the

126

most “accurate” text. Never rely on Pythia-160M to produce factually accurate

127

output.

128

129

This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset

130

known to contain profanity and texts that are lewd or otherwise offensive.

131

See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a

132

discussion of documented biases with regards to gender, religion, and race.

133

Pythia-160M may produce socially unacceptable or undesirable text, *even if*

134

the prompt itself does not include anything explicitly offensive.

135

136

If you plan on using text generated through, for example, the Hosted Inference

137

API, we recommend having a human curate the outputs of this language model

138

before presenting it to other people. Please inform your audience that the

139

text was generated by Pythia-160M.

140

141

### Quickstart

142

143

Pythia models can be loaded and used via the following code, demonstrated here

144

for the third `pythia-70m-deduped` checkpoint:

145

146

```python

147

from transformers import GPTNeoXForCausalLM, AutoTokenizer

148

149

model = GPTNeoXForCausalLM.from_pretrained(

150

"EleutherAI/pythia-70m-deduped",

151

revision="step3000",

152

cache_dir="./pythia-70m-deduped/step3000",

153

)

154

155

tokenizer = AutoTokenizer.from_pretrained(

156

"EleutherAI/pythia-70m-deduped",

157

revision="step3000",

158

cache_dir="./pythia-70m-deduped/step3000",

159

)

160

161

inputs = tokenizer("Hello, I am", return_tensors="pt")

162

tokens = model.generate(**inputs)

163

tokenizer.decode(tokens[0])

164

```

165

166

Revision/branch `step143000` corresponds exactly to the model checkpoint on

167

the `main` branch of each model.

168

For more information on how to use all Pythia models, see [documentation on

169

GitHub](https://github.com/EleutherAI/pythia).

170

171

## Training

172

173

### Training data

174

175

[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in

176

English. It was created by EleutherAI specifically for training large language

177

models. It contains texts from 22 diverse sources, roughly broken down into

178

five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),

179

prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and

180

miscellaneous (e.g. GitHub, Enron Emails). See [the Pile

181

paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,

182

methodology, and a discussion of ethical implications. Consult [the

183

datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation

184

about the Pile and its component datasets. The Pile can be downloaded from

185

the [official website](https://pile.eleuther.ai/), or from a [community

186

mirror](https://the-eye.eu/public/AI/pile/). 

187

The Pile was **not** deduplicated before being used to train Pythia-160M.

188

189

### Training procedure

190

191

All models were trained on the exact same data, in the exact same order. Each

192

model saw 299,892,736,000 tokens during training, and 143 checkpoints for each

193

model are saved every 2,097,152,000 tokens, spaced evenly throughout training,

194

from `step1000` to `step143000` (which is the same as `main`). In addition, we

195

also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.

196

This corresponds to training for just under 1 epoch on the Pile for

197

non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.

198

199

All *Pythia* models trained for 143000 steps at a batch size

200

of 2M (2,097,152 tokens). 

201

See [GitHub](https://github.com/EleutherAI/pythia) for more details on training

202

procedure, including [how to reproduce

203

it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training). 

204

Pythia uses the same tokenizer as [GPT-NeoX-

205

20B](https://huggingface.co/EleutherAI/gpt-neox-20b).

206

207

## Evaluations

208

209

All 16 *Pythia* models were evaluated using the [LM Evaluation

210

Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access

211

the results by model and step at `results/json/*` in the [GitHub

212

repository](https://github.com/EleutherAI/pythia/tree/main/results/json/). 

213

Expand the sections below to see plots of evaluation results for all

214

Pythia and Pythia-deduped models compared with OPT and BLOOM.

215

216

<details>

217

<summary>LAMBADA – OpenAI</summary>

218

<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>

219

</details>

220

221

<details>

222

<summary>Physical Interaction: Question Answering (PIQA)</summary>

223

<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>

224

</details>

225

226

<details>

227

<summary>WinoGrande</summary>

228

<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>

229

</details>

230

231

<details>

232

<summary>AI2 Reasoning Challenge—Easy Set</summary>

233

<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>

234

</details>

235

236

<details>

237

<summary>SciQ</summary>

238

<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>

239

</details>

240

241

## Changelog

242

243

This section compares differences between previously released

244

[Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current

245

models. See Appendix B of the Pythia paper for further discussion of these

246

changes and the motivation behind them. We found that retraining Pythia had no

247

impact on benchmark performance.

248

249

- All model sizes are now trained with uniform batch size of 2M tokens.

250

Previously, the models of size 160M, 410M, and 1.4B parameters were trained

251

with batch sizes of 4M tokens.

252

- We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,

253

128,256,512} in addition to every 1000 training steps.

254

- Flash Attention was used in the new retrained suite.

255

- We remedied a minor inconsistency that existed in the original suite: all

256

models of size 2.8B parameters or smaller had a learning rate (LR) schedule

257

which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and

258

12B models all used an LR schedule which decayed to a minimum LR of 0. In

259

the redone training runs, we rectified this inconsistency: all models now were

260

trained with LR decaying to a minimum of 0.1× their maximum LR.

261

262

### Naming convention and parameter count

263

264

*Pythia* models were renamed in January 2023. It is possible that the old

265

naming convention still persists in some documentation by accident. The

266

current naming convention (70M, 160M, etc.) is based on total parameter count.

267

268

<figure style="width:32em">

269

270

271

| --------------------: | ---------: | -------------: | -------------------: |

272

| 70M | 19M | 70,426,624 | 18,915,328 |

273

| 160M | 125M | 162,322,944 | 85,056,000 |

274

| 410M | 350M | 405,334,016 | 302,311,424 |

275

| 1B | 800M | 1,011,781,632 | 805,736,448 |

276

| 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |

277

| 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |

278

| 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |

279

| 12B | 13B | 11,846,072,320 | 11,327,027,200 |

280

</figure>