README.md · roberta-base

README.md

8.9 KB · 235 lines · markdown Raw

1	`---`
2	`language: en`
3	`tags:`
4	`- exbert`
5	`license: mit`
6	`datasets:`
7	`- bookcorpus`
8	`- wikipedia`
9	`---`
10
11	`# RoBERTa base model`
12
13	`Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in`
14	`[this paper](https://arxiv.org/abs/1907.11692) and first released in`
15	`[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it`
16	`makes a difference between english and English.`
17
18	`Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by`
19	`the Hugging Face team.`
20
21	`## Model description`
22
23	`RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means`
24	`it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of`
25	`publicly available data) with an automatic process to generate inputs and labels from those texts.`
26
27	`More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model`
28	`randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict`
29	`the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one`
30	`after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to`
31	`learn a bidirectional representation of the sentence.`
32
33	`This way, the model learns an inner representation of the English language that can then be used to extract features`
34	`useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard`
35	`classifier using the features produced by the BERT model as inputs.`
36
37	`## Intended uses & limitations`
38
39	`You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.`
40	`See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that`
41	`interests you.`
42
43	`Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)`
44	`to make decisions, such as sequence classification, token classification or question answering. For tasks such as text`
45	`generation you should look at a model like GPT2.`
46
47	`### How to use`
48
49	`You can use this model directly with a pipeline for masked language modeling:`
50
51	```python
52	`>>> from transformers import pipeline`
53	`>>> unmasker = pipeline('fill-mask', model='roberta-base')`
54	`>>> unmasker("Hello I'm a <mask> model.")`
55
56	`[{'sequence': "<s>Hello I'm a male model.</s>",`
57	`'score': 0.3306540250778198,`
58	`'token': 2943,`
59	`'token_str': 'Ġmale'},`
60	`{'sequence': "<s>Hello I'm a female model.</s>",`
61	`'score': 0.04655390977859497,`
62	`'token': 2182,`
63	`'token_str': 'Ġfemale'},`
64	`{'sequence': "<s>Hello I'm a professional model.</s>",`
65	`'score': 0.04232972860336304,`
66	`'token': 2038,`
67	`'token_str': 'Ġprofessional'},`
68	`{'sequence': "<s>Hello I'm a fashion model.</s>",`
69	`'score': 0.037216778844594955,`
70	`'token': 2734,`
71	`'token_str': 'Ġfashion'},`
72	`{'sequence': "<s>Hello I'm a Russian model.</s>",`
73	`'score': 0.03253649175167084,`
74	`'token': 1083,`
75	`'token_str': 'ĠRussian'}]`
76	```
77
78	`Here is how to use this model to get the features of a given text in PyTorch:`
79
80	```python
81	`from transformers import RobertaTokenizer, RobertaModel`
82	`tokenizer = RobertaTokenizer.from_pretrained('roberta-base')`
83	`model = RobertaModel.from_pretrained('roberta-base')`
84	`text = "Replace me by any text you'd like."`
85	`encoded_input = tokenizer(text, return_tensors='pt')`
86	`output = model(**encoded_input)`
87	```
88
89	`and in TensorFlow:`
90
91	```python
92	`from transformers import RobertaTokenizer, TFRobertaModel`
93	`tokenizer = RobertaTokenizer.from_pretrained('roberta-base')`
94	`model = TFRobertaModel.from_pretrained('roberta-base')`
95	`text = "Replace me by any text you'd like."`
96	`encoded_input = tokenizer(text, return_tensors='tf')`
97	`output = model(encoded_input)`
98	```
99
100	`### Limitations and bias`
101
102	`The training data used for this model contains a lot of unfiltered content from the internet, which is far from`
103	`neutral. Therefore, the model can have biased predictions:`
104
105	```python
106	`>>> from transformers import pipeline`
107	`>>> unmasker = pipeline('fill-mask', model='roberta-base')`
108	`>>> unmasker("The man worked as a <mask>.")`
109
110	`[{'sequence': '<s>The man worked as a mechanic.</s>',`
111	`'score': 0.08702439814805984,`
112	`'token': 25682,`
113	`'token_str': 'Ġmechanic'},`
114	`{'sequence': '<s>The man worked as a waiter.</s>',`
115	`'score': 0.0819653645157814,`
116	`'token': 38233,`
117	`'token_str': 'Ġwaiter'},`
118	`{'sequence': '<s>The man worked as a butcher.</s>',`
119	`'score': 0.073323555290699,`
120	`'token': 32364,`
121	`'token_str': 'Ġbutcher'},`
122	`{'sequence': '<s>The man worked as a miner.</s>',`
123	`'score': 0.046322137117385864,`
124	`'token': 18678,`
125	`'token_str': 'Ġminer'},`
126	`{'sequence': '<s>The man worked as a guard.</s>',`
127	`'score': 0.040150221437215805,`
128	`'token': 2510,`
129	`'token_str': 'Ġguard'}]`
130
131	`>>> unmasker("The Black woman worked as a <mask>.")`
132
133	`[{'sequence': '<s>The Black woman worked as a waitress.</s>',`
134	`'score': 0.22177888453006744,`
135	`'token': 35698,`
136	`'token_str': 'Ġwaitress'},`
137	`{'sequence': '<s>The Black woman worked as a prostitute.</s>',`
138	`'score': 0.19288744032382965,`
139	`'token': 36289,`
140	`'token_str': 'Ġprostitute'},`
141	`{'sequence': '<s>The Black woman worked as a maid.</s>',`
142	`'score': 0.06498628109693527,`
143	`'token': 29754,`
144	`'token_str': 'Ġmaid'},`
145	`{'sequence': '<s>The Black woman worked as a secretary.</s>',`
146	`'score': 0.05375480651855469,`
147	`'token': 2971,`
148	`'token_str': 'Ġsecretary'},`
149	`{'sequence': '<s>The Black woman worked as a nurse.</s>',`
150	`'score': 0.05245552211999893,`
151	`'token': 9008,`
152	`'token_str': 'Ġnurse'}]`
153	```
154
155	`This bias will also affect all fine-tuned versions of this model.`
156
157	`## Training data`
158
159	`The RoBERTa model was pretrained on the reunion of five datasets:`
160	`- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;`
161	`- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;`
162	`- [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news`
163	`articles crawled between September 2016 and February 2019.`
164	`- [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to`
165	`train GPT-2,`
166	`- [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the`
167	`story-like style of Winograd schemas.`
168
169	`Together these datasets weigh 160GB of text.`
170
171	`## Training procedure`
172
173	`### Preprocessing`
174
175	`The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of`
176	`the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked`
177	with `<s>` and the end of one by `</s>`
178
179	`The details of the masking procedure for each sentence are the following:`
180	`- 15% of the tokens are masked.`
181	- In 80% of the cases, the masked tokens are replaced by `<mask>`.
182	`- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.`
183	`- In the 10% remaining cases, the masked tokens are left as is.`
184
185	`Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).`
186
187	`### Pretraining`
188
189	`The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The`
190	`optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and`
191	`\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning`
192	`rate after.`
193
194	`## Evaluation results`
195
196	`When fine-tuned on downstream tasks, this model achieves the following results:`
197
198	`Glue test results:`
199
200	`\| Task \| MNLI \| QQP \| QNLI \| SST-2 \| CoLA \| STS-B \| MRPC \| RTE \|`
201	`\|:----:\|:----:\|:----:\|:----:\|:-----:\|:----:\|:-----:\|:----:\|:----:\|`
202	`\| \| 87.6 \| 91.9 \| 92.8 \| 94.8 \| 63.6 \| 91.2 \| 90.2 \| 78.7 \|`
203
204
205	`### BibTeX entry and citation info`
206
207	```bibtex
208	`@article{DBLP:journals/corr/abs-1907-11692,`
209	`author = {Yinhan Liu and`
210	`Myle Ott and`
211	`Naman Goyal and`
212	`Jingfei Du and`
213	`Mandar Joshi and`
214	`Danqi Chen and`
215	`Omer Levy and`
216	`Mike Lewis and`
217	`Luke Zettlemoyer and`
218	`Veselin Stoyanov},`
219	`title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},`
220	`journal = {CoRR},`
221	`volume = {abs/1907.11692},`
222	`year = {2019},`
223	`url = {http://arxiv.org/abs/1907.11692},`
224	`archivePrefix = {arXiv},`
225	`eprint = {1907.11692},`
226	`timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},`
227	`biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},`
228	`bibsource = {dblp computer science bibliography, https://dblp.org}`
229	`}`
230	```
231
232	`<a href="https://huggingface.co/exbert/?model=roberta-base">`
233	`<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">`
234	`</a>`
235