README.md · roberta-large

README.md

9.1 KB · 236 lines · markdown Raw

1	`---`
2	`language: en`
3	`tags:`
4	`- exbert`
5	`license: mit`
6	`datasets:`
7	`- bookcorpus`
8	`- wikipedia`
9	`---`
10
11	`# RoBERTa large model`
12
13	`Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in`
14	`[this paper](https://arxiv.org/abs/1907.11692) and first released in`
15	`[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it`
16	`makes a difference between english and English.`
17
18	`Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by`
19	`the Hugging Face team.`
20
21	`## Model description`
22
23	`RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means`
24	`it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of`
25	`publicly available data) with an automatic process to generate inputs and labels from those texts.`
26
27	`More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model`
28	`randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict`
29	`the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one`
30	`after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to`
31	`learn a bidirectional representation of the sentence.`
32
33	`This way, the model learns an inner representation of the English language that can then be used to extract features`
34	`useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard`
35	`classifier using the features produced by the BERT model as inputs.`
36
37	`## Intended uses & limitations`
38
39	`You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.`
40	`See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that`
41	`interests you.`
42
43	`Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)`
44	`to make decisions, such as sequence classification, token classification or question answering. For tasks such as text`
45	`generation you should look at model like GPT2.`
46
47	`### How to use`
48
49	`You can use this model directly with a pipeline for masked language modeling:`
50
51	```python
52	`>>> from transformers import pipeline`
53	`>>> unmasker = pipeline('fill-mask', model='roberta-large')`
54	`>>> unmasker("Hello I'm a <mask> model.")`
55
56	`[{'sequence': "<s>Hello I'm a male model.</s>",`
57	`'score': 0.3317350447177887,`
58	`'token': 2943,`
59	`'token_str': 'Ġmale'},`
60	`{'sequence': "<s>Hello I'm a fashion model.</s>",`
61	`'score': 0.14171843230724335,`
62	`'token': 2734,`
63	`'token_str': 'Ġfashion'},`
64	`{'sequence': "<s>Hello I'm a professional model.</s>",`
65	`'score': 0.04291723668575287,`
66	`'token': 2038,`
67	`'token_str': 'Ġprofessional'},`
68	`{'sequence': "<s>Hello I'm a freelance model.</s>",`
69	`'score': 0.02134818211197853,`
70	`'token': 18150,`
71	`'token_str': 'Ġfreelance'},`
72	`{'sequence': "<s>Hello I'm a young model.</s>",`
73	`'score': 0.021098261699080467,`
74	`'token': 664,`
75	`'token_str': 'Ġyoung'}]`
76	```
77
78	`Here is how to use this model to get the features of a given text in PyTorch:`
79
80	```python
81	`from transformers import RobertaTokenizer, RobertaModel`
82	`tokenizer = RobertaTokenizer.from_pretrained('roberta-large')`
83	`model = RobertaModel.from_pretrained('roberta-large')`
84	`text = "Replace me by any text you'd like."`
85	`encoded_input = tokenizer(text, return_tensors='pt')`
86	`output = model(**encoded_input)`
87	```
88
89	`and in TensorFlow:`
90
91	```python
92	`from transformers import RobertaTokenizer, TFRobertaModel`
93	`tokenizer = RobertaTokenizer.from_pretrained('roberta-large')`
94	`model = TFRobertaModel.from_pretrained('roberta-large')`
95	`text = "Replace me by any text you'd like."`
96	`encoded_input = tokenizer(text, return_tensors='tf')`
97	`output = model(encoded_input)`
98	```
99
100	`### Limitations and bias`
101
102	`The training data used for this model contains a lot of unfiltered content from the internet, which is far from`
103	`neutral. Therefore, the model can have biased predictions:`
104
105	```python
106	`>>> from transformers import pipeline`
107	`>>> unmasker = pipeline('fill-mask', model='roberta-large')`
108	`>>> unmasker("The man worked as a <mask>.")`
109
110	`[{'sequence': '<s>The man worked as a mechanic.</s>',`
111	`'score': 0.08260300755500793,`
112	`'token': 25682,`
113	`'token_str': 'Ġmechanic'},`
114	`{'sequence': '<s>The man worked as a driver.</s>',`
115	`'score': 0.05736079439520836,`
116	`'token': 1393,`
117	`'token_str': 'Ġdriver'},`
118	`{'sequence': '<s>The man worked as a teacher.</s>',`
119	`'score': 0.04709019884467125,`
120	`'token': 3254,`
121	`'token_str': 'Ġteacher'},`
122	`{'sequence': '<s>The man worked as a bartender.</s>',`
123	`'score': 0.04641604796051979,`
124	`'token': 33080,`
125	`'token_str': 'Ġbartender'},`
126	`{'sequence': '<s>The man worked as a waiter.</s>',`
127	`'score': 0.04239227622747421,`
128	`'token': 38233,`
129	`'token_str': 'Ġwaiter'}]`
130
131	`>>> unmasker("The woman worked as a <mask>.")`
132
133	`[{'sequence': '<s>The woman worked as a nurse.</s>',`
134	`'score': 0.2667474150657654,`
135	`'token': 9008,`
136	`'token_str': 'Ġnurse'},`
137	`{'sequence': '<s>The woman worked as a waitress.</s>',`
138	`'score': 0.12280137836933136,`
139	`'token': 35698,`
140	`'token_str': 'Ġwaitress'},`
141	`{'sequence': '<s>The woman worked as a teacher.</s>',`
142	`'score': 0.09747499972581863,`
143	`'token': 3254,`
144	`'token_str': 'Ġteacher'},`
145	`{'sequence': '<s>The woman worked as a secretary.</s>',`
146	`'score': 0.05783602222800255,`
147	`'token': 2971,`
148	`'token_str': 'Ġsecretary'},`
149	`{'sequence': '<s>The woman worked as a cleaner.</s>',`
150	`'score': 0.05576248839497566,`
151	`'token': 16126,`
152	`'token_str': 'Ġcleaner'}]`
153	```
154
155	`This bias will also affect all fine-tuned versions of this model.`
156
157	`## Training data`
158
159	`The RoBERTa model was pretrained on the reunion of five datasets:`
160	`- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;`
161	`- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;`
162	`- [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news`
163	`articles crawled between September 2016 and February 2019.`
164	`- [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to`
165	`train GPT-2,`
166	`- [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the`
167	`story-like style of Winograd schemas.`
168
169	`Together theses datasets weight 160GB of text.`
170
171	`## Training procedure`
172
173	`### Preprocessing`
174
175	`The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of`
176	`the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked`
177	with `<s>` and the end of one by `</s>`
178
179	`The details of the masking procedure for each sentence are the following:`
180	`- 15% of the tokens are masked.`
181	- In 80% of the cases, the masked tokens are replaced by `<mask>`.
182
183	`- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.`
184	`- In the 10% remaining cases, the masked tokens are left as is.`
185
186	`Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).`
187
188	`### Pretraining`
189
190	`The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The`
191	`optimizer used is Adam with a learning rate of 4e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and`
192	`\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning`
193	`rate after.`
194
195	`## Evaluation results`
196
197	`When fine-tuned on downstream tasks, this model achieves the following results:`
198
199	`Glue test results:`
200
201	`\| Task \| MNLI \| QQP \| QNLI \| SST-2 \| CoLA \| STS-B \| MRPC \| RTE \|`
202	`\|:----:\|:----:\|:----:\|:----:\|:-----:\|:----:\|:-----:\|:----:\|:----:\|`
203	`\| \| 90.2 \| 92.2 \| 94.7 \| 96.4 \| 68.0 \| 96.4 \| 90.9 \| 86.6 \|`
204
205
206	`### BibTeX entry and citation info`
207
208	```bibtex
209	`@article{DBLP:journals/corr/abs-1907-11692,`
210	`author = {Yinhan Liu and`
211	`Myle Ott and`
212	`Naman Goyal and`
213	`Jingfei Du and`
214	`Mandar Joshi and`
215	`Danqi Chen and`
216	`Omer Levy and`
217	`Mike Lewis and`
218	`Luke Zettlemoyer and`
219	`Veselin Stoyanov},`
220	`title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},`
221	`journal = {CoRR},`
222	`volume = {abs/1907.11692},`
223	`year = {2019},`
224	`url = {http://arxiv.org/abs/1907.11692},`
225	`archivePrefix = {arXiv},`
226	`eprint = {1907.11692},`
227	`timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},`
228	`biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},`
229	`bibsource = {dblp computer science bibliography, https://dblp.org}`
230	`}`
231	```
232
233	`<a href="https://huggingface.co/exbert/?model=roberta-base">`
234	`<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">`
235	`</a>`
236