README.md · distilbert-base-uncased

README.md

8.4 KB · 219 lines · markdown Raw

1	`---`
2	`language: en`
3	`tags:`
4	`- exbert`
5	`license: apache-2.0`
6	`datasets:`
7	`- bookcorpus`
8	`- wikipedia`
9	`---`
10
11	`# DistilBERT base model (uncased)`
12
13	`This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-uncased). It was`
14	`introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found`
15	`[here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). This model is uncased: it does`
16	`not make a difference between english and English.`
17
18	`## Model description`
19
20	`DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a`
21	`self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only,`
22	`with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic`
23	`process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained`
24	`with three objectives:`
25
26	`- Distillation loss: the model was trained to return the same probabilities as the BERT base model.`
27	`- Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a`
28	`sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the`
29	`model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that`
30	`usually see the words one after the other, or from autoregressive models like GPT which internally mask the future`
31	`tokens. It allows the model to learn a bidirectional representation of the sentence.`
32	`- Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base`
33	`model.`
34
35	`This way, the model learns the same inner representation of the English language than its teacher model, while being`
36	`faster for inference or downstream tasks.`
37
38	`## Intended uses & limitations`
39
40	`You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to`
41	`be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=distilbert) to look for`
42	`fine-tuned versions on a task that interests you.`
43
44	`Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)`
45	`to make decisions, such as sequence classification, token classification or question answering. For tasks such as text`
46	`generation you should look at model like GPT2.`
47
48	`### How to use`
49
50	`You can use this model directly with a pipeline for masked language modeling:`
51
52	```python
53	`>>> from transformers import pipeline`
54	`>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')`
55	`>>> unmasker("Hello I'm a [MASK] model.")`
56
57	`[{'sequence': "[CLS] hello i'm a role model. [SEP]",`
58	`'score': 0.05292855575680733,`
59	`'token': 2535,`
60	`'token_str': 'role'},`
61	`{'sequence': "[CLS] hello i'm a fashion model. [SEP]",`
62	`'score': 0.03968575969338417,`
63	`'token': 4827,`
64	`'token_str': 'fashion'},`
65	`{'sequence': "[CLS] hello i'm a business model. [SEP]",`
66	`'score': 0.034743521362543106,`
67	`'token': 2449,`
68	`'token_str': 'business'},`
69	`{'sequence': "[CLS] hello i'm a model model. [SEP]",`
70	`'score': 0.03462274372577667,`
71	`'token': 2944,`
72	`'token_str': 'model'},`
73	`{'sequence': "[CLS] hello i'm a modeling model. [SEP]",`
74	`'score': 0.018145186826586723,`
75	`'token': 11643,`
76	`'token_str': 'modeling'}]`
77	```
78
79	`Here is how to use this model to get the features of a given text in PyTorch:`
80
81	```python
82	`from transformers import DistilBertTokenizer, DistilBertModel`
83	`tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')`
84	`model = DistilBertModel.from_pretrained("distilbert-base-uncased")`
85	`text = "Replace me by any text you'd like."`
86	`encoded_input = tokenizer(text, return_tensors='pt')`
87	`output = model(**encoded_input)`
88	```
89
90	`and in TensorFlow:`
91
92	```python
93	`from transformers import DistilBertTokenizer, TFDistilBertModel`
94	`tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')`
95	`model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")`
96	`text = "Replace me by any text you'd like."`
97	`encoded_input = tokenizer(text, return_tensors='tf')`
98	`output = model(encoded_input)`
99	```
100
101	`### Limitations and bias`
102
103	`Even if the training data used for this model could be characterized as fairly neutral, this model can have biased`
104	`predictions. It also inherits some of`
105	`[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).`
106
107	```python
108	`>>> from transformers import pipeline`
109	`>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')`
110	`>>> unmasker("The White man worked as a [MASK].")`
111
112	`[{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]',`
113	`'score': 0.1235365942120552,`
114	`'token': 20987,`
115	`'token_str': 'blacksmith'},`
116	`{'sequence': '[CLS] the white man worked as a carpenter. [SEP]',`
117	`'score': 0.10142576694488525,`
118	`'token': 10533,`
119	`'token_str': 'carpenter'},`
120	`{'sequence': '[CLS] the white man worked as a farmer. [SEP]',`
121	`'score': 0.04985016956925392,`
122	`'token': 7500,`
123	`'token_str': 'farmer'},`
124	`{'sequence': '[CLS] the white man worked as a miner. [SEP]',`
125	`'score': 0.03932540491223335,`
126	`'token': 18594,`
127	`'token_str': 'miner'},`
128	`{'sequence': '[CLS] the white man worked as a butcher. [SEP]',`
129	`'score': 0.03351764753460884,`
130	`'token': 14998,`
131	`'token_str': 'butcher'}]`
132
133	`>>> unmasker("The Black woman worked as a [MASK].")`
134
135	`[{'sequence': '[CLS] the black woman worked as a waitress. [SEP]',`
136	`'score': 0.13283951580524445,`
137	`'token': 13877,`
138	`'token_str': 'waitress'},`
139	`{'sequence': '[CLS] the black woman worked as a nurse. [SEP]',`
140	`'score': 0.12586183845996857,`
141	`'token': 6821,`
142	`'token_str': 'nurse'},`
143	`{'sequence': '[CLS] the black woman worked as a maid. [SEP]',`
144	`'score': 0.11708822101354599,`
145	`'token': 10850,`
146	`'token_str': 'maid'},`
147	`{'sequence': '[CLS] the black woman worked as a prostitute. [SEP]',`
148	`'score': 0.11499975621700287,`
149	`'token': 19215,`
150	`'token_str': 'prostitute'},`
151	`{'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]',`
152	`'score': 0.04722772538661957,`
153	`'token': 22583,`
154	`'token_str': 'housekeeper'}]`
155	```
156
157	`This bias will also affect all fine-tuned versions of this model.`
158
159	`## Training data`
160
161	`DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset`
162	`consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)`
163	`(excluding lists, tables and headers).`
164
165	`## Training procedure`
166
167	`### Preprocessing`
168
169	`The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are`
170	`then of the form:`
171
172	```
173	`[CLS] Sentence A [SEP] Sentence B [SEP]`
174	```
175
176	`With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in`
177	`the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a`
178	`consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two`
179	`"sentences" has a combined length of less than 512 tokens.`
180
181	`The details of the masking procedure for each sentence are the following:`
182	`- 15% of the tokens are masked.`
183	- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
184	`- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.`
185	`- In the 10% remaining cases, the masked tokens are left as is.`
186
187	`### Pretraining`
188
189	`The model was trained on 8 16 GB V100 for 90 hours. See the`
190	`[training code](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) for all hyperparameters`
191	`details.`
192
193	`## Evaluation results`
194
195	`When fine-tuned on downstream tasks, this model achieves the following results:`
196
197	`Glue test results:`
198
199	`\| Task \| MNLI \| QQP \| QNLI \| SST-2 \| CoLA \| STS-B \| MRPC \| RTE \|`
200	`\|:----:\|:----:\|:----:\|:----:\|:-----:\|:----:\|:-----:\|:----:\|:----:\|`
201	`\| \| 82.2 \| 88.5 \| 89.2 \| 91.3 \| 51.3 \| 85.8 \| 87.5 \| 59.9 \|`
202
203
204	`### BibTeX entry and citation info`
205
206	```bibtex
207	`@article{Sanh2019DistilBERTAD,`
208	`title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},`
209	`author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},`
210	`journal={ArXiv},`
211	`year={2019},`
212	`volume={abs/1910.01108}`
213	`}`
214	```
215
216	`<a href="https://huggingface.co/exbert/?model=distilbert-base-uncased">`
217	`<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">`
218	`</a>`
219