README.md · bert-base-cased

README.md

8.8 KB · 231 lines · markdown Raw

1	`---`
2	`language: en`
3	`tags:`
4	`- exbert`
5	`license: apache-2.0`
6	`datasets:`
7	`- bookcorpus`
8	`- wikipedia`
9	`---`
10
11	`# BERT base model (cased)`
12
13	`Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in`
14	`[this paper](https://arxiv.org/abs/1810.04805) and first released in`
15	`[this repository](https://github.com/google-research/bert). This model is case-sensitive: it makes a difference between`
16	`english and English.`
17
18	`Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by`
19	`the Hugging Face team.`
20
21	`## Model description`
22
23	`BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it`
24	`was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of`
25	`publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it`
26	`was pretrained with two objectives:`
27
28	`- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run`
29	`the entire masked sentence through the model and has to predict the masked words. This is different from traditional`
30	`recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like`
31	`GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the`
32	`sentence.`
33	`- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes`
34	`they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to`
35	`predict if the two sentences were following each other or not.`
36
37	`This way, the model learns an inner representation of the English language that can then be used to extract features`
38	`useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard`
39	`classifier using the features produced by the BERT model as inputs.`
40
41	`## Intended uses & limitations`
42
43	`You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to`
44	`be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for`
45	`fine-tuned versions on a task that interests you.`
46
47	`Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)`
48	`to make decisions, such as sequence classification, token classification or question answering. For tasks such as text`
49	`generation you should look at model like GPT2.`
50
51	`### How to use`
52
53	`You can use this model directly with a pipeline for masked language modeling:`
54
55	```python
56	`>>> from transformers import pipeline`
57	`>>> unmasker = pipeline('fill-mask', model='bert-base-cased')`
58	`>>> unmasker("Hello I'm a [MASK] model.")`
59
60	`[{'sequence': "[CLS] Hello I'm a fashion model. [SEP]",`
61	`'score': 0.09019174426794052,`
62	`'token': 4633,`
63	`'token_str': 'fashion'},`
64	`{'sequence': "[CLS] Hello I'm a new model. [SEP]",`
65	`'score': 0.06349995732307434,`
66	`'token': 1207,`
67	`'token_str': 'new'},`
68	`{'sequence': "[CLS] Hello I'm a male model. [SEP]",`
69	`'score': 0.06228214129805565,`
70	`'token': 2581,`
71	`'token_str': 'male'},`
72	`{'sequence': "[CLS] Hello I'm a professional model. [SEP]",`
73	`'score': 0.0441727414727211,`
74	`'token': 1848,`
75	`'token_str': 'professional'},`
76	`{'sequence': "[CLS] Hello I'm a super model. [SEP]",`
77	`'score': 0.03326151892542839,`
78	`'token': 7688,`
79	`'token_str': 'super'}]`
80	```
81
82	`Here is how to use this model to get the features of a given text in PyTorch:`
83
84	```python
85	`from transformers import BertTokenizer, BertModel`
86	`tokenizer = BertTokenizer.from_pretrained('bert-base-cased')`
87	`model = BertModel.from_pretrained("bert-base-cased")`
88	`text = "Replace me by any text you'd like."`
89	`encoded_input = tokenizer(text, return_tensors='pt')`
90	`output = model(**encoded_input)`
91	```
92
93	`and in TensorFlow:`
94
95	```python
96	`from transformers import BertTokenizer, TFBertModel`
97	`tokenizer = BertTokenizer.from_pretrained('bert-base-cased')`
98	`model = TFBertModel.from_pretrained("bert-base-cased")`
99	`text = "Replace me by any text you'd like."`
100	`encoded_input = tokenizer(text, return_tensors='tf')`
101	`output = model(encoded_input)`
102	```
103
104	`### Limitations and bias`
105
106	`Even if the training data used for this model could be characterized as fairly neutral, this model can have biased`
107	`predictions:`
108
109	```python
110	`>>> from transformers import pipeline`
111	`>>> unmasker = pipeline('fill-mask', model='bert-base-cased')`
112	`>>> unmasker("The man worked as a [MASK].")`
113
114	`[{'sequence': '[CLS] The man worked as a lawyer. [SEP]',`
115	`'score': 0.04804691672325134,`
116	`'token': 4545,`
117	`'token_str': 'lawyer'},`
118	`{'sequence': '[CLS] The man worked as a waiter. [SEP]',`
119	`'score': 0.037494491785764694,`
120	`'token': 17989,`
121	`'token_str': 'waiter'},`
122	`{'sequence': '[CLS] The man worked as a cop. [SEP]',`
123	`'score': 0.035512614995241165,`
124	`'token': 9947,`
125	`'token_str': 'cop'},`
126	`{'sequence': '[CLS] The man worked as a detective. [SEP]',`
127	`'score': 0.031271643936634064,`
128	`'token': 9140,`
129	`'token_str': 'detective'},`
130	`{'sequence': '[CLS] The man worked as a doctor. [SEP]',`
131	`'score': 0.027423162013292313,`
132	`'token': 3995,`
133	`'token_str': 'doctor'}]`
134
135	`>>> unmasker("The woman worked as a [MASK].")`
136
137	`[{'sequence': '[CLS] The woman worked as a nurse. [SEP]',`
138	`'score': 0.16927455365657806,`
139	`'token': 7439,`
140	`'token_str': 'nurse'},`
141	`{'sequence': '[CLS] The woman worked as a waitress. [SEP]',`
142	`'score': 0.1501094549894333,`
143	`'token': 15098,`
144	`'token_str': 'waitress'},`
145	`{'sequence': '[CLS] The woman worked as a maid. [SEP]',`
146	`'score': 0.05600163713097572,`
147	`'token': 13487,`
148	`'token_str': 'maid'},`
149	`{'sequence': '[CLS] The woman worked as a housekeeper. [SEP]',`
150	`'score': 0.04838843643665314,`
151	`'token': 26458,`
152	`'token_str': 'housekeeper'},`
153	`{'sequence': '[CLS] The woman worked as a cook. [SEP]',`
154	`'score': 0.029980547726154327,`
155	`'token': 9834,`
156	`'token_str': 'cook'}]`
157	```
158
159	`This bias will also affect all fine-tuned versions of this model.`
160
161	`## Training data`
162
163	`The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038`
164	`unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and`
165	`headers).`
166
167	`## Training procedure`
168
169	`### Preprocessing`
170
171	`The texts are tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:`
172
173	```
174	`[CLS] Sentence A [SEP] Sentence B [SEP]`
175	```
176
177	`With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in`
178	`the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a`
179	`consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two`
180	`"sentences" has a combined length of less than 512 tokens.`
181
182	`The details of the masking procedure for each sentence are the following:`
183	`- 15% of the tokens are masked.`
184	- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
185	`- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.`
186	`- In the 10% remaining cases, the masked tokens are left as is.`
187
188	`### Pretraining`
189
190	`The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size`
191	`of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer`
192	`used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,`
193	`learning rate warmup for 10,000 steps and linear decay of the learning rate after.`
194
195	`## Evaluation results`
196
197	`When fine-tuned on downstream tasks, this model achieves the following results:`
198
199	`Glue test results:`
200
201	`\| Task \| MNLI-(m/mm) \| QQP \| QNLI \| SST-2 \| CoLA \| STS-B \| MRPC \| RTE \| Average \|`
202	`\|:----:\|:-----------:\|:----:\|:----:\|:-----:\|:----:\|:-----:\|:----:\|:----:\|:-------:\|`
203	`\| \| 84.6/83.4 \| 71.2 \| 90.5 \| 93.5 \| 52.1 \| 85.8 \| 88.9 \| 66.4 \| 79.6 \|`
204
205
206	`### BibTeX entry and citation info`
207
208	```bibtex
209	`@article{DBLP:journals/corr/abs-1810-04805,`
210	`author = {Jacob Devlin and`
211	`Ming{-}Wei Chang and`
212	`Kenton Lee and`
213	`Kristina Toutanova},`
214	`title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language`
215	`Understanding},`
216	`journal = {CoRR},`
217	`volume = {abs/1810.04805},`
218	`year = {2018},`
219	`url = {http://arxiv.org/abs/1810.04805},`
220	`archivePrefix = {arXiv},`
221	`eprint = {1810.04805},`
222	`timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},`
223	`biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},`
224	`bibsource = {dblp computer science bibliography, https://dblp.org}`
225	`}`
226	```
227
228	`<a href="https://huggingface.co/exbert/?model=bert-base-cased">`
229	`<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">`
230	`</a>`
231