README.md · bert-base-multilingual-cased

README.md

6.9 KB · 257 lines · markdown Raw

1	`---`
2	`language:`
3	`- multilingual`
4	`- af`
5	`- sq`
6	`- ar`
7	`- an`
8	`- hy`
9	`- ast`
10	`- az`
11	`- ba`
12	`- eu`
13	`- bar`
14	`- be`
15	`- bn`
16	`- inc`
17	`- bs`
18	`- br`
19	`- bg`
20	`- my`
21	`- ca`
22	`- ceb`
23	`- ce`
24	`- zh`
25	`- cv`
26	`- hr`
27	`- cs`
28	`- da`
29	`- nl`
30	`- en`
31	`- et`
32	`- fi`
33	`- fr`
34	`- gl`
35	`- ka`
36	`- de`
37	`- el`
38	`- gu`
39	`- ht`
40	`- he`
41	`- hi`
42	`- hu`
43	`- is`
44	`- io`
45	`- id`
46	`- ga`
47	`- it`
48	`- ja`
49	`- jv`
50	`- kn`
51	`- kk`
52	`- ky`
53	`- ko`
54	`- la`
55	`- lv`
56	`- lt`
57	`- roa`
58	`- nds`
59	`- lm`
60	`- mk`
61	`- mg`
62	`- ms`
63	`- ml`
64	`- mr`
65	`- mn`
66	`- min`
67	`- ne`
68	`- new`
69	`- nb`
70	`- nn`
71	`- oc`
72	`- fa`
73	`- pms`
74	`- pl`
75	`- pt`
76	`- pa`
77	`- ro`
78	`- ru`
79	`- sco`
80	`- sr`
81	`- hr`
82	`- scn`
83	`- sk`
84	`- sl`
85	`- aze`
86	`- es`
87	`- su`
88	`- sw`
89	`- sv`
90	`- tl`
91	`- tg`
92	`- th`
93	`- ta`
94	`- tt`
95	`- te`
96	`- tr`
97	`- uk`
98	`- ud`
99	`- uz`
100	`- vi`
101	`- vo`
102	`- war`
103	`- cy`
104	`- fry`
105	`- pnb`
106	`- yo`
107	`license: apache-2.0`
108	`datasets:`
109	`- wikipedia`
110	`---`
111
112	`# BERT multilingual base model (cased)`
113
114	`Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.`
115	`It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in`
116	`[this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference`
117	`between english and English.`
118
119	`Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by`
120	`the Hugging Face team.`
121
122	`## Model description`
123
124	`BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means`
125	`it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of`
126	`publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it`
127	`was pretrained with two objectives:`
128
129	`- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run`
130	`the entire masked sentence through the model and has to predict the masked words. This is different from traditional`
131	`recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like`
132	`GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the`
133	`sentence.`
134	`- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes`
135	`they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to`
136	`predict if the two sentences were following each other or not.`
137
138	`This way, the model learns an inner representation of the languages in the training set that can then be used to`
139	`extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a`
140	`standard classifier using the features produced by the BERT model as inputs.`
141
142	`## Intended uses & limitations`
143
144	`You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to`
145	`be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for`
146	`fine-tuned versions on a task that interests you.`
147
148	`Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)`
149	`to make decisions, such as sequence classification, token classification or question answering. For tasks such as text`
150	`generation you should look at model like GPT2.`
151
152	`### How to use`
153
154	`You can use this model directly with a pipeline for masked language modeling:`
155
156	```python
157	`>>> from transformers import pipeline`
158	`>>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-cased')`
159	`>>> unmasker("Hello I'm a [MASK] model.")`
160
161	`[{'sequence': "[CLS] Hello I'm a model model. [SEP]",`
162	`'score': 0.10182085633277893,`
163	`'token': 13192,`
164	`'token_str': 'model'},`
165	`{'sequence': "[CLS] Hello I'm a world model. [SEP]",`
166	`'score': 0.052126359194517136,`
167	`'token': 11356,`
168	`'token_str': 'world'},`
169	`{'sequence': "[CLS] Hello I'm a data model. [SEP]",`
170	`'score': 0.048930276185274124,`
171	`'token': 11165,`
172	`'token_str': 'data'},`
173	`{'sequence': "[CLS] Hello I'm a flight model. [SEP]",`
174	`'score': 0.02036019042134285,`
175	`'token': 23578,`
176	`'token_str': 'flight'},`
177	`{'sequence': "[CLS] Hello I'm a business model. [SEP]",`
178	`'score': 0.020079681649804115,`
179	`'token': 14155,`
180	`'token_str': 'business'}]`
181	```
182
183	`Here is how to use this model to get the features of a given text in PyTorch:`
184
185	```python
186	`from transformers import BertTokenizer, BertModel`
187	`tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')`
188	`model = BertModel.from_pretrained("bert-base-multilingual-cased")`
189	`text = "Replace me by any text you'd like."`
190	`encoded_input = tokenizer(text, return_tensors='pt')`
191	`output = model(**encoded_input)`
192	```
193
194	`and in TensorFlow:`
195
196	```python
197	`from transformers import BertTokenizer, TFBertModel`
198	`tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')`
199	`model = TFBertModel.from_pretrained("bert-base-multilingual-cased")`
200	`text = "Replace me by any text you'd like."`
201	`encoded_input = tokenizer(text, return_tensors='tf')`
202	`output = model(encoded_input)`
203	```
204
205	`## Training data`
206
207	`The BERT model was pretrained on the 104 languages with the largest Wikipedias. You can find the complete list`
208	`[here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).`
209
210	`## Training procedure`
211
212	`### Preprocessing`
213
214	`The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a`
215	`larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese,`
216	`Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character.`
217
218	`The inputs of the model are then of the form:`
219
220	```
221	`[CLS] Sentence A [SEP] Sentence B [SEP]`
222	```
223
224	`With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in`
225	`the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a`
226	`consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two`
227	`"sentences" has a combined length of less than 512 tokens.`
228
229	`The details of the masking procedure for each sentence are the following:`
230	`- 15% of the tokens are masked.`
231	- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
232	`- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.`
233	`- In the 10% remaining cases, the masked tokens are left as is.`
234
235
236	`### BibTeX entry and citation info`
237
238	```bibtex
239	`@article{DBLP:journals/corr/abs-1810-04805,`
240	`author = {Jacob Devlin and`
241	`Ming{-}Wei Chang and`
242	`Kenton Lee and`
243	`Kristina Toutanova},`
244	`title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language`
245	`Understanding},`
246	`journal = {CoRR},`
247	`volume = {abs/1810.04805},`
248	`year = {2018},`
249	`url = {http://arxiv.org/abs/1810.04805},`
250	`archivePrefix = {arXiv},`
251	`eprint = {1810.04805},`
252	`timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},`
253	`biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},`
254	`bibsource = {dblp computer science bibliography, https://dblp.org}`
255	`}`
256	```
257