README.md · xlm-roberta-large-xnli

README.md

5.0 KB · 142 lines · markdown Raw

1	`---`
2	`language:`
3	`- multilingual`
4	`- en`
5	`- fr`
6	`- es`
7	`- de`
8	`- el`
9	`- bg`
10	`- ru`
11	`- tr`
12	`- ar`
13	`- vi`
14	`- th`
15	`- zh`
16	`- hi`
17	`- sw`
18	`- ur`
19	`tags:`
20	`- text-classification`
21	`- pytorch`
22	`- tensorflow`
23	`datasets:`
24	`- multi_nli`
25	`- xnli`
26	`license: mit`
27	`pipeline_tag: zero-shot-classification`
28	`widget:`
29	`- text: "За кого вы голосуете в 2020 году?"`
30	`candidate_labels: "politique étrangère, Europe, élections, affaires, politique"`
31	`multi_class: true`
32	`- text: "لمن تصوت في 2020؟"`
33	`candidate_labels: "السياسة الخارجية, أوروبا, الانتخابات, الأعمال, السياسة"`
34	`multi_class: true`
35	`- text: "2020'de kime oy vereceksiniz?"`
36	`candidate_labels: "dış politika, Avrupa, seçimler, ticaret, siyaset"`
37	`multi_class: true`
38	`---`
39
40	`# xlm-roberta-large-xnli`
41
42	`## Model Description`
43
44	`This model takes [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) and fine-tunes it on a combination of NLI data in 15 languages. It is intended to be used for zero-shot text classification, such as with the Hugging Face [ZeroShotClassificationPipeline](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline).`
45
46	`## Intended Usage`
47
48	`This model is intended to be used for zero-shot text classification, especially in languages other than English. It is fine-tuned on XNLI, which is a multilingual NLI dataset. The model can therefore be used with any of the languages in the XNLI corpus:`
49
50	`- English`
51	`- French`
52	`- Spanish`
53	`- German`
54	`- Greek`
55	`- Bulgarian`
56	`- Russian`
57	`- Turkish`
58	`- Arabic`
59	`- Vietnamese`
60	`- Thai`
61	`- Chinese`
62	`- Hindi`
63	`- Swahili`
64	`- Urdu`
65
66	`Since the base model was pre-trained trained on 100 different languages, the`
67	`model has shown some effectiveness in languages beyond those listed above as`
68	`well. See the full list of pre-trained languages in appendix A of the`
69	`[XLM Roberata paper](https://arxiv.org/abs/1911.02116)`
70
71	`For English-only classification, it is recommended to use`
72	`[bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or`
73	`[a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla).`
74
75	`#### With the zero-shot classification pipeline`
76
77	The model can be loaded with the `zero-shot-classification` pipeline like so:
78
79	```python
80	`from transformers import pipeline`
81	`classifier = pipeline("zero-shot-classification",`
82	`model="joeddav/xlm-roberta-large-xnli")`
83	```
84
85	`You can then classify in any of the above languages. You can even pass the labels in one language and the sequence to`
86	`classify in another:`
87
88	```python
89	`# we will classify the Russian translation of, "Who are you voting for in 2020?"`
90	`sequence_to_classify = "За кого вы голосуете в 2020 году?"`
91	`# we can specify candidate labels in Russian or any other language above:`
92	`candidate_labels = ["Europe", "public health", "politics"]`
93	`classifier(sequence_to_classify, candidate_labels)`
94	`# {'labels': ['politics', 'Europe', 'public health'],`
95	`# 'scores': [0.9048484563827515, 0.05722189322113991, 0.03792969882488251],`
96	`# 'sequence': 'За кого вы голосуете в 2020 году?'}`
97	```
98
99	The default hypothesis template is the English, `This text is {}`. If you are working strictly within one language, it
100	`may be worthwhile to translate this to the language you are working with:`
101
102	```python
103	`sequence_to_classify = "¿A quién vas a votar en 2020?"`
104	`candidate_labels = ["Europa", "salud pública", "política"]`
105	`hypothesis_template = "Este ejemplo es {}."`
106	`classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)`
107	`# {'labels': ['política', 'Europa', 'salud pública'],`
108	`# 'scores': [0.9109585881233215, 0.05954807624220848, 0.029493311420083046],`
109	`# 'sequence': '¿A quién vas a votar en 2020?'}`
110	```
111
112	`#### With manual PyTorch`
113
114	```python
115	`# pose sequence as a NLI premise and label as a hypothesis`
116	`from transformers import AutoModelForSequenceClassification, AutoTokenizer`
117	`nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')`
118	`tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')`
119
120	`premise = sequence`
121	`hypothesis = f'This example is {label}.'`
122
123	`# run through model pre-trained on MNLI`
124	`x = tokenizer.encode(premise, hypothesis, return_tensors='pt',`
125	`truncation_strategy='only_first')`
126	`logits = nli_model(x.to(device))[0]`
127
128	`# we throw away "neutral" (dim 1) and take the probability of`
129	`# "entailment" (2) as the probability of the label being true`
130	`entail_contradiction_logits = logits[:,[0,2]]`
131	`probs = entail_contradiction_logits.softmax(dim=1)`
132	`prob_label_is_true = probs[:,1]`
133	```
134
135	`## Training`
136
137	`This model was pre-trained on set of 100 languages, as described in`
138	`[the original paper](https://arxiv.org/abs/1911.02116). It was then fine-tuned on the task of NLI on the concatenated`
139	`MNLI train set and the XNLI validation and test sets. Finally, it was trained for one additional epoch on only XNLI`
140	`data where the translations for the premise and hypothesis are shuffled such that the premise and hypothesis for`
141	`each example come from the same original English example but the premise and hypothesis are of different languages.`
142