README.md
5.0 KB · 142 lines · markdown Raw
1 ---
2 language:
3 - multilingual
4 - en
5 - fr
6 - es
7 - de
8 - el
9 - bg
10 - ru
11 - tr
12 - ar
13 - vi
14 - th
15 - zh
16 - hi
17 - sw
18 - ur
19 tags:
20 - text-classification
21 - pytorch
22 - tensorflow
23 datasets:
24 - multi_nli
25 - xnli
26 license: mit
27 pipeline_tag: zero-shot-classification
28 widget:
29 - text: "За кого вы голосуете в 2020 году?"
30 candidate_labels: "politique étrangère, Europe, élections, affaires, politique"
31 multi_class: true
32 - text: "لمن تصوت في 2020؟"
33 candidate_labels: "السياسة الخارجية, أوروبا, الانتخابات, الأعمال, السياسة"
34 multi_class: true
35 - text: "2020'de kime oy vereceksiniz?"
36 candidate_labels: "dış politika, Avrupa, seçimler, ticaret, siyaset"
37 multi_class: true
38 ---
39
40 # xlm-roberta-large-xnli
41
42 ## Model Description
43
44 This model takes [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) and fine-tunes it on a combination of NLI data in 15 languages. It is intended to be used for zero-shot text classification, such as with the Hugging Face [ZeroShotClassificationPipeline](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline).
45
46 ## Intended Usage
47
48 This model is intended to be used for zero-shot text classification, especially in languages other than English. It is fine-tuned on XNLI, which is a multilingual NLI dataset. The model can therefore be used with any of the languages in the XNLI corpus:
49
50 - English
51 - French
52 - Spanish
53 - German
54 - Greek
55 - Bulgarian
56 - Russian
57 - Turkish
58 - Arabic
59 - Vietnamese
60 - Thai
61 - Chinese
62 - Hindi
63 - Swahili
64 - Urdu
65
66 Since the base model was pre-trained trained on 100 different languages, the
67 model has shown some effectiveness in languages beyond those listed above as
68 well. See the full list of pre-trained languages in appendix A of the
69 [XLM Roberata paper](https://arxiv.org/abs/1911.02116)
70
71 For English-only classification, it is recommended to use
72 [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or
73 [a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla).
74
75 #### With the zero-shot classification pipeline
76
77 The model can be loaded with the `zero-shot-classification` pipeline like so:
78
79 ```python
80 from transformers import pipeline
81 classifier = pipeline("zero-shot-classification",
82 model="joeddav/xlm-roberta-large-xnli")
83 ```
84
85 You can then classify in any of the above languages. You can even pass the labels in one language and the sequence to
86 classify in another:
87
88 ```python
89 # we will classify the Russian translation of, "Who are you voting for in 2020?"
90 sequence_to_classify = "За кого вы голосуете в 2020 году?"
91 # we can specify candidate labels in Russian or any other language above:
92 candidate_labels = ["Europe", "public health", "politics"]
93 classifier(sequence_to_classify, candidate_labels)
94 # {'labels': ['politics', 'Europe', 'public health'],
95 # 'scores': [0.9048484563827515, 0.05722189322113991, 0.03792969882488251],
96 # 'sequence': 'За кого вы голосуете в 2020 году?'}
97 ```
98
99 The default hypothesis template is the English, `This text is {}`. If you are working strictly within one language, it
100 may be worthwhile to translate this to the language you are working with:
101
102 ```python
103 sequence_to_classify = "¿A quién vas a votar en 2020?"
104 candidate_labels = ["Europa", "salud pública", "política"]
105 hypothesis_template = "Este ejemplo es {}."
106 classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
107 # {'labels': ['política', 'Europa', 'salud pública'],
108 # 'scores': [0.9109585881233215, 0.05954807624220848, 0.029493311420083046],
109 # 'sequence': '¿A quién vas a votar en 2020?'}
110 ```
111
112 #### With manual PyTorch
113
114 ```python
115 # pose sequence as a NLI premise and label as a hypothesis
116 from transformers import AutoModelForSequenceClassification, AutoTokenizer
117 nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
118 tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
119
120 premise = sequence
121 hypothesis = f'This example is {label}.'
122
123 # run through model pre-trained on MNLI
124 x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
125 truncation_strategy='only_first')
126 logits = nli_model(x.to(device))[0]
127
128 # we throw away "neutral" (dim 1) and take the probability of
129 # "entailment" (2) as the probability of the label being true
130 entail_contradiction_logits = logits[:,[0,2]]
131 probs = entail_contradiction_logits.softmax(dim=1)
132 prob_label_is_true = probs[:,1]
133 ```
134
135 ## Training
136
137 This model was pre-trained on set of 100 languages, as described in
138 [the original paper](https://arxiv.org/abs/1911.02116). It was then fine-tuned on the task of NLI on the concatenated
139 MNLI train set and the XNLI validation and test sets. Finally, it was trained for one additional epoch on only XNLI
140 data where the translations for the premise and hypothesis are shuffled such that the premise and hypothesis for
141 each example come from the same original English example but the premise and hypothesis are of different languages.
142