README.md
| 1 | --- |
| 2 | language: |
| 3 | - multilingual |
| 4 | - en |
| 5 | - fr |
| 6 | - es |
| 7 | - de |
| 8 | - el |
| 9 | - bg |
| 10 | - ru |
| 11 | - tr |
| 12 | - ar |
| 13 | - vi |
| 14 | - th |
| 15 | - zh |
| 16 | - hi |
| 17 | - sw |
| 18 | - ur |
| 19 | tags: |
| 20 | - text-classification |
| 21 | - pytorch |
| 22 | - tensorflow |
| 23 | datasets: |
| 24 | - multi_nli |
| 25 | - xnli |
| 26 | license: mit |
| 27 | pipeline_tag: zero-shot-classification |
| 28 | widget: |
| 29 | - text: "За кого вы голосуете в 2020 году?" |
| 30 | candidate_labels: "politique étrangère, Europe, élections, affaires, politique" |
| 31 | multi_class: true |
| 32 | - text: "لمن تصوت في 2020؟" |
| 33 | candidate_labels: "السياسة الخارجية, أوروبا, الانتخابات, الأعمال, السياسة" |
| 34 | multi_class: true |
| 35 | - text: "2020'de kime oy vereceksiniz?" |
| 36 | candidate_labels: "dış politika, Avrupa, seçimler, ticaret, siyaset" |
| 37 | multi_class: true |
| 38 | --- |
| 39 | |
| 40 | # xlm-roberta-large-xnli |
| 41 | |
| 42 | ## Model Description |
| 43 | |
| 44 | This model takes [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) and fine-tunes it on a combination of NLI data in 15 languages. It is intended to be used for zero-shot text classification, such as with the Hugging Face [ZeroShotClassificationPipeline](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline). |
| 45 | |
| 46 | ## Intended Usage |
| 47 | |
| 48 | This model is intended to be used for zero-shot text classification, especially in languages other than English. It is fine-tuned on XNLI, which is a multilingual NLI dataset. The model can therefore be used with any of the languages in the XNLI corpus: |
| 49 | |
| 50 | - English |
| 51 | - French |
| 52 | - Spanish |
| 53 | - German |
| 54 | - Greek |
| 55 | - Bulgarian |
| 56 | - Russian |
| 57 | - Turkish |
| 58 | - Arabic |
| 59 | - Vietnamese |
| 60 | - Thai |
| 61 | - Chinese |
| 62 | - Hindi |
| 63 | - Swahili |
| 64 | - Urdu |
| 65 | |
| 66 | Since the base model was pre-trained trained on 100 different languages, the |
| 67 | model has shown some effectiveness in languages beyond those listed above as |
| 68 | well. See the full list of pre-trained languages in appendix A of the |
| 69 | [XLM Roberata paper](https://arxiv.org/abs/1911.02116) |
| 70 | |
| 71 | For English-only classification, it is recommended to use |
| 72 | [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or |
| 73 | [a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla). |
| 74 | |
| 75 | #### With the zero-shot classification pipeline |
| 76 | |
| 77 | The model can be loaded with the `zero-shot-classification` pipeline like so: |
| 78 | |
| 79 | ```python |
| 80 | from transformers import pipeline |
| 81 | classifier = pipeline("zero-shot-classification", |
| 82 | model="joeddav/xlm-roberta-large-xnli") |
| 83 | ``` |
| 84 | |
| 85 | You can then classify in any of the above languages. You can even pass the labels in one language and the sequence to |
| 86 | classify in another: |
| 87 | |
| 88 | ```python |
| 89 | # we will classify the Russian translation of, "Who are you voting for in 2020?" |
| 90 | sequence_to_classify = "За кого вы голосуете в 2020 году?" |
| 91 | # we can specify candidate labels in Russian or any other language above: |
| 92 | candidate_labels = ["Europe", "public health", "politics"] |
| 93 | classifier(sequence_to_classify, candidate_labels) |
| 94 | # {'labels': ['politics', 'Europe', 'public health'], |
| 95 | # 'scores': [0.9048484563827515, 0.05722189322113991, 0.03792969882488251], |
| 96 | # 'sequence': 'За кого вы голосуете в 2020 году?'} |
| 97 | ``` |
| 98 | |
| 99 | The default hypothesis template is the English, `This text is {}`. If you are working strictly within one language, it |
| 100 | may be worthwhile to translate this to the language you are working with: |
| 101 | |
| 102 | ```python |
| 103 | sequence_to_classify = "¿A quién vas a votar en 2020?" |
| 104 | candidate_labels = ["Europa", "salud pública", "política"] |
| 105 | hypothesis_template = "Este ejemplo es {}." |
| 106 | classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template) |
| 107 | # {'labels': ['política', 'Europa', 'salud pública'], |
| 108 | # 'scores': [0.9109585881233215, 0.05954807624220848, 0.029493311420083046], |
| 109 | # 'sequence': '¿A quién vas a votar en 2020?'} |
| 110 | ``` |
| 111 | |
| 112 | #### With manual PyTorch |
| 113 | |
| 114 | ```python |
| 115 | # pose sequence as a NLI premise and label as a hypothesis |
| 116 | from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| 117 | nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli') |
| 118 | tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli') |
| 119 | |
| 120 | premise = sequence |
| 121 | hypothesis = f'This example is {label}.' |
| 122 | |
| 123 | # run through model pre-trained on MNLI |
| 124 | x = tokenizer.encode(premise, hypothesis, return_tensors='pt', |
| 125 | truncation_strategy='only_first') |
| 126 | logits = nli_model(x.to(device))[0] |
| 127 | |
| 128 | # we throw away "neutral" (dim 1) and take the probability of |
| 129 | # "entailment" (2) as the probability of the label being true |
| 130 | entail_contradiction_logits = logits[:,[0,2]] |
| 131 | probs = entail_contradiction_logits.softmax(dim=1) |
| 132 | prob_label_is_true = probs[:,1] |
| 133 | ``` |
| 134 | |
| 135 | ## Training |
| 136 | |
| 137 | This model was pre-trained on set of 100 languages, as described in |
| 138 | [the original paper](https://arxiv.org/abs/1911.02116). It was then fine-tuned on the task of NLI on the concatenated |
| 139 | MNLI train set and the XNLI validation and test sets. Finally, it was trained for one additional epoch on only XNLI |
| 140 | data where the translations for the premise and hypothesis are shuffled such that the premise and hypothesis for |
| 141 | each example come from the same original English example but the premise and hypothesis are of different languages. |
| 142 | |