README.md
3.2 KB · 115 lines · markdown Raw
1 ---
2 language: multilingual
3 widget:
4 - text: "🤗"
5 - text: "T'estimo! ❤️"
6 - text: "I love you!"
7 - text: "I hate you 🤮"
8 - text: "Mahal kita!"
9 - text: "사랑해!"
10 - text: "난 너가 싫어"
11 - text: "😍😍😍"
12 ---
13
14
15 # twitter-XLM-roBERTa-base for Sentiment Analysis
16
17 This is a multilingual XLM-roBERTa-base model trained on ~198M tweets and finetuned for sentiment analysis. The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details).
18
19 - Paper: [XLM-T: A Multilingual Language Model Toolkit for Twitter](https://arxiv.org/abs/2104.12250).
20 - Git Repo: [XLM-T official repository](https://github.com/cardiffnlp/xlm-t).
21
22 This model has been integrated into the [TweetNLP library](https://github.com/cardiffnlp/tweetnlp).
23
24 ## Example Pipeline
25 ```python
26 from transformers import pipeline
27 model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
28 sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
29 sentiment_task("T'estimo!")
30 ```
31 ```
32 [{'label': 'Positive', 'score': 0.6600581407546997}]
33 ```
34
35 ## Full classification example
36
37 ```python
38 from transformers import AutoModelForSequenceClassification
39 from transformers import TFAutoModelForSequenceClassification
40 from transformers import AutoTokenizer, AutoConfig
41 import numpy as np
42 from scipy.special import softmax
43
44 # Preprocess text (username and link placeholders)
45 def preprocess(text):
46 new_text = []
47 for t in text.split(" "):
48 t = '@user' if t.startswith('@') and len(t) > 1 else t
49 t = 'http' if t.startswith('http') else t
50 new_text.append(t)
51 return " ".join(new_text)
52
53 MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"
54
55 tokenizer = AutoTokenizer.from_pretrained(MODEL)
56 config = AutoConfig.from_pretrained(MODEL)
57
58 # PT
59 model = AutoModelForSequenceClassification.from_pretrained(MODEL)
60 model.save_pretrained(MODEL)
61
62 text = "Good night 😊"
63 text = preprocess(text)
64 encoded_input = tokenizer(text, return_tensors='pt')
65 output = model(**encoded_input)
66 scores = output[0][0].detach().numpy()
67 scores = softmax(scores)
68
69 # # TF
70 # model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
71 # model.save_pretrained(MODEL)
72
73 # text = "Good night 😊"
74 # encoded_input = tokenizer(text, return_tensors='tf')
75 # output = model(encoded_input)
76 # scores = output[0][0].numpy()
77 # scores = softmax(scores)
78
79 # Print labels and scores
80 ranking = np.argsort(scores)
81 ranking = ranking[::-1]
82 for i in range(scores.shape[0]):
83 l = config.id2label[ranking[i]]
84 s = scores[ranking[i]]
85 print(f"{i+1}) {l} {np.round(float(s), 4)}")
86
87 ```
88
89 Output:
90
91 ```
92 1) Positive 0.7673
93 2) Neutral 0.2015
94 3) Negative 0.0313
95 ```
96
97 ### Reference
98 ```
99 @inproceedings{barbieri-etal-2022-xlm,
100 title = "{XLM}-{T}: Multilingual Language Models in {T}witter for Sentiment Analysis and Beyond",
101 author = "Barbieri, Francesco and
102 Espinosa Anke, Luis and
103 Camacho-Collados, Jose",
104 booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
105 month = jun,
106 year = "2022",
107 address = "Marseille, France",
108 publisher = "European Language Resources Association",
109 url = "https://aclanthology.org/2022.lrec-1.27",
110 pages = "258--266"
111 }
112
113 ```
114
115