README.md · twitter-xlm-roberta-base-sentiment

README.md

3.2 KB · 115 lines · markdown Raw

1	`---`
2	`language: multilingual`
3	`widget:`
4	`- text: "🤗"`
5	`- text: "T'estimo! ❤️"`
6	`- text: "I love you!"`
7	`- text: "I hate you 🤮"`
8	`- text: "Mahal kita!"`
9	`- text: "사랑해!"`
10	`- text: "난 너가 싫어"`
11	`- text: "😍😍😍"`
12	`---`
13
14
15	`# twitter-XLM-roBERTa-base for Sentiment Analysis`
16
17	`This is a multilingual XLM-roBERTa-base model trained on ~198M tweets and finetuned for sentiment analysis. The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details).`
18
19	`- Paper: [XLM-T: A Multilingual Language Model Toolkit for Twitter](https://arxiv.org/abs/2104.12250).`
20	`- Git Repo: [XLM-T official repository](https://github.com/cardiffnlp/xlm-t).`
21
22	`This model has been integrated into the [TweetNLP library](https://github.com/cardiffnlp/tweetnlp).`
23
24	`## Example Pipeline`
25	```python
26	`from transformers import pipeline`
27	`model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"`
28	`sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)`
29	`sentiment_task("T'estimo!")`
30	```
31	```
32	`[{'label': 'Positive', 'score': 0.6600581407546997}]`
33	```
34
35	`## Full classification example`
36
37	```python
38	`from transformers import AutoModelForSequenceClassification`
39	`from transformers import TFAutoModelForSequenceClassification`
40	`from transformers import AutoTokenizer, AutoConfig`
41	`import numpy as np`
42	`from scipy.special import softmax`
43
44	`# Preprocess text (username and link placeholders)`
45	`def preprocess(text):`
46	`new_text = []`
47	`for t in text.split(" "):`
48	`t = '@user' if t.startswith('@') and len(t) > 1 else t`
49	`t = 'http' if t.startswith('http') else t`
50	`new_text.append(t)`
51	`return " ".join(new_text)`
52
53	`MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"`
54
55	`tokenizer = AutoTokenizer.from_pretrained(MODEL)`
56	`config = AutoConfig.from_pretrained(MODEL)`
57
58	`# PT`
59	`model = AutoModelForSequenceClassification.from_pretrained(MODEL)`
60	`model.save_pretrained(MODEL)`
61
62	`text = "Good night 😊"`
63	`text = preprocess(text)`
64	`encoded_input = tokenizer(text, return_tensors='pt')`
65	`output = model(**encoded_input)`
66	`scores = output[0][0].detach().numpy()`
67	`scores = softmax(scores)`
68
69	`# # TF`
70	`# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)`
71	`# model.save_pretrained(MODEL)`
72
73	`# text = "Good night 😊"`
74	`# encoded_input = tokenizer(text, return_tensors='tf')`
75	`# output = model(encoded_input)`
76	`# scores = output[0][0].numpy()`
77	`# scores = softmax(scores)`
78
79	`# Print labels and scores`
80	`ranking = np.argsort(scores)`
81	`ranking = ranking[::-1]`
82	`for i in range(scores.shape[0]):`
83	`l = config.id2label[ranking[i]]`
84	`s = scores[ranking[i]]`
85	`print(f"{i+1}) {l} {np.round(float(s), 4)}")`
86
87	```
88
89	`Output:`
90
91	```
92	`1) Positive 0.7673`
93	`2) Neutral 0.2015`
94	`3) Negative 0.0313`
95	```
96
97	`### Reference`
98	```
99	`@inproceedings{barbieri-etal-2022-xlm,`
100	`title = "{XLM}-{T}: Multilingual Language Models in {T}witter for Sentiment Analysis and Beyond",`
101	`author = "Barbieri, Francesco and`
102	`Espinosa Anke, Luis and`
103	`Camacho-Collados, Jose",`
104	`booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",`
105	`month = jun,`
106	`year = "2022",`
107	`address = "Marseille, France",`
108	`publisher = "European Language Resources Association",`
109	`url = "https://aclanthology.org/2022.lrec-1.27",`
110	`pages = "258--266"`
111	`}`
112
113	```
114
115