README.md
4.2 KB · 133 lines · markdown Raw
1 ---
2 language: en
3 widget:
4 - text: Covid cases are increasing fast!
5 datasets:
6 - tweet_eval
7 license: cc-by-4.0
8 ---
9
10
11 # Twitter-roBERTa-base for Sentiment Analysis - UPDATED (2022)
12
13 This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark.
14 The original Twitter-based RoBERTa model can be found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) and the original reference paper is [TweetEval](https://github.com/cardiffnlp/tweeteval). This model is suitable for English.
15
16 - Reference Paper: [TimeLMs paper](https://arxiv.org/abs/2202.03829).
17 - Git Repo: [TimeLMs official repository](https://github.com/cardiffnlp/timelms).
18
19 <b>Labels</b>:
20 0 -> Negative;
21 1 -> Neutral;
22 2 -> Positive
23
24 This sentiment analysis model has been integrated into [TweetNLP](https://github.com/cardiffnlp/tweetnlp). You can access the demo [here](https://tweetnlp.org).
25
26 ## Example Pipeline
27 ```python
28 from transformers import pipeline
29 sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
30 sentiment_task("Covid cases are increasing fast!")
31 ```
32 ```
33 [{'label': 'Negative', 'score': 0.7236}]
34 ```
35
36 ## Full classification example
37
38 ```python
39 from transformers import AutoModelForSequenceClassification
40 from transformers import TFAutoModelForSequenceClassification
41 from transformers import AutoTokenizer, AutoConfig
42 import numpy as np
43 from scipy.special import softmax
44 # Preprocess text (username and link placeholders)
45 def preprocess(text):
46 new_text = []
47 for t in text.split(" "):
48 t = '@user' if t.startswith('@') and len(t) > 1 else t
49 t = 'http' if t.startswith('http') else t
50 new_text.append(t)
51 return " ".join(new_text)
52 MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
53 tokenizer = AutoTokenizer.from_pretrained(MODEL)
54 config = AutoConfig.from_pretrained(MODEL)
55 # PT
56 model = AutoModelForSequenceClassification.from_pretrained(MODEL)
57 #model.save_pretrained(MODEL)
58 text = "Covid cases are increasing fast!"
59 text = preprocess(text)
60 encoded_input = tokenizer(text, return_tensors='pt')
61 output = model(**encoded_input)
62 scores = output[0][0].detach().numpy()
63 scores = softmax(scores)
64 # # TF
65 # model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
66 # model.save_pretrained(MODEL)
67 # text = "Covid cases are increasing fast!"
68 # encoded_input = tokenizer(text, return_tensors='tf')
69 # output = model(encoded_input)
70 # scores = output[0][0].numpy()
71 # scores = softmax(scores)
72 # Print labels and scores
73 ranking = np.argsort(scores)
74 ranking = ranking[::-1]
75 for i in range(scores.shape[0]):
76 l = config.id2label[ranking[i]]
77 s = scores[ranking[i]]
78 print(f"{i+1}) {l} {np.round(float(s), 4)}")
79 ```
80
81 Output:
82
83 ```
84 1) Negative 0.7236
85 2) Neutral 0.2287
86 3) Positive 0.0477
87 ```
88
89
90 ### References
91 ```
92 @inproceedings{camacho-collados-etal-2022-tweetnlp,
93 title = "{T}weet{NLP}: Cutting-Edge Natural Language Processing for Social Media",
94 author = "Camacho-collados, Jose and
95 Rezaee, Kiamehr and
96 Riahi, Talayeh and
97 Ushio, Asahi and
98 Loureiro, Daniel and
99 Antypas, Dimosthenis and
100 Boisson, Joanne and
101 Espinosa Anke, Luis and
102 Liu, Fangyu and
103 Mart{\'\i}nez C{\'a}mara, Eugenio" and others,
104 booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
105 month = dec,
106 year = "2022",
107 address = "Abu Dhabi, UAE",
108 publisher = "Association for Computational Linguistics",
109 url = "https://aclanthology.org/2022.emnlp-demos.5",
110 pages = "38--49"
111 }
112
113 ```
114
115 ```
116 @inproceedings{loureiro-etal-2022-timelms,
117 title = "{T}ime{LM}s: Diachronic Language Models from {T}witter",
118 author = "Loureiro, Daniel and
119 Barbieri, Francesco and
120 Neves, Leonardo and
121 Espinosa Anke, Luis and
122 Camacho-collados, Jose",
123 booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
124 month = may,
125 year = "2022",
126 address = "Dublin, Ireland",
127 publisher = "Association for Computational Linguistics",
128 url = "https://aclanthology.org/2022.acl-demo.25",
129 doi = "10.18653/v1/2022.acl-demo.25",
130 pages = "251--260"
131 }
132
133 ```