README.md · twitter-roberta-base-sentiment-latest

README.md

4.2 KB · 133 lines · markdown Raw

1	`---`
2	`language: en`
3	`widget:`
4	`- text: Covid cases are increasing fast!`
5	`datasets:`
6	`- tweet_eval`
7	`license: cc-by-4.0`
8	`---`
9
10
11	`# Twitter-roBERTa-base for Sentiment Analysis - UPDATED (2022)`
12
13	`This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark.`
14	`The original Twitter-based RoBERTa model can be found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) and the original reference paper is [TweetEval](https://github.com/cardiffnlp/tweeteval). This model is suitable for English.`
15
16	`- Reference Paper: [TimeLMs paper](https://arxiv.org/abs/2202.03829).`
17	`- Git Repo: [TimeLMs official repository](https://github.com/cardiffnlp/timelms).`
18
19	`<b>Labels</b>:`
20	`0 -> Negative;`
21	`1 -> Neutral;`
22	`2 -> Positive`
23
24	`This sentiment analysis model has been integrated into [TweetNLP](https://github.com/cardiffnlp/tweetnlp). You can access the demo [here](https://tweetnlp.org).`
25
26	`## Example Pipeline`
27	```python
28	`from transformers import pipeline`
29	`sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)`
30	`sentiment_task("Covid cases are increasing fast!")`
31	```
32	```
33	`[{'label': 'Negative', 'score': 0.7236}]`
34	```
35
36	`## Full classification example`
37
38	```python
39	`from transformers import AutoModelForSequenceClassification`
40	`from transformers import TFAutoModelForSequenceClassification`
41	`from transformers import AutoTokenizer, AutoConfig`
42	`import numpy as np`
43	`from scipy.special import softmax`
44	`# Preprocess text (username and link placeholders)`
45	`def preprocess(text):`
46	`new_text = []`
47	`for t in text.split(" "):`
48	`t = '@user' if t.startswith('@') and len(t) > 1 else t`
49	`t = 'http' if t.startswith('http') else t`
50	`new_text.append(t)`
51	`return " ".join(new_text)`
52	`MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"`
53	`tokenizer = AutoTokenizer.from_pretrained(MODEL)`
54	`config = AutoConfig.from_pretrained(MODEL)`
55	`# PT`
56	`model = AutoModelForSequenceClassification.from_pretrained(MODEL)`
57	`#model.save_pretrained(MODEL)`
58	`text = "Covid cases are increasing fast!"`
59	`text = preprocess(text)`
60	`encoded_input = tokenizer(text, return_tensors='pt')`
61	`output = model(**encoded_input)`
62	`scores = output[0][0].detach().numpy()`
63	`scores = softmax(scores)`
64	`# # TF`
65	`# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)`
66	`# model.save_pretrained(MODEL)`
67	`# text = "Covid cases are increasing fast!"`
68	`# encoded_input = tokenizer(text, return_tensors='tf')`
69	`# output = model(encoded_input)`
70	`# scores = output[0][0].numpy()`
71	`# scores = softmax(scores)`
72	`# Print labels and scores`
73	`ranking = np.argsort(scores)`
74	`ranking = ranking[::-1]`
75	`for i in range(scores.shape[0]):`
76	`l = config.id2label[ranking[i]]`
77	`s = scores[ranking[i]]`
78	`print(f"{i+1}) {l} {np.round(float(s), 4)}")`
79	```
80
81	`Output:`
82
83	```
84	`1) Negative 0.7236`
85	`2) Neutral 0.2287`
86	`3) Positive 0.0477`
87	```
88
89
90	`### References`
91	```
92	`@inproceedings{camacho-collados-etal-2022-tweetnlp,`
93	`title = "{T}weet{NLP}: Cutting-Edge Natural Language Processing for Social Media",`
94	`author = "Camacho-collados, Jose and`
95	`Rezaee, Kiamehr and`
96	`Riahi, Talayeh and`
97	`Ushio, Asahi and`
98	`Loureiro, Daniel and`
99	`Antypas, Dimosthenis and`
100	`Boisson, Joanne and`
101	`Espinosa Anke, Luis and`
102	`Liu, Fangyu and`
103	`Mart{\'\i}nez C{\'a}mara, Eugenio" and others,`
104	`booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",`
105	`month = dec,`
106	`year = "2022",`
107	`address = "Abu Dhabi, UAE",`
108	`publisher = "Association for Computational Linguistics",`
109	`url = "https://aclanthology.org/2022.emnlp-demos.5",`
110	`pages = "38--49"`
111	`}`
112
113	```
114
115	```
116	`@inproceedings{loureiro-etal-2022-timelms,`
117	`title = "{T}ime{LM}s: Diachronic Language Models from {T}witter",`
118	`author = "Loureiro, Daniel and`
119	`Barbieri, Francesco and`
120	`Neves, Leonardo and`
121	`Espinosa Anke, Luis and`
122	`Camacho-collados, Jose",`
123	`booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",`
124	`month = may,`
125	`year = "2022",`
126	`address = "Dublin, Ireland",`
127	`publisher = "Association for Computational Linguistics",`
128	`url = "https://aclanthology.org/2022.acl-demo.25",`
129	`doi = "10.18653/v1/2022.acl-demo.25",`
130	`pages = "251--260"`
131	`}`
132
133	```