README.md · koelectra-small-v3-nsmc

README.md

5.0 KB · 152 lines · markdown Raw

1	`---`
2	`language:`
3	`- ko`
4	`tags:`
5	`- classification`
6	`license: mit`
7	`datasets:`
8	`- nsmc`
9	`widget:`
10	`- text: "불후의 명작입니다! 이렇게 감동적인 내용은 처음이에요"`
11	`example_title: "Positive"`
12	`- text: "시간이 정말 아깝습니다. 10점 만점에 1점도 아까워요.."`
13	`example_title: "Negative"`
14	`metrics:`
15	`- accuracy`
16	`- f1`
17	`- precision`
18	`- recall- accuracy`
19	`---`
20
21	`# Sentiment Binary Classification (fine-tuning with KoELECTRA-Small-v3 model and Naver Sentiment Movie Corpus dataset)`
22
23	`## Usage (Amazon SageMaker inference applicable)`
24	`It uses the interface of the SageMaker Inference Toolkit as is, so it can be easily deployed to SageMaker Endpoint.`
25
26	`### inference_nsmc.py`
27
28	```python
29	`import json`
30	`import sys`
31	`import logging`
32	`import torch`
33	`from torch import nn`
34	`from transformers import ElectraConfig`
35	`from transformers import ElectraModel, AutoTokenizer, ElectraTokenizer, ElectraForSequenceClassification`
36
37	`logging.basicConfig(`
38	`level=logging.INFO,`
39	`format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s',`
40	`handlers=[`
41	`logging.FileHandler(filename='tmp.log'),`
42	`logging.StreamHandler(sys.stdout)`
43	`]`
44	`)`
45	`logger = logging.getLogger(__name__)`
46
47	`max_seq_length = 128`
48	`classes = ['Neg', 'Pos']`
49
50	`tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/koelectra-small-v3-nsmc")`
51	`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`
52
53
54	`def model_fn(model_path=None):`
55	`####`
56	`# If you have your own trained model`
57	`# Huggingface pre-trained model: 'monologg/koelectra-small-v3-discriminator'`
58	`####`
59	`#config = ElectraConfig.from_json_file(f'{model_path}/config.json')`
60	`#model = ElectraForSequenceClassification.from_pretrained(f'{model_path}/model.pth', config=config)`
61
62	`# Download model from the Huggingface hub`
63	`model = ElectraForSequenceClassification.from_pretrained('daekeun-ml/koelectra-small-v3-nsmc')`
64	`model.to(device)`
65	`return model`
66
67
68	`def input_fn(input_data, content_type="application/jsonlines"):`
69	`data_str = input_data.decode("utf-8")`
70	`jsonlines = data_str.split("\n")`
71	`transformed_inputs = []`
72
73	`for jsonline in jsonlines:`
74	`text = json.loads(jsonline)["text"][0]`
75	`logger.info("input text: {}".format(text))`
76	`encode_plus_token = tokenizer.encode_plus(`
77	`text,`
78	`max_length=max_seq_length,`
79	`add_special_tokens=True,`
80	`return_token_type_ids=False,`
81	`padding="max_length",`
82	`return_attention_mask=True,`
83	`return_tensors="pt",`
84	`truncation=True,`
85	`)`
86	`transformed_inputs.append(encode_plus_token)`
87
88	`return transformed_inputs`
89
90
91	`def predict_fn(transformed_inputs, model):`
92	`predicted_classes = []`
93
94	`for data in transformed_inputs:`
95	`data = data.to(device)`
96	`output = model(**data)`
97
98	`softmax_fn = nn.Softmax(dim=1)`
99	`softmax_output = softmax_fn(output[0])`
100	`_, prediction = torch.max(softmax_output, dim=1)`
101
102	`predicted_class_idx = prediction.item()`
103	`predicted_class = classes[predicted_class_idx]`
104	`score = softmax_output[0][predicted_class_idx]`
105	`logger.info("predicted_class: {}".format(predicted_class))`
106
107	`prediction_dict = {}`
108	`prediction_dict["predicted_label"] = predicted_class`
109	`prediction_dict['score'] = score.cpu().detach().numpy().tolist()`
110
111	`jsonline = json.dumps(prediction_dict)`
112	`logger.info("jsonline: {}".format(jsonline))`
113	`predicted_classes.append(jsonline)`
114
115	`predicted_classes_jsonlines = "\n".join(predicted_classes)`
116	`return predicted_classes_jsonlines`
117
118
119	`def output_fn(outputs, accept="application/jsonlines"):`
120	`return outputs, accept`
121	```
122
123	`### test.py`
124	```python
125	`>>> from inference_nsmc import model_fn, input_fn, predict_fn, output_fn`
126	`>>> with open('samples/nsmc.txt', mode='rb') as file:`
127	`>>> model_input_data = file.read()`
128	`>>> model = model_fn()`
129	`>>> transformed_inputs = input_fn(model_input_data)`
130	`>>> predicted_classes_jsonlines = predict_fn(transformed_inputs, model)`
131	`>>> model_outputs = output_fn(predicted_classes_jsonlines)`
132	`>>> print(model_outputs[0])`
133
134	`[{inference_nsmc.py:47} INFO - input text: 이 영화는 최고의 영화입니다`
135	`[{inference_nsmc.py:47} INFO - input text: 최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다`
136	`[{inference_nsmc.py:77} INFO - predicted_class: Pos`
137	`[{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Pos", "score": 0.9619030952453613}`
138	`[{inference_nsmc.py:77} INFO - predicted_class: Neg`
139	`[{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Neg", "score": 0.9994170665740967}`
140	`{"predicted_label": "Pos", "score": 0.9619030952453613}`
141	`{"predicted_label": "Neg", "score": 0.9994170665740967}`
142	```
143
144	`### Sample data (samples/nsmc.txt)`
145	```
146	`{"text": ["이 영화는 최고의 영화입니다"]}`
147	`{"text": ["최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다"]}`
148	```
149
150	`## References`
151	`- KoELECTRA: https://github.com/monologg/KoELECTRA`
152	`- Naver Sentiment Movie Corpus Dataset: https://github.com/e9t/nsmc`