README.md
5.0 KB · 152 lines · markdown Raw
1 ---
2 language:
3 - ko
4 tags:
5 - classification
6 license: mit
7 datasets:
8 - nsmc
9 widget:
10 - text: "불후의 명작입니다! 이렇게 감동적인 내용은 처음이에요"
11 example_title: "Positive"
12 - text: "시간이 정말 아깝습니다. 10점 만점에 1점도 아까워요.."
13 example_title: "Negative"
14 metrics:
15 - accuracy
16 - f1
17 - precision
18 - recall- accuracy
19 ---
20
21 # Sentiment Binary Classification (fine-tuning with KoELECTRA-Small-v3 model and Naver Sentiment Movie Corpus dataset)
22
23 ## Usage (Amazon SageMaker inference applicable)
24 It uses the interface of the SageMaker Inference Toolkit as is, so it can be easily deployed to SageMaker Endpoint.
25
26 ### inference_nsmc.py
27
28 ```python
29 import json
30 import sys
31 import logging
32 import torch
33 from torch import nn
34 from transformers import ElectraConfig
35 from transformers import ElectraModel, AutoTokenizer, ElectraTokenizer, ElectraForSequenceClassification
36
37 logging.basicConfig(
38 level=logging.INFO,
39 format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s',
40 handlers=[
41 logging.FileHandler(filename='tmp.log'),
42 logging.StreamHandler(sys.stdout)
43 ]
44 )
45 logger = logging.getLogger(__name__)
46
47 max_seq_length = 128
48 classes = ['Neg', 'Pos']
49
50 tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/koelectra-small-v3-nsmc")
51 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
52
53
54 def model_fn(model_path=None):
55 ####
56 # If you have your own trained model
57 # Huggingface pre-trained model: 'monologg/koelectra-small-v3-discriminator'
58 ####
59 #config = ElectraConfig.from_json_file(f'{model_path}/config.json')
60 #model = ElectraForSequenceClassification.from_pretrained(f'{model_path}/model.pth', config=config)
61
62 # Download model from the Huggingface hub
63 model = ElectraForSequenceClassification.from_pretrained('daekeun-ml/koelectra-small-v3-nsmc')
64 model.to(device)
65 return model
66
67
68 def input_fn(input_data, content_type="application/jsonlines"):
69 data_str = input_data.decode("utf-8")
70 jsonlines = data_str.split("\n")
71 transformed_inputs = []
72
73 for jsonline in jsonlines:
74 text = json.loads(jsonline)["text"][0]
75 logger.info("input text: {}".format(text))
76 encode_plus_token = tokenizer.encode_plus(
77 text,
78 max_length=max_seq_length,
79 add_special_tokens=True,
80 return_token_type_ids=False,
81 padding="max_length",
82 return_attention_mask=True,
83 return_tensors="pt",
84 truncation=True,
85 )
86 transformed_inputs.append(encode_plus_token)
87
88 return transformed_inputs
89
90
91 def predict_fn(transformed_inputs, model):
92 predicted_classes = []
93
94 for data in transformed_inputs:
95 data = data.to(device)
96 output = model(**data)
97
98 softmax_fn = nn.Softmax(dim=1)
99 softmax_output = softmax_fn(output[0])
100 _, prediction = torch.max(softmax_output, dim=1)
101
102 predicted_class_idx = prediction.item()
103 predicted_class = classes[predicted_class_idx]
104 score = softmax_output[0][predicted_class_idx]
105 logger.info("predicted_class: {}".format(predicted_class))
106
107 prediction_dict = {}
108 prediction_dict["predicted_label"] = predicted_class
109 prediction_dict['score'] = score.cpu().detach().numpy().tolist()
110
111 jsonline = json.dumps(prediction_dict)
112 logger.info("jsonline: {}".format(jsonline))
113 predicted_classes.append(jsonline)
114
115 predicted_classes_jsonlines = "\n".join(predicted_classes)
116 return predicted_classes_jsonlines
117
118
119 def output_fn(outputs, accept="application/jsonlines"):
120 return outputs, accept
121 ```
122
123 ### test.py
124 ```python
125 >>> from inference_nsmc import model_fn, input_fn, predict_fn, output_fn
126 >>> with open('samples/nsmc.txt', mode='rb') as file:
127 >>> model_input_data = file.read()
128 >>> model = model_fn()
129 >>> transformed_inputs = input_fn(model_input_data)
130 >>> predicted_classes_jsonlines = predict_fn(transformed_inputs, model)
131 >>> model_outputs = output_fn(predicted_classes_jsonlines)
132 >>> print(model_outputs[0])
133
134 [{inference_nsmc.py:47} INFO - input text: 이 영화는 최고의 영화입니다
135 [{inference_nsmc.py:47} INFO - input text: 최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다
136 [{inference_nsmc.py:77} INFO - predicted_class: Pos
137 [{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Pos", "score": 0.9619030952453613}
138 [{inference_nsmc.py:77} INFO - predicted_class: Neg
139 [{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Neg", "score": 0.9994170665740967}
140 {"predicted_label": "Pos", "score": 0.9619030952453613}
141 {"predicted_label": "Neg", "score": 0.9994170665740967}
142 ```
143
144 ### Sample data (samples/nsmc.txt)
145 ```
146 {"text": ["이 영화는 최고의 영화입니다"]}
147 {"text": ["최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다"]}
148 ```
149
150 ## References
151 - KoELECTRA: https://github.com/monologg/KoELECTRA
152 - Naver Sentiment Movie Corpus Dataset: https://github.com/e9t/nsmc