README.md
| 1 | --- |
| 2 | language: |
| 3 | - ko |
| 4 | tags: |
| 5 | - classification |
| 6 | license: mit |
| 7 | datasets: |
| 8 | - nsmc |
| 9 | widget: |
| 10 | - text: "불후의 명작입니다! 이렇게 감동적인 내용은 처음이에요" |
| 11 | example_title: "Positive" |
| 12 | - text: "시간이 정말 아깝습니다. 10점 만점에 1점도 아까워요.." |
| 13 | example_title: "Negative" |
| 14 | metrics: |
| 15 | - accuracy |
| 16 | - f1 |
| 17 | - precision |
| 18 | - recall- accuracy |
| 19 | --- |
| 20 | |
| 21 | # Sentiment Binary Classification (fine-tuning with KoELECTRA-Small-v3 model and Naver Sentiment Movie Corpus dataset) |
| 22 | |
| 23 | ## Usage (Amazon SageMaker inference applicable) |
| 24 | It uses the interface of the SageMaker Inference Toolkit as is, so it can be easily deployed to SageMaker Endpoint. |
| 25 | |
| 26 | ### inference_nsmc.py |
| 27 | |
| 28 | ```python |
| 29 | import json |
| 30 | import sys |
| 31 | import logging |
| 32 | import torch |
| 33 | from torch import nn |
| 34 | from transformers import ElectraConfig |
| 35 | from transformers import ElectraModel, AutoTokenizer, ElectraTokenizer, ElectraForSequenceClassification |
| 36 | |
| 37 | logging.basicConfig( |
| 38 | level=logging.INFO, |
| 39 | format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s', |
| 40 | handlers=[ |
| 41 | logging.FileHandler(filename='tmp.log'), |
| 42 | logging.StreamHandler(sys.stdout) |
| 43 | ] |
| 44 | ) |
| 45 | logger = logging.getLogger(__name__) |
| 46 | |
| 47 | max_seq_length = 128 |
| 48 | classes = ['Neg', 'Pos'] |
| 49 | |
| 50 | tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/koelectra-small-v3-nsmc") |
| 51 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 52 | |
| 53 | |
| 54 | def model_fn(model_path=None): |
| 55 | #### |
| 56 | # If you have your own trained model |
| 57 | # Huggingface pre-trained model: 'monologg/koelectra-small-v3-discriminator' |
| 58 | #### |
| 59 | #config = ElectraConfig.from_json_file(f'{model_path}/config.json') |
| 60 | #model = ElectraForSequenceClassification.from_pretrained(f'{model_path}/model.pth', config=config) |
| 61 | |
| 62 | # Download model from the Huggingface hub |
| 63 | model = ElectraForSequenceClassification.from_pretrained('daekeun-ml/koelectra-small-v3-nsmc') |
| 64 | model.to(device) |
| 65 | return model |
| 66 | |
| 67 | |
| 68 | def input_fn(input_data, content_type="application/jsonlines"): |
| 69 | data_str = input_data.decode("utf-8") |
| 70 | jsonlines = data_str.split("\n") |
| 71 | transformed_inputs = [] |
| 72 | |
| 73 | for jsonline in jsonlines: |
| 74 | text = json.loads(jsonline)["text"][0] |
| 75 | logger.info("input text: {}".format(text)) |
| 76 | encode_plus_token = tokenizer.encode_plus( |
| 77 | text, |
| 78 | max_length=max_seq_length, |
| 79 | add_special_tokens=True, |
| 80 | return_token_type_ids=False, |
| 81 | padding="max_length", |
| 82 | return_attention_mask=True, |
| 83 | return_tensors="pt", |
| 84 | truncation=True, |
| 85 | ) |
| 86 | transformed_inputs.append(encode_plus_token) |
| 87 | |
| 88 | return transformed_inputs |
| 89 | |
| 90 | |
| 91 | def predict_fn(transformed_inputs, model): |
| 92 | predicted_classes = [] |
| 93 | |
| 94 | for data in transformed_inputs: |
| 95 | data = data.to(device) |
| 96 | output = model(**data) |
| 97 | |
| 98 | softmax_fn = nn.Softmax(dim=1) |
| 99 | softmax_output = softmax_fn(output[0]) |
| 100 | _, prediction = torch.max(softmax_output, dim=1) |
| 101 | |
| 102 | predicted_class_idx = prediction.item() |
| 103 | predicted_class = classes[predicted_class_idx] |
| 104 | score = softmax_output[0][predicted_class_idx] |
| 105 | logger.info("predicted_class: {}".format(predicted_class)) |
| 106 | |
| 107 | prediction_dict = {} |
| 108 | prediction_dict["predicted_label"] = predicted_class |
| 109 | prediction_dict['score'] = score.cpu().detach().numpy().tolist() |
| 110 | |
| 111 | jsonline = json.dumps(prediction_dict) |
| 112 | logger.info("jsonline: {}".format(jsonline)) |
| 113 | predicted_classes.append(jsonline) |
| 114 | |
| 115 | predicted_classes_jsonlines = "\n".join(predicted_classes) |
| 116 | return predicted_classes_jsonlines |
| 117 | |
| 118 | |
| 119 | def output_fn(outputs, accept="application/jsonlines"): |
| 120 | return outputs, accept |
| 121 | ``` |
| 122 | |
| 123 | ### test.py |
| 124 | ```python |
| 125 | >>> from inference_nsmc import model_fn, input_fn, predict_fn, output_fn |
| 126 | >>> with open('samples/nsmc.txt', mode='rb') as file: |
| 127 | >>> model_input_data = file.read() |
| 128 | >>> model = model_fn() |
| 129 | >>> transformed_inputs = input_fn(model_input_data) |
| 130 | >>> predicted_classes_jsonlines = predict_fn(transformed_inputs, model) |
| 131 | >>> model_outputs = output_fn(predicted_classes_jsonlines) |
| 132 | >>> print(model_outputs[0]) |
| 133 | |
| 134 | [{inference_nsmc.py:47} INFO - input text: 이 영화는 최고의 영화입니다 |
| 135 | [{inference_nsmc.py:47} INFO - input text: 최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다 |
| 136 | [{inference_nsmc.py:77} INFO - predicted_class: Pos |
| 137 | [{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Pos", "score": 0.9619030952453613} |
| 138 | [{inference_nsmc.py:77} INFO - predicted_class: Neg |
| 139 | [{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Neg", "score": 0.9994170665740967} |
| 140 | {"predicted_label": "Pos", "score": 0.9619030952453613} |
| 141 | {"predicted_label": "Neg", "score": 0.9994170665740967} |
| 142 | ``` |
| 143 | |
| 144 | ### Sample data (samples/nsmc.txt) |
| 145 | ``` |
| 146 | {"text": ["이 영화는 최고의 영화입니다"]} |
| 147 | {"text": ["최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다"]} |
| 148 | ``` |
| 149 | |
| 150 | ## References |
| 151 | - KoELECTRA: https://github.com/monologg/KoELECTRA |
| 152 | - Naver Sentiment Movie Corpus Dataset: https://github.com/e9t/nsmc |