README.md · vi-mrc-large

1

---

2

language:

3

- vi

4

- vn

5

- en

6

tags:

7

- question-answering

8

- pytorch

9

datasets:

10

- squad

11

license: cc-by-nc-4.0

12

pipeline_tag: question-answering

13

metrics:

14

- squad

15

widget:

16

- text: "Bình là chuyên gia về gì ?"

17

  context: "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"

18

- text: "Bình được công nhận với danh hiệu gì ?"

19

  context: "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"

20

---

21

## Model Description

22

23

- Language model: [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)

24

- Fine-tune: [MRCQuestionAnswering](https://github.com/nguyenvulebinh/extractive-qa-mrc)

25

- Language: Vietnamese, Englsih

26

- Downstream-task: Extractive QA

27

- Dataset (combine English and Vietnamese):

28

- [Squad 2.0](https://rajpurkar.github.io/SQuAD-explorer/)

29

- [mailong25](https://github.com/mailong25/bert-vietnamese-question-answering/tree/master/dataset)

30

- [VLSP MRC 2021](https://vlsp.org.vn/vlsp2021/eval/mrc)

31

- [MultiLingual Question Answering](https://github.com/facebookresearch/MLQA)

32

33

This model is intended to be used for QA in the Vietnamese language so the valid set is Vietnamese only (but English works fine). The evaluation result below uses the VLSP MRC 2021 test set. This experiment achieves TOP 1 on the leaderboard.

34

35

36

| Model | EM | F1 |

37

| ------------- | ------------- | ------------- |

38

| [large](https://huggingface.co/nguyenvulebinh/vi-mrc-large) public_test_set | 85.847 | 83.826 |

39

| [large](https://huggingface.co/nguyenvulebinh/vi-mrc-large) private_test_set | 82.072 | 78.071 |

40

41

Public leaderboard | Private leaderboard

42

:-------------------------:|:-------------------------:

43

![](https://i.ibb.co/tJX6V6T/public-leaderboard.jpg) | ![](https://i.ibb.co/nmsX2pG/private-leaderboard.jpg)

44

45

[MRCQuestionAnswering](https://github.com/nguyenvulebinh/extractive-qa-mrc) using [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html) as a pre-trained language model. By default, XLM-RoBERTa will split word in to sub-words. But in my implementation, I re-combine sub-words representation (after encoded by BERT layer) into word representation using sum strategy.

46

47

## Using pre-trained model

48

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Yqgdfaca7L94OyQVnq5iQq8wRTFvVZjv?usp=sharing)

49

50

- Hugging Face pipeline style (**NOT using sum features strategy**).

51

52

```python

53

from transformers import pipeline

54

# model_checkpoint = "nguyenvulebinh/vi-mrc-large"

55

model_checkpoint = "nguyenvulebinh/vi-mrc-base"

56

nlp = pipeline('question-answering', model=model_checkpoint,

57

tokenizer=model_checkpoint)

58

QA_input = {

59

'question': "Bình là chuyên gia về gì ?",

60

  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"

61

}

62

res = nlp(QA_input)

63

print('pipeline: {}'.format(res))

64

#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

65

```

66

67

- More accurate infer process ([**Using sum features strategy**](https://github.com/nguyenvulebinh/extractive-qa-mrc))

68

69

```python

70

from infer import tokenize_function, data_collator, extract_answer

71

from model.mrc_model import MRCQuestionAnswering

72

from transformers import AutoTokenizer

73

74

model_checkpoint = "nguyenvulebinh/vi-mrc-large"

75

#model_checkpoint = "nguyenvulebinh/vi-mrc-base"

76

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

77

model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

78

79

QA_input = {

80

'question': "Bình được công nhận với danh hiệu gì ?",

81

  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"

82

}

83

84

inputs = [tokenize_function(*QA_input)]

85

inputs_ids = data_collator(inputs)

86

outputs = model(**inputs_ids)

87

answer = extract_answer(inputs, outputs, tokenizer)

88

89

print(answer)

90

# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013

91

```

92

93

## About

94

95

*Built by Binh Nguyen*

96

[![Follow](https://img.shields.io/twitter/follow/nguyenvulebinh?style=social)](https://twitter.com/intent/follow?screen_name=nguyenvulebinh)

97

For more details, visit the project repository.

98

[![GitHub stars](https://img.shields.io/github/stars/nguyenvulebinh/extractive-qa-mrc?style=social)](https://github.com/nguyenvulebinh/extractive-qa-mrc)