README.md · fullstop-punctuation-multilang-large

1

---

2

language:

3

- en

4

- de

5

- fr

6

- it

7

- multilingual

8

tags:

9

- punctuation prediction

10

- punctuation

11

datasets: wmt/europarl

12

license: mit

13

widget:

14

- text: "Ho sentito che ti sei laureata il che mi fa molto piacere"

15

example_title: "Italian"

16

- text: "Tous les matins vers quatre heures mon père ouvrait la porte de ma chambre"

17

example_title: "French"

18

- text: "Ist das eine Frage Frau Müller"

19

example_title: "German"

20

- text: "Yet she blushed as if with guilt when Cynthia reading her thoughts said to her one day Molly you're very glad to get rid of us are not you"

21

example_title: "English"

22

metrics:

23

- f1

24

---

25

26

This model predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

27

28

This multilanguage model was trained on the [Europarl Dataset](https://huggingface.co/datasets/wmt/europarl) provided by the [SEPP-NLG Shared Task](https://sites.google.com/view/sentence-segmentation). *Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.*

29

30

The model restores the following punctuation markers: **"." "," "?" "-" ":"**

31

## Sample Code

32

We provide a simple python package that allows you to process text of any length.

33

34

## Install

35

36

To get started install the package from [pypi](https://pypi.org/project/deepmultilingualpunctuation/):

37

38

```bash

39

pip install deepmultilingualpunctuation

40

```

41

### Restore Punctuation

42

```python

43

from deepmultilingualpunctuation import PunctuationModel

44

45

model = PunctuationModel()

46

text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"

47

result = model.restore_punctuation(text)

48

print(result)

49

```

50

51

**output**

52

> My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

53

54

55

### Predict Labels

56

```python

57

from deepmultilingualpunctuation import PunctuationModel

58

59

model = PunctuationModel()

60

text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"

61

clean_text = model.preprocess(text)

62

labled_words = model.predict(clean_text)

63

print(labled_words)

64

```

65

66

**output**

67

68

> [['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

69

70

71

72

73

## Results

74

75

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

76

77

| Label | EN | DE | FR | IT |

78

| ------------- | ----- | ----- | ----- | ----- |

79

| 0 | 0.991 | 0.997 | 0.992 | 0.989 |

80

| . | 0.948 | 0.961 | 0.945 | 0.942 |

81

| ? | 0.890 | 0.893 | 0.871 | 0.832 |

82

| , | 0.819 | 0.945 | 0.831 | 0.798 |

83

| : | 0.575 | 0.652 | 0.620 | 0.588 |

84

| - | 0.425 | 0.435 | 0.431 | 0.421 |

85

| macro average | 0.775 | 0.814 | 0.782 | 0.762 |

86

87

## Languages

88

89

### Models

90

91

| Languages | Model |

92

| ------------------------------------------ | ------------------------------------------------------------ |

93

| English, Italian, French and German        | [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) |

94

| English, Italian, French, German and Dutch | [oliverguhr/fullstop-punctuation-multilingual-sonar-base](https://huggingface.co/oliverguhr/fullstop-punctuation-multilingual-sonar-base) |

95

| Dutch                                      | [oliverguhr/fullstop-dutch-sonar-punctuation-prediction](https://huggingface.co/oliverguhr/fullstop-dutch-sonar-punctuation-prediction) |

96

97

### Community Models

98

99

| Languages | Model |

100

| ------------------------------------------ | ------------------------------------------------------------ |

101

|English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian| [kredor/punctuate-all](https://huggingface.co/kredor/punctuate-all)                                                             |

102

| Catalan                                    | [softcatala/fullstop-catalan-punctuation-prediction](https://huggingface.co/softcatala/fullstop-catalan-punctuation-prediction) |

103

| Welsh | [techiaith/fullstop-welsh-punctuation-prediction](https://huggingface.co/techiaith/fullstop-welsh-punctuation-prediction) |

104

105

You can use different models by setting the model parameter:

106

107

```python

108

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

109

```

110

111

## Where do I find the code and can I train my own model?

112

113

Yes you can! For complete code of the reareach project take a look at [this repository](https://github.com/oliverguhr/fullstop-deep-punctuation-prediction).

114

115

There is also an guide on [how to fine tune this model for you data / language](https://github.com/oliverguhr/fullstop-deep-punctuation-prediction/blob/main/other_languages/readme.md).

116

117

118

## References

119

```

120

@article{guhr-EtAl:2021:fullstop,

121

title={FullStop: Multilingual Deep Models for Punctuation Prediction},

122

author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},

123

booktitle = {Proceedings of the Swiss Text Analytics Conference 2021},

124

month = {June},

125

year = {2021},

126

address = {Winterthur, Switzerland},

127

publisher = {CEUR Workshop Proceedings},

128

url = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}

129

}

130

```