README.md · opus-mt-en-de

README.md

3.2 KB · 110 lines · markdown Raw

1	`---`
2	`tags:`
3	`- translation`
4	`license: cc-by-4.0`
5	`---`
6
7	`### opus-mt-en-de`
8
9
10	`## Table of Contents`
11	`- [Model Details](#model-details)`
12	`- [Uses](#uses)`
13	`- [Risks, Limitations and Biases](#risks-limitations-and-biases)`
14	`- [Training](#training)`
15	`- [Evaluation](#evaluation)`
16	`- [Citation Information](#citation-information)`
17	`- [How to Get Started With the Model](#how-to-get-started-with-the-model)`
18
19	`## Model Details`
20	`Model Description:`
21	`- Developed by: Language Technology Research Group at the University of Helsinki`
22	`- Model Type: Translation`
23	`- Language(s):`
24	`- Source Language: English`
25	`- Target Language: German`
26	`- License: CC-BY-4.0`
27	`- Resources for more information:`
28	`- [GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)`
29
30
31	`## Uses`
32
33	`#### Direct Use`
34
35	`This model can be used for translation and text-to-text generation.`
36
37
38	`## Risks, Limitations and Biases`
39
40
41
42	`CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.`
43
44	`Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).`
45
46	`Further details about the dataset for this model can be found in the OPUS readme: [en-de](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-de/README.md)`
47
48
49	`#### Training Data`
50	`##### Preprocessing`
51	`* pre-processing: normalization + SentencePiece`
52
53	`* dataset: [opus](https://github.com/Helsinki-NLP/Opus-MT)`
54	`* download original weights: [opus-2020-02-26.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-de/opus-2020-02-26.zip)`
55
56	`* test set translations: [opus-2020-02-26.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-de/opus-2020-02-26.test.txt)`
57
58	`## Evaluation`
59
60	`#### Results`
61
62	`* test set scores: [opus-2020-02-26.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-de/opus-2020-02-26.eval.txt)`
63
64
65	`#### Benchmarks`
66
67	`\| testset \| BLEU \| chr-F \|`
68	`\|-----------------------\|-------\|-------\|`
69	`\| newssyscomb2009.en.de \| 23.5 \| 0.540 \|`
70	`\| news-test2008.en.de \| 23.5 \| 0.529 \|`
71	`\| newstest2009.en.de \| 22.3 \| 0.530 \|`
72	`\| newstest2010.en.de \| 24.9 \| 0.544 \|`
73	`\| newstest2011.en.de \| 22.5 \| 0.524 \|`
74	`\| newstest2012.en.de \| 23.0 \| 0.525 \|`
75	`\| newstest2013.en.de \| 26.9 \| 0.553 \|`
76	`\| newstest2015-ende.en.de \| 31.1 \| 0.594 \|`
77	`\| newstest2016-ende.en.de \| 37.0 \| 0.636 \|`
78	`\| newstest2017-ende.en.de \| 29.9 \| 0.586 \|`
79	`\| newstest2018-ende.en.de \| 45.2 \| 0.690 \|`
80	`\| newstest2019-ende.en.de \| 40.9 \| 0.654 \|`
81	`\| Tatoeba.en.de \| 47.3 \| 0.664 \|`
82
83
84
85	`## Citation Information`
86
87	```bibtex
88	`@InProceedings{TiedemannThottingal:EAMT2020,`
89	`author = {J{\"o}rg Tiedemann and Santhosh Thottingal},`
90	`title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},`
91	`booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},`
92	`year = {2020},`
93	`address = {Lisbon, Portugal}`
94	`}`
95	```
96
97	`## How to Get Started With the Model`
98	```python
99	`from transformers import AutoTokenizer, AutoModelForSeq2SeqLM`
100
101	`tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")`
102
103	`model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")`
104
105	```
106
107
108
109
110