README.md · paraphrase-multilingual-MiniLM-L12-v2

README.md

3.8 KB · 156 lines · markdown Raw

1	`---`
2	`language:`
3	`- multilingual`
4	`- ar`
5	`- bg`
6	`- ca`
7	`- cs`
8	`- da`
9	`- de`
10	`- el`
11	`- en`
12	`- es`
13	`- et`
14	`- fa`
15	`- fi`
16	`- fr`
17	`- gl`
18	`- gu`
19	`- he`
20	`- hi`
21	`- hr`
22	`- hu`
23	`- hy`
24	`- id`
25	`- it`
26	`- ja`
27	`- ka`
28	`- ko`
29	`- ku`
30	`- lt`
31	`- lv`
32	`- mk`
33	`- mn`
34	`- mr`
35	`- ms`
36	`- my`
37	`- nb`
38	`- nl`
39	`- pl`
40	`- pt`
41	`- ro`
42	`- ru`
43	`- sk`
44	`- sl`
45	`- sq`
46	`- sr`
47	`- sv`
48	`- th`
49	`- tr`
50	`- uk`
51	`- ur`
52	`- vi`
53	`license: apache-2.0`
54	`library_name: sentence-transformers`
55	`tags:`
56	`- sentence-transformers`
57	`- feature-extraction`
58	`- sentence-similarity`
59	`- transformers`
60	`language_bcp47:`
61	`- fr-ca`
62	`- pt-br`
63	`- zh-cn`
64	`- zh-tw`
65	`pipeline_tag: sentence-similarity`
66	`---`
67
68	`# sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
69
70	`This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.`
71
72
73
74	`## Usage (Sentence-Transformers)`
75
76	`Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:`
77
78	```
79	`pip install -U sentence-transformers`
80	```
81
82	`Then you can use the model like this:`
83
84	```python
85	`from sentence_transformers import SentenceTransformer`
86	`sentences = ["This is an example sentence", "Each sentence is converted"]`
87
88	`model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')`
89	`embeddings = model.encode(sentences)`
90	`print(embeddings)`
91	```
92
93
94
95	`## Usage (HuggingFace Transformers)`
96	`Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.`
97
98	```python
99	`from transformers import AutoTokenizer, AutoModel`
100	`import torch`
101
102
103	`# Mean Pooling - Take attention mask into account for correct averaging`
104	`def mean_pooling(model_output, attention_mask):`
105	`token_embeddings = model_output[0] #First element of model_output contains all token embeddings`
106	`input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()`
107	`return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)`
108
109
110	`# Sentences we want sentence embeddings for`
111	`sentences = ['This is an example sentence', 'Each sentence is converted']`
112
113	`# Load model from HuggingFace Hub`
114	`tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')`
115	`model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')`
116
117	`# Tokenize sentences`
118	`encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')`
119
120	`# Compute token embeddings`
121	`with torch.no_grad():`
122	`model_output = model(**encoded_input)`
123
124	`# Perform pooling. In this case, max pooling.`
125	`sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])`
126
127	`print("Sentence embeddings:")`
128	`print(sentence_embeddings)`
129	```
130
131
132
133	`## Full Model Architecture`
134	```
135	`SentenceTransformer(`
136	`(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel`
137	`(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})`
138	`)`
139	```
140
141	`## Citing & Authors`
142
143	`This model was trained by [sentence-transformers](https://www.sbert.net/).`
144
145	`If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):`
146	```bibtex
147	`@inproceedings{reimers-2019-sentence-bert,`
148	`title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",`
149	`author = "Reimers, Nils and Gurevych, Iryna",`
150	`booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",`
151	`month = "11",`
152	`year = "2019",`
153	`publisher = "Association for Computational Linguistics",`
154	`url = "http://arxiv.org/abs/1908.10084",`
155	`}`
156	```