README.md · paraphrase-multilingual-mpnet-base-v2

README.md

5.0 KB · 184 lines · markdown Raw

1	`---`
2	`language:`
3	`- multilingual`
4	`- ar`
5	`- bg`
6	`- ca`
7	`- cs`
8	`- da`
9	`- de`
10	`- el`
11	`- en`
12	`- es`
13	`- et`
14	`- fa`
15	`- fi`
16	`- fr`
17	`- gl`
18	`- gu`
19	`- he`
20	`- hi`
21	`- hr`
22	`- hu`
23	`- hy`
24	`- id`
25	`- it`
26	`- ja`
27	`- ka`
28	`- ko`
29	`- ku`
30	`- lt`
31	`- lv`
32	`- mk`
33	`- mn`
34	`- mr`
35	`- ms`
36	`- my`
37	`- nb`
38	`- nl`
39	`- pl`
40	`- pt`
41	`- ro`
42	`- ru`
43	`- sk`
44	`- sl`
45	`- sq`
46	`- sr`
47	`- sv`
48	`- th`
49	`- tr`
50	`- uk`
51	`- ur`
52	`- vi`
53	`license: apache-2.0`
54	`library_name: sentence-transformers`
55	`tags:`
56	`- sentence-transformers`
57	`- feature-extraction`
58	`- sentence-similarity`
59	`- transformers`
60	`- text-embeddings-inference`
61	`language_bcp47:`
62	`- fr-ca`
63	`- pt-br`
64	`- zh-cn`
65	`- zh-tw`
66	`pipeline_tag: sentence-similarity`
67	`---`
68
69	`# sentence-transformers/paraphrase-multilingual-mpnet-base-v2`
70
71	`This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.`
72
73
74
75	`## Usage (Sentence-Transformers)`
76
77	`Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:`
78
79	```
80	`pip install -U sentence-transformers`
81	```
82
83	`Then you can use the model like this:`
84
85	```python
86	`from sentence_transformers import SentenceTransformer`
87	`sentences = ["This is an example sentence", "Each sentence is converted"]`
88
89	`model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')`
90	`embeddings = model.encode(sentences)`
91	`print(embeddings)`
92	```
93
94
95
96	`## Usage (HuggingFace Transformers)`
97	`Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.`
98
99	```python
100	`from transformers import AutoTokenizer, AutoModel`
101	`import torch`
102
103
104	`# Mean Pooling - Take attention mask into account for correct averaging`
105	`def mean_pooling(model_output, attention_mask):`
106	`token_embeddings = model_output[0] # First element of model_output contains all token embeddings`
107	`input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()`
108	`return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)`
109
110
111	`# Sentences we want sentence embeddings for`
112	`sentences = ['This is an example sentence', 'Each sentence is converted']`
113
114	`# Load model from HuggingFace Hub`
115	`tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')`
116	`model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')`
117
118	`# Tokenize sentences`
119	`encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')`
120
121	`# Compute token embeddings`
122	`with torch.no_grad():`
123	`model_output = model(**encoded_input)`
124
125	`# Perform pooling. In this case, mean pooling`
126	`sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])`
127
128	`print("Sentence embeddings:")`
129	`print(sentence_embeddings)`
130	```
131
132
133	`## Usage (Text Embeddings Inference (TEI))`
134
135	`[Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embedding models.`
136
137	`- CPU:`
138	```bash
139	`docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id sentence-transformers/paraphrase-multilingual-mpnet-base-v2 --pooling mean --dtype float16`
140	```
141
142	`- NVIDIA GPU:`
143	```bash
144	`docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest --model-id sentence-transformers/paraphrase-multilingual-mpnet-base-v2 --pooling mean --dtype float16`
145	```
146
147	Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
148	```bash
149	`curl http://localhost:8080/v1/embeddings \`
150	`-H "Content-Type: application/json" \`
151	`-d '{`
152	`"model": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",`
153	`"input": "This is an example sentence"`
154	`}'`
155	```
156
157	`Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.`
158
159
160
161	`## Full Model Architecture`
162	```
163	`SentenceTransformer(`
164	`(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel`
165	`(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})`
166	`)`
167	```
168
169	`## Citing & Authors`
170
171	`This model was trained by [sentence-transformers](https://www.sbert.net/).`
172
173	`If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):`
174	```bibtex
175	`@inproceedings{reimers-2019-sentence-bert,`
176	`title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",`
177	`author = "Reimers, Nils and Gurevych, Iryna",`
178	`booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",`
179	`month = "11",`
180	`year = "2019",`
181	`publisher = "Association for Computational Linguistics",`
182	`url = "http://arxiv.org/abs/1908.10084",`
183	`}`
184	```