README.md
11.3 KB · 200 lines · markdown Raw
1 ---
2 language: en
3 license: apache-2.0
4 library_name: sentence-transformers
5 tags:
6 - sentence-transformers
7 - feature-extraction
8 - sentence-similarity
9 - transformers
10 - text-embeddings-inference
11 datasets:
12 - s2orc
13 - flax-sentence-embeddings/stackexchange_xml
14 - ms_marco
15 - gooaq
16 - yahoo_answers_topics
17 - code_search_net
18 - search_qa
19 - eli5
20 - snli
21 - multi_nli
22 - wikihow
23 - natural_questions
24 - trivia_qa
25 - embedding-data/sentence-compression
26 - embedding-data/flickr30k-captions
27 - embedding-data/altlex
28 - embedding-data/simple-wiki
29 - embedding-data/QQP
30 - embedding-data/SPECTER
31 - embedding-data/PAQ_pairs
32 - embedding-data/WikiAnswers
33 pipeline_tag: sentence-similarity
34 ---
35
36
37 # all-mpnet-base-v2
38 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
39
40 ## Usage (Sentence-Transformers)
41 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
42
43 ```
44 pip install -U sentence-transformers
45 ```
46
47 Then you can use the model like this:
48 ```python
49 from sentence_transformers import SentenceTransformer
50 sentences = ["This is an example sentence", "Each sentence is converted"]
51
52 model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
53 embeddings = model.encode(sentences)
54 print(embeddings)
55 ```
56
57 ## Usage (HuggingFace Transformers)
58 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
59
60 ```python
61 from transformers import AutoTokenizer, AutoModel
62 import torch
63 import torch.nn.functional as F
64
65 #Mean Pooling - Take attention mask into account for correct averaging
66 def mean_pooling(model_output, attention_mask):
67 token_embeddings = model_output[0] #First element of model_output contains all token embeddings
68 input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
69 return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
70
71
72 # Sentences we want sentence embeddings for
73 sentences = ['This is an example sentence', 'Each sentence is converted']
74
75 # Load model from HuggingFace Hub
76 tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
77 model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
78
79 # Tokenize sentences
80 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
81
82 # Compute token embeddings
83 with torch.no_grad():
84 model_output = model(**encoded_input)
85
86 # Perform pooling
87 sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
88
89 # Normalize embeddings
90 sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
91
92 print("Sentence embeddings:")
93 print(sentence_embeddings)
94 ```
95
96 ## Usage (Text Embeddings Inference (TEI))
97
98 [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embedding models.
99
100 - CPU:
101 ```bash
102 docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id sentence-transformers/all-mpnet-base-v2 --pooling mean --dtype float16
103 ```
104
105 - NVIDIA GPU:
106 ```bash
107 docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest --model-id sentence-transformers/all-mpnet-base-v2 --pooling mean --dtype float16
108 ```
109
110 Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
111 ```bash
112 curl http://localhost:8080/v1/embeddings \
113 -H 'Content-Type: application/json' \
114 -d '{
115 "model": "sentence-transformers/all-mpnet-base-v2",
116 "input": ["This is an example sentence", "Each sentence is converted"]
117 }'
118 ```
119
120 Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.
121
122 ------
123
124 ## Background
125
126 The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
127 contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
128 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
129
130 We developed this model during the
131 [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
132 organized by Hugging Face. We developed this model as part of the project:
133 [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
134
135 ## Intended uses
136
137 Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures
138 the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
139
140 By default, input text longer than 384 word pieces is truncated.
141
142
143 ## Training procedure
144
145 ### Pre-training
146
147 We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
148
149 ### Fine-tuning
150
151 We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
152 We then apply the cross entropy loss by comparing with true pairs.
153
154 #### Hyper parameters
155
156 We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
157 We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
158 a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
159
160 #### Training data
161
162 We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
163 We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
164
165
166 | Dataset | Paper | Number of training tuples |
167 |--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
168 | [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
169 | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
170 | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
171 | [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
172 | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
173 | [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
174 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
175 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
176 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
177 | [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
178 | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
179 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
180 | [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
181 | [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
182 | [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
183 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
184 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
185 | [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
186 | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
187 | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
188 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
189 | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
190 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
191 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
192 | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
193 | [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
194 | [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
195 | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
196 | [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
197 | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
198 | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
199 | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
200 | **Total** | | **1,170,060,424** |