README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | language: |
| 4 | - en |
| 5 | - ar |
| 6 | - zh |
| 7 | - nl |
| 8 | - fr |
| 9 | - de |
| 10 | - hi |
| 11 | - in |
| 12 | - it |
| 13 | - ja |
| 14 | - pt |
| 15 | - ru |
| 16 | - es |
| 17 | - vi |
| 18 | - multilingual |
| 19 | datasets: |
| 20 | - unicamp-dl/mmarco |
| 21 | base_model: |
| 22 | - nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large |
| 23 | pipeline_tag: text-ranking |
| 24 | library_name: sentence-transformers |
| 25 | tags: |
| 26 | - transformers |
| 27 | --- |
| 28 | # Cross-Encoder for multilingual MS Marco |
| 29 | |
| 30 | This model was trained on the [MMARCO](https://hf.co/unicamp-dl/mmarco) dataset. It is a machine translated version of MS MARCO using Google Translate. It was translated to 14 languages. In our experiments, we observed that it performs also well for other languages. |
| 31 | |
| 32 | As a base model, we used the [multilingual MiniLMv2](https://huggingface.co/nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large) model. |
| 33 | |
| 34 | The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco) |
| 35 | |
| 36 | ## Usage with SentenceTransformers |
| 37 | |
| 38 | The usage becomes easy when you have [SentenceTransformers](https://www.sbert.net/) installed. Then, you can use the pre-trained models like this: |
| 39 | ```python |
| 40 | from sentence_transformers import CrossEncoder |
| 41 | model = CrossEncoder('model_name') |
| 42 | scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]) |
| 43 | ``` |
| 44 | |
| 45 | |
| 46 | |
| 47 | |
| 48 | ## Usage with Transformers |
| 49 | |
| 50 | ```python |
| 51 | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| 52 | import torch |
| 53 | |
| 54 | model = AutoModelForSequenceClassification.from_pretrained('model_name') |
| 55 | tokenizer = AutoTokenizer.from_pretrained('model_name') |
| 56 | |
| 57 | features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt") |
| 58 | |
| 59 | model.eval() |
| 60 | with torch.no_grad(): |
| 61 | scores = model(**features).logits |
| 62 | print(scores) |
| 63 | ``` |