README.md
2.2 KB · 63 lines · markdown Raw
1 ---
2 license: apache-2.0
3 language:
4 - en
5 - ar
6 - zh
7 - nl
8 - fr
9 - de
10 - hi
11 - in
12 - it
13 - ja
14 - pt
15 - ru
16 - es
17 - vi
18 - multilingual
19 datasets:
20 - unicamp-dl/mmarco
21 base_model:
22 - nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large
23 pipeline_tag: text-ranking
24 library_name: sentence-transformers
25 tags:
26 - transformers
27 ---
28 # Cross-Encoder for multilingual MS Marco
29
30 This model was trained on the [MMARCO](https://hf.co/unicamp-dl/mmarco) dataset. It is a machine translated version of MS MARCO using Google Translate. It was translated to 14 languages. In our experiments, we observed that it performs also well for other languages.
31
32 As a base model, we used the [multilingual MiniLMv2](https://huggingface.co/nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large) model.
33
34 The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco)
35
36 ## Usage with SentenceTransformers
37
38 The usage becomes easy when you have [SentenceTransformers](https://www.sbert.net/) installed. Then, you can use the pre-trained models like this:
39 ```python
40 from sentence_transformers import CrossEncoder
41 model = CrossEncoder('model_name')
42 scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])
43 ```
44
45
46
47
48 ## Usage with Transformers
49
50 ```python
51 from transformers import AutoTokenizer, AutoModelForSequenceClassification
52 import torch
53
54 model = AutoModelForSequenceClassification.from_pretrained('model_name')
55 tokenizer = AutoTokenizer.from_pretrained('model_name')
56
57 features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
58
59 model.eval()
60 with torch.no_grad():
61 scores = model(**features).logits
62 print(scores)
63 ```