README.md · mmarco-mMiniLMv2-L12-H384-v1

README.md

2.2 KB · 63 lines · markdown Raw

1	`---`
2	`license: apache-2.0`
3	`language:`
4	`- en`
5	`- ar`
6	`- zh`
7	`- nl`
8	`- fr`
9	`- de`
10	`- hi`
11	`- in`
12	`- it`
13	`- ja`
14	`- pt`
15	`- ru`
16	`- es`
17	`- vi`
18	`- multilingual`
19	`datasets:`
20	`- unicamp-dl/mmarco`
21	`base_model:`
22	`- nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large`
23	`pipeline_tag: text-ranking`
24	`library_name: sentence-transformers`
25	`tags:`
26	`- transformers`
27	`---`
28	`# Cross-Encoder for multilingual MS Marco`
29
30	`This model was trained on the [MMARCO](https://hf.co/unicamp-dl/mmarco) dataset. It is a machine translated version of MS MARCO using Google Translate. It was translated to 14 languages. In our experiments, we observed that it performs also well for other languages.`
31
32	`As a base model, we used the [multilingual MiniLMv2](https://huggingface.co/nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large) model.`
33
34	`The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco)`
35
36	`## Usage with SentenceTransformers`
37
38	`The usage becomes easy when you have [SentenceTransformers](https://www.sbert.net/) installed. Then, you can use the pre-trained models like this:`
39	```python
40	`from sentence_transformers import CrossEncoder`
41	`model = CrossEncoder('model_name')`
42	`scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])`
43	```
44
45
46
47
48	`## Usage with Transformers`
49
50	```python
51	`from transformers import AutoTokenizer, AutoModelForSequenceClassification`
52	`import torch`
53
54	`model = AutoModelForSequenceClassification.from_pretrained('model_name')`
55	`tokenizer = AutoTokenizer.from_pretrained('model_name')`
56
57	`features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")`
58
59	`model.eval()`
60	`with torch.no_grad():`
61	`scores = model(**features).logits`
62	`print(scores)`
63	```