README.md · lt-wikidata-comp-en

README.md

5.6 KB · 148 lines · markdown Raw

1	`---`
2	`pipeline_tag: sentence-similarity`
3	`language:`
4	`- en`
5	`tags:`
6	`- linktransformer`
7	`- sentence-transformers`
8	`- sentence-similarity`
9	`- tabular-classification`
10
11	`---`
12
13	`# {MODEL_NAME}`
14
15	`This is a [LinkTransformer](https://linktransformer.github.io/) model. At its core this model this is a sentence transformer model [sentence-transformers](https://www.SBERT.net) model- it just wraps around the class.`
16	`It is designed for quick and easy record linkage (entity-matching) through the LinkTransformer package. The tasks include clustering, deduplication, linking, aggregation and more.`
17	`Notwithstanding that, it can be used for any sentence similarity task within the sentence-transformers framework as well.`
18	`It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.`
19	`Take a look at the documentation of [sentence-transformers](https://www.sbert.net/index.html) if you want to use this model for more than what we support in our applications.`
20
21
22	`This model has been fine-tuned on the model : multi-qa-mpnet-base-dot-v1. It is pretrained for the language : - en.`
23
24
25	`This model was trained on a dataset consisting of company aliases from wiki data using the LinkTransformer framework.`
26	`It was trained for 100 epochs using other defaults that can be found in the repo's LinkTransformer config file - LT_training_config.json`
27
28
29	`## Usage (LinkTransformer)`
30
31	`Using this model becomes easy when you have [LinkTransformer](https://github.com/dell-research-harvard/linktransformer) installed:`
32
33	```
34	`pip install -U linktransformer`
35	```
36
37	`Then you can use the model like this:`
38
39	```python
40	`import linktransformer as lt`
41	`import pandas as pd`
42
43	`##Load the two dataframes that you want to link. For example, 2 dataframes with company names that are written differently`
44	`df1=pd.read_csv("data/df1.csv") ###This is the left dataframe with key CompanyName for instance`
45	`df2=pd.read_csv("data/df2.csv") ###This is the right dataframe with key CompanyName for instance`
46
47	`###Merge the two dataframes on the key column!`
48	`df_merged = lt.merge(df1, df2, on="CompanyName", how="inner")`
49
50	`##Done! The merged dataframe has a column called "score" that contains the similarity score between the two company names`
51
52	```
53
54
55	`## Training your own LinkTransformer model`
56	`Any Sentence Transformers can be used as a backbone by simply adding a pooling layer. Any other transformer on HuggingFace can also be used by specifying the option add_pooling_layer==True`
57	`The model was trained using SupCon loss.`
58	`Usage can be found in the package docs.`
59	`The training config can be found in the repo with the name LT_training_config.json`
60	`To replicate the training, you can download the file and specify the path in the config_path argument of the training function. You can also override the config by specifying the training_args argument.`
61	`Here is an example.`
62
63
64	```python
65
66	`##Consider the example in the paper that has a dataset of Mexican products and their tariff codes from 1947 and 1948 and we want train a model to link the two tariff codes.`
67	`saved_model_path = train_model(`
68	`model_path="hiiamsid/sentence_similarity_spanish_es",`
69	`dataset_path=dataset_path,`
70	`left_col_names=["description47"],`
71	`right_col_names=['description48'],`
72	`left_id_name=['tariffcode47'],`
73	`right_id_name=['tariffcode48'],`
74	`log_wandb=False,`
75	`config_path=LINKAGE_CONFIG_PATH,`
76	`training_args={"num_epochs": 1}`
77	`)`
78
79	```
80
81
82	`You can also use this package for deduplication (clusters a df on the supplied key column). Merging a fine class (like product) to a coarse class (like HS code) is also possible.`
83	`Read our paper and the documentation for more!`
84
85
86
87	`## Evaluation Results`
88
89	`<!--- Describe how your model was evaluated -->`
90
91	`You can evaluate the model using the [LinkTransformer](https://github.com/dell-research-harvard/linktransformer) package's inference functions.`
92	`We have provided a few datasets in the package for you to try out. We plan to host more datasets on Huggingface and our website (Coming soon) that you can take a look at.`
93
94
95	`## Training`
96	`The model was trained with the parameters:`
97
98	`DataLoader:`
99
100	`torch.utils.data.dataloader.DataLoader` of length 2087 with parameters:
101	```
102	`{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}`
103	```
104
105	`Loss:`
106
107	`linktransformer.modified_sbert.losses.SupConLoss_wandb`
108
109	`Parameters of the fit()-Method:`
110	```
111	`{`
112	`"epochs": 100,`
113	`"evaluation_steps": 1044,`
114	`"evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",`
115	`"max_grad_norm": 1,`
116	`"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",`
117	`"optimizer_params": {`
118	`"lr": 2e-05`
119	`},`
120	`"scheduler": "WarmupLinear",`
121	`"steps_per_epoch": null,`
122	`"warmup_steps": 208700,`
123	`"weight_decay": 0.01`
124	`}`
125	```
126
127
128
129
130	`LinkTransformer(`
131	`(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel`
132	`(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})`
133	`)`
134	```
135
136	`## Citing & Authors`
137
138	```
139	`@misc{arora2023linktransformer,`
140	`title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},`
141	`author={Abhishek Arora and Melissa Dell},`
142	`year={2023},`
143	`eprint={2309.00789},`
144	`archivePrefix={arXiv},`
145	`primaryClass={cs.CL}`
146	`}`
147
148	```