README.md
| 1 | --- |
| 2 | pipeline_tag: sentence-similarity |
| 3 | language: |
| 4 | - en |
| 5 | tags: |
| 6 | - linktransformer |
| 7 | - sentence-transformers |
| 8 | - sentence-similarity |
| 9 | - tabular-classification |
| 10 | |
| 11 | --- |
| 12 | |
| 13 | # {MODEL_NAME} |
| 14 | |
| 15 | This is a [LinkTransformer](https://linktransformer.github.io/) model. At its core this model this is a sentence transformer model [sentence-transformers](https://www.SBERT.net) model- it just wraps around the class. |
| 16 | It is designed for quick and easy record linkage (entity-matching) through the LinkTransformer package. The tasks include clustering, deduplication, linking, aggregation and more. |
| 17 | Notwithstanding that, it can be used for any sentence similarity task within the sentence-transformers framework as well. |
| 18 | It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. |
| 19 | Take a look at the documentation of [sentence-transformers](https://www.sbert.net/index.html) if you want to use this model for more than what we support in our applications. |
| 20 | |
| 21 | |
| 22 | This model has been fine-tuned on the model : multi-qa-mpnet-base-dot-v1. It is pretrained for the language : - en. |
| 23 | |
| 24 | |
| 25 | This model was trained on a dataset consisting of company aliases from wiki data using the LinkTransformer framework. |
| 26 | It was trained for 100 epochs using other defaults that can be found in the repo's LinkTransformer config file - LT_training_config.json |
| 27 | |
| 28 | |
| 29 | ## Usage (LinkTransformer) |
| 30 | |
| 31 | Using this model becomes easy when you have [LinkTransformer](https://github.com/dell-research-harvard/linktransformer) installed: |
| 32 | |
| 33 | ``` |
| 34 | pip install -U linktransformer |
| 35 | ``` |
| 36 | |
| 37 | Then you can use the model like this: |
| 38 | |
| 39 | ```python |
| 40 | import linktransformer as lt |
| 41 | import pandas as pd |
| 42 | |
| 43 | ##Load the two dataframes that you want to link. For example, 2 dataframes with company names that are written differently |
| 44 | df1=pd.read_csv("data/df1.csv") ###This is the left dataframe with key CompanyName for instance |
| 45 | df2=pd.read_csv("data/df2.csv") ###This is the right dataframe with key CompanyName for instance |
| 46 | |
| 47 | ###Merge the two dataframes on the key column! |
| 48 | df_merged = lt.merge(df1, df2, on="CompanyName", how="inner") |
| 49 | |
| 50 | ##Done! The merged dataframe has a column called "score" that contains the similarity score between the two company names |
| 51 | |
| 52 | ``` |
| 53 | |
| 54 | |
| 55 | ## Training your own LinkTransformer model |
| 56 | Any Sentence Transformers can be used as a backbone by simply adding a pooling layer. Any other transformer on HuggingFace can also be used by specifying the option add_pooling_layer==True |
| 57 | The model was trained using SupCon loss. |
| 58 | Usage can be found in the package docs. |
| 59 | The training config can be found in the repo with the name LT_training_config.json |
| 60 | To replicate the training, you can download the file and specify the path in the config_path argument of the training function. You can also override the config by specifying the training_args argument. |
| 61 | Here is an example. |
| 62 | |
| 63 | |
| 64 | ```python |
| 65 | |
| 66 | ##Consider the example in the paper that has a dataset of Mexican products and their tariff codes from 1947 and 1948 and we want train a model to link the two tariff codes. |
| 67 | saved_model_path = train_model( |
| 68 | model_path="hiiamsid/sentence_similarity_spanish_es", |
| 69 | dataset_path=dataset_path, |
| 70 | left_col_names=["description47"], |
| 71 | right_col_names=['description48'], |
| 72 | left_id_name=['tariffcode47'], |
| 73 | right_id_name=['tariffcode48'], |
| 74 | log_wandb=False, |
| 75 | config_path=LINKAGE_CONFIG_PATH, |
| 76 | training_args={"num_epochs": 1} |
| 77 | ) |
| 78 | |
| 79 | ``` |
| 80 | |
| 81 | |
| 82 | You can also use this package for deduplication (clusters a df on the supplied key column). Merging a fine class (like product) to a coarse class (like HS code) is also possible. |
| 83 | Read our paper and the documentation for more! |
| 84 | |
| 85 | |
| 86 | |
| 87 | ## Evaluation Results |
| 88 | |
| 89 | <!--- Describe how your model was evaluated --> |
| 90 | |
| 91 | You can evaluate the model using the [LinkTransformer](https://github.com/dell-research-harvard/linktransformer) package's inference functions. |
| 92 | We have provided a few datasets in the package for you to try out. We plan to host more datasets on Huggingface and our website (Coming soon) that you can take a look at. |
| 93 | |
| 94 | |
| 95 | ## Training |
| 96 | The model was trained with the parameters: |
| 97 | |
| 98 | **DataLoader**: |
| 99 | |
| 100 | `torch.utils.data.dataloader.DataLoader` of length 2087 with parameters: |
| 101 | ``` |
| 102 | {'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
| 103 | ``` |
| 104 | |
| 105 | **Loss**: |
| 106 | |
| 107 | `linktransformer.modified_sbert.losses.SupConLoss_wandb` |
| 108 | |
| 109 | Parameters of the fit()-Method: |
| 110 | ``` |
| 111 | { |
| 112 | "epochs": 100, |
| 113 | "evaluation_steps": 1044, |
| 114 | "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator", |
| 115 | "max_grad_norm": 1, |
| 116 | "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
| 117 | "optimizer_params": { |
| 118 | "lr": 2e-05 |
| 119 | }, |
| 120 | "scheduler": "WarmupLinear", |
| 121 | "steps_per_epoch": null, |
| 122 | "warmup_steps": 208700, |
| 123 | "weight_decay": 0.01 |
| 124 | } |
| 125 | ``` |
| 126 | |
| 127 | |
| 128 | |
| 129 | |
| 130 | LinkTransformer( |
| 131 | (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel |
| 132 | (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}) |
| 133 | ) |
| 134 | ``` |
| 135 | |
| 136 | ## Citing & Authors |
| 137 | |
| 138 | ``` |
| 139 | @misc{arora2023linktransformer, |
| 140 | title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models}, |
| 141 | author={Abhishek Arora and Melissa Dell}, |
| 142 | year={2023}, |
| 143 | eprint={2309.00789}, |
| 144 | archivePrefix={arXiv}, |
| 145 | primaryClass={cs.CL} |
| 146 | } |
| 147 | |
| 148 | ``` |