README.md
| 1 | --- |
| 2 | base_model: |
| 3 | - intfloat/e5-small-v2 |
| 4 | license: cc-by-4.0 |
| 5 | pipeline_tag: tabular-classification |
| 6 | --- |
| 7 | |
| 8 | |
| 9 | <p align="center"> |
| 10 | <img src="https://raw.githubusercontent.com/alanarazi7/TabSTAR/main/figures/tabstar_logo.png" alt="TabSTAR Logo" width="50%"> |
| 11 | </p> |
| 12 | |
| 13 | --- |
| 14 | |
| 15 | ## Install |
| 16 | |
| 17 | To fit a pretrained TabSTAR model to your own dataset, install the package: |
| 18 | |
| 19 | ```bash |
| 20 | pip install tabstar |
| 21 | ``` |
| 22 | |
| 23 | --- |
| 24 | |
| 25 | ## Quickstart Example |
| 26 | |
| 27 | ```python |
| 28 | from importlib.resources import files |
| 29 | import pandas as pd |
| 30 | from sklearn.metrics import classification_report |
| 31 | from sklearn.model_selection import train_test_split |
| 32 | |
| 33 | from tabstar.tabstar_model import TabSTARClassifier |
| 34 | |
| 35 | csv_path = files("tabstar").joinpath("resources", "imdb.csv") |
| 36 | x = pd.read_csv(csv_path) |
| 37 | y = x.pop('Genre_is_Drama') |
| 38 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1) |
| 39 | # For regression tasks, replace `TabSTARClassifier` with `TabSTARRegressor`. |
| 40 | tabstar = TabSTARClassifier() |
| 41 | tabstar.fit(x_train, y_train) |
| 42 | y_pred = tabstar.predict(x_test) |
| 43 | print(classification_report(y_test, y_pred)) |
| 44 | ``` |
| 45 | |
| 46 | --- |
| 47 | |
| 48 | # 📚 TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations |
| 49 | |
| 50 | **Repository:** [alanarazi7/TabSTAR](https://github.com/alanarazi7/TabSTAR) |
| 51 | |
| 52 | **Paper:** [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations](https://arxiv.org/abs/2505.18125) |
| 53 | |
| 54 | **License:** MIT © Alan Arazi et al. |
| 55 | |
| 56 | --- |
| 57 | |
| 58 | ## Abstract |
| 59 | |
| 60 | > While deep learning has achieved remarkable success across many domains, it |
| 61 | > has historically underperformed on tabular learning tasks, which remain |
| 62 | > dominated by gradient boosting decision trees (GBDTs). However, recent |
| 63 | > advancements are paving the way for Tabular Foundation Models, which can |
| 64 | > leverage real-world knowledge and generalize across diverse datasets, |
| 65 | > particularly when the data contains free-text. Although incorporating language |
| 66 | > model capabilities into tabular tasks has been explored, most existing methods |
| 67 | > utilize static, target-agnostic textual representations, limiting their |
| 68 | > effectiveness. We introduce TabSTAR: a Foundation Tabular Model with |
| 69 | > Semantically Target-Aware Representations. TabSTAR is designed to enable |
| 70 | > transfer learning on tabular data with textual features, with an architecture |
| 71 | > free of dataset-specific parameters. It unfreezes a pretrained text encoder and |
| 72 | > takes as input target tokens, which provide the model with the context needed |
| 73 | > to learn task-specific embeddings. TabSTAR achieves state-of-the-art |
| 74 | > performance for both medium- and large-sized datasets across known benchmarks |
| 75 | > of classification tasks with text features, and its pretraining phase exhibits |
| 76 | > scaling laws in the number of datasets, offering a pathway for further |
| 77 | > performance improvements. |
| 78 | |
| 79 | |