README.md
2.7 KB · 79 lines · markdown Raw
1 ---
2 base_model:
3 - intfloat/e5-small-v2
4 license: cc-by-4.0
5 pipeline_tag: tabular-classification
6 ---
7
8
9 <p align="center">
10 <img src="https://raw.githubusercontent.com/alanarazi7/TabSTAR/main/figures/tabstar_logo.png" alt="TabSTAR Logo" width="50%">
11 </p>
12
13 ---
14
15 ## Install
16
17 To fit a pretrained TabSTAR model to your own dataset, install the package:
18
19 ```bash
20 pip install tabstar
21 ```
22
23 ---
24
25 ## Quickstart Example
26
27 ```python
28 from importlib.resources import files
29 import pandas as pd
30 from sklearn.metrics import classification_report
31 from sklearn.model_selection import train_test_split
32
33 from tabstar.tabstar_model import TabSTARClassifier
34
35 csv_path = files("tabstar").joinpath("resources", "imdb.csv")
36 x = pd.read_csv(csv_path)
37 y = x.pop('Genre_is_Drama')
38 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
39 # For regression tasks, replace `TabSTARClassifier` with `TabSTARRegressor`.
40 tabstar = TabSTARClassifier()
41 tabstar.fit(x_train, y_train)
42 y_pred = tabstar.predict(x_test)
43 print(classification_report(y_test, y_pred))
44 ```
45
46 ---
47
48 # 📚 TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
49
50 **Repository:** [alanarazi7/TabSTAR](https://github.com/alanarazi7/TabSTAR)
51
52 **Paper:** [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations](https://arxiv.org/abs/2505.18125)
53
54 **License:** MIT © Alan Arazi et al.
55
56 ---
57
58 ## Abstract
59
60 > While deep learning has achieved remarkable success across many domains, it
61 > has historically underperformed on tabular learning tasks, which remain
62 > dominated by gradient boosting decision trees (GBDTs). However, recent
63 > advancements are paving the way for Tabular Foundation Models, which can
64 > leverage real-world knowledge and generalize across diverse datasets,
65 > particularly when the data contains free-text. Although incorporating language
66 > model capabilities into tabular tasks has been explored, most existing methods
67 > utilize static, target-agnostic textual representations, limiting their
68 > effectiveness. We introduce TabSTAR: a Foundation Tabular Model with
69 > Semantically Target-Aware Representations. TabSTAR is designed to enable
70 > transfer learning on tabular data with textual features, with an architecture
71 > free of dataset-specific parameters. It unfreezes a pretrained text encoder and
72 > takes as input target tokens, which provide the model with the context needed
73 > to learn task-specific embeddings. TabSTAR achieves state-of-the-art
74 > performance for both medium- and large-sized datasets across known benchmarks
75 > of classification tasks with text features, and its pretraining phase exhibits
76 > scaling laws in the number of datasets, offering a pathway for further
77 > performance improvements.
78
79