README.md
5.7 KB · 154 lines · markdown Raw
1 ---
2 license: apache-2.0
3 library_name: synthefy-nori
4 pipeline_tag: tabular-regression
5 tags:
6 - tabular
7 - tabular-regression
8 - tabular-foundation-model
9 - in-context-learning
10 - synthetic-data
11 - pytorch
12 ---
13
14 <p align="center">
15 <img src="synthefy_nori_banner.png" alt="Nori" width="100%">
16 </p>
17
18 # Nori
19
20 **Nori** is a tabular foundation model for **regression** via in-context
21 learning (ICL). Given a few labeled rows as context, it predicts on new query rows in a
22 **single forward pass**, with no task-specific training or fine-tuning. The model is
23 trained **entirely on synthetic data**.
24
25 - **Documentation:** https://docs.synthefy.com/nori/
26 - **Repository:** https://github.com/Synthefy/synthefy-nori
27 - **Library:** `pip install synthefy-nori`
28 - **Checkpoint:** `nori.pt` (this repo)
29 - **Parameters:** ~5.9M
30 - **License:** Apache-2.0
31
32 ## Results
33
34 Mean and median R² of the base model across 96 regression tasks from three
35 public benchmark suites (single H200, up to 50K context rows per dataset):
36
37 | Suite | Datasets | Mean R² | Median R² |
38 |-------|---------:|--------:|----------:|
39 | TabArena | 13 | 0.8117 | 0.8757 |
40 | TALENT | 72 | 0.7569 | 0.8802 |
41 | OpenML | 11 | 0.6373 | 0.5856 |
42 | **Overall** | **96** | **0.7506** | **0.8702** |
43
44 Large-N / long-context tables (common in TabArena) are the current focus of the
45 large-table training stages. These numbers are reproducible end-to-end with one
46 command — see [Reproducing these numbers](https://github.com/Synthefy/synthefy-nori#reproducing-these-numbers).
47
48 > **Thinking** is an inference-time reasoning extension that improves these
49 > numbers further. Details are forthcoming.
50
51 ## Usage
52
53 ```bash
54 pip install synthefy-nori
55 ```
56
57 ```python
58 from sklearn.datasets import load_diabetes
59 from sklearn.model_selection import train_test_split
60 from synthefy_nori import NoriRegressor
61
62 X, y = load_diabetes(return_X_y=True)
63 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
64
65 model = NoriRegressor() # downloads these weights from the Hub on first use
66 model.fit(X_train, y_train) # "fit" just stores the labeled rows as context
67 pred = model.predict(X_test) # predictions in a single forward pass, no training
68 ```
69
70 It uses a GPU when one is available and falls back to CPU. A one-shot helper skips the
71 object entirely:
72
73 ```python
74 from synthefy_nori import predict
75 pred = predict(X_train, y_train, X_test, task="regression")
76 ```
77
78 `predict` follows the `TabPFNRegressor.predict` contract: pass `output_type="mean"`
79 (default), `"median"`, or `"mode"` to choose the point estimate drawn from the model's
80 predictive distribution.
81
82 To run from a local checkpoint instead of the Hub default, pass a path:
83
84 ```python
85 model = NoriRegressor(model_path="path/to/checkpoint.pt")
86 ```
87
88 This checkpoint is **public**: the first inference call downloads and caches it
89 automatically, with no token and no access request. A Hugging Face token (read scope)
90 is only worth setting if you hit anonymous download rate limits — provide it via
91 `export HF_TOKEN=hf_...`, `hf auth login`, or `NoriRegressor(token="hf_...")`.
92
93 ## How it works
94
95 ### Architecture
96
97 A **FeaturesTransformer (~5.9M parameters)** that alternates two kinds of attention:
98
99 - **Feature attention** learns relationships between columns.
100 - **Sample attention** learns relationships between rows (context and query).
101 - **In-context learning**: predictions condition on labeled context rows, with no
102 gradient updates at inference.
103
104 Key config: 16 transformer layers, embed_dim 128, hidden 384, 2 heads, the **v2-lite**
105 block (SwiGLU + RMSNorm + pre-norm), features grouped in pairs (`features_per_group=2`),
106 with **column-specific y-aware** feature attention. Features are encoded with RBF
107 embeddings; missing values are handled natively via learned mask embeddings. The
108 regression head predicts a full distribution over 999 quantiles (pinball loss).
109
110 ### Synthetic data
111
112 The model never sees real data during training. Its capability comes from a diverse
113 synthetic data generator covering real-world tabular regimes:
114
115 - **Structural Causal Models (SCM)**: hierarchical DAGs with 8 edge-function types
116 (MLP, decision tree, piecewise-linear, polynomial, periodic, RBF, log/exp, conv1d).
117 - **Regression priors**: 9 target families (dense/sparse linear, GAM, interactions,
118 random MLP, random tree, radial/RBF, Fourier features, chained trigonometric).
119 - **Realism augmentations**: discretized features, noise features, correlated blocks,
120 structural missingness, label noise.
121 - **Learnability filter**: an ExtraTrees signal-quality filter rejects unlearnable
122 datasets so training compute is spent on learnable tasks.
123
124 Training runs entirely on synthetic data and **trains to completion** — there is no
125 real-data validation in the loop, so no benchmark data is needed to train and no eval
126 signal influences checkpoint selection. See the
127 [training guide](https://github.com/Synthefy/synthefy-nori/blob/main/docs/training.md)
128 for the full curriculum recipe.
129
130 ## Intended use & limitations
131
132 - **Intended for** small-to-medium tabular regression where in-context learning is
133 attractive (no per-task training).
134 - **Limitations:** the current gap vs the best baselines is on **large-N / long-context**
135 TabArena datasets; dense O(N²) sample attention bounds practical context size. Very large
136 tables are the focus of the large-table training stages.
137
138 ## Citation
139
140 ```bibtex
141 @software{synthefy_nori_2026,
142 title = {Nori: A Tabular Foundation Model Trained on Synthetic Data},
143 author = {Synthefy},
144 year = {2026},
145 url = {https://github.com/Synthefy/synthefy-nori}
146 }
147 ```
148
149 ## License
150
151 Apache-2.0. See
152 [LICENSE](https://github.com/Synthefy/synthefy-nori/blob/main/LICENSE) and
153 [NOTICE](https://github.com/Synthefy/synthefy-nori/blob/main/NOTICE).
154