README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | library_name: synthefy-nori |
| 4 | pipeline_tag: tabular-regression |
| 5 | tags: |
| 6 | - tabular |
| 7 | - tabular-regression |
| 8 | - tabular-foundation-model |
| 9 | - in-context-learning |
| 10 | - synthetic-data |
| 11 | - pytorch |
| 12 | --- |
| 13 | |
| 14 | <p align="center"> |
| 15 | <img src="synthefy_nori_banner.png" alt="Nori" width="100%"> |
| 16 | </p> |
| 17 | |
| 18 | # Nori |
| 19 | |
| 20 | **Nori** is a tabular foundation model for **regression** via in-context |
| 21 | learning (ICL). Given a few labeled rows as context, it predicts on new query rows in a |
| 22 | **single forward pass**, with no task-specific training or fine-tuning. The model is |
| 23 | trained **entirely on synthetic data**. |
| 24 | |
| 25 | - **Documentation:** https://docs.synthefy.com/nori/ |
| 26 | - **Repository:** https://github.com/Synthefy/synthefy-nori |
| 27 | - **Library:** `pip install synthefy-nori` |
| 28 | - **Checkpoint:** `nori.pt` (this repo) |
| 29 | - **Parameters:** ~5.9M |
| 30 | - **License:** Apache-2.0 |
| 31 | |
| 32 | ## Results |
| 33 | |
| 34 | Mean and median R² of the base model across 96 regression tasks from three |
| 35 | public benchmark suites (single H200, up to 50K context rows per dataset): |
| 36 | |
| 37 | | Suite | Datasets | Mean R² | Median R² | |
| 38 | |-------|---------:|--------:|----------:| |
| 39 | | TabArena | 13 | 0.8117 | 0.8757 | |
| 40 | | TALENT | 72 | 0.7569 | 0.8802 | |
| 41 | | OpenML | 11 | 0.6373 | 0.5856 | |
| 42 | | **Overall** | **96** | **0.7506** | **0.8702** | |
| 43 | |
| 44 | Large-N / long-context tables (common in TabArena) are the current focus of the |
| 45 | large-table training stages. These numbers are reproducible end-to-end with one |
| 46 | command — see [Reproducing these numbers](https://github.com/Synthefy/synthefy-nori#reproducing-these-numbers). |
| 47 | |
| 48 | > **Thinking** is an inference-time reasoning extension that improves these |
| 49 | > numbers further. Details are forthcoming. |
| 50 | |
| 51 | ## Usage |
| 52 | |
| 53 | ```bash |
| 54 | pip install synthefy-nori |
| 55 | ``` |
| 56 | |
| 57 | ```python |
| 58 | from sklearn.datasets import load_diabetes |
| 59 | from sklearn.model_selection import train_test_split |
| 60 | from synthefy_nori import NoriRegressor |
| 61 | |
| 62 | X, y = load_diabetes(return_X_y=True) |
| 63 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) |
| 64 | |
| 65 | model = NoriRegressor() # downloads these weights from the Hub on first use |
| 66 | model.fit(X_train, y_train) # "fit" just stores the labeled rows as context |
| 67 | pred = model.predict(X_test) # predictions in a single forward pass, no training |
| 68 | ``` |
| 69 | |
| 70 | It uses a GPU when one is available and falls back to CPU. A one-shot helper skips the |
| 71 | object entirely: |
| 72 | |
| 73 | ```python |
| 74 | from synthefy_nori import predict |
| 75 | pred = predict(X_train, y_train, X_test, task="regression") |
| 76 | ``` |
| 77 | |
| 78 | `predict` follows the `TabPFNRegressor.predict` contract: pass `output_type="mean"` |
| 79 | (default), `"median"`, or `"mode"` to choose the point estimate drawn from the model's |
| 80 | predictive distribution. |
| 81 | |
| 82 | To run from a local checkpoint instead of the Hub default, pass a path: |
| 83 | |
| 84 | ```python |
| 85 | model = NoriRegressor(model_path="path/to/checkpoint.pt") |
| 86 | ``` |
| 87 | |
| 88 | This checkpoint is **public**: the first inference call downloads and caches it |
| 89 | automatically, with no token and no access request. A Hugging Face token (read scope) |
| 90 | is only worth setting if you hit anonymous download rate limits — provide it via |
| 91 | `export HF_TOKEN=hf_...`, `hf auth login`, or `NoriRegressor(token="hf_...")`. |
| 92 | |
| 93 | ## How it works |
| 94 | |
| 95 | ### Architecture |
| 96 | |
| 97 | A **FeaturesTransformer (~5.9M parameters)** that alternates two kinds of attention: |
| 98 | |
| 99 | - **Feature attention** learns relationships between columns. |
| 100 | - **Sample attention** learns relationships between rows (context and query). |
| 101 | - **In-context learning**: predictions condition on labeled context rows, with no |
| 102 | gradient updates at inference. |
| 103 | |
| 104 | Key config: 16 transformer layers, embed_dim 128, hidden 384, 2 heads, the **v2-lite** |
| 105 | block (SwiGLU + RMSNorm + pre-norm), features grouped in pairs (`features_per_group=2`), |
| 106 | with **column-specific y-aware** feature attention. Features are encoded with RBF |
| 107 | embeddings; missing values are handled natively via learned mask embeddings. The |
| 108 | regression head predicts a full distribution over 999 quantiles (pinball loss). |
| 109 | |
| 110 | ### Synthetic data |
| 111 | |
| 112 | The model never sees real data during training. Its capability comes from a diverse |
| 113 | synthetic data generator covering real-world tabular regimes: |
| 114 | |
| 115 | - **Structural Causal Models (SCM)**: hierarchical DAGs with 8 edge-function types |
| 116 | (MLP, decision tree, piecewise-linear, polynomial, periodic, RBF, log/exp, conv1d). |
| 117 | - **Regression priors**: 9 target families (dense/sparse linear, GAM, interactions, |
| 118 | random MLP, random tree, radial/RBF, Fourier features, chained trigonometric). |
| 119 | - **Realism augmentations**: discretized features, noise features, correlated blocks, |
| 120 | structural missingness, label noise. |
| 121 | - **Learnability filter**: an ExtraTrees signal-quality filter rejects unlearnable |
| 122 | datasets so training compute is spent on learnable tasks. |
| 123 | |
| 124 | Training runs entirely on synthetic data and **trains to completion** — there is no |
| 125 | real-data validation in the loop, so no benchmark data is needed to train and no eval |
| 126 | signal influences checkpoint selection. See the |
| 127 | [training guide](https://github.com/Synthefy/synthefy-nori/blob/main/docs/training.md) |
| 128 | for the full curriculum recipe. |
| 129 | |
| 130 | ## Intended use & limitations |
| 131 | |
| 132 | - **Intended for** small-to-medium tabular regression where in-context learning is |
| 133 | attractive (no per-task training). |
| 134 | - **Limitations:** the current gap vs the best baselines is on **large-N / long-context** |
| 135 | TabArena datasets; dense O(N²) sample attention bounds practical context size. Very large |
| 136 | tables are the focus of the large-table training stages. |
| 137 | |
| 138 | ## Citation |
| 139 | |
| 140 | ```bibtex |
| 141 | @software{synthefy_nori_2026, |
| 142 | title = {Nori: A Tabular Foundation Model Trained on Synthetic Data}, |
| 143 | author = {Synthefy}, |
| 144 | year = {2026}, |
| 145 | url = {https://github.com/Synthefy/synthefy-nori} |
| 146 | } |
| 147 | ``` |
| 148 | |
| 149 | ## License |
| 150 | |
| 151 | Apache-2.0. See |
| 152 | [LICENSE](https://github.com/Synthefy/synthefy-nori/blob/main/LICENSE) and |
| 153 | [NOTICE](https://github.com/Synthefy/synthefy-nori/blob/main/NOTICE). |
| 154 | |