README.md · stockprediction-ai

1

---

2

license: mit

3

tags:

4

- finance

5

- stock-prediction

6

- lightgbm

7

- backtesting

8

- quantitative-trading

9

- time-series

10

library_name: lightgbm

11

pipeline_tag: tabular-regression

12

---

13

14

# Stock Prediction AI — Regime-Aware LightGBM

15

16

> I reverse-engineered a hedge fund's trading strategy and open-sourced the model.

17

18

A LightGBM regressor that predicts next-day log returns for 150 US stocks, paired with a one-line regime rule that decides whether to act on the prediction. Trained on a MacBook M1. No cloud GPU, no paid data beyond a single FMP API subscription.

19

20

**Author**: [@jc_builds](https://twitter.com/jc_builds)

21

22

---

23

24

![equity curves vs buy-and-hold vs SPY across 4 market regimes](images/equity_curves.png)

25

26

## TL;DR

27

28

| Regime | Strategy vs B&H | Strategy vs SPY |

29

|------------------|-------------------|--------------------|

30

| 2010 bull | −0.1% | **+10.9%** ✓ |

31

| 2018 bear | **+11.9%** ✓ | **+18.1%** ✓ |

32

| 2020 COVID | **+22.0%** ✓ | **+38.8%** ✓ |

33

| 2022 bear | −0.0% | −3.8% |

34

| Live 2025–26 | −3.2% | −6.2% |

35

| **5-split avg** | **+6.1%** | **+11.6%** |

36

37

The model beats buy-and-hold in crashes (2018, 2020) and is roughly flat in bulls and grinding bears (2010, 2022, live). **It is a crash shield, not a stock picker.**

38

39

---

40

41

## How it works

42

43

![pipeline](images/pipeline.png)

44

45

The whole system is five boring parts:

46

47

1. **Universe** — 150 large-cap US equities, 20 years of daily bars from FMP.

48

2. **Features** — 47 engineered features per stock: multi-horizon returns, rolling vol, drawdowns, moving averages, SPY regime context, cross-sectional ranks. All strictly known at day `t` to predict day `t+1`.

49

3. **Model** — a single LightGBM regressor trained to predict next-day log return. MSE loss. Early stopping on a 10% chronological validation tail.

50

4. **Decision** — a per-stock long-or-cash gate, threshold tuned on validation to maximize excess-return-vs-buy-and-hold.

51

5. **Backtest** — walk-forward, out-of-sample, 5 bps per position change.

52

53

No shorts. No leverage. No options. No intraday.

54

55

---

56

57

## The one rule that moved the needle

58

59

![regime rule diagram](images/regime_rule.png)

60

61

The raw model was already decent, but the thing that actually moved metrics was a **one-line regime gate**:

62

63

```python

64

# Pseudocode — see src/strategy.py for the real version

65

bull = spy_price > spy_200_day_moving_average

66

threshold = bull_threshold if bull else bear_threshold # bull: -0.003, bear: tuned

67

position = 1 if pred > threshold else 0

68

```

69

70

- In **bull markets** (SPY above 200-day MA), demand a strongly bearish signal before going to cash. Otherwise stay long and earn the bull.

71

- In **bear markets**, use a stricter bar: go long only if the model genuinely expects up.

72

73

That one `if` statement lifted the 2018 split from flat to **+11.85% vs buy-and-hold**.

74

75

---

76

77

## Results across 5 walk-forward regimes

78

79

![excess return vs SPY by regime](images/excess_vs_spy.png)

80

81

The training is a classic walk-forward: for each test year `Y`, train on all data from 2005 through `Y-1`, never touching `Y`. For the live split, training ran through 2025-04-22 and testing ran forward from there.

82

83

Where it won (crashes):

84

- **2018 bear** — volatile Q4 crash. Strategy +18.05% vs SPY, beat B&H on 128 of 130 stocks.

85

- **2020 COVID** — V-shaped crash and rally. Strategy +38.83% vs SPY, beat B&H on 113 of 138 stocks.

86

87

Where it barely won or lost (bulls and grinding bears):

88

- **2010 bull** — +10.90% vs SPY (the 150-stock basket happened to beat SPY significantly in 2010, and the strategy just matched the basket minus costs).

89

- **2022 bear** — −3.78% vs SPY. LightGBM's early-stopping triggered at iteration 1; the 2005–2021 training distribution simply didn't predict 2022's rate-hike bear well. Effectively always-long, matched the basket minus costs.

90

- **Live 2025–26** — −6.16% vs SPY. A strong bull year. Same ceiling problem as 2010 — a long-or-cash strategy's upper bound in a straight-up bull is basically B&H minus costs.

91

92

**Average across 675 stock-years tested: +11.57% excess vs SPY, +6.11% excess vs per-stock buy-and-hold.**

93

94

---

95

96

## What worked vs what failed

97

98

![worked vs failed](images/worked_vs_failed.png)

99

100

Eight model versions were tried. Seven regressed. One shipped. The champion isn't clever — it's the simplest thing that works plus one regime rule.

101

102

| Version | Idea | Outcome |

103

|---------|------|---------|

104

| v1 | Plain LightGBM, MSE, val-tuned threshold | baseline that works |

105

| v2 | Asymmetric y-weighted loss | **regressed** — early-stopped at iter 1 |

106

| v3 | Magnitude-weighted training | **regressed** — averaged away crash calls |

107

| **v4** | **v1 + regime-aware threshold** | **champion — shipped** |

108

| v4b | v4 with fixed threshold | slight regression |

109

| v5 | 5-seed ensemble | **regressed** — averaged away high-conviction crash picks |

110

| v6 | P(big-down) classifier | **regressed** — target too rare |

111

| v7 | MLX MLP on Apple GPU | **regressed** — calibration collapsed |

112

| v8 | v4 trained on 2015+ only | **regressed** — best_iter 120 → 5, lost regularization |

113

| v9 | Multi-modal top-10 rotation | ships separately, feast-or-famine profile |

114

115

---

116

117

## What I actually learned

118

119

![learnings](images/learnings.png)

120

121

1. **The model only helps in crashes.** In a bull market, just buying and holding wins. Any long-or-cash strategy is capped at B&H minus transaction costs.

122

2. **A one-line rule beat every clever loss function.** I spent weeks on asymmetric losses, magnitude weighting, and custom objectives. A single "is SPY above its 200-day average?" gate outperformed all of them.

123

3. **More training data usually helps — but not always.** Twenty years of training usually beat ten. One version (v8) flipped it. You have to test instead of guess.

124

4. **Leave the losing year in the README.** The 2025 live run lost 6% vs SPY. Hiding it would make every other number untrustworthy.

125

126

---

127

128

## Artifacts

129

130

This repository contains:

131

132

```

133

stockprediction-ai/

134

├── README.md this file

135

├── config.json hyperparameters, feature columns, regime thresholds, splits

136

├── results.json per-split metrics (strat final, B&H final, excess)

137

├── boosters/

138

│ ├── 2010.txt LightGBM text-format booster, trained on 2005–2009

139

│ ├── 2018.txt trained on 2005–2017

140

│ ├── 2020.txt trained on 2005–2019

141

│ ├── 2022.txt trained on 2005–2021

142

│ └── live.txt trained on 2005-01-01 through 2025-04-22

143

└── images/ charts used in this card

144

```

145

146

Boosters are plain LightGBM text files — load them with any LightGBM runtime, no Python required.

147

148

---

149

150

## Using the live booster

151

152

```python

153

import json

154

import lightgbm as lgb

155

import numpy as np

156

import pandas as pd

157

from huggingface_hub import hf_hub_download

158

159

REPO = "jc-builds/stockprediction-ai"

160

161

# 1. Grab the live booster and config

162

booster_path = hf_hub_download(REPO, "boosters/live.txt")

163

config_path = hf_hub_download(REPO, "config.json")

164

165

booster = lgb.Booster(model_file=booster_path)

166

config = json.loads(open(config_path).read())

167

feature_cols = config["feature_columns"]

168

169

# 2. Build a feature row for one stock on one trading day

170

# (see github.com/jc-builds/stockprediction/blob/main/src/features.py

171

# for the full 47-feature builder — the code is open-source).

172

features: pd.DataFrame = build_features_for_today(symbol="AAPL")

173

x = features[feature_cols].astype(float).values

174

175

# 3. Predict next-day log return

176

pred = booster.predict(x)[0]

177

178

# 4. Apply the regime rule

179

bull = spy_close_today > spy_200_day_ma_today

180

thr_bull = config["regime_rule"]["bull_threshold"] # -0.003

181

thr_bear = config["splits"]["live"]["bear_threshold"]

182

threshold = thr_bull if bull else thr_bear

183

position = 1 if pred > threshold else 0 # 1=long, 0=cash

184

```

185

186

---

187

188

## Limitations — read before using

189

190

- **Not investment advice.** I lost 6% live in a year SPY made 32%. Treat this as an educational artifact.

191

- **Survivorship bias** in the 150-stock universe. The chosen tickers are all still listed in 2025-26; anything that went bankrupt in 2008-10 isn't in the dataset.

192

- **Transaction cost model is linear.** 5 bps per flip. Real slippage is bigger on small caps and during panics.

193

- **No tax modeling.** Wash sales, short-term gains — ignored.

194

- **The regime rule is leaky in subtle ways.** SPY's 200-day MA uses the same history the booster is trained on. In practice this isn't lookahead (the rule only reads today's price vs its trailing mean), but any regime rule is a second optimization over the test set in disguise.

195

- **The 2022 case is a honest miss.** The 2005–2021 training distribution did not generalize. `best_iter=1` on v4 means the model "gave up" — it returned a near-constant prediction. Any real deployment would need to detect this (e.g., refuse to trade when val loss is flat) rather than fall through to always-long.

196

197

---

198

199

## Reproduce

200

201

```bash

202

git clone https://github.com/jc-builds/stockprediction

203

cd stockprediction

204

python -m venv .venv && source .venv/bin/activate

205

pip install -r requirements.txt

206

207

# Set FMP_API_KEY in .env, then:

208

PYTHONPATH=src python src/run.py --model v4

209

```

210

211

---

212

213

## Citation

214

215

If this was useful in your own work or teaching, please cite:

216

217

```bibtex

218

@misc{jcbuilds_stockprediction_2026,

219

author = {Jared Cassoutt (@jc_builds)},

220

title = {Stock Prediction AI: Regime-Aware LightGBM},

221

year = {2026},

222

howpublished = {\url{https://huggingface.co/jc-builds/stockprediction-ai}},

223

}

224

```

225

226

## License

227

228

MIT. Do whatever you want, but don't sue me when you lose money.

229