README.md · PQC Training Data Transparency

1

# PQC Training Data Transparency

2

3

![PQC Native](https://img.shields.io/badge/PQC-Native-blue)

4

![Merkle SHA3-256](https://img.shields.io/badge/Merkle-SHA3--256-green)

5

![ML-DSA](https://img.shields.io/badge/ML--DSA-FIPS%20204-green)

6

![License](https://img.shields.io/badge/License-Apache%202.0-orange)

7

![Version](https://img.shields.io/badge/version-0.1.0-lightgrey)

8

9

**Cryptographic transparency for AI training data.** Build an SHA3-256 Merkle tree over every record in your training set, sign the root with **ML-DSA** (FIPS 204), and publish it. Anyone who holds a single document can later receive an `O(log n)` inclusion proof showing that the record was in the training set — without revealing any of the other records. The audit trail survives the transition to post-quantum cryptography, so commitments made today remain verifiable in 2035 and beyond.

10

11

## The Problem

12

13

AI copyright litigation, regulatory audits, and red-team requests keep asking the same question: *what exactly was used to train this model?* Model creators today have no cryptographic answer.

14

15

- "Prove this document was NOT in your training set" — requires revealing the entire training set (impossible for proprietary or licensed data).

16

- "Prove your model wasn't trained on PII" — requires deleting, then proving a negative.

17

- "Which records were used for fine-tune v2 vs v3?" — no binding commitment exists, so claims are unfalsifiable.

18

19

And the few audit trails that do exist are typically RSA- or ECDSA-signed. A cryptographically relevant quantum computer breaks those signatures, and the entire audit chain collapses retroactively. Training data provenance has a 15-20 year shelf life; the crypto under it must survive that long.

20

21

## The Solution

22

23

Commit once, prove selectively:

24

25

- Hash every record into a leaf: `SHA3-256(content || canonical(metadata))`.

26

- Build an SHA3-256 Merkle tree over the leaves.

27

- Wrap the root in a `TrainingCommitment` (dataset name, version, record count, timestamps, licenses, tags).

28

- Sign the canonical commitment with **ML-DSA** at model-release time.

29

- Publish the commitment anywhere — on-chain, in a transparency log, on quantamrkt.com, stapled to the model card.

30

31

Later, anyone can ask "was record X in the training set?" The creator returns an inclusion proof (`log₂(n)` sibling hashes). The verifier checks the proof against the signed root. No other record is revealed.

32

33

## Installation

34

35

```bash

36

pip install pqc-training-data-transparency

37

```

38

39

Development:

40

41

```bash

42

pip install -e ".[dev]"

43

```

44

45

## Quick Start

46

47

### Build and sign a commitment

48

49

```python

50

from quantumshield import AgentIdentity

51

from pqc_training_data import (

52

CommitmentBuilder, CommitmentSigner, DataRecord,

53

)

54

55

identity = AgentIdentity.create("model-creator")

56

signer = CommitmentSigner(identity)

57

58

corpus = [

59

DataRecord(content=doc_bytes, metadata={"source": "internal", "id": i})

60

for i, doc_bytes in enumerate(your_documents)

61

]

62

63

builder = CommitmentBuilder(dataset_name="model-v1-train", dataset_version="1.0.0")

64

builder.add_records(corpus)

65

builder.licenses = ["cc-by-4.0"]

66

builder.tags = ["production"]

67

68

commitment = signer.sign(builder.build(description="Production training set"))

69

70

# Publish commitment.to_json() — this is the public audit artifact.

71

```

72

73

### Prove a single record is in the training set

74

75

```python

76

# Auditor holds only one specific record + the public commitment.

77

proof = builder.tree.inclusion_proof(index=42)

78

result = CommitmentVerifier.verify(corpus[42], proof, commitment)

79

80

assert result.fully_verified

81

# -> signature_valid=True, proof_valid=True, leaf_matches_record=True

82

```

83

84

### Detect a forged inclusion claim

85

86

```python

87

forged = DataRecord(content=b"never-in-training", metadata={"id": 999})

88

pretend_proof = builder.tree.inclusion_proof(index=0) # hijack a real slot

89

90

result = CommitmentVerifier.verify(forged, pretend_proof, commitment)

91

assert not result.fully_verified # rejected

92

# result.error: "record leaf_hash ... does not match proof ..."

93

```

94

95

## Architecture

96

97

```

98

Training Pipeline (creator) Audit Path (third party)

99

-------------------------- ------------------------

100

|

101

records = [doc1, doc2, ..., docN] |

102

| |

103

| 1. leaf_hash = SHA3-256( |

104

| SHA3-256(content) || canonical_json(metadata)) |

105

v |

106

[leaf_1, leaf_2, ..., leaf_N] |

107

| |

108

| 2. Merkle fold (SHA3-256, 0x00/0x01 domain sep) |

109

v |

110

ROOT |

111

| |

112

| 3. wrap in TrainingCommitment |

113

| (id, dataset, version, created_at, ...) |

114

| |

115

| 4. ML-DSA.sign(canonical(commitment)) |

116

v |

117

SIGNED COMMITMENT --> published (on-chain, log, model card) |

118

|

119

| 5. request

120

| inclusion

121

| proof for

122

| record R

123

v

124

InclusionProof (leaf, siblings, dirs, root)

125

|

126

| 6. verify:

127

| ML-DSA(commitment) OK?

128

| leaf_hash(R) == proof.leaf?

129

| walk siblings -> root?

130

| proof.root == commitment.root?

131

v

132

VerificationResult

133

(fully_verified T/F)

134

```

135

136

## Threat Model

137

138

| Threat | Handled | Notes |

139

|---|---|---|

140

| **Forged inclusion claim** (attacker claims doc X is in the set) | Yes | Verifier recomputes `leaf_hash(X)` and compares to the proof; walk to root fails or mismatches. |

141

| **Tampered commitment signature** (attacker edits dataset_name, record_count, root) | Yes | Canonical bytes change, ML-DSA signature no longer verifies. |

142

| **Tampered inclusion proof** (attacker flips a sibling hash) | Yes | Root recomputation diverges from signed root. |

143

| **Quantum forgery in 2035+** (CRQC forges the audit trail retroactively) | Yes | ML-DSA is a FIPS 204 post-quantum signature; not broken by Shor/Grover. |

144

| **Proving NON-inclusion** (prove a record was *not* in training) | No | Requires a sorted-tree / Verkle construction. Future work. |

145

| **Revealing private training data** | No (by design) | Commitment contains only the root; proofs reveal `log₂(n)` sibling hashes, never other records. The creator decides what to reveal. |

146

| **Selective disclosure of metadata fields** | No | A record's metadata is fully inside its leaf. Hashing over `metadata` is all-or-nothing; carve out separate fields into the leaf if you need partial reveals. |

147

| **Re-publication of old commitment** (attacker re-uses prior root for a new model release) | Partial | `commitment_id` + `dataset_version` + `created_at` are all signed; enforce freshness by policy. |

148

149

## API Reference

150

151

### `DataRecord`

152

153

Frozen dataclass. One training example.

154

155

| Field / Method | Description |

156

|---|---|

157

| `content: bytes` | Raw record payload (doc text, image bytes, serialized row, ...). |

158

| `metadata: dict` | Arbitrary metadata — participates in the leaf hash. |

159

| `canonical_bytes()` | Deterministic `SHA3-256(content) || "|" || canonical_json(metadata)`. |

160

| `leaf_hash() -> RecordHash` | SHA3-256 of canonical bytes — the Merkle leaf value. |

161

| `to_dict()` | Safe serialization. **Does not include raw content.** |

162

163

### `MerkleTree`

164

165

SHA3-256 Merkle tree with RFC6962-style odd-node promotion.

166

167

| Method | Description |

168

|---|---|

169

| `add(leaf_hash)` / `add_many(leaves)` | Append leaves. |

170

| `root() -> str` | Hex Merkle root. Raises `EmptyTreeError` for empty trees. |

171

| `inclusion_proof(index) -> InclusionProof` | `O(log n)` proof for leaf at `index`. |

172

| `MerkleTree.verify_inclusion(proof) -> bool` | Static verification — independent of tree state. |

173

174

### `InclusionProof`

175

176

Frozen dataclass carried from prover to verifier.

177

178

| Field | Description |

179

|---|---|

180

| `leaf_hash` | Hex of the leaf being proven. |

181

| `index`, `tree_size` | Position and total size at time of proof. |

182

| `root` | Hex root the prover claims. |

183

| `siblings`, `directions` | `log₂(n)` sibling hashes + `'L'`/`'R'` flags. |

184

185

### `TrainingCommitment`

186

187

The signed audit artifact.

188

189

| Field | Description |

190

|---|---|

191

| `commitment_id` | `urn:pqc-td:<uuid>`. |

192

| `dataset_name`, `dataset_version`, `description` | Human-readable identification. |

193

| `record_count`, `root` | Cryptographic binding to the tree. |

194

| `created_at`, `licenses`, `tags`, `extra` | Provenance metadata — all signed. |

195

| `signer_did`, `algorithm`, `signature`, `public_key`, `signed_at` | ML-DSA signature block (populated by `CommitmentSigner.sign`). |

196

| `to_json()` / `from_json()` | Network-safe round-trip. |

197

| `canonical_bytes()` | Deterministic JSON covered by the signature. |

198

199

### `CommitmentBuilder`

200

201

Accumulator for records, emits an unsigned `TrainingCommitment`.

202

203

| Method | Description |

204

|---|---|

205

| `CommitmentBuilder(dataset_name, dataset_version)` | Start a build. |

206

| `add_record(record)` / `add_records(records)` | Queue records. |

207

| `add_leaf_hash_hex(hex)` | Direct-add when caller pre-hashed the data. |

208

| `build(description="") -> TrainingCommitment` | Produce unsigned commitment. |

209

| `.tree` | Underlying `MerkleTree` — use to generate inclusion proofs later. |

210

211

### `CommitmentSigner`

212

213

ML-DSA sign + verify.

214

215

| Method | Description |

216

|---|---|

217

| `CommitmentSigner(identity)` | Wrap a QuantumShield `AgentIdentity`. |

218

| `sign(commitment) -> TrainingCommitment` | Populate signature fields. |

219

| `CommitmentSigner.verify(commitment) -> bool` | Static — verify signature against embedded public key. |

220

221

### `CommitmentVerifier` + `VerificationResult`

222

223

End-to-end check of (record, proof, commitment).

224

225

| Method | Description |

226

|---|---|

227

| `CommitmentVerifier.verify(record, proof, commitment)` | Returns a `VerificationResult`. |

228

| `CommitmentVerifier.verify_or_raise(...)` | Raises `CommitmentVerificationError` on any failure. |

229

230

`VerificationResult` fields: `signature_valid`, `proof_valid`, `leaf_matches_record`, `commitment_id`, `record_leaf_hash`, `claimed_root`, `error`, and the `fully_verified` property.

231

232

### Exceptions

233

234

| Exception | When |

235

|---|---|

236

| `TrainingDataError` | Base class. |

237

| `EmptyTreeError` | Tree operation requires at least one leaf. |

238

| `InclusionProofError` | Malformed or unverifiable proof. |

239

| `CommitmentVerificationError` | Raised by `verify_or_raise` on failure. |

240

| `IndexOutOfRangeError` | Leaf index outside `[0, size)`. |

241

242

## Why PQC for Training Data

243

244

Training data provenance is a 15-to-20-year commitment:

245

246

- Regulatory discovery can ask about training data *decades* after the model was released.

247

- Copyright plaintiffs litigate on timelines that long outlive a model's commercial life.

248

- Medical, legal, and financial AI systems are audited for the lifetime of the decisions they influenced.

249

250

A Merkle commitment signed today with RSA-2048 or ECDSA-P256 becomes forgeable the moment a cryptographically relevant quantum computer exists. An adversary with a CRQC can retroactively forge arbitrary "signed commitments" and "inclusion proofs", collapsing the entire audit trail.

251

252

ML-DSA (FIPS 204) is not broken by Shor's algorithm. Commitments minted today remain verifiable through the post-quantum transition.

253

254

## Examples

255

256

See the `examples/` directory:

257

258

- **`commit_corpus.py`** — build a signed commitment over a small training corpus.

259

- **`prove_inclusion.py`** — produce and verify an `O(log n)` inclusion proof.

260

- **`detect_false_inclusion_claim.py`** — demonstrate rejection of a forged "my data was in training" claim.

261

262

Run them:

263

264

```bash

265

python examples/commit_corpus.py

266

python examples/prove_inclusion.py

267

python examples/detect_false_inclusion_claim.py

268

```

269

270

## Development

271

272

```bash

273

pip install -e ".[dev]"

274

pytest

275

ruff check src/ tests/ examples/

276

```

277

278

## Related

279

280

Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also:

281

282

- **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`).

283

- **PQC RAG Signing** — sister tool for signing RAG corpus chunks with ML-DSA.

284

- **PQC Content Provenance** — signed manifests for content authenticity.

285

- **PQC MCP Transport** — signed JSON-RPC transport for Model Context Protocol.

286

287

## License

288

289

Apache License 2.0. See [LICENSE](LICENSE).

290