README.md
13.2 KB · 290 lines · markdown Raw
1 # PQC Training Data Transparency
2
3 ![PQC Native](https://img.shields.io/badge/PQC-Native-blue)
4 ![Merkle SHA3-256](https://img.shields.io/badge/Merkle-SHA3--256-green)
5 ![ML-DSA](https://img.shields.io/badge/ML--DSA-FIPS%20204-green)
6 ![License](https://img.shields.io/badge/License-Apache%202.0-orange)
7 ![Version](https://img.shields.io/badge/version-0.1.0-lightgrey)
8
9 **Cryptographic transparency for AI training data.** Build an SHA3-256 Merkle tree over every record in your training set, sign the root with **ML-DSA** (FIPS 204), and publish it. Anyone who holds a single document can later receive an `O(log n)` inclusion proof showing that the record was in the training set — without revealing any of the other records. The audit trail survives the transition to post-quantum cryptography, so commitments made today remain verifiable in 2035 and beyond.
10
11 ## The Problem
12
13 AI copyright litigation, regulatory audits, and red-team requests keep asking the same question: *what exactly was used to train this model?* Model creators today have no cryptographic answer.
14
15 - "Prove this document was NOT in your training set" — requires revealing the entire training set (impossible for proprietary or licensed data).
16 - "Prove your model wasn't trained on PII" — requires deleting, then proving a negative.
17 - "Which records were used for fine-tune v2 vs v3?" — no binding commitment exists, so claims are unfalsifiable.
18
19 And the few audit trails that do exist are typically RSA- or ECDSA-signed. A cryptographically relevant quantum computer breaks those signatures, and the entire audit chain collapses retroactively. Training data provenance has a 15-20 year shelf life; the crypto under it must survive that long.
20
21 ## The Solution
22
23 Commit once, prove selectively:
24
25 - Hash every record into a leaf: `SHA3-256(content || canonical(metadata))`.
26 - Build an SHA3-256 Merkle tree over the leaves.
27 - Wrap the root in a `TrainingCommitment` (dataset name, version, record count, timestamps, licenses, tags).
28 - Sign the canonical commitment with **ML-DSA** at model-release time.
29 - Publish the commitment anywhere — on-chain, in a transparency log, on quantamrkt.com, stapled to the model card.
30
31 Later, anyone can ask "was record X in the training set?" The creator returns an inclusion proof (`log₂(n)` sibling hashes). The verifier checks the proof against the signed root. No other record is revealed.
32
33 ## Installation
34
35 ```bash
36 pip install pqc-training-data-transparency
37 ```
38
39 Development:
40
41 ```bash
42 pip install -e ".[dev]"
43 ```
44
45 ## Quick Start
46
47 ### Build and sign a commitment
48
49 ```python
50 from quantumshield import AgentIdentity
51 from pqc_training_data import (
52 CommitmentBuilder, CommitmentSigner, DataRecord,
53 )
54
55 identity = AgentIdentity.create("model-creator")
56 signer = CommitmentSigner(identity)
57
58 corpus = [
59 DataRecord(content=doc_bytes, metadata={"source": "internal", "id": i})
60 for i, doc_bytes in enumerate(your_documents)
61 ]
62
63 builder = CommitmentBuilder(dataset_name="model-v1-train", dataset_version="1.0.0")
64 builder.add_records(corpus)
65 builder.licenses = ["cc-by-4.0"]
66 builder.tags = ["production"]
67
68 commitment = signer.sign(builder.build(description="Production training set"))
69
70 # Publish commitment.to_json() — this is the public audit artifact.
71 ```
72
73 ### Prove a single record is in the training set
74
75 ```python
76 # Auditor holds only one specific record + the public commitment.
77 proof = builder.tree.inclusion_proof(index=42)
78 result = CommitmentVerifier.verify(corpus[42], proof, commitment)
79
80 assert result.fully_verified
81 # -> signature_valid=True, proof_valid=True, leaf_matches_record=True
82 ```
83
84 ### Detect a forged inclusion claim
85
86 ```python
87 forged = DataRecord(content=b"never-in-training", metadata={"id": 999})
88 pretend_proof = builder.tree.inclusion_proof(index=0) # hijack a real slot
89
90 result = CommitmentVerifier.verify(forged, pretend_proof, commitment)
91 assert not result.fully_verified # rejected
92 # result.error: "record leaf_hash ... does not match proof ..."
93 ```
94
95 ## Architecture
96
97 ```
98 Training Pipeline (creator) Audit Path (third party)
99 -------------------------- ------------------------
100 |
101 records = [doc1, doc2, ..., docN] |
102 | |
103 | 1. leaf_hash = SHA3-256( |
104 | SHA3-256(content) || canonical_json(metadata)) |
105 v |
106 [leaf_1, leaf_2, ..., leaf_N] |
107 | |
108 | 2. Merkle fold (SHA3-256, 0x00/0x01 domain sep) |
109 v |
110 ROOT |
111 | |
112 | 3. wrap in TrainingCommitment |
113 | (id, dataset, version, created_at, ...) |
114 | |
115 | 4. ML-DSA.sign(canonical(commitment)) |
116 v |
117 SIGNED COMMITMENT --> published (on-chain, log, model card) |
118 |
119 | 5. request
120 | inclusion
121 | proof for
122 | record R
123 v
124 InclusionProof (leaf, siblings, dirs, root)
125 |
126 | 6. verify:
127 | ML-DSA(commitment) OK?
128 | leaf_hash(R) == proof.leaf?
129 | walk siblings -> root?
130 | proof.root == commitment.root?
131 v
132 VerificationResult
133 (fully_verified T/F)
134 ```
135
136 ## Threat Model
137
138 | Threat | Handled | Notes |
139 |---|---|---|
140 | **Forged inclusion claim** (attacker claims doc X is in the set) | Yes | Verifier recomputes `leaf_hash(X)` and compares to the proof; walk to root fails or mismatches. |
141 | **Tampered commitment signature** (attacker edits dataset_name, record_count, root) | Yes | Canonical bytes change, ML-DSA signature no longer verifies. |
142 | **Tampered inclusion proof** (attacker flips a sibling hash) | Yes | Root recomputation diverges from signed root. |
143 | **Quantum forgery in 2035+** (CRQC forges the audit trail retroactively) | Yes | ML-DSA is a FIPS 204 post-quantum signature; not broken by Shor/Grover. |
144 | **Proving NON-inclusion** (prove a record was *not* in training) | No | Requires a sorted-tree / Verkle construction. Future work. |
145 | **Revealing private training data** | No (by design) | Commitment contains only the root; proofs reveal `log₂(n)` sibling hashes, never other records. The creator decides what to reveal. |
146 | **Selective disclosure of metadata fields** | No | A record's metadata is fully inside its leaf. Hashing over `metadata` is all-or-nothing; carve out separate fields into the leaf if you need partial reveals. |
147 | **Re-publication of old commitment** (attacker re-uses prior root for a new model release) | Partial | `commitment_id` + `dataset_version` + `created_at` are all signed; enforce freshness by policy. |
148
149 ## API Reference
150
151 ### `DataRecord`
152
153 Frozen dataclass. One training example.
154
155 | Field / Method | Description |
156 |---|---|
157 | `content: bytes` | Raw record payload (doc text, image bytes, serialized row, ...). |
158 | `metadata: dict` | Arbitrary metadata — participates in the leaf hash. |
159 | `canonical_bytes()` | Deterministic `SHA3-256(content) || "|" || canonical_json(metadata)`. |
160 | `leaf_hash() -> RecordHash` | SHA3-256 of canonical bytes — the Merkle leaf value. |
161 | `to_dict()` | Safe serialization. **Does not include raw content.** |
162
163 ### `MerkleTree`
164
165 SHA3-256 Merkle tree with RFC6962-style odd-node promotion.
166
167 | Method | Description |
168 |---|---|
169 | `add(leaf_hash)` / `add_many(leaves)` | Append leaves. |
170 | `root() -> str` | Hex Merkle root. Raises `EmptyTreeError` for empty trees. |
171 | `inclusion_proof(index) -> InclusionProof` | `O(log n)` proof for leaf at `index`. |
172 | `MerkleTree.verify_inclusion(proof) -> bool` | Static verification — independent of tree state. |
173
174 ### `InclusionProof`
175
176 Frozen dataclass carried from prover to verifier.
177
178 | Field | Description |
179 |---|---|
180 | `leaf_hash` | Hex of the leaf being proven. |
181 | `index`, `tree_size` | Position and total size at time of proof. |
182 | `root` | Hex root the prover claims. |
183 | `siblings`, `directions` | `log₂(n)` sibling hashes + `'L'`/`'R'` flags. |
184
185 ### `TrainingCommitment`
186
187 The signed audit artifact.
188
189 | Field | Description |
190 |---|---|
191 | `commitment_id` | `urn:pqc-td:<uuid>`. |
192 | `dataset_name`, `dataset_version`, `description` | Human-readable identification. |
193 | `record_count`, `root` | Cryptographic binding to the tree. |
194 | `created_at`, `licenses`, `tags`, `extra` | Provenance metadata — all signed. |
195 | `signer_did`, `algorithm`, `signature`, `public_key`, `signed_at` | ML-DSA signature block (populated by `CommitmentSigner.sign`). |
196 | `to_json()` / `from_json()` | Network-safe round-trip. |
197 | `canonical_bytes()` | Deterministic JSON covered by the signature. |
198
199 ### `CommitmentBuilder`
200
201 Accumulator for records, emits an unsigned `TrainingCommitment`.
202
203 | Method | Description |
204 |---|---|
205 | `CommitmentBuilder(dataset_name, dataset_version)` | Start a build. |
206 | `add_record(record)` / `add_records(records)` | Queue records. |
207 | `add_leaf_hash_hex(hex)` | Direct-add when caller pre-hashed the data. |
208 | `build(description="") -> TrainingCommitment` | Produce unsigned commitment. |
209 | `.tree` | Underlying `MerkleTree` — use to generate inclusion proofs later. |
210
211 ### `CommitmentSigner`
212
213 ML-DSA sign + verify.
214
215 | Method | Description |
216 |---|---|
217 | `CommitmentSigner(identity)` | Wrap a QuantumShield `AgentIdentity`. |
218 | `sign(commitment) -> TrainingCommitment` | Populate signature fields. |
219 | `CommitmentSigner.verify(commitment) -> bool` | Static — verify signature against embedded public key. |
220
221 ### `CommitmentVerifier` + `VerificationResult`
222
223 End-to-end check of (record, proof, commitment).
224
225 | Method | Description |
226 |---|---|
227 | `CommitmentVerifier.verify(record, proof, commitment)` | Returns a `VerificationResult`. |
228 | `CommitmentVerifier.verify_or_raise(...)` | Raises `CommitmentVerificationError` on any failure. |
229
230 `VerificationResult` fields: `signature_valid`, `proof_valid`, `leaf_matches_record`, `commitment_id`, `record_leaf_hash`, `claimed_root`, `error`, and the `fully_verified` property.
231
232 ### Exceptions
233
234 | Exception | When |
235 |---|---|
236 | `TrainingDataError` | Base class. |
237 | `EmptyTreeError` | Tree operation requires at least one leaf. |
238 | `InclusionProofError` | Malformed or unverifiable proof. |
239 | `CommitmentVerificationError` | Raised by `verify_or_raise` on failure. |
240 | `IndexOutOfRangeError` | Leaf index outside `[0, size)`. |
241
242 ## Why PQC for Training Data
243
244 Training data provenance is a 15-to-20-year commitment:
245
246 - Regulatory discovery can ask about training data *decades* after the model was released.
247 - Copyright plaintiffs litigate on timelines that long outlive a model's commercial life.
248 - Medical, legal, and financial AI systems are audited for the lifetime of the decisions they influenced.
249
250 A Merkle commitment signed today with RSA-2048 or ECDSA-P256 becomes forgeable the moment a cryptographically relevant quantum computer exists. An adversary with a CRQC can retroactively forge arbitrary "signed commitments" and "inclusion proofs", collapsing the entire audit trail.
251
252 ML-DSA (FIPS 204) is not broken by Shor's algorithm. Commitments minted today remain verifiable through the post-quantum transition.
253
254 ## Examples
255
256 See the `examples/` directory:
257
258 - **`commit_corpus.py`** — build a signed commitment over a small training corpus.
259 - **`prove_inclusion.py`** — produce and verify an `O(log n)` inclusion proof.
260 - **`detect_false_inclusion_claim.py`** — demonstrate rejection of a forged "my data was in training" claim.
261
262 Run them:
263
264 ```bash
265 python examples/commit_corpus.py
266 python examples/prove_inclusion.py
267 python examples/detect_false_inclusion_claim.py
268 ```
269
270 ## Development
271
272 ```bash
273 pip install -e ".[dev]"
274 pytest
275 ruff check src/ tests/ examples/
276 ```
277
278 ## Related
279
280 Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also:
281
282 - **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`).
283 - **PQC RAG Signing** — sister tool for signing RAG corpus chunks with ML-DSA.
284 - **PQC Content Provenance** — signed manifests for content authenticity.
285 - **PQC MCP Transport** — signed JSON-RPC transport for Model Context Protocol.
286
287 ## License
288
289 Apache License 2.0. See [LICENSE](LICENSE).
290