README.md
| 1 | # PQC Training Data Transparency |
| 2 | |
| 3 |  |
| 4 |  |
| 5 |  |
| 6 |  |
| 7 |  |
| 8 | |
| 9 | **Cryptographic transparency for AI training data.** Build an SHA3-256 Merkle tree over every record in your training set, sign the root with **ML-DSA** (FIPS 204), and publish it. Anyone who holds a single document can later receive an `O(log n)` inclusion proof showing that the record was in the training set — without revealing any of the other records. The audit trail survives the transition to post-quantum cryptography, so commitments made today remain verifiable in 2035 and beyond. |
| 10 | |
| 11 | ## The Problem |
| 12 | |
| 13 | AI copyright litigation, regulatory audits, and red-team requests keep asking the same question: *what exactly was used to train this model?* Model creators today have no cryptographic answer. |
| 14 | |
| 15 | - "Prove this document was NOT in your training set" — requires revealing the entire training set (impossible for proprietary or licensed data). |
| 16 | - "Prove your model wasn't trained on PII" — requires deleting, then proving a negative. |
| 17 | - "Which records were used for fine-tune v2 vs v3?" — no binding commitment exists, so claims are unfalsifiable. |
| 18 | |
| 19 | And the few audit trails that do exist are typically RSA- or ECDSA-signed. A cryptographically relevant quantum computer breaks those signatures, and the entire audit chain collapses retroactively. Training data provenance has a 15-20 year shelf life; the crypto under it must survive that long. |
| 20 | |
| 21 | ## The Solution |
| 22 | |
| 23 | Commit once, prove selectively: |
| 24 | |
| 25 | - Hash every record into a leaf: `SHA3-256(content || canonical(metadata))`. |
| 26 | - Build an SHA3-256 Merkle tree over the leaves. |
| 27 | - Wrap the root in a `TrainingCommitment` (dataset name, version, record count, timestamps, licenses, tags). |
| 28 | - Sign the canonical commitment with **ML-DSA** at model-release time. |
| 29 | - Publish the commitment anywhere — on-chain, in a transparency log, on quantamrkt.com, stapled to the model card. |
| 30 | |
| 31 | Later, anyone can ask "was record X in the training set?" The creator returns an inclusion proof (`log₂(n)` sibling hashes). The verifier checks the proof against the signed root. No other record is revealed. |
| 32 | |
| 33 | ## Installation |
| 34 | |
| 35 | ```bash |
| 36 | pip install pqc-training-data-transparency |
| 37 | ``` |
| 38 | |
| 39 | Development: |
| 40 | |
| 41 | ```bash |
| 42 | pip install -e ".[dev]" |
| 43 | ``` |
| 44 | |
| 45 | ## Quick Start |
| 46 | |
| 47 | ### Build and sign a commitment |
| 48 | |
| 49 | ```python |
| 50 | from quantumshield import AgentIdentity |
| 51 | from pqc_training_data import ( |
| 52 | CommitmentBuilder, CommitmentSigner, DataRecord, |
| 53 | ) |
| 54 | |
| 55 | identity = AgentIdentity.create("model-creator") |
| 56 | signer = CommitmentSigner(identity) |
| 57 | |
| 58 | corpus = [ |
| 59 | DataRecord(content=doc_bytes, metadata={"source": "internal", "id": i}) |
| 60 | for i, doc_bytes in enumerate(your_documents) |
| 61 | ] |
| 62 | |
| 63 | builder = CommitmentBuilder(dataset_name="model-v1-train", dataset_version="1.0.0") |
| 64 | builder.add_records(corpus) |
| 65 | builder.licenses = ["cc-by-4.0"] |
| 66 | builder.tags = ["production"] |
| 67 | |
| 68 | commitment = signer.sign(builder.build(description="Production training set")) |
| 69 | |
| 70 | # Publish commitment.to_json() — this is the public audit artifact. |
| 71 | ``` |
| 72 | |
| 73 | ### Prove a single record is in the training set |
| 74 | |
| 75 | ```python |
| 76 | # Auditor holds only one specific record + the public commitment. |
| 77 | proof = builder.tree.inclusion_proof(index=42) |
| 78 | result = CommitmentVerifier.verify(corpus[42], proof, commitment) |
| 79 | |
| 80 | assert result.fully_verified |
| 81 | # -> signature_valid=True, proof_valid=True, leaf_matches_record=True |
| 82 | ``` |
| 83 | |
| 84 | ### Detect a forged inclusion claim |
| 85 | |
| 86 | ```python |
| 87 | forged = DataRecord(content=b"never-in-training", metadata={"id": 999}) |
| 88 | pretend_proof = builder.tree.inclusion_proof(index=0) # hijack a real slot |
| 89 | |
| 90 | result = CommitmentVerifier.verify(forged, pretend_proof, commitment) |
| 91 | assert not result.fully_verified # rejected |
| 92 | # result.error: "record leaf_hash ... does not match proof ..." |
| 93 | ``` |
| 94 | |
| 95 | ## Architecture |
| 96 | |
| 97 | ``` |
| 98 | Training Pipeline (creator) Audit Path (third party) |
| 99 | -------------------------- ------------------------ |
| 100 | | |
| 101 | records = [doc1, doc2, ..., docN] | |
| 102 | | | |
| 103 | | 1. leaf_hash = SHA3-256( | |
| 104 | | SHA3-256(content) || canonical_json(metadata)) | |
| 105 | v | |
| 106 | [leaf_1, leaf_2, ..., leaf_N] | |
| 107 | | | |
| 108 | | 2. Merkle fold (SHA3-256, 0x00/0x01 domain sep) | |
| 109 | v | |
| 110 | ROOT | |
| 111 | | | |
| 112 | | 3. wrap in TrainingCommitment | |
| 113 | | (id, dataset, version, created_at, ...) | |
| 114 | | | |
| 115 | | 4. ML-DSA.sign(canonical(commitment)) | |
| 116 | v | |
| 117 | SIGNED COMMITMENT --> published (on-chain, log, model card) | |
| 118 | | |
| 119 | | 5. request |
| 120 | | inclusion |
| 121 | | proof for |
| 122 | | record R |
| 123 | v |
| 124 | InclusionProof (leaf, siblings, dirs, root) |
| 125 | | |
| 126 | | 6. verify: |
| 127 | | ML-DSA(commitment) OK? |
| 128 | | leaf_hash(R) == proof.leaf? |
| 129 | | walk siblings -> root? |
| 130 | | proof.root == commitment.root? |
| 131 | v |
| 132 | VerificationResult |
| 133 | (fully_verified T/F) |
| 134 | ``` |
| 135 | |
| 136 | ## Threat Model |
| 137 | |
| 138 | | Threat | Handled | Notes | |
| 139 | |---|---|---| |
| 140 | | **Forged inclusion claim** (attacker claims doc X is in the set) | Yes | Verifier recomputes `leaf_hash(X)` and compares to the proof; walk to root fails or mismatches. | |
| 141 | | **Tampered commitment signature** (attacker edits dataset_name, record_count, root) | Yes | Canonical bytes change, ML-DSA signature no longer verifies. | |
| 142 | | **Tampered inclusion proof** (attacker flips a sibling hash) | Yes | Root recomputation diverges from signed root. | |
| 143 | | **Quantum forgery in 2035+** (CRQC forges the audit trail retroactively) | Yes | ML-DSA is a FIPS 204 post-quantum signature; not broken by Shor/Grover. | |
| 144 | | **Proving NON-inclusion** (prove a record was *not* in training) | No | Requires a sorted-tree / Verkle construction. Future work. | |
| 145 | | **Revealing private training data** | No (by design) | Commitment contains only the root; proofs reveal `log₂(n)` sibling hashes, never other records. The creator decides what to reveal. | |
| 146 | | **Selective disclosure of metadata fields** | No | A record's metadata is fully inside its leaf. Hashing over `metadata` is all-or-nothing; carve out separate fields into the leaf if you need partial reveals. | |
| 147 | | **Re-publication of old commitment** (attacker re-uses prior root for a new model release) | Partial | `commitment_id` + `dataset_version` + `created_at` are all signed; enforce freshness by policy. | |
| 148 | |
| 149 | ## API Reference |
| 150 | |
| 151 | ### `DataRecord` |
| 152 | |
| 153 | Frozen dataclass. One training example. |
| 154 | |
| 155 | | Field / Method | Description | |
| 156 | |---|---| |
| 157 | | `content: bytes` | Raw record payload (doc text, image bytes, serialized row, ...). | |
| 158 | | `metadata: dict` | Arbitrary metadata — participates in the leaf hash. | |
| 159 | | `canonical_bytes()` | Deterministic `SHA3-256(content) || "|" || canonical_json(metadata)`. | |
| 160 | | `leaf_hash() -> RecordHash` | SHA3-256 of canonical bytes — the Merkle leaf value. | |
| 161 | | `to_dict()` | Safe serialization. **Does not include raw content.** | |
| 162 | |
| 163 | ### `MerkleTree` |
| 164 | |
| 165 | SHA3-256 Merkle tree with RFC6962-style odd-node promotion. |
| 166 | |
| 167 | | Method | Description | |
| 168 | |---|---| |
| 169 | | `add(leaf_hash)` / `add_many(leaves)` | Append leaves. | |
| 170 | | `root() -> str` | Hex Merkle root. Raises `EmptyTreeError` for empty trees. | |
| 171 | | `inclusion_proof(index) -> InclusionProof` | `O(log n)` proof for leaf at `index`. | |
| 172 | | `MerkleTree.verify_inclusion(proof) -> bool` | Static verification — independent of tree state. | |
| 173 | |
| 174 | ### `InclusionProof` |
| 175 | |
| 176 | Frozen dataclass carried from prover to verifier. |
| 177 | |
| 178 | | Field | Description | |
| 179 | |---|---| |
| 180 | | `leaf_hash` | Hex of the leaf being proven. | |
| 181 | | `index`, `tree_size` | Position and total size at time of proof. | |
| 182 | | `root` | Hex root the prover claims. | |
| 183 | | `siblings`, `directions` | `log₂(n)` sibling hashes + `'L'`/`'R'` flags. | |
| 184 | |
| 185 | ### `TrainingCommitment` |
| 186 | |
| 187 | The signed audit artifact. |
| 188 | |
| 189 | | Field | Description | |
| 190 | |---|---| |
| 191 | | `commitment_id` | `urn:pqc-td:<uuid>`. | |
| 192 | | `dataset_name`, `dataset_version`, `description` | Human-readable identification. | |
| 193 | | `record_count`, `root` | Cryptographic binding to the tree. | |
| 194 | | `created_at`, `licenses`, `tags`, `extra` | Provenance metadata — all signed. | |
| 195 | | `signer_did`, `algorithm`, `signature`, `public_key`, `signed_at` | ML-DSA signature block (populated by `CommitmentSigner.sign`). | |
| 196 | | `to_json()` / `from_json()` | Network-safe round-trip. | |
| 197 | | `canonical_bytes()` | Deterministic JSON covered by the signature. | |
| 198 | |
| 199 | ### `CommitmentBuilder` |
| 200 | |
| 201 | Accumulator for records, emits an unsigned `TrainingCommitment`. |
| 202 | |
| 203 | | Method | Description | |
| 204 | |---|---| |
| 205 | | `CommitmentBuilder(dataset_name, dataset_version)` | Start a build. | |
| 206 | | `add_record(record)` / `add_records(records)` | Queue records. | |
| 207 | | `add_leaf_hash_hex(hex)` | Direct-add when caller pre-hashed the data. | |
| 208 | | `build(description="") -> TrainingCommitment` | Produce unsigned commitment. | |
| 209 | | `.tree` | Underlying `MerkleTree` — use to generate inclusion proofs later. | |
| 210 | |
| 211 | ### `CommitmentSigner` |
| 212 | |
| 213 | ML-DSA sign + verify. |
| 214 | |
| 215 | | Method | Description | |
| 216 | |---|---| |
| 217 | | `CommitmentSigner(identity)` | Wrap a QuantumShield `AgentIdentity`. | |
| 218 | | `sign(commitment) -> TrainingCommitment` | Populate signature fields. | |
| 219 | | `CommitmentSigner.verify(commitment) -> bool` | Static — verify signature against embedded public key. | |
| 220 | |
| 221 | ### `CommitmentVerifier` + `VerificationResult` |
| 222 | |
| 223 | End-to-end check of (record, proof, commitment). |
| 224 | |
| 225 | | Method | Description | |
| 226 | |---|---| |
| 227 | | `CommitmentVerifier.verify(record, proof, commitment)` | Returns a `VerificationResult`. | |
| 228 | | `CommitmentVerifier.verify_or_raise(...)` | Raises `CommitmentVerificationError` on any failure. | |
| 229 | |
| 230 | `VerificationResult` fields: `signature_valid`, `proof_valid`, `leaf_matches_record`, `commitment_id`, `record_leaf_hash`, `claimed_root`, `error`, and the `fully_verified` property. |
| 231 | |
| 232 | ### Exceptions |
| 233 | |
| 234 | | Exception | When | |
| 235 | |---|---| |
| 236 | | `TrainingDataError` | Base class. | |
| 237 | | `EmptyTreeError` | Tree operation requires at least one leaf. | |
| 238 | | `InclusionProofError` | Malformed or unverifiable proof. | |
| 239 | | `CommitmentVerificationError` | Raised by `verify_or_raise` on failure. | |
| 240 | | `IndexOutOfRangeError` | Leaf index outside `[0, size)`. | |
| 241 | |
| 242 | ## Why PQC for Training Data |
| 243 | |
| 244 | Training data provenance is a 15-to-20-year commitment: |
| 245 | |
| 246 | - Regulatory discovery can ask about training data *decades* after the model was released. |
| 247 | - Copyright plaintiffs litigate on timelines that long outlive a model's commercial life. |
| 248 | - Medical, legal, and financial AI systems are audited for the lifetime of the decisions they influenced. |
| 249 | |
| 250 | A Merkle commitment signed today with RSA-2048 or ECDSA-P256 becomes forgeable the moment a cryptographically relevant quantum computer exists. An adversary with a CRQC can retroactively forge arbitrary "signed commitments" and "inclusion proofs", collapsing the entire audit trail. |
| 251 | |
| 252 | ML-DSA (FIPS 204) is not broken by Shor's algorithm. Commitments minted today remain verifiable through the post-quantum transition. |
| 253 | |
| 254 | ## Examples |
| 255 | |
| 256 | See the `examples/` directory: |
| 257 | |
| 258 | - **`commit_corpus.py`** — build a signed commitment over a small training corpus. |
| 259 | - **`prove_inclusion.py`** — produce and verify an `O(log n)` inclusion proof. |
| 260 | - **`detect_false_inclusion_claim.py`** — demonstrate rejection of a forged "my data was in training" claim. |
| 261 | |
| 262 | Run them: |
| 263 | |
| 264 | ```bash |
| 265 | python examples/commit_corpus.py |
| 266 | python examples/prove_inclusion.py |
| 267 | python examples/detect_false_inclusion_claim.py |
| 268 | ``` |
| 269 | |
| 270 | ## Development |
| 271 | |
| 272 | ```bash |
| 273 | pip install -e ".[dev]" |
| 274 | pytest |
| 275 | ruff check src/ tests/ examples/ |
| 276 | ``` |
| 277 | |
| 278 | ## Related |
| 279 | |
| 280 | Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also: |
| 281 | |
| 282 | - **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`). |
| 283 | - **PQC RAG Signing** — sister tool for signing RAG corpus chunks with ML-DSA. |
| 284 | - **PQC Content Provenance** — signed manifests for content authenticity. |
| 285 | - **PQC MCP Transport** — signed JSON-RPC transport for Model Context Protocol. |
| 286 | |
| 287 | ## License |
| 288 | |
| 289 | Apache License 2.0. See [LICENSE](LICENSE). |
| 290 | |