README.md
| 1 | # PQC Signed AI Content Provenance |
| 2 | |
| 3 |  |
| 4 |  |
| 5 |  |
| 6 |  |
| 7 |  |
| 8 | |
| 9 | **C2PA for AI outputs, signed with ML-DSA.** Every piece of AI-generated content (text, image, audio) gets a signed provenance manifest that cryptographically proves *which model* produced it, *when*, *from what prompt*, and *under what licensing terms*. Unlike classical C2PA, signatures use **ML-DSA (FIPS 204)** so they survive the quantum transition: audit trails signed today remain verifiable 20+ years from now, even against a future quantum adversary. |
| 10 | |
| 11 | ## The Problem |
| 12 | |
| 13 | Classical C2PA manifests rely on ECDSA / RSA signatures. A sufficiently large quantum computer running Shor's algorithm breaks both. That means every AI-generated article, diagnostic, or trading recommendation you sign today becomes **retroactively forgeable** once CRQCs (cryptographically-relevant quantum computers) arrive. Industries with long audit horizons (healthcare: 10-30 years, finance: 7+ years, legal discovery: indefinite) cannot rely on a classical signature for provenance. |
| 14 | |
| 15 | ## The Solution |
| 16 | |
| 17 | Every AI output is wrapped in a signed **ContentManifest**: |
| 18 | |
| 19 | - SHA3-256 content hash binds the manifest to the exact bytes produced. |
| 20 | - **ModelAttribution** names the model, version, and Shield Registry manifest hash. |
| 21 | - **GenerationContext** records prompt hash, parameters, and timestamp. |
| 22 | - **Assertions** — pluggable C2PA-style claims (AI-generated, training summary, usage license). |
| 23 | - **ML-DSA signature** over the canonical digest, by the model's AgentIdentity DID. |
| 24 | - **Provenance chain** links derivations (AI draft -> human edit -> final) so every change has an auditable signer. |
| 25 | |
| 26 | At any future date, a verifier recomputes the content hash, re-runs ML-DSA verify on the canonical manifest bytes, and walks the chain. Tampering at any layer is detected. |
| 27 | |
| 28 | ## Installation |
| 29 | |
| 30 | ```bash |
| 31 | pip install pqc-content-provenance |
| 32 | ``` |
| 33 | |
| 34 | Development: |
| 35 | |
| 36 | ```bash |
| 37 | pip install -e ".[dev]" |
| 38 | ``` |
| 39 | |
| 40 | ## Quick Start |
| 41 | |
| 42 | ### Sign an AI output |
| 43 | |
| 44 | ```python |
| 45 | from quantumshield import AgentIdentity |
| 46 | from pqc_content_provenance import ( |
| 47 | AIGeneratedAssertion, |
| 48 | ContentManifest, |
| 49 | GenerationContext, |
| 50 | ManifestSigner, |
| 51 | ModelAttribution, |
| 52 | UsageAssertion, |
| 53 | embed_manifest, |
| 54 | ) |
| 55 | |
| 56 | identity = AgentIdentity.create("llama-3-signer") |
| 57 | signer = ManifestSigner(identity) |
| 58 | |
| 59 | content = b"AI-generated press release about tool #4." |
| 60 | |
| 61 | manifest = ContentManifest.create( |
| 62 | content=content, |
| 63 | content_type="text/plain", |
| 64 | model_attribution=ModelAttribution( |
| 65 | model_did=identity.did, |
| 66 | model_name="Llama-3-8B-Instruct", |
| 67 | model_version="1.0", |
| 68 | registry_url="https://quantamrkt.com/models/meta-llama-Llama-3-8B-Instruct", |
| 69 | ), |
| 70 | generation_context=GenerationContext( |
| 71 | prompt_hash="ab" * 32, |
| 72 | parameters={"temperature": 0.7}, |
| 73 | generated_at="2026-04-20T12:00:00Z", |
| 74 | ), |
| 75 | assertions=[ |
| 76 | AIGeneratedAssertion(model_name="Llama-3-8B-Instruct", model_version="1.0"), |
| 77 | UsageAssertion(license="cc-by-4.0", commercial_use=True, attribution_required=True), |
| 78 | ], |
| 79 | ) |
| 80 | |
| 81 | signed = signer.sign(manifest) |
| 82 | envelope = embed_manifest(content, signed, mode="sidecar") |
| 83 | |
| 84 | # Persist envelope alongside the content -- e.g. output.txt + output.txt.c2pa.json |
| 85 | ``` |
| 86 | |
| 87 | ### Verify an AI output |
| 88 | |
| 89 | ```python |
| 90 | from pqc_content_provenance import extract_manifest, ManifestSigner |
| 91 | |
| 92 | manifest, content = extract_manifest(envelope, mode="sidecar") |
| 93 | result = ManifestSigner.verify(manifest, content) |
| 94 | |
| 95 | if not result.valid: |
| 96 | raise RuntimeError(f"provenance check failed: {result.error}") |
| 97 | |
| 98 | print(f"valid output from {result.signer_did}") |
| 99 | ``` |
| 100 | |
| 101 | ## Architecture |
| 102 | |
| 103 | ``` |
| 104 | AI Model Publisher Consumer / Auditor |
| 105 | -------- --------- ------------------ |
| 106 | | | | |
| 107 | | 1. generate output | | |
| 108 | | | | |
| 109 | | 2. ContentManifest.create: | |
| 110 | | - SHA3-256 content hash | |
| 111 | | - model attribution (from Shield Registry) | |
| 112 | | - generation context (prompt, params, time) | |
| 113 | | - assertions (AI-generated, usage, training) | |
| 114 | | | | |
| 115 | | 3. ManifestSigner.sign: | |
| 116 | | - canonical JSON -> SHA3-256 | |
| 117 | | - ML-DSA signature with AgentIdentity | |
| 118 | | | | |
| 119 | | 4. embed_manifest --->| 5. store content + sidecar | |
| 120 | | (sidecar or inline) | in CMS / DB / S3 | |
| 121 | | | | |
| 122 | | 6. deliver envelope ------>| |
| 123 | | | |
| 124 | | 7. extract_manifest |
| 125 | | 8. ManifestSigner.verify: |
| 126 | | - recompute content hash |
| 127 | | - ML-DSA verify canonical |
| 128 | | - walk ProvenanceChain |
| 129 | | |
| 130 | | 9. reject on any mismatch |
| 131 | ``` |
| 132 | |
| 133 | ## Threat Model |
| 134 | |
| 135 | | Threat | Mitigation | |
| 136 | |---|---| |
| 137 | | **Forged attribution** (claim output came from model X when it didn't) | Manifest ML-DSA signature only verifies against model X's AgentIdentity public key. | |
| 138 | | **Content tampering** (text/image modified after signing) | Recomputed SHA3-256 no longer matches `manifest.content_hash`. | |
| 139 | | **Manifest tampering** (edit claimed model/prompt/license) | ML-DSA signature over canonical bytes breaks as soon as any field changes. | |
| 140 | | **Lost chain of custody** (edits with no signer record) | `ProvenanceChain` enforces `previous_manifest_id` links; each link has its own signer. | |
| 141 | | **Re-used signature across outputs** | Signature is over the canonical bytes of this specific manifest, which includes `content_hash` and `manifest_id`. | |
| 142 | | **Unknown / unregistered assertion** | `ASSERTION_REGISTRY` rejects unknown labels with `UnknownAssertionError`. | |
| 143 | | **Quantum adversary (Shor's algorithm)** | ML-DSA (FIPS 204) is not broken by known quantum attacks. | |
| 144 | | **Long audit horizon** (10-30 year retention) | Post-quantum signatures remain verifiable past classical crypto's expiry. | |
| 145 | |
| 146 | ## Assertions |
| 147 | |
| 148 | Pluggable facts attached to a manifest. Each is a dataclass with a `label` that matches a C2PA-style namespace. |
| 149 | |
| 150 | ### `AIGeneratedAssertion` — `c2pa.ai_generated` |
| 151 | |
| 152 | | Field | Description | |
| 153 | |---|---| |
| 154 | | `model_name`, `model_version`, `model_did` | Which model produced the content | |
| 155 | | `generator_type` | `text` / `image` / `audio` / `video` / `multimodal` | |
| 156 | | `human_edited` | Was it post-edited by a human? | |
| 157 | | `generation_params` | Temperature, top_p, seed, etc. | |
| 158 | |
| 159 | ### `TrainingAssertion` — `c2pa.training` |
| 160 | |
| 161 | | Field | Description | |
| 162 | |---|---| |
| 163 | | `dataset_name`, `dataset_root_hash` | Source training set + Merkle root | |
| 164 | | `fine_tune_dataset`, `fine_tune_root_hash` | Optional fine-tune set | |
| 165 | | `pii_filtered`, `copyright_cleared` | Compliance flags | |
| 166 | | `licenses` | SPDX identifiers, e.g. `["cc-by-4.0", "apache-2.0"]` | |
| 167 | |
| 168 | ### `UsageAssertion` — `c2pa.usage` |
| 169 | |
| 170 | | Field | Description | |
| 171 | |---|---| |
| 172 | | `license` | SPDX identifier or custom string | |
| 173 | | `commercial_use`, `attribution_required` | Rights flags | |
| 174 | | `attribution_text` | Required credit text | |
| 175 | | `jurisdictions` | Country codes where valid | |
| 176 | | `expiry` | ISO-8601 expiry or empty | |
| 177 | |
| 178 | Register your own assertion subclass by adding it to `ASSERTION_REGISTRY` with its `label`. |
| 179 | |
| 180 | ## Chain of Custody |
| 181 | |
| 182 | Every derivation (AI draft -> human edit -> legal review) produces a new manifest that references the previous via `previous_manifest_id`. The `ProvenanceChain` verifies: |
| 183 | |
| 184 | 1. Each manifest's ML-DSA signature. |
| 185 | 2. Each manifest's `previous_manifest_id` matches the prior link's `manifest_id`. |
| 186 | 3. The whole chain round-trips through `to_dicts()` / `from_dicts()` without loss. |
| 187 | |
| 188 | ```python |
| 189 | chain = ProvenanceChain() |
| 190 | chain.add(ai_draft_signed) # signed by model identity |
| 191 | chain.add(human_edit_signed) # signed by editor identity, prev = ai_draft.manifest_id |
| 192 | chain.add(legal_review_signed) # signed by legal identity, prev = human_edit.manifest_id |
| 193 | |
| 194 | ok, errors = chain.verify_chain() |
| 195 | ``` |
| 196 | |
| 197 | ## API Reference |
| 198 | |
| 199 | ### `ContentManifest` |
| 200 | |
| 201 | | Method | Description | |
| 202 | |---|---| |
| 203 | | `ContentManifest.create(content, content_type, attribution, context, assertions=..., previous_manifest_id=...)` | Build an unsigned manifest | |
| 204 | | `ContentManifest.compute_content_hash(bytes)` | Static SHA3-256 helper | |
| 205 | | `canonical_bytes()` | Deterministic bytes used for signing | |
| 206 | | `to_dict()` / `to_json()` / `from_dict()` / `from_json()` | JSON-safe round-trip | |
| 207 | |
| 208 | ### `ModelAttribution` / `GenerationContext` |
| 209 | |
| 210 | Plain dataclasses holding model identity + generation context. Fully JSON-round-trippable. |
| 211 | |
| 212 | ### `ManifestSigner` |
| 213 | |
| 214 | | Method | Description | |
| 215 | |---|---| |
| 216 | | `ManifestSigner(identity)` | Bind a signer to an `AgentIdentity` | |
| 217 | | `sign(manifest)` | In-place sign; returns manifest | |
| 218 | | `sign_and_raise_on_mismatch(manifest, content)` | Defensive: re-check content hash before signing | |
| 219 | | `ManifestSigner.verify(manifest, content=None)` | Static — returns `VerificationResult` | |
| 220 | |
| 221 | ### `VerificationResult` |
| 222 | |
| 223 | Frozen dataclass. Fields: `valid`, `manifest_id`, `signer_did`, `algorithm`, `content_hash_match`, `signature_match`, `error`. |
| 224 | |
| 225 | ### `ProvenanceChain` / `ProvenanceLink` |
| 226 | |
| 227 | | Method | Description | |
| 228 | |---|---| |
| 229 | | `add(manifest)` | Append link; raises `ChainBrokenError` on bad `previous_manifest_id` | |
| 230 | | `verify_chain()` | Returns `(ok, errors)` — verifies every signature and every link | |
| 231 | | `to_dicts()` / `from_dicts(items)` | JSON-safe round-trip | |
| 232 | |
| 233 | ### `embed_manifest` / `extract_manifest` |
| 234 | |
| 235 | | Mode | Description | |
| 236 | |---|---| |
| 237 | | `sidecar` | JSON envelope containing manifest + base64 content. Save to `.c2pa.json`. | |
| 238 | | `text-header` | Inline marker block prepended to text content. | |
| 239 | |
| 240 | ### Exceptions |
| 241 | |
| 242 | | Exception | When | |
| 243 | |---|---| |
| 244 | | `ProvenanceError` | Base class | |
| 245 | | `InvalidManifestError` | Malformed manifest / missing fields / bad JSON | |
| 246 | | `SignatureVerificationError` | Base for signature check failures | |
| 247 | | `ContentHashMismatchError` | Content bytes don't match manifest's claimed hash | |
| 248 | | `ChainBrokenError` | Provenance chain link mismatch | |
| 249 | | `UnknownAssertionError` | Assertion label not in `ASSERTION_REGISTRY` | |
| 250 | |
| 251 | ## Examples |
| 252 | |
| 253 | See the `examples/` directory: |
| 254 | |
| 255 | - **`sign_llm_output.py`** — end-to-end: agent signs AI text, embeds into sidecar, extracts, verifies. |
| 256 | - **`detect_tampered_output.py`** — shows that modifying the content bytes after signing is detected. |
| 257 | - **`provenance_chain.py`** — AI draft -> human-edited derivation; each link signed by a different identity. |
| 258 | |
| 259 | Run them: |
| 260 | |
| 261 | ```bash |
| 262 | python examples/sign_llm_output.py |
| 263 | python examples/detect_tampered_output.py |
| 264 | python examples/provenance_chain.py |
| 265 | ``` |
| 266 | |
| 267 | ## Why PQC Matters for Provenance |
| 268 | |
| 269 | Provenance is fundamentally an **audit-trail** technology: its whole value is being verifiable *later*. "Later" for healthcare is decades; for financial audits, years; for legal discovery, possibly forever. Classical signatures are vulnerable to **Harvest-Now-Decrypt-Later (HNDL)** style retroactive forgery — an adversary who records today's signed outputs can, once quantum-capable, produce indistinguishable fake manifests that appear to have been signed in the past. ML-DSA (FIPS 204) is believed to resist this attack. Signing AI outputs with PQC today is how we guarantee that tomorrow's auditors can still trust yesterday's provenance. |
| 270 | |
| 271 | ## Development |
| 272 | |
| 273 | ```bash |
| 274 | pip install -e ".[dev]" |
| 275 | pytest |
| 276 | ruff check src/ tests/ examples/ |
| 277 | ``` |
| 278 | |
| 279 | ## Related |
| 280 | |
| 281 | Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also: |
| 282 | |
| 283 | - **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`). |
| 284 | - **PQC RAG Signing** — sister tool for signing RAG pipeline chunks with ML-DSA. |
| 285 | - **PQC MCP Transport** — sister tool for PQC-secured Model Context Protocol transports. |
| 286 | |
| 287 | ## License |
| 288 | |
| 289 | Apache License 2.0. See [LICENSE](LICENSE). |
| 290 | |