README.md · Signed AI Content Provenance

1

# PQC Signed AI Content Provenance

2

3

![PQC Native](https://img.shields.io/badge/PQC-Native-blue)

4

![ML-DSA](https://img.shields.io/badge/ML--DSA-FIPS%20204-green)

5

![C2PA-Compatible](https://img.shields.io/badge/C2PA-Compatible-purple)

6

![License](https://img.shields.io/badge/License-Apache%202.0-orange)

7

![Version](https://img.shields.io/badge/version-0.1.0-lightgrey)

8

9

**C2PA for AI outputs, signed with ML-DSA.** Every piece of AI-generated content (text, image, audio) gets a signed provenance manifest that cryptographically proves *which model* produced it, *when*, *from what prompt*, and *under what licensing terms*. Unlike classical C2PA, signatures use **ML-DSA (FIPS 204)** so they survive the quantum transition: audit trails signed today remain verifiable 20+ years from now, even against a future quantum adversary.

10

11

## The Problem

12

13

Classical C2PA manifests rely on ECDSA / RSA signatures. A sufficiently large quantum computer running Shor's algorithm breaks both. That means every AI-generated article, diagnostic, or trading recommendation you sign today becomes **retroactively forgeable** once CRQCs (cryptographically-relevant quantum computers) arrive. Industries with long audit horizons (healthcare: 10-30 years, finance: 7+ years, legal discovery: indefinite) cannot rely on a classical signature for provenance.

14

15

## The Solution

16

17

Every AI output is wrapped in a signed **ContentManifest**:

18

19

- SHA3-256 content hash binds the manifest to the exact bytes produced.

20

- **ModelAttribution** names the model, version, and Shield Registry manifest hash.

21

- **GenerationContext** records prompt hash, parameters, and timestamp.

22

- **Assertions** — pluggable C2PA-style claims (AI-generated, training summary, usage license).

23

- **ML-DSA signature** over the canonical digest, by the model's AgentIdentity DID.

24

- **Provenance chain** links derivations (AI draft -> human edit -> final) so every change has an auditable signer.

25

26

At any future date, a verifier recomputes the content hash, re-runs ML-DSA verify on the canonical manifest bytes, and walks the chain. Tampering at any layer is detected.

27

28

## Installation

29

30

```bash

31

pip install pqc-content-provenance

32

```

33

34

Development:

35

36

```bash

37

pip install -e ".[dev]"

38

```

39

40

## Quick Start

41

42

### Sign an AI output

43

44

```python

45

from quantumshield import AgentIdentity

46

from pqc_content_provenance import (

47

AIGeneratedAssertion,

48

ContentManifest,

49

GenerationContext,

50

ManifestSigner,

51

ModelAttribution,

52

UsageAssertion,

53

embed_manifest,

54

)

55

56

identity = AgentIdentity.create("llama-3-signer")

57

signer = ManifestSigner(identity)

58

59

content = b"AI-generated press release about tool #4."

60

61

manifest = ContentManifest.create(

62

content=content,

63

content_type="text/plain",

64

model_attribution=ModelAttribution(

65

model_did=identity.did,

66

model_name="Llama-3-8B-Instruct",

67

model_version="1.0",

68

registry_url="https://quantamrkt.com/models/meta-llama-Llama-3-8B-Instruct",

69

),

70

generation_context=GenerationContext(

71

prompt_hash="ab" * 32,

72

parameters={"temperature": 0.7},

73

generated_at="2026-04-20T12:00:00Z",

74

),

75

assertions=[

76

AIGeneratedAssertion(model_name="Llama-3-8B-Instruct", model_version="1.0"),

77

UsageAssertion(license="cc-by-4.0", commercial_use=True, attribution_required=True),

78

],

79

)

80

81

signed = signer.sign(manifest)

82

envelope = embed_manifest(content, signed, mode="sidecar")

83

84

# Persist envelope alongside the content -- e.g. output.txt + output.txt.c2pa.json

85

```

86

87

### Verify an AI output

88

89

```python

90

from pqc_content_provenance import extract_manifest, ManifestSigner

91

92

manifest, content = extract_manifest(envelope, mode="sidecar")

93

result = ManifestSigner.verify(manifest, content)

94

95

if not result.valid:

96

raise RuntimeError(f"provenance check failed: {result.error}")

97

98

print(f"valid output from {result.signer_did}")

99

```

100

101

## Architecture

102

103

```

104

AI Model Publisher Consumer / Auditor

105

-------- --------- ------------------

106

| | |

107

| 1. generate output | |

108

| | |

109

| 2. ContentManifest.create: |

110

| - SHA3-256 content hash |

111

| - model attribution (from Shield Registry) |

112

| - generation context (prompt, params, time) |

113

| - assertions (AI-generated, usage, training) |

114

| | |

115

| 3. ManifestSigner.sign: |

116

| - canonical JSON -> SHA3-256 |

117

| - ML-DSA signature with AgentIdentity |

118

| | |

119

| 4. embed_manifest --->| 5. store content + sidecar |

120

| (sidecar or inline) | in CMS / DB / S3 |

121

| | |

122

| 6. deliver envelope ------>|

123

| |

124

| 7. extract_manifest

125

| 8. ManifestSigner.verify:

126

| - recompute content hash

127

| - ML-DSA verify canonical

128

| - walk ProvenanceChain

129

|

130

| 9. reject on any mismatch

131

```

132

133

## Threat Model

134

135

| Threat | Mitigation |

136

|---|---|

137

| **Forged attribution** (claim output came from model X when it didn't) | Manifest ML-DSA signature only verifies against model X's AgentIdentity public key. |

138

| **Content tampering** (text/image modified after signing) | Recomputed SHA3-256 no longer matches `manifest.content_hash`. |

139

| **Manifest tampering** (edit claimed model/prompt/license) | ML-DSA signature over canonical bytes breaks as soon as any field changes. |

140

| **Lost chain of custody** (edits with no signer record) | `ProvenanceChain` enforces `previous_manifest_id` links; each link has its own signer. |

141

| **Re-used signature across outputs** | Signature is over the canonical bytes of this specific manifest, which includes `content_hash` and `manifest_id`. |

142

| **Unknown / unregistered assertion** | `ASSERTION_REGISTRY` rejects unknown labels with `UnknownAssertionError`. |

143

| **Quantum adversary (Shor's algorithm)** | ML-DSA (FIPS 204) is not broken by known quantum attacks. |

144

| **Long audit horizon** (10-30 year retention) | Post-quantum signatures remain verifiable past classical crypto's expiry. |

145

146

## Assertions

147

148

Pluggable facts attached to a manifest. Each is a dataclass with a `label` that matches a C2PA-style namespace.

149

150

### `AIGeneratedAssertion` — `c2pa.ai_generated`

151

152

| Field | Description |

153

|---|---|

154

| `model_name`, `model_version`, `model_did` | Which model produced the content |

155

| `generator_type` | `text` / `image` / `audio` / `video` / `multimodal` |

156

| `human_edited` | Was it post-edited by a human? |

157

| `generation_params` | Temperature, top_p, seed, etc. |

158

159

### `TrainingAssertion` — `c2pa.training`

160

161

| Field | Description |

162

|---|---|

163

| `dataset_name`, `dataset_root_hash` | Source training set + Merkle root |

164

| `fine_tune_dataset`, `fine_tune_root_hash` | Optional fine-tune set |

165

| `pii_filtered`, `copyright_cleared` | Compliance flags |

166

| `licenses` | SPDX identifiers, e.g. `["cc-by-4.0", "apache-2.0"]` |

167

168

### `UsageAssertion` — `c2pa.usage`

169

170

| Field | Description |

171

|---|---|

172

| `license` | SPDX identifier or custom string |

173

| `commercial_use`, `attribution_required` | Rights flags |

174

| `attribution_text` | Required credit text |

175

| `jurisdictions` | Country codes where valid |

176

| `expiry` | ISO-8601 expiry or empty |

177

178

Register your own assertion subclass by adding it to `ASSERTION_REGISTRY` with its `label`.

179

180

## Chain of Custody

181

182

Every derivation (AI draft -> human edit -> legal review) produces a new manifest that references the previous via `previous_manifest_id`. The `ProvenanceChain` verifies:

183

184

1. Each manifest's ML-DSA signature.

185

2. Each manifest's `previous_manifest_id` matches the prior link's `manifest_id`.

186

3. The whole chain round-trips through `to_dicts()` / `from_dicts()` without loss.

187

188

```python

189

chain = ProvenanceChain()

190

chain.add(ai_draft_signed) # signed by model identity

191

chain.add(human_edit_signed) # signed by editor identity, prev = ai_draft.manifest_id

192

chain.add(legal_review_signed) # signed by legal identity, prev = human_edit.manifest_id

193

194

ok, errors = chain.verify_chain()

195

```

196

197

## API Reference

198

199

### `ContentManifest`

200

201

| Method | Description |

202

|---|---|

203

| `ContentManifest.create(content, content_type, attribution, context, assertions=..., previous_manifest_id=...)` | Build an unsigned manifest |

204

| `ContentManifest.compute_content_hash(bytes)` | Static SHA3-256 helper |

205

| `canonical_bytes()` | Deterministic bytes used for signing |

206

| `to_dict()` / `to_json()` / `from_dict()` / `from_json()` | JSON-safe round-trip |

207

208

### `ModelAttribution` / `GenerationContext`

209

210

Plain dataclasses holding model identity + generation context. Fully JSON-round-trippable.

211

212

### `ManifestSigner`

213

214

| Method | Description |

215

|---|---|

216

| `ManifestSigner(identity)` | Bind a signer to an `AgentIdentity` |

217

| `sign(manifest)` | In-place sign; returns manifest |

218

| `sign_and_raise_on_mismatch(manifest, content)` | Defensive: re-check content hash before signing |

219

| `ManifestSigner.verify(manifest, content=None)` | Static — returns `VerificationResult` |

220

221

### `VerificationResult`

222

223

Frozen dataclass. Fields: `valid`, `manifest_id`, `signer_did`, `algorithm`, `content_hash_match`, `signature_match`, `error`.

224

225

### `ProvenanceChain` / `ProvenanceLink`

226

227

| Method | Description |

228

|---|---|

229

| `add(manifest)` | Append link; raises `ChainBrokenError` on bad `previous_manifest_id` |

230

| `verify_chain()` | Returns `(ok, errors)` — verifies every signature and every link |

231

| `to_dicts()` / `from_dicts(items)` | JSON-safe round-trip |

232

233

### `embed_manifest` / `extract_manifest`

234

235

| Mode | Description |

236

|---|---|

237

| `sidecar` | JSON envelope containing manifest + base64 content. Save to `.c2pa.json`. |

238

| `text-header` | Inline marker block prepended to text content. |

239

240

### Exceptions

241

242

| Exception | When |

243

|---|---|

244

| `ProvenanceError` | Base class |

245

| `InvalidManifestError` | Malformed manifest / missing fields / bad JSON |

246

| `SignatureVerificationError` | Base for signature check failures |

247

| `ContentHashMismatchError` | Content bytes don't match manifest's claimed hash |

248

| `ChainBrokenError` | Provenance chain link mismatch |

249

| `UnknownAssertionError` | Assertion label not in `ASSERTION_REGISTRY` |

250

251

## Examples

252

253

See the `examples/` directory:

254

255

- **`sign_llm_output.py`** — end-to-end: agent signs AI text, embeds into sidecar, extracts, verifies.

256

- **`detect_tampered_output.py`** — shows that modifying the content bytes after signing is detected.

257

- **`provenance_chain.py`** — AI draft -> human-edited derivation; each link signed by a different identity.

258

259

Run them:

260

261

```bash

262

python examples/sign_llm_output.py

263

python examples/detect_tampered_output.py

264

python examples/provenance_chain.py

265

```

266

267

## Why PQC Matters for Provenance

268

269

Provenance is fundamentally an **audit-trail** technology: its whole value is being verifiable *later*. "Later" for healthcare is decades; for financial audits, years; for legal discovery, possibly forever. Classical signatures are vulnerable to **Harvest-Now-Decrypt-Later (HNDL)** style retroactive forgery — an adversary who records today's signed outputs can, once quantum-capable, produce indistinguishable fake manifests that appear to have been signed in the past. ML-DSA (FIPS 204) is believed to resist this attack. Signing AI outputs with PQC today is how we guarantee that tomorrow's auditors can still trust yesterday's provenance.

270

271

## Development

272

273

```bash

274

pip install -e ".[dev]"

275

pytest

276

ruff check src/ tests/ examples/

277

```

278

279

## Related

280

281

Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also:

282

283

- **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`).

284

- **PQC RAG Signing** — sister tool for signing RAG pipeline chunks with ML-DSA.

285

- **PQC MCP Transport** — sister tool for PQC-secured Model Context Protocol transports.

286

287

## License

288

289

Apache License 2.0. See [LICENSE](LICENSE).

290