README.md
13.5 KB · 256 lines · markdown Raw
1 # PQC AI MBOM
2
3 ![PQC Native](https://img.shields.io/badge/PQC-Native-blue)
4 ![ML-DSA-65](https://img.shields.io/badge/ML--DSA--65-FIPS%20204-green)
5 ![SPDX](https://img.shields.io/badge/SPDX--2.3-Compatible-purple)
6 ![License](https://img.shields.io/badge/License-Apache%202.0-orange)
7 ![Version](https://img.shields.io/badge/version-0.1.0-lightgrey)
8
9 **Bill of Materials for AI models, signed with post-quantum cryptography.** Enumerate every component that went into a model — base architecture, pretraining data, fine-tuning data, RLHF feedback, tokenizer, quantization method, evaluation benchmarks, safety classifiers — hash each one with SHA3-256, commit the whole set to a Merkle-style root, and sign the root with ML-DSA (FIPS 204). The result is a machine-verifiable provenance artifact whose signature will still be valid when a cryptographically-relevant quantum computer arrives in 10-15 years — which matters, because federal AI procurement audits already require 15+ year record retention.
10
11 ## The Problem
12
13 There is no standard, tamper-evident way to declare what an AI model is made of. Model cards are freeform Markdown. Hugging Face repos are a filesystem. SBOM tools like SPDX and CycloneDX were built for software libraries, not datasets, RLHF feedback, or quantization recipes. When a regulator (or your own security team) asks "prove this model wasn't trained on the leaked dataset," the answer is usually an email thread.
14
15 Even when providers *do* publish lineage, every signature you see today is RSA or ECDSA — both broken by Shor's algorithm. An AI MBOM signed in 2026 with RSA-2048 will not be verifiable as authentic in 2041. Auditors and procurement officers who keep records for a 15-year retention window will be looking at signatures that a quantum adversary can forge.
16
17 ## The Solution
18
19 `pqc-mbom` is a Python library for producing, signing, and verifying **Model Bill of Materials** documents:
20
21 - Each component has a stable id, a type, a SHA3-256 content hash, supplier, author, license, and arbitrary property bag.
22 - The MBOM commits to `components_root_hash = SHA3-256(sorted component hashes)`.
23 - The canonical JSON of the MBOM is signed with **ML-DSA** via `quantumshield`.
24 - SPDX-2.3 interop: `to_spdx_json` / `from_spdx_json` so the output drops into existing SBOM pipelines.
25 - Diffing: `diff_mboms(old, new)` surfaces added / removed / changed components — the minimum surface area an auditor needs to sign off on a fine-tune.
26
27 ## Installation
28
29 ```bash
30 pip install pqc-mbom
31 ```
32
33 Development:
34
35 ```bash
36 pip install -e ".[dev]"
37 ```
38
39 ## Quick Start
40
41 ```python
42 from quantumshield.identity.agent import AgentIdentity
43 from pqc_mbom import MBOMBuilder, MBOMSigner, MBOMVerifier
44
45 identity = AgentIdentity.create("llama-release-pipeline")
46
47 mbom = (
48 MBOMBuilder("Llama-3-8B-Instruct", "1.0.0", supplier="Meta")
49 .set_description("Llama 3 8B instruction-tuned.")
50 .add_base_architecture("Llama-3", version="3.0", content_hash="a" * 64)
51 .add_tokenizer("llama3-tokenizer", content_hash="b" * 64)
52 .add_training_data("pretraining-mix", content_hash="c" * 64, content_size=15 * 10**12)
53 .add_fine_tuning_data("instruct-sft-v1", content_hash="d" * 64)
54 .add_rlhf_data("preference-pairs-v1", content_hash="e" * 64)
55 .add_weights("model.safetensors", content_hash="f" * 64, content_size=16_060_522_240)
56 .add_quantization("no-quant-fp16")
57 .add_evaluation("mmlu-5shot", content_hash="1" * 64)
58 .build()
59 )
60
61 MBOMSigner(identity).sign(mbom) # fills signer_did / algorithm / signature / public_key
62 result = MBOMVerifier.verify(mbom) # VerificationResult(valid=True, ...)
63 assert result.valid
64
65 # Persist
66 open("llama3-8b.mbom.json", "w").write(mbom.to_json())
67
68 # List components by type
69 from pqc_mbom import ComponentType
70 for c in mbom.components_by_type(ComponentType.TRAINING_DATA):
71 print(c.name, c.content_hash[:16], c.content_size)
72 ```
73
74 ## Architecture
75
76 ```
77 +---------------------------+
78 | MBOMBuilder (fluent API) |
79 +-------------+-------------+
80 |
81 v
82 +------------------------+------------------------+
83 | MBOM |
84 | mbom_id, schema_version, model_name/version |
85 | components: [ModelComponent, ...] |
86 | components_root_hash = SHA3-256(sorted hashes) |
87 +-----------+-------------------+-----------------+
88 | |
89 v v
90 +---------+--------+ +------+--------+
91 | MBOMSigner | | to_spdx_json |<----> SPDX-2.3
92 | ML-DSA sign() | | from_spdx | interop
93 +---------+--------+ +---------------+
94 |
95 v
96 +----------+----------+
97 | Signed MBOM JSON | +-----------------+
98 | (transport / CDN) |----->| MBOMVerifier |
99 +---------------------+ | ML-DSA verify |
100 | root recompute |
101 +--------+--------+
102 |
103 v
104 VerificationResult
105 ```
106
107 ## Component Types
108
109 | Type | Meaning |
110 | ----------------------- | ------------------------------------------------------------- |
111 | `base-architecture` | Model architecture definition (e.g. Llama-3 decoder layout) |
112 | `weights` | Serialized model weights (safetensors, GGUF, pth) |
113 | `training-data` | Raw pretraining dataset |
114 | `fine-tuning-data` | SFT / instruction / domain adaptation data |
115 | `rlhf-data` | Human preference pairs / feedback data |
116 | `evaluation-benchmark` | Benchmark corpus used for reported eval numbers |
117 | `tokenizer` | Tokenizer vocab + merges / BPE / SentencePiece artifacts |
118 | `quantization-method` | Quantization recipe (int8 SmoothQuant, GPTQ, AWQ, etc.) |
119 | `code` | Training / inference code revision |
120 | `config` | JSON / YAML config files |
121 | `adapter` | LoRA / QLoRA adapter weights |
122 | `safety-model` | Content filter / classifier (e.g. Llama-Guard) |
123 | `other` | Anything else worth enumerating |
124
125 Thirteen types cover the standard model lifecycle. Any arbitrary metadata lives in the per-component `properties: dict[str, str]` bag.
126
127 ## Cryptography
128
129 | Layer | Algorithm | Notes |
130 | ----------------- | ------------------------------- | --------------------------------------------------------- |
131 | Content hashing | **SHA3-256** | Per component and over sorted component hashes |
132 | Canonical form | JSON with `sort_keys=True` | Deterministic byte-level input to the signer |
133 | Signature | **ML-DSA-65** (FIPS 204) | Via `quantumshield` — ML-DSA-44 / 87 also supported |
134 | Identity | `did:pqaid:...` (AgentIdentity) | Stable, rotatable signer identity |
135 | Fallback (no oqs) | Ed25519 | Transitional only — install `quantumshield[pqc]` for real |
136
137 The MBOM signature commits to the canonical bytes of the document *including* `components_root_hash`. `MBOMVerifier.verify` both (a) checks the ML-DSA signature and (b) recomputes the root from scratch, so any tamper with a component, the component list, or the stored root is caught.
138
139 ## Threat Model
140
141 | Threat | Caught by |
142 | ----------------------------------------------------- | ------------------------------------------------------ |
143 | Forged MBOM (attacker publishes an MBOM they didn't make) | ML-DSA signature fails under attacker's key + trust-policy rejects unknown signer_did |
144 | Tampered component (flip a byte in a component entry) | Recomputed component hash + recomputed root mismatch |
145 | Dataset swap (same component_id, new content_hash) | Canonical bytes change -> signature invalid; `diff_mboms` reports it as `changed` |
146 | Component insertion / removal after signing | `components_root_hash` changes -> signature invalid |
147 | Stale signature (published MBOM whose signer rotated) | `signer_did` + `signed_at` let you enforce key-freshness policy |
148 | Post-quantum forgery (harvest-now / decrypt-later) | ML-DSA is resistant to Shor's algorithm |
149
150 Trust anchoring (which DIDs are authoritative for a given model supplier) is policy, not cryptography. `pqc-mbom` gives you the cryptographic primitive; your verification layer decides whose signatures to honor.
151
152 ## Why PQC for AI MBOMs
153
154 Federal AI procurement guidance (NIST AI 600-1, OMB M-24-10) pushes retention windows of 10-15 years for AI provenance records. Commercial contracts covering model-derived IP often run longer. Anything signed with RSA or ECDSA today is a ticking clock: once a cryptographically-relevant quantum computer exists, every stored signature can be forged retroactively.
155
156 If you're publishing an AI MBOM in 2026 that needs to be verifiable in 2041, you either sign it post-quantum now or you re-sign every artifact every time a new cryptosystem becomes standard. The first option is dramatically cheaper and is what FIPS 204 exists to enable.
157
158 ## SPDX Compatibility
159
160 ```python
161 from pqc_mbom import to_spdx_json, from_spdx_json
162
163 blob = to_spdx_json(mbom) # SPDX-2.3 JSON document
164 mbom2 = from_spdx_json(blob) # roundtrip back
165 ```
166
167 Each `ModelComponent` becomes an SPDX `Package`. AI-specific metadata (component_type, MBOM signature, license extras, arbitrary properties) is preserved as structured `annotations` with `pqc-mbom:*` keys. Any SPDX 2.3 consumer — Dependency-Track, OSV, Anchore, the SPDX CLI — can ingest the output as a normal SBOM and will simply ignore the AI extensions. Round-tripping through `from_spdx_json` recovers the full MBOM.
168
169 ## API Reference
170
171 ```python
172 # Components
173 ModelComponent(component_id, component_type, name, version, content_hash,
174 content_size, supplier, author, external_url, license,
175 references, properties)
176 ModelComponent.hash_content(bytes) -> str # SHA3-256 hex
177 ModelComponent.canonical_bytes() -> bytes
178 ModelComponent.hash() -> str # canonical SHA3-256
179 ModelComponent.to_dict() / from_dict()
180
181 ComponentType.{BASE_ARCHITECTURE, WEIGHTS, TRAINING_DATA, FINE_TUNING_DATA,
182 RLHF_DATA, EVALUATION_BENCHMARK, TOKENIZER, QUANTIZATION_METHOD,
183 CODE, CONFIG, ADAPTER, SAFETY_MODEL, OTHER}
184 LicenseInfo(spdx_id, name, url, commercial_use, attribution_required)
185 ComponentReference(component_id, relationship)
186
187 # MBOM
188 MBOM.create(model_name, model_version, supplier, description, components)
189 MBOM.recompute_root() -> str
190 MBOM.get_component(component_id) -> ModelComponent # raises MissingComponentError
191 MBOM.components_by_type(ctype) -> list[ModelComponent]
192 MBOM.canonical_bytes() -> bytes
193 MBOM.to_dict() / to_json() / from_dict() / from_json()
194
195 MBOMBuilder(model_name, model_version, supplier)
196 .set_description(str)
197 .add_component(ModelComponent)
198 .add_base_architecture(name, version, content_hash, **kwargs)
199 .add_weights(name, content_hash, content_size, **kwargs)
200 .add_training_data(name, content_hash, content_size, **kwargs)
201 .add_fine_tuning_data(name, content_hash, **kwargs)
202 .add_rlhf_data(name, content_hash, **kwargs)
203 .add_tokenizer(name, content_hash, **kwargs)
204 .add_quantization(name, **kwargs)
205 .add_evaluation(name, content_hash, **kwargs)
206 .build() -> MBOM
207
208 # Signing / verification
209 MBOMSigner(identity).sign(mbom) -> MBOM
210 MBOMVerifier.verify(mbom) -> VerificationResult
211 MBOMVerifier.verify_or_raise(mbom) -> VerificationResult # raises SignatureVerificationError
212
213 VerificationResult(signature_valid, root_hash_valid, mbom_id, signer_did,
214 algorithm, error)
215 VerificationResult.valid # signature_valid and root_hash_valid
216
217 # SPDX
218 to_spdx_json(mbom, *, indent=2) -> str
219 from_spdx_json(blob) -> MBOM # raises SPDXConversionError
220
221 # Diff
222 diff_mboms(old, new) -> MBOMDiff
223 MBOMDiff.{added, removed, changed, is_empty}
224 ```
225
226 ## Exceptions
227
228 ```
229 MBOMError
230 ├── InvalidMBOMError
231 ├── SignatureVerificationError
232 ├── ComponentError
233 │ └── MissingComponentError
234 └── SPDXConversionError
235 ```
236
237 ## Examples
238
239 | File | Shows |
240 | ------------------------------------ | ---------------------------------------------------------- |
241 | `examples/build_llama_mbom.py` | End-to-end: build realistic Llama-3-8B MBOM, sign, verify |
242 | `examples/detect_dataset_swap.py` | Diff two versions, catch a training-data swap attempt |
243 | `examples/mbom_to_spdx.py` | Export an MBOM to SPDX-2.3 JSON and round-trip it back |
244
245 Run them with:
246
247 ```bash
248 python examples/build_llama_mbom.py
249 python examples/detect_dataset_swap.py
250 python examples/mbom_to_spdx.py
251 ```
252
253 ## License
254
255 Apache 2.0 — see [LICENSE](./LICENSE).
256