README.md
| 1 | # PQC AI MBOM |
| 2 | |
| 3 |  |
| 4 |  |
| 5 |  |
| 6 |  |
| 7 |  |
| 8 | |
| 9 | **Bill of Materials for AI models, signed with post-quantum cryptography.** Enumerate every component that went into a model — base architecture, pretraining data, fine-tuning data, RLHF feedback, tokenizer, quantization method, evaluation benchmarks, safety classifiers — hash each one with SHA3-256, commit the whole set to a Merkle-style root, and sign the root with ML-DSA (FIPS 204). The result is a machine-verifiable provenance artifact whose signature will still be valid when a cryptographically-relevant quantum computer arrives in 10-15 years — which matters, because federal AI procurement audits already require 15+ year record retention. |
| 10 | |
| 11 | ## The Problem |
| 12 | |
| 13 | There is no standard, tamper-evident way to declare what an AI model is made of. Model cards are freeform Markdown. Hugging Face repos are a filesystem. SBOM tools like SPDX and CycloneDX were built for software libraries, not datasets, RLHF feedback, or quantization recipes. When a regulator (or your own security team) asks "prove this model wasn't trained on the leaked dataset," the answer is usually an email thread. |
| 14 | |
| 15 | Even when providers *do* publish lineage, every signature you see today is RSA or ECDSA — both broken by Shor's algorithm. An AI MBOM signed in 2026 with RSA-2048 will not be verifiable as authentic in 2041. Auditors and procurement officers who keep records for a 15-year retention window will be looking at signatures that a quantum adversary can forge. |
| 16 | |
| 17 | ## The Solution |
| 18 | |
| 19 | `pqc-mbom` is a Python library for producing, signing, and verifying **Model Bill of Materials** documents: |
| 20 | |
| 21 | - Each component has a stable id, a type, a SHA3-256 content hash, supplier, author, license, and arbitrary property bag. |
| 22 | - The MBOM commits to `components_root_hash = SHA3-256(sorted component hashes)`. |
| 23 | - The canonical JSON of the MBOM is signed with **ML-DSA** via `quantumshield`. |
| 24 | - SPDX-2.3 interop: `to_spdx_json` / `from_spdx_json` so the output drops into existing SBOM pipelines. |
| 25 | - Diffing: `diff_mboms(old, new)` surfaces added / removed / changed components — the minimum surface area an auditor needs to sign off on a fine-tune. |
| 26 | |
| 27 | ## Installation |
| 28 | |
| 29 | ```bash |
| 30 | pip install pqc-mbom |
| 31 | ``` |
| 32 | |
| 33 | Development: |
| 34 | |
| 35 | ```bash |
| 36 | pip install -e ".[dev]" |
| 37 | ``` |
| 38 | |
| 39 | ## Quick Start |
| 40 | |
| 41 | ```python |
| 42 | from quantumshield.identity.agent import AgentIdentity |
| 43 | from pqc_mbom import MBOMBuilder, MBOMSigner, MBOMVerifier |
| 44 | |
| 45 | identity = AgentIdentity.create("llama-release-pipeline") |
| 46 | |
| 47 | mbom = ( |
| 48 | MBOMBuilder("Llama-3-8B-Instruct", "1.0.0", supplier="Meta") |
| 49 | .set_description("Llama 3 8B instruction-tuned.") |
| 50 | .add_base_architecture("Llama-3", version="3.0", content_hash="a" * 64) |
| 51 | .add_tokenizer("llama3-tokenizer", content_hash="b" * 64) |
| 52 | .add_training_data("pretraining-mix", content_hash="c" * 64, content_size=15 * 10**12) |
| 53 | .add_fine_tuning_data("instruct-sft-v1", content_hash="d" * 64) |
| 54 | .add_rlhf_data("preference-pairs-v1", content_hash="e" * 64) |
| 55 | .add_weights("model.safetensors", content_hash="f" * 64, content_size=16_060_522_240) |
| 56 | .add_quantization("no-quant-fp16") |
| 57 | .add_evaluation("mmlu-5shot", content_hash="1" * 64) |
| 58 | .build() |
| 59 | ) |
| 60 | |
| 61 | MBOMSigner(identity).sign(mbom) # fills signer_did / algorithm / signature / public_key |
| 62 | result = MBOMVerifier.verify(mbom) # VerificationResult(valid=True, ...) |
| 63 | assert result.valid |
| 64 | |
| 65 | # Persist |
| 66 | open("llama3-8b.mbom.json", "w").write(mbom.to_json()) |
| 67 | |
| 68 | # List components by type |
| 69 | from pqc_mbom import ComponentType |
| 70 | for c in mbom.components_by_type(ComponentType.TRAINING_DATA): |
| 71 | print(c.name, c.content_hash[:16], c.content_size) |
| 72 | ``` |
| 73 | |
| 74 | ## Architecture |
| 75 | |
| 76 | ``` |
| 77 | +---------------------------+ |
| 78 | | MBOMBuilder (fluent API) | |
| 79 | +-------------+-------------+ |
| 80 | | |
| 81 | v |
| 82 | +------------------------+------------------------+ |
| 83 | | MBOM | |
| 84 | | mbom_id, schema_version, model_name/version | |
| 85 | | components: [ModelComponent, ...] | |
| 86 | | components_root_hash = SHA3-256(sorted hashes) | |
| 87 | +-----------+-------------------+-----------------+ |
| 88 | | | |
| 89 | v v |
| 90 | +---------+--------+ +------+--------+ |
| 91 | | MBOMSigner | | to_spdx_json |<----> SPDX-2.3 |
| 92 | | ML-DSA sign() | | from_spdx | interop |
| 93 | +---------+--------+ +---------------+ |
| 94 | | |
| 95 | v |
| 96 | +----------+----------+ |
| 97 | | Signed MBOM JSON | +-----------------+ |
| 98 | | (transport / CDN) |----->| MBOMVerifier | |
| 99 | +---------------------+ | ML-DSA verify | |
| 100 | | root recompute | |
| 101 | +--------+--------+ |
| 102 | | |
| 103 | v |
| 104 | VerificationResult |
| 105 | ``` |
| 106 | |
| 107 | ## Component Types |
| 108 | |
| 109 | | Type | Meaning | |
| 110 | | ----------------------- | ------------------------------------------------------------- | |
| 111 | | `base-architecture` | Model architecture definition (e.g. Llama-3 decoder layout) | |
| 112 | | `weights` | Serialized model weights (safetensors, GGUF, pth) | |
| 113 | | `training-data` | Raw pretraining dataset | |
| 114 | | `fine-tuning-data` | SFT / instruction / domain adaptation data | |
| 115 | | `rlhf-data` | Human preference pairs / feedback data | |
| 116 | | `evaluation-benchmark` | Benchmark corpus used for reported eval numbers | |
| 117 | | `tokenizer` | Tokenizer vocab + merges / BPE / SentencePiece artifacts | |
| 118 | | `quantization-method` | Quantization recipe (int8 SmoothQuant, GPTQ, AWQ, etc.) | |
| 119 | | `code` | Training / inference code revision | |
| 120 | | `config` | JSON / YAML config files | |
| 121 | | `adapter` | LoRA / QLoRA adapter weights | |
| 122 | | `safety-model` | Content filter / classifier (e.g. Llama-Guard) | |
| 123 | | `other` | Anything else worth enumerating | |
| 124 | |
| 125 | Thirteen types cover the standard model lifecycle. Any arbitrary metadata lives in the per-component `properties: dict[str, str]` bag. |
| 126 | |
| 127 | ## Cryptography |
| 128 | |
| 129 | | Layer | Algorithm | Notes | |
| 130 | | ----------------- | ------------------------------- | --------------------------------------------------------- | |
| 131 | | Content hashing | **SHA3-256** | Per component and over sorted component hashes | |
| 132 | | Canonical form | JSON with `sort_keys=True` | Deterministic byte-level input to the signer | |
| 133 | | Signature | **ML-DSA-65** (FIPS 204) | Via `quantumshield` — ML-DSA-44 / 87 also supported | |
| 134 | | Identity | `did:pqaid:...` (AgentIdentity) | Stable, rotatable signer identity | |
| 135 | | Fallback (no oqs) | Ed25519 | Transitional only — install `quantumshield[pqc]` for real | |
| 136 | |
| 137 | The MBOM signature commits to the canonical bytes of the document *including* `components_root_hash`. `MBOMVerifier.verify` both (a) checks the ML-DSA signature and (b) recomputes the root from scratch, so any tamper with a component, the component list, or the stored root is caught. |
| 138 | |
| 139 | ## Threat Model |
| 140 | |
| 141 | | Threat | Caught by | |
| 142 | | ----------------------------------------------------- | ------------------------------------------------------ | |
| 143 | | Forged MBOM (attacker publishes an MBOM they didn't make) | ML-DSA signature fails under attacker's key + trust-policy rejects unknown signer_did | |
| 144 | | Tampered component (flip a byte in a component entry) | Recomputed component hash + recomputed root mismatch | |
| 145 | | Dataset swap (same component_id, new content_hash) | Canonical bytes change -> signature invalid; `diff_mboms` reports it as `changed` | |
| 146 | | Component insertion / removal after signing | `components_root_hash` changes -> signature invalid | |
| 147 | | Stale signature (published MBOM whose signer rotated) | `signer_did` + `signed_at` let you enforce key-freshness policy | |
| 148 | | Post-quantum forgery (harvest-now / decrypt-later) | ML-DSA is resistant to Shor's algorithm | |
| 149 | |
| 150 | Trust anchoring (which DIDs are authoritative for a given model supplier) is policy, not cryptography. `pqc-mbom` gives you the cryptographic primitive; your verification layer decides whose signatures to honor. |
| 151 | |
| 152 | ## Why PQC for AI MBOMs |
| 153 | |
| 154 | Federal AI procurement guidance (NIST AI 600-1, OMB M-24-10) pushes retention windows of 10-15 years for AI provenance records. Commercial contracts covering model-derived IP often run longer. Anything signed with RSA or ECDSA today is a ticking clock: once a cryptographically-relevant quantum computer exists, every stored signature can be forged retroactively. |
| 155 | |
| 156 | If you're publishing an AI MBOM in 2026 that needs to be verifiable in 2041, you either sign it post-quantum now or you re-sign every artifact every time a new cryptosystem becomes standard. The first option is dramatically cheaper and is what FIPS 204 exists to enable. |
| 157 | |
| 158 | ## SPDX Compatibility |
| 159 | |
| 160 | ```python |
| 161 | from pqc_mbom import to_spdx_json, from_spdx_json |
| 162 | |
| 163 | blob = to_spdx_json(mbom) # SPDX-2.3 JSON document |
| 164 | mbom2 = from_spdx_json(blob) # roundtrip back |
| 165 | ``` |
| 166 | |
| 167 | Each `ModelComponent` becomes an SPDX `Package`. AI-specific metadata (component_type, MBOM signature, license extras, arbitrary properties) is preserved as structured `annotations` with `pqc-mbom:*` keys. Any SPDX 2.3 consumer — Dependency-Track, OSV, Anchore, the SPDX CLI — can ingest the output as a normal SBOM and will simply ignore the AI extensions. Round-tripping through `from_spdx_json` recovers the full MBOM. |
| 168 | |
| 169 | ## API Reference |
| 170 | |
| 171 | ```python |
| 172 | # Components |
| 173 | ModelComponent(component_id, component_type, name, version, content_hash, |
| 174 | content_size, supplier, author, external_url, license, |
| 175 | references, properties) |
| 176 | ModelComponent.hash_content(bytes) -> str # SHA3-256 hex |
| 177 | ModelComponent.canonical_bytes() -> bytes |
| 178 | ModelComponent.hash() -> str # canonical SHA3-256 |
| 179 | ModelComponent.to_dict() / from_dict() |
| 180 | |
| 181 | ComponentType.{BASE_ARCHITECTURE, WEIGHTS, TRAINING_DATA, FINE_TUNING_DATA, |
| 182 | RLHF_DATA, EVALUATION_BENCHMARK, TOKENIZER, QUANTIZATION_METHOD, |
| 183 | CODE, CONFIG, ADAPTER, SAFETY_MODEL, OTHER} |
| 184 | LicenseInfo(spdx_id, name, url, commercial_use, attribution_required) |
| 185 | ComponentReference(component_id, relationship) |
| 186 | |
| 187 | # MBOM |
| 188 | MBOM.create(model_name, model_version, supplier, description, components) |
| 189 | MBOM.recompute_root() -> str |
| 190 | MBOM.get_component(component_id) -> ModelComponent # raises MissingComponentError |
| 191 | MBOM.components_by_type(ctype) -> list[ModelComponent] |
| 192 | MBOM.canonical_bytes() -> bytes |
| 193 | MBOM.to_dict() / to_json() / from_dict() / from_json() |
| 194 | |
| 195 | MBOMBuilder(model_name, model_version, supplier) |
| 196 | .set_description(str) |
| 197 | .add_component(ModelComponent) |
| 198 | .add_base_architecture(name, version, content_hash, **kwargs) |
| 199 | .add_weights(name, content_hash, content_size, **kwargs) |
| 200 | .add_training_data(name, content_hash, content_size, **kwargs) |
| 201 | .add_fine_tuning_data(name, content_hash, **kwargs) |
| 202 | .add_rlhf_data(name, content_hash, **kwargs) |
| 203 | .add_tokenizer(name, content_hash, **kwargs) |
| 204 | .add_quantization(name, **kwargs) |
| 205 | .add_evaluation(name, content_hash, **kwargs) |
| 206 | .build() -> MBOM |
| 207 | |
| 208 | # Signing / verification |
| 209 | MBOMSigner(identity).sign(mbom) -> MBOM |
| 210 | MBOMVerifier.verify(mbom) -> VerificationResult |
| 211 | MBOMVerifier.verify_or_raise(mbom) -> VerificationResult # raises SignatureVerificationError |
| 212 | |
| 213 | VerificationResult(signature_valid, root_hash_valid, mbom_id, signer_did, |
| 214 | algorithm, error) |
| 215 | VerificationResult.valid # signature_valid and root_hash_valid |
| 216 | |
| 217 | # SPDX |
| 218 | to_spdx_json(mbom, *, indent=2) -> str |
| 219 | from_spdx_json(blob) -> MBOM # raises SPDXConversionError |
| 220 | |
| 221 | # Diff |
| 222 | diff_mboms(old, new) -> MBOMDiff |
| 223 | MBOMDiff.{added, removed, changed, is_empty} |
| 224 | ``` |
| 225 | |
| 226 | ## Exceptions |
| 227 | |
| 228 | ``` |
| 229 | MBOMError |
| 230 | ├── InvalidMBOMError |
| 231 | ├── SignatureVerificationError |
| 232 | ├── ComponentError |
| 233 | │ └── MissingComponentError |
| 234 | └── SPDXConversionError |
| 235 | ``` |
| 236 | |
| 237 | ## Examples |
| 238 | |
| 239 | | File | Shows | |
| 240 | | ------------------------------------ | ---------------------------------------------------------- | |
| 241 | | `examples/build_llama_mbom.py` | End-to-end: build realistic Llama-3-8B MBOM, sign, verify | |
| 242 | | `examples/detect_dataset_swap.py` | Diff two versions, catch a training-data swap attempt | |
| 243 | | `examples/mbom_to_spdx.py` | Export an MBOM to SPDX-2.3 JSON and round-trip it back | |
| 244 | |
| 245 | Run them with: |
| 246 | |
| 247 | ```bash |
| 248 | python examples/build_llama_mbom.py |
| 249 | python examples/detect_dataset_swap.py |
| 250 | python examples/mbom_to_spdx.py |
| 251 | ``` |
| 252 | |
| 253 | ## License |
| 254 | |
| 255 | Apache 2.0 — see [LICENSE](./LICENSE). |
| 256 | |