README.md
| 1 | # PQC RAG Signing |
| 2 | |
| 3 |  |
| 4 |  |
| 5 |  |
| 6 |  |
| 7 | |
| 8 | **Sigstore for RAG chunks.** Sign every chunk in your Retrieval-Augmented Generation pipeline with **ML-DSA** (FIPS 204) at ingestion time, then cryptographically verify each chunk at retrieval time before it ever reaches your LLM. Prevents vector database poisoning, supply-chain tampering, and silent chunk substitution attacks — even against adversaries with access to your vector DB. Every signature is post-quantum secure. |
| 9 | |
| 10 | ## The Problem |
| 11 | |
| 12 | Enterprise RAG pipelines have no integrity guarantees. Once a chunk lands in a vector database, there is nothing that cryptographically proves it came from the expected ingestion pipeline. An attacker with write access to the vector DB (insider threat, compromised credentials, or a misconfigured index) can inject malicious chunks that look exactly like legitimate ones. The LLM cannot tell the difference, so it grounds its response on poisoned context. |
| 13 | |
| 14 | ## The Solution |
| 15 | |
| 16 | Every chunk is wrapped in a signed envelope at ingestion: |
| 17 | |
| 18 | - Canonical SHA3-256 of `(text + metadata + nonce)` — deterministic content hash. |
| 19 | - ML-DSA signature over the content hash, by a known signer DID. |
| 20 | - Per-corpus Merkle-style manifest that commits to the entire set of chunks. |
| 21 | - Allow-list of trusted signers enforced at retrieval. |
| 22 | |
| 23 | At retrieval time, any tampering — a flipped bit, a swapped chunk, an injected row — is detected before the LLM sees the content. |
| 24 | |
| 25 | ## Installation |
| 26 | |
| 27 | ```bash |
| 28 | pip install pqc-rag-signing |
| 29 | ``` |
| 30 | |
| 31 | Vector-DB extras: |
| 32 | |
| 33 | ```bash |
| 34 | pip install "pqc-rag-signing[chroma]" |
| 35 | pip install "pqc-rag-signing[pinecone]" |
| 36 | pip install "pqc-rag-signing[qdrant]" |
| 37 | ``` |
| 38 | |
| 39 | Development: |
| 40 | |
| 41 | ```bash |
| 42 | pip install -e ".[dev]" |
| 43 | ``` |
| 44 | |
| 45 | ## Quick Start |
| 46 | |
| 47 | ### Ingest: sign a corpus |
| 48 | |
| 49 | ```python |
| 50 | from quantumshield import AgentIdentity |
| 51 | from pqc_rag_signing import Corpus |
| 52 | |
| 53 | identity = AgentIdentity.create("my-rag-ingest") |
| 54 | |
| 55 | corpus = Corpus(name="company-handbook-v1", identity=identity) |
| 56 | corpus.add_document("handbook.pdf", chunks=[ |
| 57 | "PQC is required for all new systems.", |
| 58 | "ML-DSA-87 is the preferred signature algorithm.", |
| 59 | ]) |
| 60 | |
| 61 | signed_chunks = corpus.sign_all() |
| 62 | manifest = corpus.build_manifest() |
| 63 | |
| 64 | # Store signed_chunks in your vector DB (persist chunk.to_dict() as metadata) |
| 65 | # Persist manifest.to_json() to S3 / disk / git-managed config |
| 66 | ``` |
| 67 | |
| 68 | ### Retrieve: verify before the LLM |
| 69 | |
| 70 | ```python |
| 71 | from pqc_rag_signing import RetrievalVerifier |
| 72 | |
| 73 | verifier = RetrievalVerifier( |
| 74 | trusted_signers={identity.did}, # only these DIDs are accepted |
| 75 | strict=True, |
| 76 | ) |
| 77 | |
| 78 | retrieved_chunks = vector_db.query(query_embedding, top_k=5) # your DB |
| 79 | result = verifier.verify_retrieved(retrieved_chunks) |
| 80 | |
| 81 | if not result.all_verified: |
| 82 | raise RuntimeError(f"{result.failed_count} chunks failed verification!") |
| 83 | |
| 84 | # Only cryptographically verified text ever reaches the LLM |
| 85 | safe_context = "\n\n".join(result.verified_texts()) |
| 86 | llm_response = your_llm.generate(prompt=query, context=safe_context) |
| 87 | ``` |
| 88 | |
| 89 | ## Architecture |
| 90 | |
| 91 | ``` |
| 92 | Ingest Pipeline Vector DB Retrieval |
| 93 | --------------- --------- --------- |
| 94 | | | | |
| 95 | | 1. chunk text | | |
| 96 | | | | |
| 97 | | 2. sign each chunk | | |
| 98 | | (ML-DSA over SHA3-256) | | |
| 99 | | | | |
| 100 | | 3. build corpus manifest | | |
| 101 | | (Merkle root + signature) | | |
| 102 | | | | |
| 103 | | 4. upsert SignedChunks ----->| | |
| 104 | | | | |
| 105 | | | |
| 106 | | 5. query (embedding) <---- | |
| 107 | | | |
| 108 | | 6. retrieve SignedChunks-->| |
| 109 | | | |
| 110 | | 7. verify_retrieved(): |
| 111 | | - recompute content hash |
| 112 | | - verify ML-DSA signature |
| 113 | | - check trusted-signer allow-list |
| 114 | | |
| 115 | | 8. ONLY verified text |
| 116 | | passed to LLM |
| 117 | ``` |
| 118 | |
| 119 | ## Threat Model |
| 120 | |
| 121 | | Threat | Mitigation | |
| 122 | |---|---| |
| 123 | | **Vector DB poisoning** (attacker inserts malicious chunks) | Chunks signed by an untrusted DID are rejected at retrieval. | |
| 124 | | **Chunk tampering** (attacker modifies text in place) | Recomputed content hash no longer matches the signed hash. | |
| 125 | | **Metadata tampering** (attacker changes source/index) | Metadata is part of the signed hash input. | |
| 126 | | **Chunk substitution** (swap chunk A for chunk B, both signed) | Manifest verification detects missing or extra chunks in the corpus. | |
| 127 | | **MITM between vector DB and LLM** | All verification is done by the RAG app; no trust in the transport. | |
| 128 | | **Quantum adversary (Shor's algorithm)** | ML-DSA (FIPS 204) is not broken by known quantum attacks. | |
| 129 | | **Replay of old corpus** | Manifests carry `corpus_id` + `created_at`; reject stale manifests by policy. | |
| 130 | |
| 131 | ## API Reference |
| 132 | |
| 133 | ### `ChunkMetadata` |
| 134 | |
| 135 | Frozen dataclass describing where a chunk came from. |
| 136 | |
| 137 | | Field | Description | |
| 138 | |---|---| |
| 139 | | `source` | Source document identifier (filename, URL, etc.) | |
| 140 | | `chunk_index` | Zero-based position within source | |
| 141 | | `total_chunks` | Total chunks in source | |
| 142 | | `start_offset` / `end_offset` | Character offsets in original document | |
| 143 | | `extra` | Arbitrary user-supplied metadata (preserved through signing) | |
| 144 | |
| 145 | ### `SignedChunk` |
| 146 | |
| 147 | | Field | Description | |
| 148 | |---|---| |
| 149 | | `chunk_id` | Unique id (`chunk-<hex>`) | |
| 150 | | `text` | Content used for embedding | |
| 151 | | `metadata` | `ChunkMetadata` | |
| 152 | | `content_hash` | SHA3-256 of canonical `(text, metadata, nonce)` | |
| 153 | | `signer_did`, `public_key`, `algorithm` | Signer identity + algorithm | |
| 154 | | `signature` | Hex ML-DSA signature over `content_hash` | |
| 155 | | `signed_at` | ISO-8601 timestamp | |
| 156 | | `corpus_id` | Optional corpus binding | |
| 157 | | `nonce` | Per-chunk random nonce | |
| 158 | |
| 159 | | Method | Description | |
| 160 | |---|---| |
| 161 | | `compute_content_hash(text, metadata, nonce)` | Deterministic canonical hash (static) | |
| 162 | | `to_dict()` / `from_dict()` | JSON-safe round-trip for vector DB metadata | |
| 163 | |
| 164 | ### `ChunkSigner` |
| 165 | |
| 166 | | Method | Description | |
| 167 | |---|---| |
| 168 | | `sign_chunk(text, metadata, chunk_id=None)` | Sign one chunk | |
| 169 | | `sign_chunks(texts, source)` | Batch-sign chunks from one document | |
| 170 | | `verify_chunk(chunk)` | Static — returns `VerificationResult` | |
| 171 | | `verify_chunks(chunks)` | Static — batch verification | |
| 172 | |
| 173 | ### `VerificationResult` |
| 174 | |
| 175 | Frozen dataclass with `valid`, `chunk_id`, `signer_did`, `algorithm`, `error`. Call `.raise_if_invalid()` to convert to an exception. |
| 176 | |
| 177 | ### `Corpus` + `CorpusManifest` |
| 178 | |
| 179 | | Method | Description | |
| 180 | |---|---| |
| 181 | | `Corpus(name, identity, corpus_id=None)` | Start a new corpus build | |
| 182 | | `add_document(source, chunks)` | Queue a document for signing | |
| 183 | | `sign_all()` | Sign all queued chunks | |
| 184 | | `build_manifest(chunks=None)` | Build a signed Merkle-style manifest | |
| 185 | | `verify_manifest(manifest)` | Static — verify the manifest signature and root | |
| 186 | | `verify_chunks_against_manifest(chunks, manifest)` | Static — check every chunk is committed | |
| 187 | |
| 188 | ### `RetrievalVerifier` + `RetrievalResult` |
| 189 | |
| 190 | | Method | Description | |
| 191 | |---|---| |
| 192 | | `RetrievalVerifier(trusted_signers=None, strict=True)` | Build a verifier with optional allow-list | |
| 193 | | `verify_retrieved(chunks)` | Verify batch, return `RetrievalResult` | |
| 194 | | `verify_or_raise(chunks)` | Raise `TamperedChunkError` on any failure | |
| 195 | |
| 196 | `RetrievalResult` fields: `total`, `verified`, `failed`, `all_verified`, `verified_count`, `failed_count`, `verified_texts()`. |
| 197 | |
| 198 | ### `RAGAuditLog` + `RAGAuditEntry` |
| 199 | |
| 200 | Append-only in-memory audit trail. `log_sign`, `log_verify`, `log_retrieval`, `entries(...)`, `export_json()`. |
| 201 | |
| 202 | ### Exceptions |
| 203 | |
| 204 | | Exception | When | |
| 205 | |---|---| |
| 206 | | `RAGSigningError` | Base class | |
| 207 | | `ChunkVerificationError` | Any signature check failure | |
| 208 | | `TamperedChunkError` | Content hash does not match | |
| 209 | | `UnsignedChunkError` | Expected signed chunk, got raw text | |
| 210 | | `CorpusIntegrityError` | Manifest mismatch | |
| 211 | | `KeyMismatchError` | Signer DID differs from expected | |
| 212 | |
| 213 | ## Vector DB Integration |
| 214 | |
| 215 | Any vector database that allows arbitrary metadata per record is compatible. Store `SignedChunk.to_dict()` as metadata alongside the embedding, and rebuild the `SignedChunk` at retrieval: |
| 216 | |
| 217 | ```python |
| 218 | from pqc_rag_signing import SignedChunk |
| 219 | |
| 220 | # On ingest: |
| 221 | metadata_blob = signed_chunk.to_dict() |
| 222 | vector_db.upsert(id=signed_chunk.chunk_id, |
| 223 | vector=embedding, |
| 224 | metadata=metadata_blob) |
| 225 | |
| 226 | # On retrieve: |
| 227 | hits = vector_db.query(vector=query_embedding, top_k=5) |
| 228 | signed = [SignedChunk.from_dict(h["metadata"]) for h in hits] |
| 229 | result = verifier.verify_retrieved(signed) |
| 230 | ``` |
| 231 | |
| 232 | The reference `InMemoryAdapter` (in `pqc_rag_signing.adapters`) and the abstract `VectorStoreAdapter` base class show the shape of a real adapter — use them as templates for Chroma, Pinecone, Qdrant, Weaviate, pgvector, and friends. |
| 233 | |
| 234 | ## Examples |
| 235 | |
| 236 | See the `examples/` directory: |
| 237 | |
| 238 | - **`simple_ingest.py`** — sign a two-document corpus and build a manifest. |
| 239 | - **`retrieve_and_verify.py`** — full retrieve + verify round-trip with an audit log. |
| 240 | - **`poisoning_attack_demo.py`** — demonstrates detection of a vector-DB poisoning attack. |
| 241 | |
| 242 | Run them: |
| 243 | |
| 244 | ```bash |
| 245 | python examples/simple_ingest.py |
| 246 | python examples/retrieve_and_verify.py |
| 247 | python examples/poisoning_attack_demo.py |
| 248 | ``` |
| 249 | |
| 250 | ## Development |
| 251 | |
| 252 | ```bash |
| 253 | pip install -e ".[dev]" |
| 254 | pytest |
| 255 | ruff check src/ tests/ |
| 256 | ``` |
| 257 | |
| 258 | ## Related |
| 259 | |
| 260 | Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also: |
| 261 | |
| 262 | - **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`). |
| 263 | - **PQC MCP Transport** — sister tool for signing Model Context Protocol JSON-RPC messages. |
| 264 | |
| 265 | ## License |
| 266 | |
| 267 | Apache License 2.0. See [LICENSE](LICENSE). |
| 268 | |