README.md
10.6 KB · 268 lines · markdown Raw
1 # PQC RAG Signing
2
3 ![PQC Native](https://img.shields.io/badge/PQC-Native-blue)
4 ![ML-DSA-87](https://img.shields.io/badge/ML--DSA--87-FIPS%20204-green)
5 ![License](https://img.shields.io/badge/License-Apache%202.0-orange)
6 ![Version](https://img.shields.io/badge/version-0.1.0-lightgrey)
7
8 **Sigstore for RAG chunks.** Sign every chunk in your Retrieval-Augmented Generation pipeline with **ML-DSA** (FIPS 204) at ingestion time, then cryptographically verify each chunk at retrieval time before it ever reaches your LLM. Prevents vector database poisoning, supply-chain tampering, and silent chunk substitution attacks — even against adversaries with access to your vector DB. Every signature is post-quantum secure.
9
10 ## The Problem
11
12 Enterprise RAG pipelines have no integrity guarantees. Once a chunk lands in a vector database, there is nothing that cryptographically proves it came from the expected ingestion pipeline. An attacker with write access to the vector DB (insider threat, compromised credentials, or a misconfigured index) can inject malicious chunks that look exactly like legitimate ones. The LLM cannot tell the difference, so it grounds its response on poisoned context.
13
14 ## The Solution
15
16 Every chunk is wrapped in a signed envelope at ingestion:
17
18 - Canonical SHA3-256 of `(text + metadata + nonce)` — deterministic content hash.
19 - ML-DSA signature over the content hash, by a known signer DID.
20 - Per-corpus Merkle-style manifest that commits to the entire set of chunks.
21 - Allow-list of trusted signers enforced at retrieval.
22
23 At retrieval time, any tampering — a flipped bit, a swapped chunk, an injected row — is detected before the LLM sees the content.
24
25 ## Installation
26
27 ```bash
28 pip install pqc-rag-signing
29 ```
30
31 Vector-DB extras:
32
33 ```bash
34 pip install "pqc-rag-signing[chroma]"
35 pip install "pqc-rag-signing[pinecone]"
36 pip install "pqc-rag-signing[qdrant]"
37 ```
38
39 Development:
40
41 ```bash
42 pip install -e ".[dev]"
43 ```
44
45 ## Quick Start
46
47 ### Ingest: sign a corpus
48
49 ```python
50 from quantumshield import AgentIdentity
51 from pqc_rag_signing import Corpus
52
53 identity = AgentIdentity.create("my-rag-ingest")
54
55 corpus = Corpus(name="company-handbook-v1", identity=identity)
56 corpus.add_document("handbook.pdf", chunks=[
57 "PQC is required for all new systems.",
58 "ML-DSA-87 is the preferred signature algorithm.",
59 ])
60
61 signed_chunks = corpus.sign_all()
62 manifest = corpus.build_manifest()
63
64 # Store signed_chunks in your vector DB (persist chunk.to_dict() as metadata)
65 # Persist manifest.to_json() to S3 / disk / git-managed config
66 ```
67
68 ### Retrieve: verify before the LLM
69
70 ```python
71 from pqc_rag_signing import RetrievalVerifier
72
73 verifier = RetrievalVerifier(
74 trusted_signers={identity.did}, # only these DIDs are accepted
75 strict=True,
76 )
77
78 retrieved_chunks = vector_db.query(query_embedding, top_k=5) # your DB
79 result = verifier.verify_retrieved(retrieved_chunks)
80
81 if not result.all_verified:
82 raise RuntimeError(f"{result.failed_count} chunks failed verification!")
83
84 # Only cryptographically verified text ever reaches the LLM
85 safe_context = "\n\n".join(result.verified_texts())
86 llm_response = your_llm.generate(prompt=query, context=safe_context)
87 ```
88
89 ## Architecture
90
91 ```
92 Ingest Pipeline Vector DB Retrieval
93 --------------- --------- ---------
94 | | |
95 | 1. chunk text | |
96 | | |
97 | 2. sign each chunk | |
98 | (ML-DSA over SHA3-256) | |
99 | | |
100 | 3. build corpus manifest | |
101 | (Merkle root + signature) | |
102 | | |
103 | 4. upsert SignedChunks ----->| |
104 | | |
105 | |
106 | 5. query (embedding) <---- |
107 | |
108 | 6. retrieve SignedChunks-->|
109 | |
110 | 7. verify_retrieved():
111 | - recompute content hash
112 | - verify ML-DSA signature
113 | - check trusted-signer allow-list
114 |
115 | 8. ONLY verified text
116 | passed to LLM
117 ```
118
119 ## Threat Model
120
121 | Threat | Mitigation |
122 |---|---|
123 | **Vector DB poisoning** (attacker inserts malicious chunks) | Chunks signed by an untrusted DID are rejected at retrieval. |
124 | **Chunk tampering** (attacker modifies text in place) | Recomputed content hash no longer matches the signed hash. |
125 | **Metadata tampering** (attacker changes source/index) | Metadata is part of the signed hash input. |
126 | **Chunk substitution** (swap chunk A for chunk B, both signed) | Manifest verification detects missing or extra chunks in the corpus. |
127 | **MITM between vector DB and LLM** | All verification is done by the RAG app; no trust in the transport. |
128 | **Quantum adversary (Shor's algorithm)** | ML-DSA (FIPS 204) is not broken by known quantum attacks. |
129 | **Replay of old corpus** | Manifests carry `corpus_id` + `created_at`; reject stale manifests by policy. |
130
131 ## API Reference
132
133 ### `ChunkMetadata`
134
135 Frozen dataclass describing where a chunk came from.
136
137 | Field | Description |
138 |---|---|
139 | `source` | Source document identifier (filename, URL, etc.) |
140 | `chunk_index` | Zero-based position within source |
141 | `total_chunks` | Total chunks in source |
142 | `start_offset` / `end_offset` | Character offsets in original document |
143 | `extra` | Arbitrary user-supplied metadata (preserved through signing) |
144
145 ### `SignedChunk`
146
147 | Field | Description |
148 |---|---|
149 | `chunk_id` | Unique id (`chunk-<hex>`) |
150 | `text` | Content used for embedding |
151 | `metadata` | `ChunkMetadata` |
152 | `content_hash` | SHA3-256 of canonical `(text, metadata, nonce)` |
153 | `signer_did`, `public_key`, `algorithm` | Signer identity + algorithm |
154 | `signature` | Hex ML-DSA signature over `content_hash` |
155 | `signed_at` | ISO-8601 timestamp |
156 | `corpus_id` | Optional corpus binding |
157 | `nonce` | Per-chunk random nonce |
158
159 | Method | Description |
160 |---|---|
161 | `compute_content_hash(text, metadata, nonce)` | Deterministic canonical hash (static) |
162 | `to_dict()` / `from_dict()` | JSON-safe round-trip for vector DB metadata |
163
164 ### `ChunkSigner`
165
166 | Method | Description |
167 |---|---|
168 | `sign_chunk(text, metadata, chunk_id=None)` | Sign one chunk |
169 | `sign_chunks(texts, source)` | Batch-sign chunks from one document |
170 | `verify_chunk(chunk)` | Static — returns `VerificationResult` |
171 | `verify_chunks(chunks)` | Static — batch verification |
172
173 ### `VerificationResult`
174
175 Frozen dataclass with `valid`, `chunk_id`, `signer_did`, `algorithm`, `error`. Call `.raise_if_invalid()` to convert to an exception.
176
177 ### `Corpus` + `CorpusManifest`
178
179 | Method | Description |
180 |---|---|
181 | `Corpus(name, identity, corpus_id=None)` | Start a new corpus build |
182 | `add_document(source, chunks)` | Queue a document for signing |
183 | `sign_all()` | Sign all queued chunks |
184 | `build_manifest(chunks=None)` | Build a signed Merkle-style manifest |
185 | `verify_manifest(manifest)` | Static — verify the manifest signature and root |
186 | `verify_chunks_against_manifest(chunks, manifest)` | Static — check every chunk is committed |
187
188 ### `RetrievalVerifier` + `RetrievalResult`
189
190 | Method | Description |
191 |---|---|
192 | `RetrievalVerifier(trusted_signers=None, strict=True)` | Build a verifier with optional allow-list |
193 | `verify_retrieved(chunks)` | Verify batch, return `RetrievalResult` |
194 | `verify_or_raise(chunks)` | Raise `TamperedChunkError` on any failure |
195
196 `RetrievalResult` fields: `total`, `verified`, `failed`, `all_verified`, `verified_count`, `failed_count`, `verified_texts()`.
197
198 ### `RAGAuditLog` + `RAGAuditEntry`
199
200 Append-only in-memory audit trail. `log_sign`, `log_verify`, `log_retrieval`, `entries(...)`, `export_json()`.
201
202 ### Exceptions
203
204 | Exception | When |
205 |---|---|
206 | `RAGSigningError` | Base class |
207 | `ChunkVerificationError` | Any signature check failure |
208 | `TamperedChunkError` | Content hash does not match |
209 | `UnsignedChunkError` | Expected signed chunk, got raw text |
210 | `CorpusIntegrityError` | Manifest mismatch |
211 | `KeyMismatchError` | Signer DID differs from expected |
212
213 ## Vector DB Integration
214
215 Any vector database that allows arbitrary metadata per record is compatible. Store `SignedChunk.to_dict()` as metadata alongside the embedding, and rebuild the `SignedChunk` at retrieval:
216
217 ```python
218 from pqc_rag_signing import SignedChunk
219
220 # On ingest:
221 metadata_blob = signed_chunk.to_dict()
222 vector_db.upsert(id=signed_chunk.chunk_id,
223 vector=embedding,
224 metadata=metadata_blob)
225
226 # On retrieve:
227 hits = vector_db.query(vector=query_embedding, top_k=5)
228 signed = [SignedChunk.from_dict(h["metadata"]) for h in hits]
229 result = verifier.verify_retrieved(signed)
230 ```
231
232 The reference `InMemoryAdapter` (in `pqc_rag_signing.adapters`) and the abstract `VectorStoreAdapter` base class show the shape of a real adapter — use them as templates for Chroma, Pinecone, Qdrant, Weaviate, pgvector, and friends.
233
234 ## Examples
235
236 See the `examples/` directory:
237
238 - **`simple_ingest.py`** — sign a two-document corpus and build a manifest.
239 - **`retrieve_and_verify.py`** — full retrieve + verify round-trip with an audit log.
240 - **`poisoning_attack_demo.py`** — demonstrates detection of a vector-DB poisoning attack.
241
242 Run them:
243
244 ```bash
245 python examples/simple_ingest.py
246 python examples/retrieve_and_verify.py
247 python examples/poisoning_attack_demo.py
248 ```
249
250 ## Development
251
252 ```bash
253 pip install -e ".[dev]"
254 pytest
255 ruff check src/ tests/
256 ```
257
258 ## Related
259
260 Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also:
261
262 - **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`).
263 - **PQC MCP Transport** — sister tool for signing Model Context Protocol JSON-RPC messages.
264
265 ## License
266
267 Apache License 2.0. See [LICENSE](LICENSE).
268