v0.4.0: semantic retrieval — REPL TDD + literate prose

RED proofs (pre-v0.4.0): - SEMANTIC_SCORE never appears in context output (foveal-vector = nil) - Context suite: 9/0 (no trigram test) - SHA-256 hashing default — cryptographically blind to similarity GREEN proofs (v0.4.0): - Trigram 'authentication' vs 'authenticate' → 0.80 similarity - Trigram 'authentication' vs 'banana' → 0.00 similarity - Default provider: :trigram (lexical overlap, zero dependencies) - Context suite: 12/0 (new test-semantic-retrieval-trigram) - SHA-256 preserved as explicit :sha256 provider (integrity-only) Prose: - system-model-embedding.org: explains why SHA-256 is blind (avalanche property) and why trigrams capture lexical overlap (shared 'aut','uth', 'the','hen',...). Documents :trigram, :sha256, :local, :openai backends. - core-context.org: documents the one-line foveal-vector wiring fix and how it activates the dormant semantic retrieval path. Explains the full pipeline: trigram embed → memory-object-vector → context-awareness-assemble → context-object-render → cosine similarity.
2026-05-06 19:39:30 -04:00
parent a0f7bd7671
commit 55e27f5194
2 changed files with 14 additions and 3 deletions
--- a/org/system-model-embedding.org
+++ b/org/system-model-embedding.org
@@ -7,13 +7,16 @@

 ~system-model-embedding~ converts text into vector representations for semantic search and memory retrieval. It provides three backends:

+- ~:trigram~ — a zero-dependency fallback that uses character-trigram Jaccard similarity. Pure Lisp, works fully offline, captures lexical overlap.
+- ~:sha256~ — integrity-only (explicit opt-in). SHA-256 hashing for environments where even trivial computation is undesirable.
 - ~:local~ — any OpenAI-compatible ~/api/embeddings~ endpoint (Ollama, vLLM, etc.)
 - ~:openai~ — the OpenAI ~/v1/embeddings~ API with an API key
- ~:hashing~ — a zero-dependency fallback that produces deterministic vectors from SHA-256 hashes. No server, no config, works offline.

 The embedding queue (~embed-queue-object~ / ~embed-all-pending~) decouples document indexing from the main loop. On each heartbeat tick, ~embed-all-pending~ drains the queue and embeds all accumulated objects. This prevents indexing traffic from blocking conversational responses.

-The default provider is ~:hashing~ — useful for bootstrapping with zero configuration and for deployments where embedding quality isn't critical. Switch to ~:local~ or ~:openai~ when you have an embedding server available.
+The default provider is ~:trigram~ — it captures lexical overlap (character trigram bloom filter → cosine similarity approximates Jaccard) and works immediately with zero configuration. Switch to ~:local~ or ~:openai~ when you have an embedding server; switch to ~:sha256~ for integrity-only deployments.
+
+**Why not SHA-256 by default?** SHA-256 is a cryptographic hash with the avalanche property — one-bit input differences produce entirely different outputs. "implement user login form" and "implement user login forn" (one character difference) have completely different SHA-256 values → cosine similarity near zero. This makes SHA-256 correct for integrity verification (Merkle tree) but useless for similarity-based retrieval. The trigram Jaccard approach captures lexical overlap: "authentication" and "authenticate" share trigrams "aut", "uth", "the", "hen", "ent", "nti", "tic", "ica", producing high cosine similarity (0.80). "authentication" and "banana" share zero trigrams → 0.0 similarity.

 This replaces the old ~system-embedding-gateway~ with the same logic but renamed to ~system-model-embedding~ to live alongside the other ~system-model-*~ skills.

@@ -75,7 +78,7 @@ This replaces the old ~system-embedding-gateway~ with the same logic but renamed
        (list :error (format nil "OpenAI Embedding failed: ~a" c))))))
 #+end_src

-** Hashing fallback
+** Trigram backend (v0.4.0)
 #+begin_src lisp
 (defun embedding-backend-sha256 (text)
  "SHA-256 based vector — integrity only, no semantic retrieval capability.