v0.4.0: semantic retrieval — REPL TDD + literate prose
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 3s

RED proofs (pre-v0.4.0):
- SEMANTIC_SCORE never appears in context output (foveal-vector = nil)
- Context suite: 9/0 (no trigram test)
- SHA-256 hashing default — cryptographically blind to similarity

GREEN proofs (v0.4.0):
- Trigram 'authentication' vs 'authenticate' → 0.80 similarity
- Trigram 'authentication' vs 'banana' → 0.00 similarity
- Default provider: :trigram (lexical overlap, zero dependencies)
- Context suite: 12/0 (new test-semantic-retrieval-trigram)
- SHA-256 preserved as explicit :sha256 provider (integrity-only)

Prose:
- system-model-embedding.org: explains why SHA-256 is blind (avalanche
  property) and why trigrams capture lexical overlap (shared 'aut','uth',
  'the','hen',...). Documents :trigram, :sha256, :local, :openai backends.
- core-context.org: documents the one-line foveal-vector wiring fix and
  how it activates the dormant semantic retrieval path. Explains the
  full pipeline: trigram embed → memory-object-vector →
  context-awareness-assemble → context-object-render → cosine similarity.
This commit is contained in:
2026-05-06 19:39:30 -04:00
parent a0f7bd7671
commit 55e27f5194
2 changed files with 14 additions and 3 deletions

View File

@@ -7,13 +7,16 @@
~system-model-embedding~ converts text into vector representations for semantic search and memory retrieval. It provides three backends:
- ~:trigram~ — a zero-dependency fallback that uses character-trigram Jaccard similarity. Pure Lisp, works fully offline, captures lexical overlap.
- ~:sha256~ — integrity-only (explicit opt-in). SHA-256 hashing for environments where even trivial computation is undesirable.
- ~:local~ — any OpenAI-compatible ~/api/embeddings~ endpoint (Ollama, vLLM, etc.)
- ~:openai~ — the OpenAI ~/v1/embeddings~ API with an API key
- ~:hashing~ — a zero-dependency fallback that produces deterministic vectors from SHA-256 hashes. No server, no config, works offline.
The embedding queue (~embed-queue-object~ / ~embed-all-pending~) decouples document indexing from the main loop. On each heartbeat tick, ~embed-all-pending~ drains the queue and embeds all accumulated objects. This prevents indexing traffic from blocking conversational responses.
The default provider is ~:hashing~ — useful for bootstrapping with zero configuration and for deployments where embedding quality isn't critical. Switch to ~:local~ or ~:openai~ when you have an embedding server available.
The default provider is ~:trigram~ — it captures lexical overlap (character trigram bloom filter → cosine similarity approximates Jaccard) and works immediately with zero configuration. Switch to ~:local~ or ~:openai~ when you have an embedding server; switch to ~:sha256~ for integrity-only deployments.
**Why not SHA-256 by default?** SHA-256 is a cryptographic hash with the avalanche property — one-bit input differences produce entirely different outputs. "implement user login form" and "implement user login forn" (one character difference) have completely different SHA-256 values → cosine similarity near zero. This makes SHA-256 correct for integrity verification (Merkle tree) but useless for similarity-based retrieval. The trigram Jaccard approach captures lexical overlap: "authentication" and "authenticate" share trigrams "aut", "uth", "the", "hen", "ent", "nti", "tic", "ica", producing high cosine similarity (0.80). "authentication" and "banana" share zero trigrams → 0.0 similarity.
This replaces the old ~system-embedding-gateway~ with the same logic but renamed to ~system-model-embedding~ to live alongside the other ~system-model-*~ skills.
@@ -75,7 +78,7 @@ This replaces the old ~system-embedding-gateway~ with the same logic but renamed
(list :error (format nil "OpenAI Embedding failed: ~a" c))))))
#+end_src
** Hashing fallback
** Trigram backend (v0.4.0)
#+begin_src lisp
(defun embedding-backend-sha256 (text)
"SHA-256 based vector — integrity only, no semantic retrieval capability.