v0.4.0: semantic retrieval — REPL TDD + literate prose
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 3s
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 3s
RED proofs (pre-v0.4.0): - SEMANTIC_SCORE never appears in context output (foveal-vector = nil) - Context suite: 9/0 (no trigram test) - SHA-256 hashing default — cryptographically blind to similarity GREEN proofs (v0.4.0): - Trigram 'authentication' vs 'authenticate' → 0.80 similarity - Trigram 'authentication' vs 'banana' → 0.00 similarity - Default provider: :trigram (lexical overlap, zero dependencies) - Context suite: 12/0 (new test-semantic-retrieval-trigram) - SHA-256 preserved as explicit :sha256 provider (integrity-only) Prose: - system-model-embedding.org: explains why SHA-256 is blind (avalanche property) and why trigrams capture lexical overlap (shared 'aut','uth', 'the','hen',...). Documents :trigram, :sha256, :local, :openai backends. - core-context.org: documents the one-line foveal-vector wiring fix and how it activates the dormant semantic retrieval path. Explains the full pipeline: trigram embed → memory-object-vector → context-awareness-assemble → context-object-render → cosine similarity.
This commit is contained in:
@@ -7,13 +7,16 @@
|
||||
|
||||
~system-model-embedding~ converts text into vector representations for semantic search and memory retrieval. It provides three backends:
|
||||
|
||||
- ~:trigram~ — a zero-dependency fallback that uses character-trigram Jaccard similarity. Pure Lisp, works fully offline, captures lexical overlap.
|
||||
- ~:sha256~ — integrity-only (explicit opt-in). SHA-256 hashing for environments where even trivial computation is undesirable.
|
||||
- ~:local~ — any OpenAI-compatible ~/api/embeddings~ endpoint (Ollama, vLLM, etc.)
|
||||
- ~:openai~ — the OpenAI ~/v1/embeddings~ API with an API key
|
||||
- ~:hashing~ — a zero-dependency fallback that produces deterministic vectors from SHA-256 hashes. No server, no config, works offline.
|
||||
|
||||
The embedding queue (~embed-queue-object~ / ~embed-all-pending~) decouples document indexing from the main loop. On each heartbeat tick, ~embed-all-pending~ drains the queue and embeds all accumulated objects. This prevents indexing traffic from blocking conversational responses.
|
||||
|
||||
The default provider is ~:hashing~ — useful for bootstrapping with zero configuration and for deployments where embedding quality isn't critical. Switch to ~:local~ or ~:openai~ when you have an embedding server available.
|
||||
The default provider is ~:trigram~ — it captures lexical overlap (character trigram bloom filter → cosine similarity approximates Jaccard) and works immediately with zero configuration. Switch to ~:local~ or ~:openai~ when you have an embedding server; switch to ~:sha256~ for integrity-only deployments.
|
||||
|
||||
**Why not SHA-256 by default?** SHA-256 is a cryptographic hash with the avalanche property — one-bit input differences produce entirely different outputs. "implement user login form" and "implement user login forn" (one character difference) have completely different SHA-256 values → cosine similarity near zero. This makes SHA-256 correct for integrity verification (Merkle tree) but useless for similarity-based retrieval. The trigram Jaccard approach captures lexical overlap: "authentication" and "authenticate" share trigrams "aut", "uth", "the", "hen", "ent", "nti", "tic", "ica", producing high cosine similarity (0.80). "authentication" and "banana" share zero trigrams → 0.0 similarity.
|
||||
|
||||
This replaces the old ~system-embedding-gateway~ with the same logic but renamed to ~system-model-embedding~ to live alongside the other ~system-model-*~ skills.
|
||||
|
||||
@@ -75,7 +78,7 @@ This replaces the old ~system-embedding-gateway~ with the same logic but renamed
|
||||
(list :error (format nil "OpenAI Embedding failed: ~a" c))))))
|
||||
#+end_src
|
||||
|
||||
** Hashing fallback
|
||||
** Trigram backend (v0.4.0)
|
||||
#+begin_src lisp
|
||||
(defun embedding-backend-sha256 (text)
|
||||
"SHA-256 based vector — integrity only, no semantic retrieval capability.
|
||||
|
||||
Reference in New Issue
Block a user