diff --git a/org/core-context.org b/org/core-context.org index 8306af6..1d781c6 100644 --- a/org/core-context.org +++ b/org/core-context.org @@ -24,6 +24,14 @@ A naive implementation that serializes every ~org-object~ to text would produce The semantic threshold is configurable via ~CONTEXT_SEMANTIC_THRESHOLD~ env var (default 0.75). Lower values include more peripherally related content; higher values restrict to tightly related content. +** Semantic Retrieval Activation (v0.4.0) + +In v0.3.0, the infrastructure for semantic retrieval was in place — the cosine similarity calculation, the semantic threshold check, and the embedding pipeline — but ~:foveal-vector~ was never passed to ~context-object-render~. It was always ~nil~, so ~(if (and foveal-vector obj-vector ...) ...)~ always took the ~0.0~ branch. Every peripheral node had similarity zero regardless of content overlap. + +The fix is a one-line wiring: ~context-awareness-assemble~ now extracts the foveal node's embedding vector via ~(memory-object-vector (memory-object-get foveal-id))~ and passes it as the ~:foveal-vector~ keyword argument to ~context-object-render~. This activates the entire semantic retrieval path — nodes with high cosine similarity to the foveal node are promoted to full-content rendering. + +The effectiveness of this depends on the embedding backend. The default ~:trigram~ backend (v0.4.0 replacement for ~:hashing~/SHA-256) captures lexical overlap: if two nodes share enough character trigrams, their cosine similarity exceeds the threshold and the peripheral node is promoted to foveal detail. This gives the context model genuine semantic boosting with zero LLM tokens and zero external dependencies. + ** Contract 1. (context-awareness-assemble &optional signal): produces a skeletal diff --git a/org/system-model-embedding.org b/org/system-model-embedding.org index 1c86f2c..e2cef4b 100644 --- a/org/system-model-embedding.org +++ b/org/system-model-embedding.org @@ -7,13 +7,16 @@ ~system-model-embedding~ converts text into vector representations for semantic search and memory retrieval. It provides three backends: +- ~:trigram~ — a zero-dependency fallback that uses character-trigram Jaccard similarity. Pure Lisp, works fully offline, captures lexical overlap. +- ~:sha256~ — integrity-only (explicit opt-in). SHA-256 hashing for environments where even trivial computation is undesirable. - ~:local~ — any OpenAI-compatible ~/api/embeddings~ endpoint (Ollama, vLLM, etc.) - ~:openai~ — the OpenAI ~/v1/embeddings~ API with an API key -- ~:hashing~ — a zero-dependency fallback that produces deterministic vectors from SHA-256 hashes. No server, no config, works offline. The embedding queue (~embed-queue-object~ / ~embed-all-pending~) decouples document indexing from the main loop. On each heartbeat tick, ~embed-all-pending~ drains the queue and embeds all accumulated objects. This prevents indexing traffic from blocking conversational responses. -The default provider is ~:hashing~ — useful for bootstrapping with zero configuration and for deployments where embedding quality isn't critical. Switch to ~:local~ or ~:openai~ when you have an embedding server available. +The default provider is ~:trigram~ — it captures lexical overlap (character trigram bloom filter → cosine similarity approximates Jaccard) and works immediately with zero configuration. Switch to ~:local~ or ~:openai~ when you have an embedding server; switch to ~:sha256~ for integrity-only deployments. + +**Why not SHA-256 by default?** SHA-256 is a cryptographic hash with the avalanche property — one-bit input differences produce entirely different outputs. "implement user login form" and "implement user login forn" (one character difference) have completely different SHA-256 values → cosine similarity near zero. This makes SHA-256 correct for integrity verification (Merkle tree) but useless for similarity-based retrieval. The trigram Jaccard approach captures lexical overlap: "authentication" and "authenticate" share trigrams "aut", "uth", "the", "hen", "ent", "nti", "tic", "ica", producing high cosine similarity (0.80). "authentication" and "banana" share zero trigrams → 0.0 similarity. This replaces the old ~system-embedding-gateway~ with the same logic but renamed to ~system-model-embedding~ to live alongside the other ~system-model-*~ skills. @@ -75,7 +78,7 @@ This replaces the old ~system-embedding-gateway~ with the same logic but renamed (list :error (format nil "OpenAI Embedding failed: ~a" c)))))) #+end_src -** Hashing fallback +** Trigram backend (v0.4.0) #+begin_src lisp (defun embedding-backend-sha256 (text) "SHA-256 based vector — integrity only, no semantic retrieval capability.