From d3b74f5c88c105058e45972eaf9806d13e5f3bef Mon Sep 17 00:00:00 2001
From: Amr Gharbeia <amr@gharbeia.net>
Date: Thu, 7 May 2026 09:55:33 -0400
Subject: [PATCH] =?UTF-8?q?v0.4.1:=20native=20embedding=20CFFI=20=E2=80=94?=
 =?UTF-8?q?=20full=20pipeline=20working,=20daemon-wired,=20HITL=20bug=20fi?=
 =?UTF-8?q?xed?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Native backend returns 768-dim vectors via llama.cpp / C wrapper (/usr/local/lib/libllama_wrap.so)
- Wired :native into embed-object dispatch and exported from passepartout package
- Model preloads at daemon startup with EMBEDDING_PROVIDER=native (~30s)
- Lazy loading via *embedding-backend* :native also works (first call ~45s)
- C wrapper bridges CFFI pointer params to llama.cpp struct-by-value API
- Correct struct layouts: llama_model_params(72B), llama_context_params(136B), llama_batch(56B)
- BERT pooling: llama_get_embeddings_seq, llama_tokenize takes vocab* not model*
- FiveAM tests pass: dimensions, self-similarity, semantic ranking
- Fixed pre-existing HITL crash: boundp guard for *hitl-pending* in core-loop-act
- Lazy load guard prevents double-load of native file in embedding-native-ensure-loaded
- ROADMAP: v0.4.0 items marked DONE, v0.4.1 native embedding updated with actual implementation
---
 docs/CONTRIBUTING.org                   |   2 +-
 docs/DESIGN_DECISIONS.org               | 129 +++++---
 docs/ROADMAP.org                        | 259 +++++++++++++---
 lisp/core-defpackage.lisp               |   5 +
 lisp/core-loop-act.lisp                 |   6 +-
 lisp/system-model-embedding-native.lisp | 266 ++++++++--------
 lisp/system-model-embedding.lisp        |  42 ++-
 org/core-defpackage.org                 |   5 +
 org/core-loop-act.org                   |   6 +-
 org/system-model-embedding-native.org   | 391 +++++++++++++++---------
 org/system-model-embedding.org          |  43 ++-
 scripts/llama_wrap.c                    |  24 ++
 12 files changed, 804 insertions(+), 374 deletions(-)
 create mode 100644 scripts/llama_wrap.c

diff --git a/docs/CONTRIBUTING.org b/docs/CONTRIBUTING.org
index bb1bf76..cc6d542 100644
--- a/docs/CONTRIBUTING.org
+++ b/docs/CONTRIBUTING.org
@@ -36,7 +36,7 @@ Skills are the building blocks of Passepartout. They reside in the `skills/` dir
 A skill must define:
 1. *Trigger*: A lambda determining if the skill should activate based on the context.
 2. *Probabilistic Gate*: Optional. Generates a prompt for the LLM.
-3. *Deterministic Gate*: A hardcoded Lisp function that guarantees safety or executes side-effects (the "Bouncer" pattern).
+3. *Deterministic Gate*: A hardcoded Lisp function that guarantees safety or executes side-effects (the Dispatcher pattern).
 
 Example Registration:
 #+begin_src lisp
diff --git a/docs/DESIGN_DECISIONS.org b/docs/DESIGN_DECISIONS.org
index 3332046..52bd269 100644
--- a/docs/DESIGN_DECISIONS.org
+++ b/docs/DESIGN_DECISIONS.org
@@ -143,9 +143,26 @@ This separation is the source of Passepartout's safety guarantee. Other agents a
 
 The split also explains why the system gets safer over time without the LLM improving. The deterministic engine accumulates rules. The LLM proposes actions, the engine evaluates them against a growing rule set. Early versions block obvious dangers. Later versions block sophisticated attacks that were previously unknown. The safety grows logarithmically with the number of interactions, not linearly with model capability.
 
+** Core Knowledge: The Four Pillars of Agentic Reliability
+:PROPERTIES:
+:CREATED:  [2026-05-07 Wed]
+:END:
+
+Every reliable AI agent must possess four types of Core Knowledge — not as prompt instructions, but as encoded symbolic rules that the neural engine cannot override. These are the "laws of physics" for the agent's computational universe. Passepartout encodes each pillar as deterministic Lisp functions in the Dispatcher gate stack.
+
+1. *Digital Object Permanence & State.* The agent must know what exists independently of its attention. Passepartout achieves this through the Merkle-tree memory: every memory-object carries a SHA-256 content hash. If the agent deletes a file, the hash proves it's gone. If an external process modifies it, the hash mismatch triggers a warning. The copy-on-write snapshot mechanism preserves the state at every decision point, enabling rollback if an action chain fails.
+
+2. *Causality and Temporal Logic.* Actions must execute in order. Step B cannot run if Step A failed. Passepartout enforces this through the pipeline's depth counter (signals cannot recurse past depth 10, preventing infinite loops) and the sequential Perceive → Reason → Act ordering. The batch tool calls feature (v0.4.1) allows parallel execution of independent actions while enforcing sequential execution of dependent ones — actions that share a dependency are ordered; actions that don't are parallelized.
+
+3. *Agentic Boundaries (The "Self").* The agent must know where its authority ends and the host system begins. Passepartout encodes this through the Dispatcher gate stack: path protection blocks access to sensitive directories (~/.ssh, /etc, ~/.aws). Shell safety blocks destructive commands (rm -rf /, dd, injection vectors). Network exfiltration detection blocks unauthorized outbound connections. The permission table (v0.2.0) allows per-tool, per-path granularity. These are not prompt instructions — they are Lisp functions that execute unconditionally for every action. The self-build safety boundary (v0.4.0) extends this to the agent's own core pipeline files: the agent can modify skills and system modules freely, but cannot modify its own brain stem without human review.
+
+4. *Epistemic Certainty (Knowing How It Knows).* The agent must distinguish between a verified fact, a retrieved memory, and an LLM prediction. Passepartout encodes this through the gate trace (v0.4.0): every action carries a record of which gates passed, which blocked, and why. The provenance system (LOGBOOK entries on memory-objects) records who modified what and when. The Dispatcher's existence-check gate verifies that a file exists before allowing a read. The process-status gate verifies that a command completed before allowing its output to be used. The agent cannot "hallucinate" a file path or a process result because the Dispatcher checks each against the live state before execution.
+
+These four pillars are not features. They are the definition of a reliable agent. Every agent architecture either provides them or compensates for their absence in ways that make the agent less trustworthy, more expensive, or both.
+
 ** The Dispatcher as Learning System
 :PROPERTIES:
-:ID:       design-bouncer-learning
+:ID:       design-dispatcher-learning
 :CREATED:  [2026-05-07 Wed]
 :END:
 
@@ -185,6 +202,21 @@ Third, the REPL is a shared substrate. When the agent evaluates code, that code
 
 This is why the REPL becomes more important as the system matures. In early versions, it is a development tool. In v0.6.0 and beyond, it becomes a cognitive tool: the agent explores hypotheses by evaluating them, verifies the output of sub-agents by inspecting live state, and tests modifications before committing them to the knowledge graph.
 
+** The Cybernetic Loop: Why the Metabolic Pipeline Works
+:PROPERTIES:
+:CREATED:  [2026-05-07 Wed]
+:END:
+
+The Perceive → Reason → Act cycle is not a software architecture pattern. It is a cybernetic feedback loop — the mechanism by which a system steers itself toward a goal in a changing environment.
+
+Norbert Wiener defined cybernetics in 1948 as "control and communication in the animal and the machine." The metabolic pipeline implements this precisely: Perceive is the sensor (reading the environment), Reason is the controller (evaluating against goals and constraints), Act is the actuator (modifying the environment), and the tool-output feedback signal closes the loop (reading the effect of the action and adjusting the next perception).
+
+The Dispatcher gate stack is the negative feedback governor. When the LLM proposes an action that would violate an invariant, the Dispatcher blocks it and feeds the rejection trace back to the LLM for self-correction. This is Ross Ashby's homeostasis — the system maintains its internal stability by correcting deviations from its set point (the safety invariants). Without this negative feedback, the probabilistic engine would drift into hallucinated proposals that become progressively less grounded. The Dispatcher constrains it to the domain of safe, verifiable actions.
+
+The self-editing capability is second-order cybernetics — autopoiesis, the capacity of a system to create and maintain itself. Humberto Maturana and Francisco Varela defined this as the hallmark of living systems. When the agent detects an error, locates the faulty function, generates a corrected version, and hot-reloads it into the running image without restarting, it is modifying its own architecture while continuing to operate. Passepartout achieves this through Lisp's homoiconicity — code is data, and the running image is the environment. The skill engine loads every skill into a jailed Common Lisp package, validates its syntax, tests its trigger function in isolation, and only then promotes it to the live registry.
+
+This framing matters for two reasons. First, it places Passepartout in a lineage that predates and outlasts the current "LLM with tools" paradigm. The cybernetic principles of feedback, homeostasis, and autopoiesis are independent of any specific model architecture. They work whether the perceptual engine is an LLM, a vision model, or a symbolic parser. Second, it explains why the architecture gets more reliable over time — cybernetic systems improve through accumulated negative feedback corrections, not through better training data. Every blocked action is a correction. Every approved exception is a refined set point. The system converges on stability through use.
+
 ** Observability and the Thought Trace
 :PROPERTIES:
 :ID:       design-observability
@@ -276,74 +308,83 @@ This does not mean Passepartout refuses to use cloud services when available and
 * Token Economics and Performance Advantage
 :PROPERTIES:
 :ID:       design-token-economics
+:CREATED:  [2026-05-07 Wed]
 :END:
 
-This section analyzes how Passepartout's architectural decisions translate into token usage, latency, and cost versus competing agent designs (OpenClaw, Hermes, Claude Code).
+This section analyzes how Passepartout's architectural decisions translate into token usage, latency, and cost versus competing agent designs. It makes one empirical claim (deterministic gates cost 0 LLM tokens — provable) and several structural claims (downward cost curve, tiered pricing, REPL economics — testable). It does not claim specific cost multiples pending empirical audit at v0.5.0.
 
 ** The Core Insight: LLM as Expensive Resource, Not Default Engine
 
 Passepartout treats the LLM as a resource to be minimized. Every operation is designed to reduce LLM dependency. Competitors treat the LLM as the core engine through which all operations flow. This is not a difference of degree but of architecture.
 
-The three structural multipliers are:
+The structural multipliers are:
 
-*Sparse tree retrieval* — loading relevant subtrees (200-800 tokens per file) rather than full files (1,500-5,000 tokens) = ~5-10x reduction per file access
-2. *Deterministic safety* — 9-vector dispatcher gate runs in pure Lisp (0 LLM tokens per verification) versus prompt-based guardrails (200-500 tokens per action) = infinite multiplier
-3. *REPL verification* — catches errors in-image (milliseconds, 0 LLM tokens) versus LLM correction round-trips (500-2,000 tokens per retry)
+1. *Sparse tree retrieval* — the foveal-peripheral model renders relevant Org subtrees (titles and properties for peripheral nodes, full content for foveal and semantically relevant nodes). Active context stays at 2,000–4,000 tokens. A "load everything" architecture serializes the entire knowledge base at 50,000–150,000 tokens. The mechanism is provably cheaper; the exact multiplier depends on memex size and complexity.
 
-These compound. A coding session touching 20 files, performing 10 actions, and triggering 3 errors saves ~50,000-100,000 tokens compared to the same session with Claude Code.
+2. *Deterministic safety* — the 9-vector Dispatcher gate stack runs in pure Lisp. Every gate is a Common Lisp function. Verification costs 0 LLM tokens per action. Competitors use prompt-based guardrails consuming 100–500 LLM tokens per verification. This multiplier is mathematically infinite — a Lisp function call costs no tokens, a guardrail paragraph in a system prompt costs tokens proportional to its length.
 
-** Per-Task Type Guesstimate
+3. *REPL verification* — code is tested in the running image before it is committed. Errors surface in milliseconds at 0 LLM tokens. Competitors discover errors after generation and pay 500–2,000 tokens per correction round-trip. The REPL eliminates the most expensive kind of LLM call: the one that produced wrong code and needs a do-over.
 
-*** Coding (debugging, refactoring, PR review)
+4. *Hot state* — in a REPL-based agent, variables, file handles, sub-routine results, and memory objects are already in memory. Every turn in a standard chat agent re-sends the full conversation history. Token costs in chat agents are quadratic: a 10-turn session pays for ~55 "turns" of context (10 + 9 + 8 + ... + 1 = 55). In Passepartout, context is stored once in the Lisp image. A 10-turn session pays for ~10 turns of context. This is an ~82% reduction on protocol overhead alone, before any foveal-peripheral pruning. This argument is testable: send the same multi-turn session through both architectures and count tokens.
 
-| Operation                           | Passepartout            | Claude Code                 | Hermes (3-agent)             | Savings vs Claude     |
-|-------------------------------------+-------------------------+-----------------------------+------------------------------+-----------------------|
-| File access (30 files)              | 30 × 400 tok = 12,000   | 30 × 3,000 tok = 90,000     | 30 × 3,000 tok × 3 = 270,000 | 78,000 tok            |
-| Reasoning rounds (20)               | 20 × 3,000 tok = 60,000 | 20 × 4,000 tok = 80,000     | 20 × 3,000 tok × 3 = 180,000 | 20,000 tok            |
-| Error correction (5 caught by REPL) | 0 (REPL)                | 5 × 1,000 tok = 5,000       | 5 × 1,000 tok × 3 = 15,000   | 5,000 tok             |
-| Safety verification                 | 0 (deterministic)       | 500 tok/round × 20 = 10,000 | 200 tok/round × agents       | 10,000 tok            |
-| Agent coordination                  | 0                       | 0                           | 3,000-5,000 tok/task         | 0                     |
-| *Total*                             | *~72,000 tok*           | *~185,000 tok*              | *~475,000 tok*               | *~113,000 tok (2.6x)* |
+** The Compounding Cost Curve — Unique Among Agents
 
-Over a month of daily coding (20 sessions): ~2.3 million tokens saved. At typical API pricing ($2-15/M tokens), this saves $5-35/month.
+Every AI agent grows more expensive over time. Context histories accumulate. Safety instructions grow more elaborate. Guardrails become longer prompt paragraphs. The user's data grows. The only way to reduce cost in a standard agent is to cap context — sacrificing capability.
 
-*** Knowledge Management (Zettelkasten, research, note-taking)
+Passepartout has a downward cost curve. Four mechanisms compound:
 
-Passepartout's strongest domain. The Org-mode native format and sparse tree retrieval create a 10-40x advantage because knowledge bases are the worst case for "load everything" architectures.
+1. *Dispatcher learning (v0.3.0).* Every blocked action and approved exception becomes a deterministic rule. A file write that initially triggered a full LLM proposal → Dispatcher review → HITL approval → rule extraction loop eventually becomes a deterministic rule check. Each hardened rule permanently removes a future LLM call.
 
-| Operation                      | Passepartout                                           | Competitor                              | Savings   |
-|--------------------------------+--------------------------------------------------------+-----------------------------------------+-----------|
-| Context assembly (500-node KB) | Peripheral outline + ~5 foveal nodes = 2,000-4,000 tok | Full serialization = 80,000-150,000 tok | 40-75x    |
-| Semantic search (10 queries)   | Vector lookup in-image = 0 LLM tok                     | LLM-assisted search = 5,000 tok         | 5,000 tok |
-| Note creation (10 notes)       | Deterministic Org writes = 0 LLM tok                   | 10 × 800 tok = 8,000                    | 8,000 tok |
-| *Total per session*            | *~7,000 tok*                                           | *~95,000-165,000 tok*                   | *~13-24x* |
+2. *Symbolic induction (v0.5.0).* The agent extracts patterns from successful interaction sequences and converts them into reusable Lisp functions. A multi-step task that took 5,000 tokens today takes 0 tokens tomorrow — it's now a ~defun~. The Dispatcher learns what to block. Symbolic induction learns what to automate.
 
-*** Day-to-Day Life Management (calendar, tasks, reminders)
+3. *Native embedding inference (v0.4.0).* Every semantic search query runs against in-image vectors at 0 external tokens. Competitors use LLM-assisted search for most retrieval operations. Passepartout's retrieval is a vector cosine similarity check — pure math, no model call.
 
-| Operation                   | Passepartout                               | Competitor                     | Savings    |
-|-----------------------------+--------------------------------------------+--------------------------------+------------|
-| Background maintenance      | Deterministic heartbeat-driven = 0 LLM tok | Scheduled LLM calls or skipped | Variable   |
-| User interactions (30/day)  | 30 × 2,000 tok = 60,000                    | 30 × 4,000 tok = 120,000       | 60,000 tok |
-| Context queries by TODO/tag | Hash table scan = 0 LLM tok                | LLM-based search = 2,500 tok   | 2,500 tok  |
-| *Total per day*             | *~60,000 tok*                              | *~122,500 tok*                 | *~2x*      |
+4. *Prefix caching (v0.4.0).* The static portion of the system prompt (IDENTITY, TOOLS, LOGS format) is transmitted once per session. Dynamic content (CONTEXT, user prompt) is sent on each call. Anthropic's prompt caching gives a 90% discount on cached tokens. OpenAI caches automatically.
 
-The defining advantage: background maintenance (compaction, archiving, link repair) costs zero LLM tokens. Competing systems either skip this or pay LLM costs for it.
+After 12 months of daily use, Passepartout's per-session costs are expected to be 40–60% of baseline, while competitors' costs rise to 125–140% of baseline. The crossover point is estimated at 3–6 months. This is not a model quality claim — it is a structural property of the architecture.
 
-*** Chatting (casual conversation)
+** Tiered Pricing: Cheap Models for Simple Tasks, Free for Learned Patterns
 
-Chatting is inherently LLM-bound. Passepartout's edge is privacy filtering before content reaches the LLM and slightly smaller context footprint. Token savings are marginal (~1.3x).
+The model-tier router (v0.3.0) classifies every task by complexity and routes it to the cheapest capable model. Simple lookups go to tiny local models or deterministic hash table scans (0 LLM tokens). Text processing goes to mid-tier models. Complex planning and code generation go to the premium model. The consensus loop (v0.10.0) only fires for high-impact actions.
 
-** The Dispatcher Learning Curve: Cost Decreases Over Time
+The induced functions from symbolic induction (v0.5.0) compound this: every learned pattern that becomes a Lisp function moves from "cheap" to "free." Over time, an increasing fraction of the agent's daily operations cost 0 LLM tokens.
 
-A unique architectural property: Passepartout's cost curve descends while competitors' ascends.
+** Version-by-Version Cost Trajectory
 
-Passepartout: As the dispatcher accumulates deterministic rules from Human-in-the-Loop decisions, fewer actions require LLM proposals. A file write that initially triggered a full LLM proposal → dispatcher review → HITL approval → rule extraction loop eventually becomes a deterministic rule check. Each hardened rule permanently reduces future token costs.
+The following projections assume a coding session equivalent to ~20 files, 10 actions, and 3 errors, using the cheapest capable cloud provider. They are architectural estimates pending empirical audit at v0.5.0.
 
-Competitors: As context histories grow, safety instructions accumulate, and guardrails become more elaborate, each interaction costs more than the last. The only way to reduce cost is to cap context — sacrificing capability.
+| Version | Cost relative to Claude Code | Why |
+|---------+-----------------------------+-----|
+| v0.4.0 (with prefix caching) | 1.5–2x cheaper | Sparse retrieval + caching; no tools yet, tasks are simple |
+| v0.5.0 (with symbolic induction) | 1.5–2x cheaper, declining over time | Induced functions begin replacing LLM calls for repeated patterns |
+| v0.7.0 (with MCP tools) | 2–3x cheaper | More complex tasks, but caching + induction compound |
+| v1.0.0 (all pre-symbolic features) | 2–3x cheaper for coding, 10–40x for knowledge management | Full stack: sparse trees + caching + induction + native embeddings |
+| v3.0.0 (neurosymbolic) | 5–10x cheaper | 80% of reasoning in symbolic middle layer costs 0 LLM tokens |
+| v4.0.0 (native inference) | ~100% cheaper for local models | No API call. No per-token pricing. Electricity only. |
 
-After 12 months of learning, Passepartout's core reasoning costs could drop to 40-60% of baseline, while competitors' costs rise to 125-140% of baseline.
+Knowledge management is Passepartout's strongest domain. A 500-node knowledge base assembled for the LLM as 2,000–4,000 tokens (foveal-peripheral) versus 80,000–150,000 tokens (full serialization) is a 40–75x difference in context alone. Semantic search in-image at 0 tokens versus LLM-assisted search at 5,000+ tokens extends the gap. Note creation via deterministic Org writes at 0 tokens versus LLM-generated notes at 800+ tokens each widens it further. Background maintenance (archiving, link repair, compaction) runs on heartbeat-driven cron jobs at 0 LLM tokens.
 
-The crossover point where Passepartout becomes structurally cheaper is estimated at 3-6 months depending on usage volume and task diversity.
+** Engineering Challenges and Solutions
+
+The architecture's advantages are genuine but unevenly distributed across task types. Three structural challenges have specific engineering solutions in the roadmap.
+
+*** Challenge: Situational Cost
+
+The sparse-tree and REPL advantages apply primarily to long-running, high-context tasks. A single-turn lookup ("what's on my calendar?") without a cost-conscious routing layer may consume comparable tokens to standard RAG. The architecture must prevent the agent from spending $5 of compute on a $0.01 question.
+
+*Solution:* The Resolution Budget (v0.5.0) is a lightweight pre-routing layer that classifies complexity before the Reason stage and assigns a cost envelope. Simple lookups take the fast path (deterministic, 0 LLM tokens, sub-second). Standard interactions use cached context and tiered models. Deep reasoning engages the full deliberative pipeline. The tier classifier (v0.8.1) adds safety-based routing: dangerous operations always take the full verification path regardless of cost. Together, cheap simple tasks take the cheap fast path; dangerous complex tasks take the expensive safe path.
+
+*** Challenge: Single-Turn Latency
+
+The Dispatcher gate stack, structured output enforcement, and verification loop add latency to every turn. Time-to-first-token is inherently higher than a raw chat agent that processes the first response directly. The goal is not to match raw chat-agent TTFT on every interaction — it is to make the verification overhead imperceptible for trivial tasks and worth the wait for complex ones.
+
+*Solution:* Three mechanisms compound. The Resolution Budget (v0.5.0) routes simple lookups through a fast path with minimal gate checks. Streaming responses (v0.6.3) hide latency by showing progressive output — the user sees the agent typing while verification runs. Interrupt-and-redirect (v0.6.3) lets the user kill a wrong response mid-generation and redirect the agent without waiting for a complete wrong answer. The self-configuring setup binary (v0.5.0) includes a tiny Syntax Scout model — a 1.5B parameter model fine-tuned on Common Lisp + Org-mode idioms that pre-validates Lisp forms before the Dispatcher, reducing rejection-loop cycles.
+
+*** Challenge: Symbolic Brittleness
+
+Deterministic gates reject code with minor syntax errors that a prompt-based guardrail would pass. A 99% correct Lisp form with one mismatched parenthesis is blocked entirely during the ~read-from-string~ stage or by the syntax validation gate. This is the correct safety posture — but without mitigation, the user experience is "the agent keeps failing to do simple things because of formatting errors."
+
+*Solution:* Three mechanisms compound. Structured Output Enforcement (v0.6.2) validates plist syntax before the Dispatcher, providing LLM feedback with the specific parse error. The Syntax Scout — the tiny model from the setup bootstrapper — pre-validates Lisp forms during the Reason stage and auto-corrects common patterns (parenthesis balance, keyword normalization). The self-correction loop (up to 3 retries with rejection trace feedback at the Reason stage) gives the LLM multiple attempts. Together, these mechanisms drop the failure rate from "every syntax error blocks" to "the LLM learns to produce valid Lisp after the first rejection, and the Syntax Scout catches the patterns that the LLM repeatedly misses."
 
 ** Local LLM Viability
 
@@ -382,7 +423,9 @@ Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1
 
 *Note:* Observations about OpenClaw and Hermes Agent are based on their public documentation and repositories as of 2026-05. OpenClaw (github.com/openclaw/openclaw) is a TypeScript personal AI assistant by @steipete with a Node.js gateway, 25+ messaging channels, and Canvas/voice companion apps. Hermes Agent (github.com/NousResearch/hermes-agent) is a Python fork by Nous Research with a built-in learning loop, full TUI, and sub-agent delegation. Both use prompt-based safety guardrails rather than deterministic gates. Architectural claims should be re-verified as these projects evolve.
 
-*Conclusion:* Passepartout's architecture is designed to produce 2-3x token savings for coding, 13-24x for knowledge management, and 2x for life management at v1.0.0 maturity. The three structural advantages — sparse trees, deterministic safety, and REPL verification — compound. The critical risk is implementation gap: achieving the retrieval precision, dispatcher learning, and REPL integration depth required to realize the design.
+*Conclusion:* Passepartout's architecture has a structural downward cost curve — a property that no competitor claims. The Dispatcher learning curve, symbolic induction, native embedding inference, and prefix caching compound to reduce LLM dependency over time. The cost advantage is not a magnitude claim (which depends on usage patterns and model selection) but a directional claim (costs decline with use, competitors' costs rise). The 80% of computation that moves to the symbolic middle layer at v3.0.0 (zero LLM tokens) and the 100% local-inference capability at v4.0.0 (zero API cost) define the long-term ceiling: eventually, the only LLM cost is input translation and output formatting. Everything else is pure Lisp.
+
+The critical risk is implementation: achieving the retrieval precision, Dispatcher learning depth, REPL integration, and symbolic engine maturity required to realize the architecture's economic potential. The token audit harness at v0.5.0 will provide the first empirical measurements.
 
 *Note:* The token savings projections in this section (2–3x for coding, 13–24x for knowledge management) are architectural estimates based on the sparse-tree retrieval and deterministic safety mechanisms. They have not yet been empirically verified. A token audit harness will produce measured comparisons at v0.5.0 (Token Economics & Prompt Efficiency). Until then, the README cites the mechanisms (sparse-tree rendering, deterministic gates) rather than specific magnitudes.
 * Open Questions and Risks
diff --git a/docs/ROADMAP.org b/docs/ROADMAP.org
index c5234e7..e9de451 100644
--- a/docs/ROADMAP.org
+++ b/docs/ROADMAP.org
@@ -419,7 +419,9 @@ Rationale: The TUI is Passepartout's only interface. OpenClaw distributes across
 
 The features in this version were originally sequenced as v0.3.x patches but represent feature-level scope. They activate the architectural advantages designed in v0.1.0–v0.3.0, harden the self-build safety boundary, and expand Passepartout's interaction surfaces beyond the terminal TUI. Each feature depends on infrastructure already in place — the wiring, the sandbox, the gate trace — and activates it.
 
-*** TODO Semantic retrieval activation
+*** DONE Semantic retrieval activation
+CLOSED: [2026-05-06 Tue]
+- State "DONE" from "TODO" [2026-05-06 Tue]
 
 Rationale: Two independent failures prevent the foveal-peripheral semantic retrieval path from ever firing. First, ~context-awareness-assemble~ never passes ~:foveal-vector~ to ~context-object-render~, so the renderer receives ~nil~ for ~foveal-vector~ and the similarity calculation always returns 0.0. Second, the default ~:hashing~ embedding backend uses SHA-256 (a cryptographic hash with the avalanche property) as a similarity function. SHA-256 is designed to produce entirely different outputs for nearly identical inputs — the property that makes it secure for integrity verification is precisely what makes it useless for semantic retrieval. A content-addressed Merkle tree correctly uses SHA-256 for identity; a retrieval engine needs a similarity function, not an identity function. The infrastructure for real embeddings (~local~ with Ollama, ~openai~ with the embeddings API) is fully implemented and working — this release activates the last-mile wiring and replaces the semantically blind default with a zero-dependency algorithm that actually captures textual overlap.
 
@@ -429,7 +431,9 @@ Rationale: Two independent failures prevent the foveal-peripheral semantic retri
 - Add ~EMBEDDING_PROVIDER~ guidance to the setup wizard: explain that ~:hashing~ is the default offline fallback, ~:local~ requires Ollama with ~nomic-embed-text~, and ~:openai~ uses the paid embeddings API.
 - Add FiveAM test: ingest two semantically related nodes ("implement login form" and "add password authentication"), verify cosine similarity > 0.0 with the trigram backend.
 
-*** TODO Self-build safety boundary
+*** DONE Self-build safety boundary
+CLOSED: [2026-05-06 Tue]
+- State "DONE" from "TODO" [2026-05-06 Tue]
 
 Rationale: Self-building (the agent modifying its own source code) begins at v0.7.1 when the tool ecosystem and test runner are in place. But self-building without path-level write protection means the agent can modify the very pipeline code that is currently executing — the ~core-*~ files that implement the Perceive-Reason-Act cycle, the Merkle-tree memory, the skill engine loader, and the Dispatcher gate stack itself. A hallucination or a logic error during self-building that corrupts ~core-loop-reason.lisp~ destroys the agent's ability to reason about and fix the corruption. The "thin harness" is not privileged code in the architectural sense (homoiconicity means any code can be modified at runtime), but it must be *protected* code — modifications to the harness require a human in the loop, enforced by the Dispatcher's path-protection gate, not by convention.
 
@@ -439,12 +443,14 @@ This is the corollary to "thin harness, fat skills": the harness is thin enough
 - The blocked action produces a Flight Plan (HITL approval required). The human reviews the proposed core change in an Org buffer before approving. This is the same mechanism that governs shell commands and network exfiltration — the core protection is a path-specific instance of the existing gate, not a new gate.
 - Implement a ~SELF_BUILD_MODE~ env var. When ~SELF_BUILD_MODE=true~ (default ~false~):
   - Core path protection is active (writes blocked, HITL required)
-  - Non-core writes proceed through the standard Dispatcher gate (permissions table + policy + Bouncer)
+  - Non-core writes proceed through the standard Dispatcher gate (permissions table + policy + Dispatcher)
   - ~SELF_BUILD_MODE=false~ disables core protection entirely — useful during initial development when the human is manually editing core files and doesn't want every save to trigger a Flight Plan
 - Telemetry: track self-build actions (core modifications proposed, core modifications approved, core modifications denied). This is the dataset that the Dispatcher's learning system uses in v3.0.0 to understand which core modifications are safe enough to automate.
 - Add FiveAM test: simulate a write to ~core-loop.lisp~, verify the Dispatcher returns a ~:LOG~ rejection with ~"protected path"~ in the message.
 
-*** TODO TUI Differentiator Visualization
+*** DONE TUI Differentiator Visualization
+CLOSED: [2026-05-06 Tue]
+- State "DONE" from "TODO" [2026-05-06 Tue]
 
 Rationale: Three architectural elements exist today in the daemon that no competitor can render — the Dispatcher gate trace, the foveal-peripheral focus map, and the rules-learned counter. All three run in pure Lisp with 0 LLM tokens. None are visible to the user. Making them visible turns Passepartout's architecture from an internal mechanism into a trust-building UX — the user sees exactly which safety gates passed, exactly what the agent is focusing on, and exactly how many rules the Dispatcher has learned from their decisions. No competitor can ship this because none has deterministic gates to trace, foveal-peripheral context to map, or a rule-synthesizing Dispatcher to count.
 
@@ -454,7 +460,9 @@ Rationale: Three architectural elements exist today in the daemon that no compet
 - Expanded theme: replace the 7-flat-color ~*tui-theme*~ with a 25-color layered system organized by message category (roles, content types, tool visibility, gate states, status). See the design discussion for the full color mapping. Implement a ~/theme <name>~ command that swaps between named presets (~dark~, ~light~, ~solarized~, ~gruvbox~). Theme change persists to disk and reloads on next session.
 - Add FiveAM tests: gate trace renders correctly for pass/block/approval states; focus map updates when ~foveal-id~ changes; rule counter increments on HITL approval.
 
-*** TODO Gateway QA, Discord, Slack + Emacs Bridge
+*** DONE Gateway QA, Discord, Slack + Emacs Bridge
+CLOSED: [2026-05-06 Tue]
+- State "DONE" from "TODO" [2026-05-06 Tue]
 
 Rationale: Passepartout currently has Telegram and Signal gateways in the codebase, both untested. The setup wizard has Slack as a configurable option with no implementation. Two messaging channels is not competitive — OpenClaw has 25+, Hermes Agent has 6+. But more critically: the Lisp crowd is Passepartout's natural audience, and they live in Emacs. An Emacs bridge that speaks the framed TCP protocol is trivial to implement (the protocol is ~200 lines of Lisp; porting to elisp is straightforward) and turns every Emacs buffer into a Passepartout interaction surface. This is not the deep Emacs integration of v0.10.2 (where the agent controls Emacs) — this is Emacs controlling the agent over TCP. The Emacs user selects a region, hits ~M-x passepartout-send-region~, and the agent responds in a dedicated buffer. They never leave their editor.
 
@@ -478,16 +486,28 @@ Emacs Bridge:
 - Agent modifies an Org file → Emacs receives ~:buffer-update~ via the bridge → the buffer is refreshed (~revert-buffer~ or targeted replacement).
 - The Emacs bridge is the daily driver for Lisp users. The TUI remains for non-Emacs users and for the differentiator visualizations. Emacs users get the gate trace and focus map as Org property drawers in the response buffer — same data, elisp-native rendering.
 
-**** TODO Native embedding inference
+**** DONE Native embedding inference
+CLOSED: [2026-05-07 Thu]
 
-Rationale: The foveal-peripheral model depends on vector similarity to surface semantically related nodes. Without vectors, retrieval is depth-2 truncation with no semantic boosting. The trigram Jaccard fallback provides real lexical signal — "login bug" shares trigrams with "authentication error" — but cannot surface nodes with zero lexical overlap ("password reset flow" vs "login broken"). A real embedding model closes this gap. Embedding inference is 10x simpler than chat LLM inference: single forward pass, no autoregressive decoding, no KV cache, no streaming. The CFFI binding is ~150 lines of Lisp.
+Implemented: in-process embedding inference via CFFI binding to llama.cpp.
 
-- FFI binding to llama.cpp's embedding API (~150 lines of CFFI). Call ~llama_get_embeddings~ after a single forward pass. No KV cache, no sampling, no streaming required — embedding models are BERT-family, single-pass.
-- Ship all-MiniLM-L6-v2 (23MB GGUF) as the bundled embedding model. 384-dimensional vectors, CPU-friendly (<100ms per document on any modern CPU), produces semantically meaningful vectors with zero external dependencies.
-- ~embedding-backend-local~ detects the bundled model at configure time. If present, uses in-process inference by default. Falls back to Ollama (~EMBEDDING_PROVIDER=local~) or OpenAI (~EMBEDDING_PROVIDER=openai~) if the model is missing or the user prefers external inference.
-- ~EMBEDDING_PROVIDER~ default becomes ~:native~ (the bundled model). The model file lives at ~~/.local/share/passepartout/models/all-MiniLM-L6-v2.gguf~, downloaded on first ~configure~ if not present.
-- The trigram Jaccard backend remains as a further fallback for environments where even 23MB is too large (embedded, resource-constrained). SHA-256 hashing is removed entirely — it was semantically blind.
-- Add FiveAM test: embedding a document with the native backend produces a 384-dimensional float vector; identical documents produce identical vectors.
+- FFI binding to llama.cpp's current (non-deprecated) embedding API via a C wrapper library (~/usr/local/lib/libllama_wrap.so~) that bridges CFFI pointer params to llama.cpp struct-by-value calls
+- Builds on ~/usr/local/lib/libllama.so~ (llama.cpp shared library)
+- Ship nomic-embed-text-v1.5 (80MB Q4_K_M GGUF) as the bundled embedding model. 768-dimensional vectors (nomic-bert, 12 layers), CPU-friendly, <100ms per document on any modern CPU
+- ~EMBEDDING_PROVIDER=native~ enables the native backend; model preloads at daemon startup (~30s)
+- Lazy loading via ~*embedding-backend* :native~ also works (first call blocks ~45s for model init)
+- C wrapper functions: ~llama_wrap_model_load~, ~llama_wrap_new_context~, ~llama_wrap_encode~, ~llama_wrap_batch_init/free~
+- Struct sizes verified via C sizeof/offsetof: llama_model_params (72B), llama_context_params (136B), llama_batch (56B)
+- BERT pooling: uses ~llama_get_embeddings_seq~ for sequence-level embedding
+- ~sb-int:set-floating-point-modes :traps nil~ required before any llama.cpp call (FPU state conflict)
+- ~llama_backend_init~ required before model load
+- ~llama_model_get_vocab~ + ~llama_vocab_n_tokens~ replaces deprecated ~llama_n_vocab~
+- ~llama_tokenize~ takes ~vocab*~ not ~model*~ (API change since earlier llama.cpp versions)
+- Exports: ~embedding-backend-native~, ~embedding-native-load-model~, ~embedding-native-unload~, ~embedding-native-ensure-loaded~, ~embedding-native-get-dim~
+- FiveAM tests: availability, loading, dimensions (768), self-similarity (1.0), semantic similarity ranking
+- The trigram Jaccard backend remains as the default fallback for zero-config deployments
+
+- State "DONE" from "TODO" [2026-05-07 Thu]
 
 *** Competitive Advantage Analysis — v0.4.0 Summary
 
@@ -499,6 +519,25 @@ The TUI differentiator visualizations are Passepartout's permanent UX advantage.
 
 The messaging gateways and Emacs bridge expand Passepartout's interaction surface from a single terminal TUI to four surfaces: terminal, Telegram/Signal/Discord/Slack messaging, Emacs, and voice (via the voice gateway in v0.7.3). The Emacs bridge is strategically critical — the Lisp crowd is Passepartout's natural audience, and they live in Emacs. An Emacs bridge that speaks the framed TCP protocol turns every Emacs buffer into a Passepartout interaction surface. Combined with the gate trace and focus map rendered as Org property drawers in the response buffer, Emacs users get the same differentiator visualizations as TUI users — same data, elisp-native rendering.
 
+** v0.4.1: Design Cleanup
+
+*** TODO Remove system-prompt-augment mechanism
+
+Rationale: The ~system-prompt-augment~ slot on the skill struct enables skills to inject always-on text into every LLM system prompt via a ~maphash~ over ~*skill-registry*~ in ~think()~ (core-loop-reason.lisp:83-92). Only one skill uses it — ~programming-repl~ — and it does so as a backdoor: the skill's trigger is hardcoded to ~nil~, so it never fires as an active skill. Its sole contribution is injecting a REPL-first mandate into every system prompt. The other ~24 skills have nil augments and are skipped by the ~when aug-fn~ guard. This is architecturally wrong: standing mandates (always-on rules) should live in a dedicated ~*standing-mandates*~ list, not piggyback on a skill that is never triggered. The mechanism also fuels a false claim in DESIGN_DECISIONS about 3,000-8,000 tokens of overhead — the actual overhead is ~40 tokens from the one active augment.
+
+- Remove ~system-prompt-augment~ slot from the ~skill~ defstruct and ~defskill~ macro (core-skills.org:78, core-skills.org:121-133).
+- Remove the ~maphash~ skill-augments collection block from ~think()~ and the associated ~(or skill-augments "")~ injection in the system-prompt ~format~ call (core-loop-reason.org:83-95, core-loop-reason.org:196-198).
+- Remove ~:system-prompt-augment #'repl-mandate~ from ~programming-repl~'s ~defskill~ (programming-repl.org:269).
+- Introduce ~*standing-mandates*~ (a list of function → string generators). Inject them into the IDENTITY section of the system prompt alongside ~assistant-name~. Move ~repl-mandate~ there: ~(push #'repl-mandate *standing-mandates*)~.
+- Tangle the corresponding lisp/ files.
+
+*** TODO Fix false token-overhead claims in docs
+
+Rationale: Two documents claim the ~system-prompt-augment~ mechanism can waste 3,000-8,000 tokens per think() call (DESIGN_DECISIONS line 435, ROADMAP line 504). This conflates the ~maphash~ iteration (cheap hash walk, no token cost) with the augments actually emitted (only ~programming-repl~ emits ~40 tokens; the ~when aug-fn~ guard skips the other 24 nil-augment skills). Once issue #1 above is resolved (removing the mechanism), these claims become doubly false.
+
+- DESIGN_DECISIONS: Rewrite or remove bullet 2 under "Open Questions and Risks" (line 435). Replace with a corrected note on standing mandates via ~*standing-mandates*~.
+- ROADMAP v0.5.0 intro (line 504): Remove or rewrite the claim that "system prompt overhead alone could reach 3,000-8,000 tokens per call before user input is even processed." The fixed overhead is not from skill augments — it is from the IDENTITY, TOOLS, CONTEXT, and LOGS sections, which prefix caching addresses.
+
 ** v0.5.0: Token Economics & Prompt Efficiency
 
 The architecture's single largest gap versus SOTA: Passepartout currently spends tokens like a research prototype. Every ~think()~ call rebuilds and retransmits the full system prompt — IDENTITY + TOOLS + CONTEXT + LOGS + SKILL_AUGMENTS — with no caching, no budget, and no incremental assembly. The foveal-peripheral model prunes memory content but doesn't touch the fixed overhead. With 20+ skills by v1.0.0, system prompt overhead alone could reach 3,000–8,000 tokens per call before user input is even processed.
@@ -549,9 +588,30 @@ Rationale: The current ~passepartout configure~ flow is a bash script that detec
 - The setup binary is a bridge. It gets the system installed and configured, then gets out of the way. The final system is a live Lisp image, not a sealed binary.
 - Add FiveAM test: the deterministic path succeeds on a system with all dependencies pre-installed; the LLM-assisted path correctly classifies 10 common package-manager error messages.
 
+*** TODO Resolution budget
+
+Rationale: Without cost-aware routing, every request goes through the full deliberative pipeline. A "what's my calendar?" query costs the same overhead as a multi-file refactor. The resolution budget prevents the agent from spending $5 of compute on a $0.01 question.
+
+- Lightweight pre-routing layer classifies complexity before the Reason stage: simple lookup (deterministic, 0 LLM tokens), standard interaction (cached context, tiered model), deep reasoning (full deliberative path with all gates).
+- Simple lookups take the fast path: query memory, check file, list TODOs — all in-process function calls, no LLM invocation, sub-second response.
+- Tasks exceeding their assigned complexity budget are flagged, reclassified by the tier router, or escalated to the user with a cost estimate.
+- The resolution budget is a skill — reloadable, tunable per user preference (~RESOLUTION_BUDGET~ env var with per-tier caps).
+- This complements the tier classifier (v0.8.1) which handles safety routing. The resolution budget handles cost routing. Together, cheap simple tasks take the cheap fast path, dangerous complex tasks take the expensive safe path.
+
+*** TODO Symbolic induction
+
+Rationale: The Dispatcher currently learns from blocked and approved actions — it accumulates rules about what to allow and what to deny. Symbolic induction extends this: the agent extracts patterns from successful interaction sequences and converts them into reusable Lisp functions. When the agent successfully completes a multi-step task (e.g., "find all TODOs tagged @urgent, sort by deadline, and create a summary"), it extracts the interaction pattern as a ~defun~ that replaces future LLM calls for similar tasks. This is the mechanism by which the system genuinely needs the LLM less over time — not just by blocking fewer dangerous actions, but by replacing probabilistic reasoning with deterministic functions. The Dispatcher learns what to prevent. Symbolic induction learns what to automate.
+
+- Scan successful interaction sequences (user request → agent actions → successful outcome) and extract reusable patterns: what was asked, what tools were called, what the verification chain looked like, and what the final result was.
+- When a pattern repeats across 3+ sessions with consistent outcomes, trigger induction: the LLM proposes a Lisp function implementing the pattern, the REPL verifies it against historical inputs, and if it passes, the function is registered as a skill.
+- Induced functions live in ~passepartout.skills.induced-<name>~ — jailed packages, same loading sandbox as user-written skills. They can be inspected, modified, or removed by the user.
+- The rule counter in the TUI status bar gains a second counter: ~[Rules: 47 | Induced: 12]~ — rules learned from HITL decisions vs functions learned from successful sessions.
+- Induced functions are proposed, not automatically applied. The next time a similar request arrives, the agent checks: "I have an induced function for this. Use it?" The user approves the first invocation, and subsequent invocations of the same function are automatic.
+- Add FiveAM test: replay a historical interaction sequence, verify the induced function produces the same outcome.
+
 *** Competitive Advantage Analysis — v0.5.0 Summary
 
-Token economics is the dimension where the architecture's theoretical advantage becomes operationally real. The foveal-peripheral model and deterministic gates reduce the tokens *needed* per task; prompt caching and incremental assembly reduce the tokens *spent* per task. Combined, the 2–3x coding savings and 13–24x knowledge management savings in the DESIGN_DECISIONS token analysis become achievable rather than aspirational.
+Token economics is the dimension where the architecture's theoretical advantage becomes operationally real. The foveal-peripheral model and deterministic gates reduce the tokens *needed* per task; prompt caching and incremental assembly reduce the tokens *spent* per task. Combined, the 2–3x coding savings and 13–24x knowledge management savings in the DESIGN_DECISIONS token analysis become achievable rather than aspirational. Symbolic induction extends this downward cost curve into new territory: the agent doesn't just block fewer dangerous actions — it automates away entire categories of LLM calls by learning reusable Lisp functions from successful interaction patterns.
 
 The cost tracking and budget enforcement are defensive advantages: no competitor gives the user visibility into per-task LLM cost. Claude Code and Copilot obscure cost behind flat-rate subscriptions. Passepartout's transparent cost model is a sovereignty feature — the user knows what the agent spends on their behalf and can cap it.
 
@@ -796,33 +856,35 @@ But it is still fundamentally probabilistic at its core. The symbolic engine ver
 
 v2.0.0 is where Passepartout stops being a daemon with clients and becomes the environment. The agent's cognitive loop, the user's editor, the user's shell, and the user's browser run in the same Common Lisp image. The Dispatcher gate stack verifies every action regardless of who initiated it — user or agent. The distinction between "tool" and "self" dissolves.
 
-**Why this version matters for UX parity.** v0.4.0 through v1.0.0 give Passepartout four interaction surfaces (TUI, messaging apps, Emacs, voice). This is competitive with Hermes Agent's TUI + messaging but not with OpenClaw's 25+ channels + Canvas + macOS/iOS apps + voice wake words. v2.0.0 doesn't try to match OpenClaw's breadth. It inverts the problem: instead of building more clients, it builds a platform where the agent's environment and the user's environment are the same process, separated not by a sandbox but by the Dispatcher gate stack. The editor IS the agent's prompt. The shell IS the agent's actuator. The browser IS the agent's web research tool. There are no clients — there is one Lisp image, one address space, one org-mode file system. This is only possible because the deterministic safety gates are in-process pure Lisp functions rather than out-of-process sandboxes.
+*Why this version matters for UX parity.* v0.4.0 through v1.0.0 give Passepartout four interaction surfaces (TUI, messaging apps, Emacs, voice). v2.0.0 inverts the problem: instead of building more clients, it builds a platform where the agent's environment and the user's environment are the same process, separated not by a sandbox but by the Dispatcher gate stack. The editor IS the agent's prompt. The shell IS the agent's actuator. The browser IS the agent's web research tool. There are no clients — there is one Lisp image, one address space, one Org-mode file system.
 
-**Components:**
+*Architectural principle: Browser inside Lisp, not Lisp inside browser.* Lisp is the parent process. It owns the window, the memory, and the input loop. The rendering engine (WebKit/Blink) is a library that paints pixels inside a Lisp buffer. The user can redefine functions while browsing without restarting. Keybinding lookups happen in microseconds (SBCL machine code) — the browser cannot "steal" shortcuts.
 
-| Component       | What it replaces        | Technology                          | Status at v1.0.0 |
-|-----------------+------------------------+-------------------------------------+-------------------|
-| Lish editor     | Emacs, VS Code          | McCLIM or Croatoan-based TUI        | TUI exists (Croatoan) |
-| Lish shell      | Bash, zsh               | Common Lisp REPL (already Lisp)     | Shell actuator exists |
-| Nyxt browser    | Chrome, Firefox         | Nyxt (Common Lisp browser)          | Playwright MCP (v0.7.0) |
-| Web interface   | Notion, Obsidian Publish | Org → HTML static site generator   | Not started |
-| Daemon + memory | "the OS"                | SBCL image + Org files              | Production (v1.0.0) |
+*** Qt/QML via EQL5 — the rendering surface
+
+- Qt/QML (via EQL5) is the UI framework. EQL5 exposes the full Qt C++ API from Common Lisp. QML is declarative — it matches Lisp's generation model.
+- Desktop: native look and feel on Linux, macOS, and Windows.
+- Mobile: Qt runs natively on iOS and Android. Android uses F-Droid for the unrestricted version and Play Store for sandboxed. iOS uses Guideline 4.7 ("Educational/Developer Tool" loophole, no JIT compilation).
+- Safety Bridge for mobile: Lisp code can manipulate browser/files but cannot touch hardware (GPS, camera, contacts) without standard permission pop-ups.
+- The minibuffer: a universal command line at the bottom of the screen. Not an Emacs modeline. Not a VS Code command palette. A single command surface for every action — edit files, navigate web, run Lisp expressions, invoke agent commands. ~M-x~ for everything.
 
 *** Lish — the Common Lisp editor
 
-Not elisp. Not Emacs. A multi-threaded Common Lisp editor built on McCLIM or the Croatoan TUI infrastructure. The complete system prompt lives in an Org buffer — the agent's identity (~AGENTS.md~ equivalent), its skill registry, its memory, and its reasoning are visible and editable as Org text. The user modifies the agent's prompt and the agent reflects the change immediately — the prompt is a file in memory, not a hidden string in a config.
+Not elisp. Not Emacs. A multi-threaded Common Lisp editor rendered via Qt/QML. The complete system prompt lives in an Org buffer — the agent's identity, its skill registry, its memory, and its reasoning are visible and editable as Org text. The user modifies the agent's prompt and the agent reflects the change immediately — the prompt is a file in memory, not a hidden string in a config.
 
 Org-babel for interactive evaluation: source blocks in Org files are executable. The user evaluates a ~#+begin_src lisp~ block and the result appears inline. The agent evaluates blocks to verify code before writing. The REPL is not a separate window — it is the Org buffer in which the agent and user both work.
 
-The editor and the agent share the same Lisp image. The editor is not a client that connects to a daemon — it IS the daemon process. The TUI from v0.4.0 (with word wrap, streaming, gate trace, focus map) is the editor's rendering surface. The Emacs bridge (v0.4.0) remains for users who prefer Emacs until Lish matures.
+The editor and the agent share the same Lisp image. The editor is not a client that connects to a daemon — it IS the daemon process. The TUI from v0.3.6 (with word wrap, streaming, gate trace, focus map) is the editor's rendering surface.
 
-*** Nyxt — the Common Lisp browser
+*** Nyxt — the Common Lisp browser (three erosion stages)
 
-Nyxt is a browser written in Common Lisp that renders web content in buffers. In v2.0.0, it replaces the Playwright MCP bridge (v0.7.0) as the agent's web interaction surface. The agent controls Nyxt by calling Nyxt functions directly — no subprocess, no serialization, no protocol. Navigation, form filling, data extraction, screenshot capture all happen within the same Lisp image.
+The browser is not a one-time feature. It is a multi-year erosion of the rendering stack toward pure Lisp:
 
-This matters because Playwright (Node.js subprocess, v0.7.0) and vision (screenshot + LLM analysis, v0.9.1) give the agent web access but with a process boundary. Nyxt eliminates the boundary. The agent's browser session shares memory with the agent's cognitive loop. The agent can inspect the DOM as Lisp data structures. It can respond to page events by injecting signals into its own pipeline.
+*Stage 1 — Qt + WebKit.* Qt provides window management and native widgets. WebKit renders web content inside a Lisp buffer. Network requests via dexador (pure Lisp). HTML parsed via Plump (pure Lisp). Layout via Yoga (C-based Flexbox, wrapped via FFI). JavaScript via embedded QuickJS. This stage delivers a working browser in months, not years.
 
-The browser also serves as the user's web interface. Org files in ~/memex are rendered as static HTML by a zero-JS Org-to-HTML converter running in Nyxt. The user browses their memex through the same browser the agent uses to research the web. No separate web server. No deployment. It's a directory on disk rendered by a local browser.
+*Stage 2 — S-expression DOM.* Lisp builds its own DOM representation as native S-expressions. WebKit is reduced to pixel painting only — it receives rendered layouts from Lisp, not raw HTML. The agent can traverse and manipulate the DOM as Lisp data structures without serialization. This makes web content natively queryable and modifiable by the agent's cognitive loop.
+
+*Stage 3 — Pure Lisp layout.* WebKit turned off entirely. Lisp-native layout engine (12-18 months of focused development). CSS subset sufficient for the modern web's 95% use case. JavaScript via QuickJS remains for interactive content. The browser is now a Lisp application that happens to speak HTTP, not a web engine wrapped in a Lisp process.
 
 *** Lish — the Lisp shell
 
@@ -832,41 +894,148 @@ The agent and the user share the same shell. The user types ~(list-todos :tag "@
 
 Org-mode buffers become the file system. The user's memex (~/memex/) is browsable as a tree of Org headlines. File operations (read, write, list, search) operate on Org AST nodes, not byte streams. A "directory listing" is a tree of headlines. A "file read" is a subtree rendered as text.
 
-Bash remains available as a backend for running external commands, but it is not the primary interface. The agent and the user interact with the system through plists, not through text parsing.
+Bash remains available as a backend for running external commands, but it is not the primary interface.
+
+*** Emacs migration — three phases
+
+The Emacs bridge (v0.4.0) is Phase I. The deep integration is three phases, not one:
+
+*Phase I — Parasite (v0.4.0).* Emacs is a client. The elisp TCP bridge sends text and receives responses. The agent does not control Emacs. Emacs users get a native chat experience alongside the TUI.
+
+*Phase II — Interpreter (v2.0.0).* An ELisp compatibility layer runs inside Passepartout's Common Lisp image. Key Emacs packages (Org-mode, Magit) run natively without an Emacs process. The compatibility layer does not aim for 100% coverage — it targets the packages the agent's workflows depend on.
+
+*Phase III — Successor (v2.0.0 and beyond).* Native Common Lisp implementations of Org-mode workflows and Git integration read/write the same file formats. Total independence from Emacs. Emacs users who prefer Emacs keep the bridge. New users get the native experience.
 
 *** Strategic timeline
 
-The Emacs bridge (v0.4.0) is the temporary bridge. It gives Lisp users a native Emacs experience while Lish is being built. By v2.0.0, Lish is mature enough to replace it for users who prefer the integrated environment. Emacs users who prefer Emacs keep the bridge — Lish does not require abandoning Emacs. It offers an alternative built on the same principles (single address space, plists everywhere, Org-mode as AST) but without elisp's limitations (single-threaded, no bordeaux-threads, no shared memory with the agent).
+v0.4.0 Emacs bridge (Phase I Parasite) → v1.0.0 SOTA parity → v2.0.0 Lish editor + Nyxt browser (Stage 1) + Emacs Phase II/III + mobile. The Qt/QML surface enables gradual erosion of the rendering stack without rewriting the application logic. The three-phase Emacs migration ensures Lisp users are never abandoned — the bridge works from day one, the native experience grows under it.
 
 ** v3.0.0: Neurosymbolic Maturity
 
 Deterministic planner takes the wheel. LLM relegated to semantic translation.
 
-- Deterministic planner: Pure Lisp task scheduler. No LLM needed for scheduling.
-- Self-correcting gates: Gates learn from false positives (user override patterns).
+*Architectural approach: Stitching, not building.* The symbolic engine is not a from-scratch reasoner. It is an integration of existing Common Lisp libraries connected by macros and DSLs. The Lisp advantage is the macro system — it transforms human-readable rules into formal logic queries without requiring a new engine.
 
-This is the architectural leap. The system transitions from "probabilistic engine with symbolic verification" to "symbolic engine with probabilistic input and output."
+*** Open-source Lisp stack
 
-The 10-80-10 architecture becomes fully realized: ten percent neural for input translation, eighty percent symbolic for reasoning against a knowledge graph, ten percent neural for output formatting. The symbolic engine maintains facts, relationships, rules, and formal proofs. When the neural engine generates something, the symbolic engine verifies it - not by checking against a blocklist, but by running the proposal through a Prolog/Datalog reasoner that understands the domain constraints.
+- *Knowledge Graph:* VivaceGraph v3 — Lisp-native graph database with a Prolog-like query language built in. Stores facts, relationships, and rules as native Lisp objects in the same image as the agent.
+- *Constraint Solver:* Screamer — non-deterministic backtracking. Given a set of constraints, finds all valid solutions or proves none exist. Used to verify that proposed actions do not violate invariants.
+- *Formal Verifier:* ACL2 — a theorem prover for Common Lisp, BSD licensed. Proves properties about functions before they are committed to the running image. Used for skill verification and Dispatcher rule validation.
 
-The deterministic planner takes the wheel. The LLM is no longer consulted for planning decisions - it translates human language to structured queries and structured results back to human language. The planning itself is pure Lisp: task graphs generated by a symbolic reasoner that has access to the full knowledge graph.
+*** The 10-80-10 architecture
 
-Self-correcting gates replace the learned Bouncer rules. The system learns not just from approved exceptions but from the full history of outcomes - did the plan succeed? Where did it fail? The symbolic engine updates its own rules based on the results.
+Ten percent neural for input translation, eighty percent symbolic for reasoning against a knowledge graph, ten percent neural for output formatting.
 
-The implications are significant. Hallucination becomes structurally impossible because the symbolic engine will not accept a fact that contradicts its knowledge graph. Safety becomes provable because the formal verification layer can prove properties about the system's behavior. Self-improvement becomes stable because the agent modifies skills that are then verified before execution.
+- *10% Input:* The LLM translates natural language into structured queries (Prolog facts, knowledge graph lookups). The neural translator is trained via EGGROLL (low-rank evolution strategies) on the reward signal from the symbolic verifier — it learns to produce queries that the symbolic engine accepts.
+- *80% Reasoning:* Pure Lisp. Task graphs generated by the deterministic planner against the knowledge graph. Formal verification via ACL2. Constraint checking via Screamer. Fact retrieval via VivaceGraph. Zero LLM tokens. Zero hallucinations.
+- *10% Output:* The LLM formats symbolic results back into natural language. The neural formatter is structurally identical to the translator — same training loop, reversed direction.
 
-** v4.0.0: AI Stack Internalized
+*** The auto-formalizer bootstrap
 
-The agent understands its own weights. No external inference.
+The symbolic engine needs a populated knowledge graph. The auto-formalizer populates it:
 
-- Llama.cpp in Lisp: FFI binding. No Python subprocess. Pure Common Lisp inference.
-- Weights as sexps: Neural weights as Lisp data structures. Homoiconic model introspection.
+1. Feed unstructured data (documentation, manuals, logs, session histories) to the LLM in ~auto-formalizer~ mode.
+2. The LLM extracts facts, relationships, and rules as structured S-expressions.
+3. The symbolic verifier (Screamer + ACL2) checks each extracted fact for consistency with the existing knowledge graph.
+4. Consistent facts are added. Conflicting facts are flagged for human review.
+5. Over time, the knowledge graph grows without manual ontology engineering.
 
-** v5.0.0: Hardware
+*** DSL approach over engine building
 
-The Lisp machine becomes physical. RISC-V with tagged architecture, hardware-enforced type checking, FPGA prototype for the symbolic core. The agent runs not in emulation but on silicon purpose-built for the architecture.
+Domain-specific languages, not general-purpose reasoners:
 
-This is the long horizon. The symbolic engine runs on logic ASICs optimized for symbolic computation. The neural engine runs on GPU or purpose-built matrix math hardware. Lisp orchestrates both, enforcing at the hardware level what it enforced at the software level in earlier versions.
+- Lisp macros transform human-readable rules into Prolog queries that run against VivaceGraph.
+- ~(defrule check-privacy :when (contains-tag payload "@personal") :then :block)~ expands to a VivaceGraph query with Screamer constraint checking.
+- Users write rules in a domain-specific DSL. The macros handle the translation to formal logic.
+- The Skill Creator (v0.8.0) generates DSL rules from English descriptions. The auto-formalizer verifies them.
+- ~(macroexpand-1 '(defrule ...))~ shows exactly how the rule compiles — 100% auditable.
+
+*** Self-correcting gates
+
+Gates learn from the full history of outcomes — did the plan succeed? Where did it fail? The symbolic engine updates its own rules based on results:
+
+- Induced functions from v0.5.0 feed into the symbolic engine as candidate rules.
+- The symbolic verifier checks each candidate against the knowledge graph for consistency.
+- Rules that pass verification are promoted to the active gate stack.
+- Rules that fail verification are discarded with a diagnostic — the agent learns why the pattern doesn't generalize.
+
+*** Implications
+
+Hallucination becomes structurally impossible because the symbolic engine will not accept a fact that contradicts its knowledge graph. Safety becomes provable because ACL2 can prove properties about the system's behavior. Self-improvement becomes stable because the agent modifies skills that are then verified before execution. The 80% of computation that happens in the symbolic middle layer costs zero LLM tokens.
+
+** v4.0.0: Native Inference
+
+LLM inference moves in-process. No external servers. No API keys required for inference.
+
+*Lisp as Sovereign Governor, not as Math Engine.* The weights themselves are not stored as Lisp objects — this would waste 50% memory on type tags and destroy cache locality through pointer-chasing. Instead, the entire tensor is tagged as a single Lisp object (~macro-tag~). The Lisp image holds a pointer to optimized flat binary (GPU-friendly, FPGA-compatible). The tag is checked once. After that, all math happens in the optimized backend.
+
+*** Native inference (FFI binding to llama.cpp)
+
+- FFI binding to llama.cpp via CFFI: load GGUF models, run inference, manage KV cache. Single SBCL image, zero process boundaries. The agent and the model share memory.
+- Speculative safety: the Dispatcher gate stack intercepts token generation in real time. A token that would produce a blocked action is preemptively suppressed before generation. No external inference API supports this.
+- Foveal-peripheral compute: the model skips pruned context nodes during attention computation. External APIs compute full attention regardless of what you send. In-process inference makes the sparse-tree rendering pay off at the compute level, not just the token level.
+
+*** Live surgery on cognition
+
+With in-process inference, the agent's internal state becomes inspectable:
+
+- Pause inference mid-stream. Inspect hidden states and activations as Lisp variables.
+- Modify a vector, change a sampling parameter, resume.
+- Detect when the agent is likely to hallucinate by comparing current activation patterns against historical baselines.
+- The REPL becomes a surgical instrument for the agent's own cognition — not just for verifying code, but for inspecting and correcting the neural process that generates it.
+
+*** DSL-compiled model architectures
+
+Model architectures are described as Lisp DSL:
+
+- ~(defmodel passepartout-reasoning :type 'transformer :heads 32 :dim 4096 :layers 32)~
+- The DSL compiles to machine code for the target backend (GPU via CUDA, FPGA via VexRiscv, CPU via llama.cpp).
+- Python interprets at runtime. Lisp compiles once. Model architecture changes are treated the same as code changes — edited, verified, hot-reloaded.
+
+** v5.0.0: Hardware — Tagged Lisp Architecture
+
+The Lisp machine becomes physical. RISC-V with tagged architecture, hardware-enforced type checking, and FPGA prototype for the symbolic core.
+
+*Not a from-scratch processor.* Use RISC-V as the skeleton, add custom Lisp extensions. RISC-V provides the carrier architecture (standard instruction set, existing toolchain, LLVM support). Lisp extensions provide tagged computation (type checking in hardware, parallel garbage collection, S-expression traversal as atomic operations).
+
+*** The macro-tag approach
+
+- Top 4–8 bits of every memory word = Type Tag. Hardware checks tags in parallel with ALU operations. Trap on type mismatch.
+- A tensor (70B weights) is one macro-tagged Lisp object — a pointer to flat binary. The tag is checked once. Math happens at native speed. This replaces "weights as sexps" (which wastes 50% memory on per-weight tags and destroys cache locality).
+- Custom instructions: TADD (tagged add), LISP.CAR, LISP.CDR — Lisp primitives as single-cycle hardware operations.
+
+*** Phase migration: Host → Co-processor → Self-hosted
+
+1. *Parasitic.* Lisp card (FPGA) is a PCIe co-processor. Host CPU (Intel/AMD, Linux/Windows) handles "dirty" I/O — networking, display, file systems. Lisp card handles tagged computation and the agent's cognitive loop. If Lisp crashes, host survives. Reset card, reload. Memory mapping: the card can see the host's memory. The Lisp environment reaches out and inspects data.
+
+2. *Functional Hijacking.* Lisp UI runs on the card, displays through the PC's GPU. The agent indexes Linux files into Lisp objects. The host becomes an I/O server for the Lisp card.
+
+3. *Driver Cannibalization.* Point the agent at C drivers. Ask it to generate native Lisp drivers for the hardware the card controls directly. PCIe Passthrough for direct hardware access.
+
+4. *Self-Hosting.* Replace the Linux bootloader with Stage0 Lisp (a bootstrap from 500 bytes of hex to a self-hosting Lisp). Cut the umbilical cord. The Lisp machine runs on bare metal.
+
+*** Concrete prototyping milestones
+
+| Stage | Hardware | Cost | What it delivers |
+|-------+----------+------+-----------------|
+| TinyTapeout | Custom silicon (130nm) | ~$500–1,000 | 8-bit tagged toy processor with Lisp primitives |
+| Shuttle | Multi-project wafer | ~$10,000–20,000 | Tagged RISC-V core at 100–300MHz |
+| FPGA | Terasic DE10-Nano / Xilinx KCU105 | ~$200–500 | VexRiscv with custom Lisp extensions, PCIe card form factor |
+| Industrial | Commercial foundry (5nm) | ~$10M–100M+ | Competes with modern CPUs on tagged workloads |
+
+Start at TinyTapeout. Validate the tagged architecture works. Move to FPGA. Validate at speed. Only then consider silicon.
+
+*** Garbage collection in hardware
+
+Dedicated bus master (Scavenger) runs background garbage collection while the main CPU executes code. No "GC pause." The scavenger traverses the heap in parallel with computation, freeing unreachable objects without stopping the agent.
+
+*** Persistent single-address-space memory
+
+NVRAM for the entire heap. Turn on the machine — state is exactly where you left it. No "booting." No "loading memory from disk." The agent's Merkle-tree memory, skill registry, knowledge graph, and induced functions survive restarts as a contiguous hardware state.
+
+*** Why this is not "Lisp inside browser"
+
+Most Lisp-on-hardware attempts fail because they try to compete with Intel on raw math. That's the wrong axis. The tagged architecture doesn't need to beat a GPU at matrix multiplication. It needs to beat a CPU at symbolic computation — graph traversal, constraint solving, theorem proving, garbage collection. These are the v3.0.0 symbolic engine's workload. Hardware that makes them single-cycle is the differentiator, not hardware that runs matrix math faster.
 
 ** v6.0.0: True Agency
 
diff --git a/lisp/core-defpackage.lisp b/lisp/core-defpackage.lisp
index a8011a9..13e6367 100644
--- a/lisp/core-defpackage.lisp
+++ b/lisp/core-defpackage.lisp
@@ -91,6 +91,11 @@
        #:embed-object
        #:embed-all-pending
        #:embedding-backend-hashing
+       #:embedding-backend-native
+       #:embedding-native-load-model
+       #:embedding-native-unload
+       #:embedding-native-ensure-loaded
+       #:embedding-native-get-dim
        #:embeddings-compute
        #:mark-vector-stale
      #:skill
diff --git a/lisp/core-loop-act.lisp b/lisp/core-loop-act.lisp
index 9ed3280..a224c9c 100644
--- a/lisp/core-loop-act.lisp
+++ b/lisp/core-loop-act.lisp
@@ -26,8 +26,10 @@
                                     (stream (getf meta :reply-stream)))
                                (when (and stream (open-stream-p stream))
                                  ;; Enrich response with differentiator visualization data
-                                 (setf (getf (getf action :payload) :rule-count)
-                                       (hash-table-count *hitl-pending*))
+                              (setf (getf (getf action :payload) :rule-count)
+                                    (if (boundp '*hitl-pending*)
+                                        (hash-table-count *hitl-pending*)
+                                        0))
                                  (setf (getf (getf action :payload) :foveal-id)
                                        (getf context :foveal-id))
                                  (format stream "~a" (frame-message action))
diff --git a/lisp/system-model-embedding-native.lisp b/lisp/system-model-embedding-native.lisp
index 89cb3ea..5c7d0a1 100644
--- a/lisp/system-model-embedding-native.lisp
+++ b/lisp/system-model-embedding-native.lisp
@@ -1,84 +1,77 @@
+(unless (find-package :passepartout)
+  (make-package :passepartout :use '(:cl)))
+
 (in-package :passepartout)
 
-(cffi:define-foreign-library libllama
-  (:unix "/usr/local/lib/libllama.so"))
-
+(cffi:define-foreign-library libllama_wrap (:unix "/usr/local/lib/libllama_wrap.so"))
+(cffi:use-foreign-library libllama_wrap)
+(cffi:define-foreign-library libllama (:unix "/usr/local/lib/libllama.so"))
 (cffi:use-foreign-library libllama)
 
-(cffi:defctype llama-model-p :pointer)
-(cffi:defctype llama-context-p :pointer)
-(cffi:defctype llama-seq-id :int32)
-(cffi:defctype llama-token :int32)
-(cffi:defctype llama-pos :int32)
+(cffi:defcstruct (llama-mparams :size 72)
+  (devices :pointer) (tensor-buft :pointer) (n-gpu-layers :int32)
+  (split-mode :int32) (main-gpu :int32) (_pad1 :int32)
+  (tensor-split :pointer) (progress-cb :pointer) (progress-data :pointer)
+  (kv-overrides :pointer) (vocab-only :bool) (use-mmap :bool)
+  (_pad2 :uint8 :count 6))
 
-(cffi:defcstruct (llama-model-params :class llama-model-params-type)
-  (n-gpu-layers :int32))
-
-(cffi:defcstruct (llama-context-params :class llama-context-params-type)
+(cffi:defcstruct (llama-cparams :size 136)
   (n-ctx :uint32)
   (n-batch :uint32)
   (n-ubatch :uint32)
   (n-seq-max :uint32)
   (n-threads :int32)
-  (embeddings :bool))
+  (n-threads-batch :int32)
+  (rope-scaling-type :int32)
+  (pooling-type :int32)
+  (attention-type :int32)
+  (flash-attn-type :int32)
+  (rope-freq-base :float)
+  (rope-freq-scale :float)
+  (yarn-ext-factor :float)
+  (yarn-attn-factor :float)
+  (yarn-beta-fast :float)
+  (yarn-beta-slow :float)
+  (yarn-orig-ctx :uint32)
+  (defrag-thold :float)
+  (cb-eval :pointer)
+  (cb-eval-user-data :pointer)
+  (type-k :int32)
+  (type-v :int32)
+  (abort-callback :pointer)
+  (abort-callback-data :pointer)
+  (embeddings :bool)
+  (offload-kqv :bool)
+  (no-perf :bool)
+  (op-offload :bool)
+  (swa-full :bool)
+  (kv-unified :bool)
+  (_c-pad3 :uint8 :count 15))
 
-(cffi:defcstruct (llama-batch :class llama-batch-type)
-  (n-tokens :int32)
-  (token :pointer)
-  (embd :pointer)
-  (pos :pointer)
-  (n-seq-id :pointer)
-  (seq-id :pointer)
-  (logits :pointer))
+(cffi:defcstruct (llama-batch :size 56)
+  (n-tokens :int32) (_bpad1 :int32) (token :pointer) (embd :pointer)
+  (pos :pointer) (n-seq-id :pointer) (seq-id :pointer) (logits :pointer))
 
-(cffi:defcfun ("llama_model_default_params" %llama-model-default-params) (:struct llama-model-params))
+;; llama.cpp public API
+(cffi:defcfun ("llama_backend_init" bl) :void)
+(cffi:defcfun ("llama_model_default_params" mdp) :void (p :pointer))
+(cffi:defcfun ("llama_context_default_params" cdp) :void (p :pointer))
+(cffi:defcfun ("llama_model_n_embd" ne) :int32 (m :pointer))
+(cffi:defcfun ("llama_model_get_vocab" gv) :pointer (m :pointer))
+(cffi:defcfun ("llama_vocab_n_tokens" vnt) :int32 (vocab :pointer))
+(cffi:defcfun ("llama_tokenize" tok) :int32 (vocab :pointer) (text :string) (len :int32) (tokens :pointer) (n-max :int32) (add-special :bool) (parse-special :bool))
+(cffi:defcfun ("llama_get_embeddings_ith" embd-ith) :pointer (ctx :pointer) (i :int32))
+(cffi:defcfun ("llama_get_embeddings_seq" embd-seq) :pointer (ctx :pointer) (seq-id :int32))
+(cffi:defcfun ("llama_pooling_type" get-pooling) :int32 (ctx :pointer))
+(cffi:defcfun ("llama_model_free" fm) :void (m :pointer))
+(cffi:defcfun ("llama_free" fc) :void (ctx :pointer))
 
-(cffi:defcfun ("llama_context_default_params" %llama-context-default-params) (:struct llama-context-params))
-
-(cffi:defcfun ("llama_model_load" %llama-model-load) llama-model-p
-  (path-model :string)
-  (params (:struct llama-model-params)))
-
-(cffi:defcfun ("llama_new_context_with_model" %llama-new-context-with-model) llama-context-p
-  (model llama-model-p)
-  (params (:struct llama-context-params)))
-
-(cffi:defcfun ("llama_free_model" %llama-free-model) :void
-  (model llama-model-p))
-
-(cffi:defcfun ("llama_free" %llama-free) :void
-  (ctx llama-context-p))
-
-(cffi:defcfun ("llama_n_embd" %llama-n-embd) :int32
-  (model llama-model-p))
-
-(cffi:defcfun ("llama_n_vocab" %llama-n-vocab) :int32
-  (model llama-model-p))
-
-(cffi:defcfun ("llama_tokenize" %llama-tokenize) :int32
-  (model llama-model-p)
-  (text :string)
-  (text-len :int32)
-  (tokens :pointer)
-  (n-max-tokens :int32)
-  (add-special :bool)
-  (parse-special :bool))
-
-(cffi:defcfun ("llama_encode" %llama-encode) :int32
-  (ctx llama-context-p)
-  (batch (:struct llama-batch)))
-
-(cffi:defcfun ("llama_get_embeddings_ith" %llama-get-embeddings-ith) :pointer
-  (ctx llama-context-p)
-  (i :int32))
-
-(cffi:defcfun ("llama_batch_init" %llama-batch-init) (:struct llama-batch)
-  (n-tokens :int32)
-  (embd :int32)
-  (n-seq-max :int32))
-
-(cffi:defcfun ("llama_batch_free" %llama-batch-free) :void
-  (batch (:struct llama-batch)))
+;; C wrapper (bridges struct-by-value ABI)
+(cffi:defcfun ("llama_wrap_model_load" wrap-load) :pointer (path :string) (params :pointer))
+(cffi:defcfun ("llama_wrap_new_context" wrap-ctx) :pointer (model :pointer) (params :pointer))
+(cffi:defcfun ("llama_wrap_encode" wrap-encode) :int32 (ctx :pointer) (batch :pointer))
+(cffi:defcfun ("llama_wrap_batch_init" wrap-batch-init) :void (batch :pointer) (n-tokens :int32) (embd :int32) (n-seq-max :int32))
+(cffi:defcfun ("llama_wrap_batch_free" wrap-batch-free) :void (batch :pointer))
 
 (defvar *native-model* nil
   "Cached llama.cpp model for embedding inference.")
@@ -86,6 +79,9 @@
 (defvar *native-context* nil
   "Cached llama.cpp context for embedding inference.")
 
+(defvar *native-vocab* nil
+  "Cached llama.cpp vocab handle (from model).")
+
 (defvar *native-model-path*
   (merge-pathnames ".local/share/passepartout/models/nomic-embed-text-v1.5.Q4_K_M.gguf"
                    (user-homedir-pathname))
@@ -96,74 +92,94 @@
   (unless (and *native-model* *native-context*)
     (unless (uiop:file-exists-p *native-model-path*)
       (error "Native embedding model not found at ~a" *native-model-path*))
-    (let ((mparams (%llama-model-default-params)))
-      (setf (cffi:foreign-slot-value mparams '(:struct llama-model-params) 'n-gpu-layers) 0)
-      (setf *native-model* (%llama-model-load (namestring *native-model-path*) mparams)))
-    (let* ((cparams (%llama-context-default-params)))
-      (setf (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-ctx) 512
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-batch) 512
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-ubatch) 512
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-seq-max) 1
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-threads) 2
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'embeddings) 1)
-      (setf *native-context* (%llama-new-context-with-model *native-model* cparams)))
-    (log-message "EMBEDDING: Native model loaded (~d-dim)" (%llama-n-embd *native-model*)))
-  (values *native-model* *native-context*))
-
-(defun embedding-native-get-dim ()
-  "Return the embedding dimension of the native model."
-  (embedding-native-load-model)
-  (%llama-n-embd *native-model*))
+    (sb-int:set-floating-point-modes :traps '())
+    (bl)
+    ;; Load model
+    (cffi:with-foreign-object (mp 'llama-mparams)
+      (mdp mp)
+      (setf (cffi:foreign-slot-value mp 'llama-mparams 'n-gpu-layers) 0)
+      (setf (cffi:foreign-slot-value mp 'llama-mparams 'use-mmap) 0)
+      (setf *native-model* (wrap-load (namestring *native-model-path*) mp)))
+    (setf *native-vocab* (gv *native-model*))
+    ;; Create context
+    (let ((n-embd (ne *native-model*)))
+      (cffi:with-foreign-object (cp 'llama-cparams)
+        (cdp cp)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-ctx) 512)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-batch) 512)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-ubatch) 512)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-seq-max) 1)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-threads) 2)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'embeddings) 1)
+        (setf *native-context* (wrap-ctx *native-model* cp)))
+      (format *error-output* "~&;; EMBEDDING: Native model loaded (~d-dim)~%" n-embd)))
+  (values *native-model* *native-context* *native-vocab*))
 
 (defun embedding-backend-native (text)
   "Compute an embedding vector using the native llama.cpp backend.
-Returns a single-float vector of dimension n_embd."
-  (let* ((text-len (length text))
+Returns a simple-vector of single-floats (dimension: n_embd, typically 768)."
+  (embedding-native-load-model)
+  (let* ((n-embd (ne *native-model*))
          (max-tokens 256)
          (tokens (cffi:foreign-alloc :int32 :count max-tokens))
-         (n-tokens 0))
+         (n-tok 0))
     (unwind-protect
         (progn
-          (embedding-native-load-model)
-          (setf n-tokens (%llama-tokenize *native-model* text text-len tokens max-tokens t t))
-          (when (zerop n-tokens)
-            (error "Native embedding: tokenization returned 0 tokens"))
-          (let* ((batch (%llama-batch-init n-tokens 0 1))
-                 (n-embd (embedding-native-get-dim))
-                 (result (make-array n-embd :element-type 'single-float :initial-element 0.0))
-                 (seq-id-ptr (cffi:foreign-alloc :int32 :count 1)))
-            (setf (cffi:mem-aref seq-id-ptr :int32 0) 0)
-            (unwind-protect
-                (progn
-                  (dotimes (i n-tokens)
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'token) :int32 i)
-                          (cffi:mem-aref tokens :int32 i))
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'pos) :int32 i) i)
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'n-seq-id) :int32 i) 1)
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'seq-id) :pointer i)
-                          seq-id-ptr))
-                  (let ((encode-result (%llama-encode *native-context* batch)))
-                    (when (not (zerop encode-result))
-                      (error "Native embedding: encode returned ~d" encode-result)))
-                  (let ((embd-ptr (%llama-get-embeddings-ith *native-context* (1- n-tokens))))
-                    (dotimes (i n-embd)
-                      (setf (aref result i) (cffi:mem-aref embd-ptr :float i)))))
-              (%llama-batch-free batch)
-              (cffi:foreign-free seq-id-ptr))
+          (setf n-tok (tok *native-vocab* text (length text) tokens max-tokens t t))
+          (when (zerop n-tok)
+            (error "Native embedding: tokenization returned 0 tokens for ~s" text))
+          (let ((result (make-array n-embd :element-type 'single-float :initial-element 0.0f0)))
+            (cffi:with-foreign-object (batch 'llama-batch)
+              (wrap-batch-init batch n-tok 0 1)
+              (setf (cffi:foreign-slot-value batch 'llama-batch 'n-tokens) n-tok)
+              (dotimes (i n-tok)
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'token) :int32 i)
+                      (cffi:mem-aref tokens :int32 i))
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'pos) :int32 i) i)
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'n-seq-id) :int32 i) 1)
+                (setf (cffi:mem-aref (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'seq-id) :pointer i) :int32 0) 0)
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'logits) :int8 i) 1))
+              (let ((enc (wrap-encode *native-context* batch)))
+                (unless (zerop enc)
+                  (error "Native embedding: encode returned ~d" enc)))
+              (let* ((pooling (get-pooling *native-context*))
+                     (eptr (if (= pooling 0)
+                               (embd-ith *native-context* (1- n-tok))
+                               (embd-seq *native-context* 0))))
+                (dotimes (i n-embd)
+                  (setf (aref result i) (cffi:mem-aref eptr :float i))))
+              (wrap-batch-free batch))
             result))
       (cffi:foreign-free tokens))))
 
-(defun embedding-backend-native-unload ()
+(defun embedding-native-unload ()
   "Release native model and context memory."
   (when *native-context*
-    (%llama-free *native-context*)
+    (fc *native-context*)
     (setf *native-context* nil))
   (when *native-model*
-    (%llama-free-model *native-model*)
-    (setf *native-model* nil))
+    (fm *native-model*)
+    (setf *native-model* nil *native-vocab* nil))
   (values))
 
-(pushnew (lambda () (embedding-backend-native-unload)) sb-ext:*exit-hooks*)
+(defun embedding-native-get-dim ()
+  "Return embedding dimension of loaded native model (0 if not loaded)."
+  (if *native-model*
+      (ne *native-model*)
+      0))
+
+(defun vector-cosine-similarity (a b)
+  "Cosine similarity between two simple-vectors of single-floats."
+  (let ((dot 0.0d0) (anorm 0.0d0) (bnorm 0.0d0))
+    (dotimes (i (length a))
+      (let ((af (float (aref a i) 0.0d0))
+            (bf (float (aref b i) 0.0d0)))
+        (incf dot (* af bf))
+        (incf anorm (* af af))
+        (incf bnorm (* bf bf))))
+    (if (or (zerop anorm) (zerop bnorm))
+        0.0d0
+        (/ dot (sqrt (* anorm bnorm))))))
 
 (eval-when (:compile-toplevel :load-toplevel :execute)
   (ql:quickload :fiveam :silent t))
@@ -205,8 +221,8 @@ Returns a single-float vector of dimension n_embd."
   "Contract v0.4.1: semantically similar texts are closer than unrelated."
   (let ((v-auth (passepartout::embedding-backend-native "implement user login form"))
         (v-related (passepartout::embedding-backend-native "add password authentication"))
-        (v-unrelated (passepartout::embedding-backend-native "banana fruit yellow"))
-        (sim-related (passepartout::vector-cosine-similarity v-auth v-related))
-        (sim-unrelated (passepartout::vector-cosine-similarity v-auth v-unrelated)))
-    (is (> sim-related 0.5))
-    (is (> sim-related sim-unrelated))))
+        (v-unrelated (passepartout::embedding-backend-native "banana fruit yellow")))
+    (let ((sim-related (passepartout::vector-cosine-similarity v-auth v-related))
+          (sim-unrelated (passepartout::vector-cosine-similarity v-auth v-unrelated)))
+      (is (> sim-related 0.5))
+      (is (> sim-related sim-unrelated)))))
diff --git a/lisp/system-model-embedding.lisp b/lisp/system-model-embedding.lisp
index b6acca4..6eeb178 100644
--- a/lisp/system-model-embedding.lisp
+++ b/lisp/system-model-embedding.lisp
@@ -1,7 +1,7 @@
 (in-package :passepartout)
 
 (defvar *embedding-provider* :trigram
-  "Active embedding provider: :trigram, :sha256, :local, :openai.")
+  "Active embedding provider: :trigram, :sha256, :local, :openai, :native.")
 
 (defvar *embedding-queue* nil
   "Queue of text objects awaiting embedding.")
@@ -85,10 +85,14 @@ Pure Lisp, zero external dependencies, works fully offline."
   "Embed a single text string using the active backend."
   (let* ((selected (or *embedding-backend* *embedding-provider* :trigram))
          (backend (case selected
-                     (:local #'embedding-backend-local)
-                     (:openai #'embedding-backend-openai)
-                     (:sha256 #'embedding-backend-sha256)
-                     (t #'embedding-backend-trigram))))
+                    (:local #'embedding-backend-local)
+                    (:openai #'embedding-backend-openai)
+                    (:native
+                     (unless (fboundp 'embedding-backend-native)
+                       (embedding-native-ensure-loaded))
+                     #'embedding-backend-native)
+                    (:sha256 #'embedding-backend-sha256)
+                    (t #'embedding-backend-trigram))))
     (if backend
         (progn
           (log-message "EMBEDDING: Provider ~a, backend=~a" selected backend)
@@ -126,6 +130,34 @@ Pure Lisp, zero external dependencies, works fully offline."
       (setf *embedding-provider* kw)
       (log-message "EMBEDDING: Set provider to ~a from EMBEDDING_PROVIDER env" kw))))
 
+(defun embedding-native-ensure-loaded ()
+  "Lazy-load the native CFFI backend. First call blocks ~30s for model init."
+  (when (fboundp 'embedding-backend-native)
+    (return-from embedding-native-ensure-loaded t))
+  (let* ((data-dir (uiop:ensure-directory-pathname
+                    (or (uiop:getenv "PASSEPARTOUT_DATA_DIR")
+                        (namestring (merge-pathnames ".local/share/passepartout/"
+                                                    (user-homedir-pathname))))))
+         (native-file (merge-pathnames "lisp/system-model-embedding-native.lisp" data-dir)))
+    (handler-case
+        (progn
+          (load native-file :verbose nil :print nil)
+          (log-message "EMBEDDING: Native backend loaded from ~a" native-file))
+      (error (c)
+        (error "Failed to load native embedding backend (~a): ~a" native-file c)))))
+
+;; Preload native model if configured at startup
+(when (eq *embedding-provider* :native)
+  (log-message "EMBEDDING: Native provider configured, preloading model...")
+  (embedding-native-ensure-loaded)
+  (handler-case
+      (progn
+        (embedding-native-load-model)
+        (log-message "EMBEDDING: Native model preloaded (~d dims)"
+                     (embedding-native-get-dim)))
+    (error (c)
+      (log-message "EMBEDDING: Preload deferred: ~a (will retry on first call)" c))))
+
 (log-message "EMBEDDING: Gateway loaded with provider ~a" *embedding-provider*)
 
 (defun mark-vector-stale (id &optional content)
diff --git a/org/core-defpackage.org b/org/core-defpackage.org
index a048f42..cc6c6bf 100644
--- a/org/core-defpackage.org
+++ b/org/core-defpackage.org
@@ -116,6 +116,11 @@ The package definition. All public symbols are exported here.
        #:embed-object
        #:embed-all-pending
        #:embedding-backend-hashing
+       #:embedding-backend-native
+       #:embedding-native-load-model
+       #:embedding-native-unload
+       #:embedding-native-ensure-loaded
+       #:embedding-native-get-dim
        #:embeddings-compute
        #:mark-vector-stale
      #:skill
diff --git a/org/core-loop-act.org b/org/core-loop-act.org
index 649338c..5e5f7a0 100644
--- a/org/core-loop-act.org
+++ b/org/core-loop-act.org
@@ -81,8 +81,10 @@ Because a skill's deterministic gate runs during Reason, but between Reason and
                                     (stream (getf meta :reply-stream)))
                                (when (and stream (open-stream-p stream))
                                  ;; Enrich response with differentiator visualization data
-                                 (setf (getf (getf action :payload) :rule-count)
-                                       (hash-table-count *hitl-pending*))
+                              (setf (getf (getf action :payload) :rule-count)
+                                    (if (boundp '*hitl-pending*)
+                                        (hash-table-count *hitl-pending*)
+                                        0))
                                  (setf (getf (getf action :payload) :foveal-id)
                                        (getf context :foveal-id))
                                  (format stream "~a" (frame-message action))
diff --git a/org/system-model-embedding-native.org b/org/system-model-embedding-native.org
index 03c9436..769b110 100644
--- a/org/system-model-embedding-native.org
+++ b/org/system-model-embedding-native.org
@@ -5,121 +5,127 @@
 
 * Architectural Intent
 
-~system-model-embedding-native~ provides in-process embedding inference via CFFI binding to llama.cpp. Unlike ~:local~ (Ollama REST API) and ~:openai~ (paid API), ~:native~ runs the embedding model directly in the SBCL process — zero network calls, zero external servers, <100ms per document on CPU.
+=system-model-embedding-native= provides in-process embedding inference via CFFI binding to llama.cpp. Unlike =:local= (Ollama REST API) and =:openai= (paid API), =:native= runs the embedding model directly in the SBCL process — zero network calls, zero external servers.
 
-The bundled model is ~nomic-embed-text-v1.5~ (nomic-bert, 768-dim, 12 layers) at ~~/.local/share/passepartout/models/nomic-embed-text-v1.5.Q4_K_M.gguf~. It is a BERT-family encoder-only model — single forward pass, no autoregressive decoding, no KV cache, no sampling.
+The bundled model is =nomic-embed-text-v1.5= (nomic-bert, 768-dim, 12 layers, Q4_K_M quantization, ~80MB) at =~/.local/share/passepartout/models/nomic-embed-text-v1.5.Q4_K_M.gguf=. It is a BERT-family encoder-only model — single forward pass, no autoregressive decoding.
 
-**Why this matters**: The trigram Jaccard fallback (v0.4.0) captures lexical overlap — "login bug" shares trigrams with "authentication error" — but cannot surface semantically related nodes with zero lexical overlap ("password reset flow" vs "login broken"). A real embedding model closes this gap by producing vectors where semantically similar texts are close regardless of word choice.
-
-The CFFI binding targets llama.cpp's public API:
-- ~llama_model_load~ / ~llama_free_model~ — model lifecycle
-- ~llama_new_context_with_model~ / ~llama_free~ — context lifecycle
-- ~llama_encode~ — single forward pass (encoder-only, no generation)
-- ~llama_get_embeddings_ith(ctx, i)~ — extract float vector at position i
-- ~llama_n_embd(model)~ — embedding dimension
-
-Memory: model and context are cached globally in ~*native-model*~ / ~*native-context*~ to avoid reloading on every embedding call.
+**Key architectural decisions**:
+- C wrapper library (=/usr/local/lib/libllama_wrap.so=) bridges CFFI pointer params to llama.cpp's struct-by-value API (CFFI cannot pass/return structs by value)
+- Struct sizes verified via C ~sizeof~ / ~offsetof~: =llama_model_params= (72B), =llama_context_params= (136B), =llama_batch= (56B)
+- Model and context cached globally in =*native-model*= / =*native-context*= to avoid reloading
+- BERT pooling: =llama_get_embeddings_seq= for sequence-level embedding (not =llama_get_embeddings_ith=)
+- =sb-int:set-floating-point-modes= :traps nil required before any llama.cpp call (FPU state conflict)
 
 * Implementation
 
-** Package
+** Package guard
 #+begin_src lisp
+(unless (find-package :passepartout)
+  (make-package :passepartout :use '(:cl)))
+
 (in-package :passepartout)
 #+end_src
 
-** CFFI: Load shared library
-#+begin_src lisp
-(cffi:define-foreign-library libllama
-  (:unix "/usr/local/lib/libllama.so"))
+** CFFI: Load C wrapper + llama libraries
 
+The C wrapper (=libllama_wrap.so=) bridges struct-by-value: all wrapper functions take pure pointers and dereference internally.
+
+#+begin_src lisp
+(cffi:define-foreign-library libllama_wrap (:unix "/usr/local/lib/libllama_wrap.so"))
+(cffi:use-foreign-library libllama_wrap)
+(cffi:define-foreign-library libllama (:unix "/usr/local/lib/libllama.so"))
 (cffi:use-foreign-library libllama)
 #+end_src
 
-** CFFI: Types
+** CFFI: Struct definitions
+
+Sizes verified via C =sizeof= / =offsetof= at build time.
+
 #+begin_src lisp
-(cffi:defctype llama-model-p :pointer)
-(cffi:defctype llama-context-p :pointer)
-(cffi:defctype llama-seq-id :int32)
-(cffi:defctype llama-token :int32)
-(cffi:defctype llama-pos :int32)
+(cffi:defcstruct (llama-mparams :size 72)
+  (devices :pointer) (tensor-buft :pointer) (n-gpu-layers :int32)
+  (split-mode :int32) (main-gpu :int32) (_pad1 :int32)
+  (tensor-split :pointer) (progress-cb :pointer) (progress-data :pointer)
+  (kv-overrides :pointer) (vocab-only :bool) (use-mmap :bool)
+  (_pad2 :uint8 :count 6))
 
-(cffi:defcstruct (llama-model-params :class llama-model-params-type)
-  (n-gpu-layers :int32))
-
-(cffi:defcstruct (llama-context-params :class llama-context-params-type)
+(cffi:defcstruct (llama-cparams :size 136)
   (n-ctx :uint32)
   (n-batch :uint32)
   (n-ubatch :uint32)
   (n-seq-max :uint32)
   (n-threads :int32)
-  (embeddings :bool))
+  (n-threads-batch :int32)
+  (rope-scaling-type :int32)
+  (pooling-type :int32)
+  (attention-type :int32)
+  (flash-attn-type :int32)
+  (rope-freq-base :float)
+  (rope-freq-scale :float)
+  (yarn-ext-factor :float)
+  (yarn-attn-factor :float)
+  (yarn-beta-fast :float)
+  (yarn-beta-slow :float)
+  (yarn-orig-ctx :uint32)
+  (defrag-thold :float)
+  (cb-eval :pointer)
+  (cb-eval-user-data :pointer)
+  (type-k :int32)
+  (type-v :int32)
+  (abort-callback :pointer)
+  (abort-callback-data :pointer)
+  (embeddings :bool)
+  (offload-kqv :bool)
+  (no-perf :bool)
+  (op-offload :bool)
+  (swa-full :bool)
+  (kv-unified :bool)
+  (_c-pad3 :uint8 :count 15))
 
-(cffi:defcstruct (llama-batch :class llama-batch-type)
-  (n-tokens :int32)
-  (token :pointer)
-  (embd :pointer)
-  (pos :pointer)
-  (n-seq-id :pointer)
-  (seq-id :pointer)
-  (logits :pointer))
+(cffi:defcstruct (llama-batch :size 56)
+  (n-tokens :int32) (_bpad1 :int32) (token :pointer) (embd :pointer)
+  (pos :pointer) (n-seq-id :pointer) (seq-id :pointer) (logits :pointer))
 #+end_src
 
-** CFFI: Functions
+** CFFI: llama.cpp API (current, non-deprecated)
+
+llama.cpp has undergone API changes. We target the current stable API:
+- =llama_model_load_from_file= → C wrapper (=llama_wrap_model_load=)
+- =llama_init_from_model= → C wrapper (=llama_wrap_new_context=)
+- =llama_encode= → C wrapper (=llama_wrap_encode=) — takes struct-by-value batch
+- =llama_batch_init/free= → C wrapper — returns/consumes struct-by-value
+- =llama_backend_init= REQUIRED before any model load
+- =llama_model_n_embd= (NOT deprecated =llama_n_embd=)
+- =llama_model_get_vocab= + =llama_vocab_n_tokens= (NOT deprecated =llama_n_vocab= with model pointer)
+- =llama_tokenize= now takes =vocab*= not =model*=
+- =llama_get_embeddings_seq= for BERT pooled embeddings (=llama_get_embeddings_ith= for token embeddings)
+- =llama_pooling_type= to query context pooling strategy
+
 #+begin_src lisp
-(cffi:defcfun ("llama_model_default_params" %llama-model-default-params) :void
-  (params :pointer))
+;; llama.cpp public API
+(cffi:defcfun ("llama_backend_init" bl) :void)
+(cffi:defcfun ("llama_model_default_params" mdp) :void (p :pointer))
+(cffi:defcfun ("llama_context_default_params" cdp) :void (p :pointer))
+(cffi:defcfun ("llama_model_n_embd" ne) :int32 (m :pointer))
+(cffi:defcfun ("llama_model_get_vocab" gv) :pointer (m :pointer))
+(cffi:defcfun ("llama_vocab_n_tokens" vnt) :int32 (vocab :pointer))
+(cffi:defcfun ("llama_tokenize" tok) :int32 (vocab :pointer) (text :string) (len :int32) (tokens :pointer) (n-max :int32) (add-special :bool) (parse-special :bool))
+(cffi:defcfun ("llama_get_embeddings_ith" embd-ith) :pointer (ctx :pointer) (i :int32))
+(cffi:defcfun ("llama_get_embeddings_seq" embd-seq) :pointer (ctx :pointer) (seq-id :int32))
+(cffi:defcfun ("llama_pooling_type" get-pooling) :int32 (ctx :pointer))
+(cffi:defcfun ("llama_model_free" fm) :void (m :pointer))
+(cffi:defcfun ("llama_free" fc) :void (ctx :pointer))
 
-(cffi:defcfun ("llama_context_default_params" %llama-context-default-params) :void
-  (params :pointer))
-
-(cffi:defcfun ("llama_model_load" %llama-model-load) llama-model-p
-  (path-model :string)
-  (params :pointer))
-
-(cffi:defcfun ("llama_new_context_with_model" %llama-new-context-with-model) llama-context-p
-  (model llama-model-p)
-  (params :pointer))
-
-(cffi:defcfun ("llama_free_model" %llama-free-model) :void
-  (model llama-model-p))
-
-(cffi:defcfun ("llama_free" %llama-free) :void
-  (ctx llama-context-p))
-
-(cffi:defcfun ("llama_n_embd" %llama-n-embd) :int32
-  (model llama-model-p))
-
-(cffi:defcfun ("llama_n_vocab" %llama-n-vocab) :int32
-  (model llama-model-p))
-
-(cffi:defcfun ("llama_tokenize" %llama-tokenize) :int32
-  (model llama-model-p)
-  (text :string)
-  (text-len :int32)
-  (tokens :pointer)
-  (n-max-tokens :int32)
-  (add-special :bool)
-  (parse-special :bool))
-
-(cffi:defcfun ("llama_encode" %llama-encode) :int32
-  (ctx llama-context-p)
-  (batch :pointer))
-
-(cffi:defcfun ("llama_get_embeddings_ith" %llama-get-embeddings-ith) :pointer
-  (ctx llama-context-p)
-  (i :int32))
-
-(cffi:defcfun ("llama_batch_init" %llama-batch-init) :void
-  (batch :pointer)
-  (n-tokens :int32)
-  (embd :int32)
-  (n-seq-max :int32))
-
-(cffi:defcfun ("llama_batch_free" %llama-batch-free) :void
-  (batch :pointer))
+;; C wrapper (bridges struct-by-value ABI)
+(cffi:defcfun ("llama_wrap_model_load" wrap-load) :pointer (path :string) (params :pointer))
+(cffi:defcfun ("llama_wrap_new_context" wrap-ctx) :pointer (model :pointer) (params :pointer))
+(cffi:defcfun ("llama_wrap_encode" wrap-encode) :int32 (ctx :pointer) (batch :pointer))
+(cffi:defcfun ("llama_wrap_batch_init" wrap-batch-init) :void (batch :pointer) (n-tokens :int32) (embd :int32) (n-seq-max :int32))
+(cffi:defcfun ("llama_wrap_batch_free" wrap-batch-free) :void (batch :pointer))
 #+end_src
 
 ** Global state
+
 #+begin_src lisp
 (defvar *native-model* nil
   "Cached llama.cpp model for embedding inference.")
@@ -127,92 +133,153 @@ Memory: model and context are cached globally in ~*native-model*~ / ~*native-con
 (defvar *native-context* nil
   "Cached llama.cpp context for embedding inference.")
 
+(defvar *native-vocab* nil
+  "Cached llama.cpp vocab handle (from model).")
+
 (defvar *native-model-path*
   (merge-pathnames ".local/share/passepartout/models/nomic-embed-text-v1.5.Q4_K_M.gguf"
                    (user-homedir-pathname))
   "Path to the bundled embedding model GGUF file.")
 #+end_src
 
-** Embedding Backend
+** Model loading
+
+Loads the GGUF model file and creates an inference context. Caches globally — subsequent calls are no-ops.
+
+Key initialization:
+- =sb-int:set-floating-point-modes= :traps nil — required or llama.cpp FPU ops SIGFPE
+- =llama_backend_init= — must run before any model operation
+- Model params: GPU off (=n-gpu-layers=0), no mmap (avoids double-free with SBCL's malloc)
+- Context params: embeddings=1, 512-token context, 2 threads, =pooling_type= unset (let model decide)
+
 #+begin_src lisp
 (defun embedding-native-load-model ()
   "Load the embedding model and create a context. Caches globally."
   (unless (and *native-model* *native-context*)
     (unless (uiop:file-exists-p *native-model-path*)
       (error "Native embedding model not found at ~a" *native-model-path*))
-    (cffi:with-foreign-object (mparams '(:struct llama-model-params))
-      (%llama-model-default-params mparams)
-      (setf (cffi:foreign-slot-value mparams '(:struct llama-model-params) 'n-gpu-layers) 0)
-      (setf *native-model* (%llama-model-load (namestring *native-model-path*) mparams)))
-    (cffi:with-foreign-object (cparams '(:struct llama-context-params))
-      (%llama-context-default-params cparams)
-      (setf (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-ctx) 512
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-batch) 512
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-ubatch) 512
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-seq-max) 1
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'n-threads) 2
-            (cffi:foreign-slot-value cparams '(:struct llama-context-params) 'embeddings) 1)
-      (setf *native-context* (%llama-new-context-with-model *native-model* cparams)))
-    (log-message "EMBEDDING: Native model loaded (~d-dim)" (%llama-n-embd *native-model*)))
-  (values *native-model* *native-context*))
+    (sb-int:set-floating-point-modes :traps '())
+    (bl)
+    ;; Load model
+    (cffi:with-foreign-object (mp 'llama-mparams)
+      (mdp mp)
+      (setf (cffi:foreign-slot-value mp 'llama-mparams 'n-gpu-layers) 0)
+      (setf (cffi:foreign-slot-value mp 'llama-mparams 'use-mmap) 0)
+      (setf *native-model* (wrap-load (namestring *native-model-path*) mp)))
+    (setf *native-vocab* (gv *native-model*))
+    ;; Create context
+    (let ((n-embd (ne *native-model*)))
+      (cffi:with-foreign-object (cp 'llama-cparams)
+        (cdp cp)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-ctx) 512)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-batch) 512)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-ubatch) 512)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-seq-max) 1)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'n-threads) 2)
+        (setf (cffi:foreign-slot-value cp 'llama-cparams 'embeddings) 1)
+        (setf *native-context* (wrap-ctx *native-model* cp)))
+      (format *error-output* "~&;; EMBEDDING: Native model loaded (~d-dim)~%" n-embd)))
+  (values *native-model* *native-context* *native-vocab*))
+#+end_src
 
-(defun embedding-native-get-dim ()
-  "Return the embedding dimension of the native model."
-  (embedding-native-load-model)
-  (%llama-n-embd *native-model*))
+** Embedding inference
 
+Computes a 768-dim single-float vector for the given text via llama.cpp.
+
+Pipeline:
+1. Load/cache model + context
+2. Tokenize text via =llama_tokenize= (takes =vocab*= not =model*= since v0.4.1)
+3. Initialize batch via C wrapper (=llama_batch_init= returns struct-by-value)
+4. Fill batch: set =tokens=, =pos=, =n_seq_id=, =seq_id[0]=, =logits= for each position
+5. CRITICAL: set =batch.n_tokens= explicitly — =llama_batch_init= initializes it to 0
+6. Encode via C wrapper (=llama_encode= takes struct-by-value batch)
+7. Extract pooled embedding via =llama_get_embeddings_seq= (BERT CLS pooling)
+   — falls back to =llama_get_embeddings_ith= if =pooling_type == NONE=
+8. Free batch memory via wrapper (=llama_batch_free= takes struct-by-value)
+
+NOTE: we write =seq_id= values directly into the arrays allocated by
+=llama_batch_init= (not foreign-alloc'd separately) to avoid double-free.
+
+#+begin_src lisp
 (defun embedding-backend-native (text)
   "Compute an embedding vector using the native llama.cpp backend.
-Returns a single-float vector of dimension n_embd."
-  (let* ((text-len (length text))
+Returns a simple-vector of single-floats (dimension: n_embd, typically 768)."
+  (embedding-native-load-model)
+  (let* ((n-embd (ne *native-model*))
          (max-tokens 256)
          (tokens (cffi:foreign-alloc :int32 :count max-tokens))
-         (n-tokens 0))
+         (n-tok 0))
     (unwind-protect
         (progn
-          (embedding-native-load-model)
-          (setf n-tokens (%llama-tokenize *native-model* text text-len tokens max-tokens t t))
-          (when (zerop n-tokens)
-            (error "Native embedding: tokenization returned 0 tokens"))
-          (let* ((batch (%llama-batch-init n-tokens 0 1))
-                 (n-embd (embedding-native-get-dim))
-                 (result (make-array n-embd :element-type 'single-float :initial-element 0.0))
-                 (seq-id-ptr (cffi:foreign-alloc :int32 :count 1)))
-            (setf (cffi:mem-aref seq-id-ptr :int32 0) 0)
-            (unwind-protect
-                (progn
-                  (dotimes (i n-tokens)
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'token) :int32 i)
-                          (cffi:mem-aref tokens :int32 i))
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'pos) :int32 i) i)
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'n-seq-id) :int32 i) 1)
-                    (setf (cffi:mem-aref (cffi:foreign-slot-value batch '(:struct llama-batch) 'seq-id) :pointer i)
-                          seq-id-ptr))
-                  (let ((encode-result (%llama-encode *native-context* batch)))
-                    (when (not (zerop encode-result))
-                      (error "Native embedding: encode returned ~d" encode-result)))
-                  (let ((embd-ptr (%llama-get-embeddings-ith *native-context* (1- n-tokens))))
-                    (dotimes (i n-embd)
-                      (setf (aref result i) (cffi:mem-aref embd-ptr :float i)))))
-              (%llama-batch-free batch)
-              (cffi:foreign-free seq-id-ptr))
+          (setf n-tok (tok *native-vocab* text (length text) tokens max-tokens t t))
+          (when (zerop n-tok)
+            (error "Native embedding: tokenization returned 0 tokens for ~s" text))
+          (let ((result (make-array n-embd :element-type 'single-float :initial-element 0.0f0)))
+            (cffi:with-foreign-object (batch 'llama-batch)
+              (wrap-batch-init batch n-tok 0 1)
+              (setf (cffi:foreign-slot-value batch 'llama-batch 'n-tokens) n-tok)
+              (dotimes (i n-tok)
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'token) :int32 i)
+                      (cffi:mem-aref tokens :int32 i))
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'pos) :int32 i) i)
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'n-seq-id) :int32 i) 1)
+                (setf (cffi:mem-aref (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'seq-id) :pointer i) :int32 0) 0)
+                (setf (cffi:mem-aref (cffi:foreign-slot-value batch 'llama-batch 'logits) :int8 i) 1))
+              (let ((enc (wrap-encode *native-context* batch)))
+                (unless (zerop enc)
+                  (error "Native embedding: encode returned ~d" enc)))
+              (let* ((pooling (get-pooling *native-context*))
+                     (eptr (if (= pooling 0)
+                               (embd-ith *native-context* (1- n-tok))
+                               (embd-seq *native-context* 0))))
+                (dotimes (i n-embd)
+                  (setf (aref result i) (cffi:mem-aref eptr :float i))))
+              (wrap-batch-free batch))
             result))
       (cffi:foreign-free tokens))))
+#+end_src
 
-(defun embedding-backend-native-unload ()
+** Cleanup and unload
+
+#+begin_src lisp
+(defun embedding-native-unload ()
   "Release native model and context memory."
   (when *native-context*
-    (%llama-free *native-context*)
+    (fc *native-context*)
     (setf *native-context* nil))
   (when *native-model*
-    (%llama-free-model *native-model*)
-    (setf *native-model* nil))
+    (fm *native-model*)
+    (setf *native-model* nil *native-vocab* nil))
   (values))
 
-(pushnew (lambda () (embedding-backend-native-unload)) sb-ext:*exit-hooks*)
+(defun embedding-native-get-dim ()
+  "Return embedding dimension of loaded native model (0 if not loaded)."
+  (if *native-model*
+      (ne *native-model*)
+      0))
+#+end_src
+
+** Cosine similarity helper
+
+Used in tests and embedding comparisons.
+
+#+begin_src lisp
+(defun vector-cosine-similarity (a b)
+  "Cosine similarity between two simple-vectors of single-floats."
+  (let ((dot 0.0d0) (anorm 0.0d0) (bnorm 0.0d0))
+    (dotimes (i (length a))
+      (let ((af (float (aref a i) 0.0d0))
+            (bf (float (aref b i) 0.0d0)))
+        (incf dot (* af bf))
+        (incf anorm (* af af))
+        (incf bnorm (* bf bf))))
+    (if (or (zerop anorm) (zerop bnorm))
+        0.0d0
+        (/ dot (sqrt (* anorm bnorm))))))
 #+end_src
 
 * Test Suite
+
 #+begin_src lisp
 (eval-when (:compile-toplevel :load-toplevel :execute)
   (ql:quickload :fiveam :silent t))
@@ -254,9 +321,41 @@ Returns a single-float vector of dimension n_embd."
   "Contract v0.4.1: semantically similar texts are closer than unrelated."
   (let ((v-auth (passepartout::embedding-backend-native "implement user login form"))
         (v-related (passepartout::embedding-backend-native "add password authentication"))
-        (v-unrelated (passepartout::embedding-backend-native "banana fruit yellow"))
-        (sim-related (passepartout::vector-cosine-similarity v-auth v-related))
-        (sim-unrelated (passepartout::vector-cosine-similarity v-auth v-unrelated)))
-    (is (> sim-related 0.5))
-    (is (> sim-related sim-unrelated))))
+        (v-unrelated (passepartout::embedding-backend-native "banana fruit yellow")))
+    (let ((sim-related (passepartout::vector-cosine-similarity v-auth v-related))
+          (sim-unrelated (passepartout::vector-cosine-similarity v-auth v-unrelated)))
+      (is (> sim-related 0.5))
+      (is (> sim-related sim-unrelated)))))
+#+end_src
+
+* C Wrapper Source
+
+The C wrapper bridges CFFI's pointer-only interface to llama.cpp's struct-by-value API.
+Compile with: =gcc -shared -fPIC -I/tmp/llama.cpp/include -o libllama_wrap.so llama_wrap.c -L/usr/local/lib -lllama=
+
+#+begin_src c :tangle ../scripts/llama_wrap.c
+// C wrapper for llama.cpp — bridges CFFI pointer params to struct-by-value
+// Compile: gcc -shared -fPIC -I/tmp/llama.cpp/include -o libllama_wrap.so llama_wrap.c -L/usr/local/lib -lllama
+
+#include <llama.h>
+
+struct llama_model * llama_wrap_model_load(const char * path, struct llama_model_params * params) {
+    return llama_model_load_from_file(path, *params);
+}
+
+struct llama_context * llama_wrap_new_context(struct llama_model * model, struct llama_context_params * params) {
+    return llama_init_from_model(model, *params);
+}
+
+int32_t llama_wrap_encode(struct llama_context * ctx, struct llama_batch * batch) {
+    return llama_encode(ctx, *batch);
+}
+
+void llama_wrap_batch_init(struct llama_batch * batch, int32_t n_tokens, int32_t embd, int32_t n_seq_max) {
+    *batch = llama_batch_init(n_tokens, embd, n_seq_max);
+}
+
+void llama_wrap_batch_free(struct llama_batch * batch) {
+    llama_batch_free(*batch);
+}
 #+end_src
diff --git a/org/system-model-embedding.org b/org/system-model-embedding.org
index e2cef4b..4af3d00 100644
--- a/org/system-model-embedding.org
+++ b/org/system-model-embedding.org
@@ -11,6 +11,7 @@
 - ~:sha256~ — integrity-only (explicit opt-in). SHA-256 hashing for environments where even trivial computation is undesirable.
 - ~:local~ — any OpenAI-compatible ~/api/embeddings~ endpoint (Ollama, vLLM, etc.)
 - ~:openai~ — the OpenAI ~/v1/embeddings~ API with an API key
+- ~:native~ — in-process inference via llama.cpp / CFFI. 768-dim nomic-embed-text-v1.5, zero network calls, <100ms per document on CPU. Requires model file at ~/.local/share/passepartout/models/nomic-embed-text-v1.5.Q4_K_M.gguf and libllama_wrap.so at /usr/local/lib.
 
 The embedding queue (~embed-queue-object~ / ~embed-all-pending~) decouples document indexing from the main loop. On each heartbeat tick, ~embed-all-pending~ drains the queue and embeds all accumulated objects. This prevents indexing traffic from blocking conversational responses.
 
@@ -27,7 +28,7 @@ This replaces the old ~system-embedding-gateway~ with the same logic but renamed
 (in-package :passepartout)
 
 (defvar *embedding-provider* :trigram
-  "Active embedding provider: :trigram, :sha256, :local, :openai.")
+  "Active embedding provider: :trigram, :sha256, :local, :openai, :native.")
 
 (defvar *embedding-queue* nil
   "Queue of text objects awaiting embedding.")
@@ -123,10 +124,14 @@ Pure Lisp, zero external dependencies, works fully offline."
   "Embed a single text string using the active backend."
   (let* ((selected (or *embedding-backend* *embedding-provider* :trigram))
          (backend (case selected
-                     (:local #'embedding-backend-local)
-                     (:openai #'embedding-backend-openai)
-                     (:sha256 #'embedding-backend-sha256)
-                     (t #'embedding-backend-trigram))))
+                    (:local #'embedding-backend-local)
+                    (:openai #'embedding-backend-openai)
+                    (:native
+                     (unless (fboundp 'embedding-backend-native)
+                       (embedding-native-ensure-loaded))
+                     #'embedding-backend-native)
+                    (:sha256 #'embedding-backend-sha256)
+                    (t #'embedding-backend-trigram))))
     (if backend
         (progn
           (log-message "EMBEDDING: Provider ~a, backend=~a" selected backend)
@@ -164,6 +169,34 @@ Pure Lisp, zero external dependencies, works fully offline."
       (setf *embedding-provider* kw)
       (log-message "EMBEDDING: Set provider to ~a from EMBEDDING_PROVIDER env" kw))))
 
+(defun embedding-native-ensure-loaded ()
+  "Lazy-load the native CFFI backend. First call blocks ~30s for model init."
+  (when (fboundp 'embedding-backend-native)
+    (return-from embedding-native-ensure-loaded t))
+  (let* ((data-dir (uiop:ensure-directory-pathname
+                    (or (uiop:getenv "PASSEPARTOUT_DATA_DIR")
+                        (namestring (merge-pathnames ".local/share/passepartout/"
+                                                    (user-homedir-pathname))))))
+         (native-file (merge-pathnames "lisp/system-model-embedding-native.lisp" data-dir)))
+    (handler-case
+        (progn
+          (load native-file :verbose nil :print nil)
+          (log-message "EMBEDDING: Native backend loaded from ~a" native-file))
+      (error (c)
+        (error "Failed to load native embedding backend (~a): ~a" native-file c)))))
+
+;; Preload native model if configured at startup
+(when (eq *embedding-provider* :native)
+  (log-message "EMBEDDING: Native provider configured, preloading model...")
+  (embedding-native-ensure-loaded)
+  (handler-case
+      (progn
+        (embedding-native-load-model)
+        (log-message "EMBEDDING: Native model preloaded (~d dims)"
+                     (embedding-native-get-dim)))
+    (error (c)
+      (log-message "EMBEDDING: Preload deferred: ~a (will retry on first call)" c))))
+
 (log-message "EMBEDDING: Gateway loaded with provider ~a" *embedding-provider*)
 #+end_src
 
diff --git a/scripts/llama_wrap.c b/scripts/llama_wrap.c
new file mode 100644
index 0000000..f16faa3
--- /dev/null
+++ b/scripts/llama_wrap.c
@@ -0,0 +1,24 @@
+// C wrapper for llama.cpp — bridges CFFI pointer params to struct-by-value
+// Compile: gcc -shared -fPIC -I/tmp/llama.cpp/include -o libllama_wrap.so llama_wrap.c -L/usr/local/lib -lllama
+
+#include <llama.h>
+
+struct llama_model * llama_wrap_model_load(const char * path, struct llama_model_params * params) {
+    return llama_model_load_from_file(path, *params);
+}
+
+struct llama_context * llama_wrap_new_context(struct llama_model * model, struct llama_context_params * params) {
+    return llama_init_from_model(model, *params);
+}
+
+int32_t llama_wrap_encode(struct llama_context * ctx, struct llama_batch * batch) {
+    return llama_encode(ctx, *batch);
+}
+
+void llama_wrap_batch_init(struct llama_batch * batch, int32_t n_tokens, int32_t embd, int32_t n_seq_max) {
+    *batch = llama_batch_init(n_tokens, embd, n_seq_max);
+}
+
+void llama_wrap_batch_free(struct llama_batch * batch) {
+    llama_batch_free(*batch);
+}