fix: TUI agent-responds uses text-match not unicode arrow

tmux capture-pane strips the ⬇ (U+2B07) character; grep on it always returns empty. Switch to case-insensitive text matching on actual LLM response content (hello, hi, greeting, hey). Also: reorder tests (cascade-parsing, eval-command, status-bar first to warm the TUI; agent-responds last), increase handshake timeout to 60s, increase agent poll timeout to 90s. TUI integration: 5/5 pass (was 4/5 with false-negative on agent-responds)
2026-05-06 09:07:16 -04:00
parent 750918527d
commit 7c9cc629a1
5 changed files with 45 additions and 41 deletions
--- a/docs/DESIGN_DECISIONS.org
+++ b/docs/DESIGN_DECISIONS.org
@@ -279,7 +279,7 @@ The three structural multipliers are:

 These compound. A coding session touching 20 files, performing 10 actions, and triggering 3 errors saves ~50,000-100,000 tokens compared to the same session with Claude Code.

-** Per-Task Type Guesstemates
+** Per-Task Type Guesstimate

 *** Coding (debugging, refactoring, PR review)

@@ -354,18 +354,6 @@ KV cache memory scales with context length:

 Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1 GB. A 7-8B model on an RTX 3060 Ti (8 GB VRAM) or MacBook (16 GB unified memory) is a practical daily driver with Passepartout. Competitors at full context require 16-32 GB VRAM or cloud APIs.

-** Open Questions and Risks
-
-1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
-
-2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
-
-3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.
-
-4. *The 3-retry dispatcher loop.* When the dispatcher rejects a proposal, the rejection trace feeds back to the LLM for self-correction (up to 3 retries). If the dispatcher rejects 30% of proposals, the effective token multiplier is 1.39x per action. At 50% rejection (plausible during early use), it is 1.75x. This penalty decreases as the dispatcher accumulates rules.
-
-5. *Competitor evolution.* Sparse retrieval is not patentable. Claude Code, Copilot, and others will implement similar mechanisms. The architectural advantage is real but finite in duration. The deterministic safety gate is the harder-to-replicate differentiator.
-
 ** Comparison Summary

 | Metric                      | Passepartout        | Claude Code             | Hermes                       | OpenClaw              |
@@ -380,3 +368,16 @@ Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1
 | Min VRAM for local          | 4-6 GB              | 16-32 GB                | 24-48 GB                     | 8-16 GB               |

 *Conclusion:* Passepartout's architecture is designed to produce 2-3x token savings for coding, 13-24x for knowledge management, and 2x for life management at v1.0.0 maturity. The three structural advantages — sparse trees, deterministic safety, and REPL verification — compound. The critical risk is implementation gap: achieving the retrieval precision, dispatcher learning, and REPL integration depth required to realize the design.
+* Open Questions and Risks
+
+1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
+
+2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
+
+3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.
+
+4. *The 3-retry dispatcher loop.* When the dispatcher rejects a proposal, the rejection trace feeds back to the LLM for self-correction (up to 3 retries). If the dispatcher rejects 30% of proposals, the effective token multiplier is 1.39x per action. At 50% rejection (plausible during early use), it is 1.75x. This penalty decreases as the dispatcher accumulates rules.
+
+5. *Competitor evolution.* Sparse retrieval is not patentable. Claude Code, Copilot, and others will implement similar mechanisms. The architectural advantage is real but finite in duration. The deterministic safety gate is the harder-to-replicate differentiator.
+
+