fix: TUI agent-responds uses text-match not unicode arrow
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 2s

tmux capture-pane strips the ⬇ (U+2B07) character; grep on it always
returns empty. Switch to case-insensitive text matching on actual
LLM response content (hello, hi, greeting, hey).

Also: reorder tests (cascade-parsing, eval-command, status-bar first
to warm the TUI; agent-responds last), increase handshake timeout
to 60s, increase agent poll timeout to 90s.

TUI integration: 5/5 pass (was 4/5 with false-negative on agent-responds)
This commit is contained in:
2026-05-06 09:07:16 -04:00
parent 750918527d
commit 7c9cc629a1
5 changed files with 45 additions and 41 deletions

View File

@@ -279,7 +279,7 @@ The three structural multipliers are:
These compound. A coding session touching 20 files, performing 10 actions, and triggering 3 errors saves ~50,000-100,000 tokens compared to the same session with Claude Code.
** Per-Task Type Guesstemates
** Per-Task Type Guesstimate
*** Coding (debugging, refactoring, PR review)
@@ -354,18 +354,6 @@ KV cache memory scales with context length:
Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1 GB. A 7-8B model on an RTX 3060 Ti (8 GB VRAM) or MacBook (16 GB unified memory) is a practical daily driver with Passepartout. Competitors at full context require 16-32 GB VRAM or cloud APIs.
** Open Questions and Risks
1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.
4. *The 3-retry dispatcher loop.* When the dispatcher rejects a proposal, the rejection trace feeds back to the LLM for self-correction (up to 3 retries). If the dispatcher rejects 30% of proposals, the effective token multiplier is 1.39x per action. At 50% rejection (plausible during early use), it is 1.75x. This penalty decreases as the dispatcher accumulates rules.
5. *Competitor evolution.* Sparse retrieval is not patentable. Claude Code, Copilot, and others will implement similar mechanisms. The architectural advantage is real but finite in duration. The deterministic safety gate is the harder-to-replicate differentiator.
** Comparison Summary
| Metric | Passepartout | Claude Code | Hermes | OpenClaw |
@@ -380,3 +368,16 @@ Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1
| Min VRAM for local | 4-6 GB | 16-32 GB | 24-48 GB | 8-16 GB |
*Conclusion:* Passepartout's architecture is designed to produce 2-3x token savings for coding, 13-24x for knowledge management, and 2x for life management at v1.0.0 maturity. The three structural advantages — sparse trees, deterministic safety, and REPL verification — compound. The critical risk is implementation gap: achieving the retrieval precision, dispatcher learning, and REPL integration depth required to realize the design.
* Open Questions and Risks
1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.
4. *The 3-retry dispatcher loop.* When the dispatcher rejects a proposal, the rejection trace feeds back to the LLM for self-correction (up to 3 retries). If the dispatcher rejects 30% of proposals, the effective token multiplier is 1.39x per action. At 50% rejection (plausible during early use), it is 1.75x. This penalty decreases as the dispatcher accumulates rules.
5. *Competitor evolution.* Sparse retrieval is not patentable. Claude Code, Copilot, and others will implement similar mechanisms. The architectural advantage is real but finite in duration. The deterministic safety gate is the harder-to-replicate differentiator.