fix: TUI agent-responds uses text-match not unicode arrow
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 2s
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 2s
tmux capture-pane strips the ⬇ (U+2B07) character; grep on it always returns empty. Switch to case-insensitive text matching on actual LLM response content (hello, hi, greeting, hey). Also: reorder tests (cascade-parsing, eval-command, status-bar first to warm the TUI; agent-responds last), increase handshake timeout to 60s, increase agent poll timeout to 90s. TUI integration: 5/5 pass (was 4/5 with false-negative on agent-responds)
This commit is contained in:
@@ -279,7 +279,7 @@ The three structural multipliers are:
|
||||
|
||||
These compound. A coding session touching 20 files, performing 10 actions, and triggering 3 errors saves ~50,000-100,000 tokens compared to the same session with Claude Code.
|
||||
|
||||
** Per-Task Type Guesstemates
|
||||
** Per-Task Type Guesstimate
|
||||
|
||||
*** Coding (debugging, refactoring, PR review)
|
||||
|
||||
@@ -354,18 +354,6 @@ KV cache memory scales with context length:
|
||||
|
||||
Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1 GB. A 7-8B model on an RTX 3060 Ti (8 GB VRAM) or MacBook (16 GB unified memory) is a practical daily driver with Passepartout. Competitors at full context require 16-32 GB VRAM or cloud APIs.
|
||||
|
||||
** Open Questions and Risks
|
||||
|
||||
1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
|
||||
|
||||
2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
|
||||
|
||||
3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.
|
||||
|
||||
4. *The 3-retry dispatcher loop.* When the dispatcher rejects a proposal, the rejection trace feeds back to the LLM for self-correction (up to 3 retries). If the dispatcher rejects 30% of proposals, the effective token multiplier is 1.39x per action. At 50% rejection (plausible during early use), it is 1.75x. This penalty decreases as the dispatcher accumulates rules.
|
||||
|
||||
5. *Competitor evolution.* Sparse retrieval is not patentable. Claude Code, Copilot, and others will implement similar mechanisms. The architectural advantage is real but finite in duration. The deterministic safety gate is the harder-to-replicate differentiator.
|
||||
|
||||
** Comparison Summary
|
||||
|
||||
| Metric | Passepartout | Claude Code | Hermes | OpenClaw |
|
||||
@@ -380,3 +368,16 @@ Passepartout at 4K effective context: ~67 MB KV cache. Competitor at 128K: ~2.1
|
||||
| Min VRAM for local | 4-6 GB | 16-32 GB | 24-48 GB | 8-16 GB |
|
||||
|
||||
*Conclusion:* Passepartout's architecture is designed to produce 2-3x token savings for coding, 13-24x for knowledge management, and 2x for life management at v1.0.0 maturity. The three structural advantages — sparse trees, deterministic safety, and REPL verification — compound. The critical risk is implementation gap: achieving the retrieval precision, dispatcher learning, and REPL integration depth required to realize the design.
|
||||
* Open Questions and Risks
|
||||
|
||||
1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
|
||||
|
||||
2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
|
||||
|
||||
3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.
|
||||
|
||||
4. *The 3-retry dispatcher loop.* When the dispatcher rejects a proposal, the rejection trace feeds back to the LLM for self-correction (up to 3 retries). If the dispatcher rejects 30% of proposals, the effective token multiplier is 1.39x per action. At 50% rejection (plausible during early use), it is 1.75x. This penalty decreases as the dispatcher accumulates rules.
|
||||
|
||||
5. *Competitor evolution.* Sparse retrieval is not patentable. Claude Code, Copilot, and others will implement similar mechanisms. The architectural advantage is real but finite in duration. The deterministic safety gate is the harder-to-replicate differentiator.
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user