tests: TUI integration + cascade parsing — precise LLM diagnostics
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 2s

- TUI agent-responds: hardened to detect and FAIL on cascade/exhausted
  responses (previously a separate WARN-only test that let real
  cascade failures slip through)
- New TUI cascade-parsing test: /eval *provider-cascade* on screen,
  checks for clean keywords (no cl-dotenv quote artifacts)
- Pre-warm step: sbcl --eval '(ql:quickload :passepartout/tui)'
  before launching tmux, cuts TUI startup from ~120s to ~10s
- Removed test_agent_not_cascade_failure (absorbed into agent-responds)
- New integration test: test-provider-cascade-parsing verifies
  PROVIDER_CASCADE entries are keywords without quotes, matching
  registered backends — catches the exact cl-dotenv quote bug
- Fixed stop-daemon ghost symbol (removed export) and paren bug
- Contract section updated with numbered Phase 2/3 items
This commit is contained in:
2026-05-06 08:56:07 -04:00
parent 9362c56678
commit 750918527d
4 changed files with 180 additions and 107 deletions

View File

@@ -2,7 +2,9 @@
This document captures the rationale behind key architectural choices. It is not a specification - it is a thinking medium for future architects and contributors who need to understand why the system is built this way, not just how.
* One single agent
* Design
** One single agent
:PROPERTIES:
:ID: design-multi-agent-default
:END:
@@ -23,7 +25,7 @@ But the default assumption that complex reasoning tasks are best solved by multi
Passepartout is single-agent by default not from limitation but from conviction: for reasoning-heavy work where coherence matters, a unified memory space and single decision-making locus are architectural assets, not constraints.
* The Unified Memory Argument
** The Unified Memory Argument
:PROPERTIES:
:ID: design-unified-memory
:END:
@@ -42,7 +44,7 @@ Context window limits are largely a symptom of lazy architecture. The default ap
The unified memory argument is not that infinite context is free. It is that with proper architecture, effective infinite context is achievable without the synchronization and fragmentation costs of multi-agent systems.
* Org-Mode as Unified AST
** Org-Mode as Unified AST
:PROPERTIES:
:ID: design-org-unified-ast
:END:
@@ -75,7 +77,7 @@ The unified format is what makes the memory architecture work. The agent's memor
This is what "sovereignty" means in technical terms: the user owns the data in a format they can access, and the agent operates on the data in the same format they own.
* Homoiconicity as Foundation
** Homoiconicity as Foundation
:PROPERTIES:
:ID: design-homoiconicity
:END:
@@ -118,7 +120,7 @@ Six decades later, neural networks have arrived at the problem from a different
Lisp's time may finally have come. Not as a replacement for neural networks, but as the governor that makes them safe - the symbolic engine that verifies what the neural engine proposes, the homoiconic substrate that allows the system to inspect, modify, and improve its own reasoning. The machine that was designed for AI in 1958 may be the exact machine needed for AI in 2026 and beyond.
* The Probabilistic-Deterministic Split
** The Probabilistic-Deterministic Split
:PROPERTIES:
:ID: design-probabilistic-deterministic
:END:
@@ -137,7 +139,7 @@ This separation is the source of Passepartout's safety guarantee. Other agents a
The split also explains why the system gets safer over time without the LLM improving. The deterministic engine accumulates rules. The LLM proposes actions, the engine evaluates them against a growing rule set. Early versions block obvious dangers. Later versions block sophisticated attacks that were previously unknown. The safety grows logarithmically with the number of interactions, not linearly with model capability.
* The Dispatcher as Learning System
** The Dispatcher as Learning System
:PROPERTIES:
:ID: design-bouncer-learning
:END:
@@ -156,7 +158,7 @@ The Dispatcher becomes, over time, not a guard that blocks bad actions but a rea
This is the bootstrap. The system begins dependent on human judgment because it has no basis for judgment of its own. Through accumulated decisions, it constructs a model of what is permitted and why. That model is the foundation for the deterministic symbolic engine that in v3.0.0 takes over the reasoning that the Dispatcher learned to perform.
* The REPL as Cognitive Substrate
** The REPL as Cognitive Substrate
:PROPERTIES:
:ID: design-repl-cognition
:END:
@@ -177,7 +179,22 @@ Third, the REPL is a shared substrate. When the agent evaluates code, that code
This is why the REPL becomes more important as the system matures. In early versions, it is a development tool. In v0.6.0 and beyond, it becomes a cognitive tool: the agent explores hypotheses by evaluating them, verifies the output of sub-agents by inspecting live state, and tests modifications before committing them to the knowledge graph.
* Literate Programming as Discipline
** Observability and the Thought Trace
:PROPERTIES:
:ID: design-observability
:END:
When a human asks why the system made a decision, the answer must be findable. In most AI systems, the reasoning is ephemeral - it exists in the model's activations and disappears when the session ends. In Passepartout, every significant cognitive event is written to an Org buffer as it happens.
The thought trace is the agent's journal, written in parallel with its reasoning. When the probabilistic engine generates a proposal, the trace records the input, the prompt, and the raw output. When the deterministic engine evaluates it, the trace records which rules were checked, which passed, which failed, and why. When an action is executed, the trace records the timestamp, the user who approved it (if human-in-the-loop), and the outcome.
This is not logging in the traditional sense. Logs are forensically useful but are written in a machine format optimized for storage, not for human reading. The thought trace is written in Org-mode: headlines for major events, property drawers for structured data, tags for categorization. The human can open the trace in a text editor and navigate it like any other Org file. They can search for a specific decision, filter by time range, find all actions blocked by a specific rule, or see the complete trajectory of a multi-step task.
The trace becomes the foundation for the Dispatcher's learning. Every blocked action is in the trace. Every approved exception is in the trace. The human-in-the-loop decisions are in the trace. The system does not need to reconstruct what happened - it reads what happened from the trace it wrote.
Without observability, the system is a black box that happens to produce correct outputs sometimes. With observability, the system is auditable. The human can see why a decision was made, identify where the reasoning failed, and course-correct the system or its own behavior accordingly.
** Literate Programming as Discipline
:PROPERTIES:
:ID: design-literate-programming
:END:
@@ -198,7 +215,7 @@ Together, these constraints create a development experience that is slower in th
The literate programming discipline is not about producing documentation. It is about producing code whose correctness has been verified by the act of explaining it.
* The Evaluation Harness
** The Evaluation Harness
:PROPERTIES:
:ID: design-evaluation-harness
:END:
@@ -213,22 +230,7 @@ Beyond SWE-bench, the harness includes chaos testing. The system is subjected to
The harness also supports regression testing on the skill set. Every skill is tested against a suite of known inputs and expected outputs. When a modification is proposed to any skill - whether through manual editing or the agent's own self-modification - the test suite runs first. A skill that fails its tests is rejected before it can propagate to the running image. This is not a convenience - it is the mechanism by which self-modification remains safe. The agent can propose changes, but the harness verifies them before the changes take effect.
* Observability and the Thought Trace
:PROPERTIES:
:ID: design-observability
:END:
When a human asks why the system made a decision, the answer must be findable. In most AI systems, the reasoning is ephemeral - it exists in the model's activations and disappears when the session ends. In Passepartout, every significant cognitive event is written to an Org buffer as it happens.
The thought trace is the agent's journal, written in parallel with its reasoning. When the probabilistic engine generates a proposal, the trace records the input, the prompt, and the raw output. When the deterministic engine evaluates it, the trace records which rules were checked, which passed, which failed, and why. When an action is executed, the trace records the timestamp, the user who approved it (if human-in-the-loop), and the outcome.
This is not logging in the traditional sense. Logs are forensically useful but are written in a machine format optimized for storage, not for human reading. The thought trace is written in Org-mode: headlines for major events, property drawers for structured data, tags for categorization. The human can open the trace in Emacs and navigate it like any other Org file. They can search for a specific decision, filter by time range, find all actions blocked by a specific rule, or see the complete trajectory of a multi-step task.
The trace becomes the foundation for the Dispatcher's learning. Every blocked action is in the trace. Every approved exception is in the trace. The human-in-the-loop decisions are in the trace. The system does not need to reconstruct what happened - it reads what happened from the trace it wrote.
Without observability, the system is a black box that happens to produce correct outputs sometimes. With observability, the system is auditable. The human can see why a decision was made, identify where the reasoning failed, and course-correct the system or its own behavior accordingly.
* The MCP Strategy
** The MCP Strategy
:PROPERTIES:
:ID: design-mcp-strategy
:END:
@@ -243,7 +245,7 @@ The alternative is to build MCP wrappers in Python or TypeScript and bridge to L
Passepartout's native client is smaller, faster, and more maintainable. The MCP client is a skill, not a core component. It can be reloaded, replaced, or removed without restarting the agent. The agent can add new MCP tool integrations by loading new skills, not by deploying new infrastructure.
* Local-First Architecture
** Local-First Architecture
:PROPERTIES:
:ID: design-local-first
:END:
@@ -277,7 +279,7 @@ The three structural multipliers are:
These compound. A coding session touching 20 files, performing 10 actions, and triggering 3 errors saves ~50,000-100,000 tokens compared to the same session with Claude Code.
** Per-Task Type Analysis
** Per-Task Type Guesstemates
*** Coding (debugging, refactoring, PR review)