passepartout: v0.4.1 Design Cleanup
Some checks failed
Deploy (Gitea) / deploy (push) Failing after 2s

- Remove system-prompt-augment mechanism, introduce *standing-mandates*
- Fix false token-overhead claims in DESIGN_DECISIONS + ROADMAP
- Update security vector count 9-10 across all docs and dispatcher docstring
- Rewrite README with agent section, soften aspirational claims
- Register 10 cognitive tools in programming-tools.org with test suite
- Enforce NO-HARDCODED-CONSTANTS in .env.example
- ROADMAP: mark v0.3.x patches DONE, add LOGBOOKs, mark releases
- AGENTS.md: rewrite compact (180 to 50 lines), move refs to CONTRIBUTING
- Normalize org tangle directives to file-level PROPERTY inheritance
This commit is contained in:
2026-05-07 16:44:59 -04:00
parent d3b74f5c88
commit 639bc348d9
25 changed files with 1555 additions and 144 deletions

View File

@@ -77,8 +77,9 @@ Every action the LLM proposes passes through a stack of deterministic gates befo
| 600 | security-permissions | Tool permission table (allow/ask/deny per tool) |
| 600 | security-vault | Credential storage integrity |
| 500 | security-policy | Requires :explanation on every action |
| 150 | security-dispatcher | 9-vector safety: secrets, paths, shell, lisp, network, |
| | (the Dispatcher) | privacy, high-impact approval |
| 150 | security-dispatcher | 11-check safety: lisp, secret path, self-build, |
| | (the Dispatcher) | content exposure, vault, privacy tags, privacy text, |
| | | shell safety, network exfil, high-impact approval |
| 95 | security-validator | Protocol schema validation |
| 100 | system-archivist | Scribe and Gardener maintenance on heartbeat |
| 80 | system-event-orchestrator | Cron job dispatch on heartbeat |

View File

@@ -6,57 +6,111 @@
* Philosophy
Passepartout is built on a "Zero-Bloat" mandate. The core kernel is mathematically pure, pushing all peripheral logic, API integrations, and routing to hot-reloadable "Skills".
* TDD Discipline (Red-Green-Refactor)
* Development Workflow
All code changes MUST follow this cycle:
The full development cycle is described in ~AGENTS.md~. In summary:
1. *Write a failing test* — capture the desired behavior as a FiveAM test
in a =* Test Suite= section within the relevant =.org= file
2. *Prove it fails* — run =sbcl --eval "(asdf:test-system :passepartout)"=
and confirm the new test fails (RED) before writing implementation
3. *Write the code* — modify the implementation in the same =.org= file
4. *Prove it passes* — run the test suite again, confirm GREEN
5. *Reflect* — ensure the test and code are both in the =.org= literate source
1. *Think in org* — write reasoning and goals in the .org file
2. *Write contract* — define each function's behavior in a ~** Contract~ section
3. *TDD from contract* — each contract item becomes a ~fiveam:test~; prove RED then GREEN
4. *Reflect in org* — ensure implementation is in .org source
5. *Update literate prose* — explain the code: what, why, how it connects
For *existing code* that lacks tests: write a characterization test that
captures current behavior as the spec. Then refactor.
* Literate Programming
No test may be committed without proof it was first run to failure.
~.org~ files in ~org/~ are the source of truth. ~lisp/~ files are generated by ~org-babel-tangle~.
* Literate Granularity
We strictly adhere to Literate Programming using Org-mode.
- *Never* edit `.lisp` files in `src/` directly.
- Modify the corresponding `.org` files in the `literate/` or `skills/` directories.
- Run `org-babel-tangle` to generate the source code.
- Every architectural decision, constraint, and implementation detail must be documented alongside the code in the `.org` file.
- Never edit =lisp/= files directly — always modify the corresponding =org/= file
- All ~#+begin_src lisp~ blocks in a file inherit their tangle destination from the file-level ~#+PROPERTY: header-args:lisp :tangle ../lisp/FILE.lisp~
- Every architectural decision, constraint, and implementation detail must be documented alongside the code
* Contracts and Tests
Every code change starts with a contract and a failing test. Write a ~** Contract~ section listing each function's behavior, then create a ~fiveam:test~ in the ~* Test Suite~ section for each contract item.
To run tests for a specific file:
#+begin_src bash
sbcl --noinform \
--eval '(load (merge-pathnames "quicklisp/setup.lisp" (user-homedir-pathname)))' \
--eval '(ql:quickload :passepartout :silent t)' \
--eval '(load "lisp/FILE.lisp")' \
--eval '(fiveam:run (intern "SUITE-NAME" :passepartout-TESTS))' --quit
#+end_src
No test may be committed without proof it was first run to failure (RED).
* Skill Creation Standard
Skills are the building blocks of Passepartout. They reside in the `skills/` directory.
A skill must define:
1. *Trigger*: A lambda determining if the skill should activate based on the context.
2. *Probabilistic Gate*: Optional. Generates a prompt for the LLM.
3. *Deterministic Gate*: A hardcoded Lisp function that guarantees safety or executes side-effects (the Dispatcher pattern).
A skill is a =.org= file in =org/= that defines:
Example Registration:
1. *Contract* — what the skill guarantees
2. *Implementation* — the code, tangled to ~lisp/~ via ~#+PROPERTY: header-args:lisp~
3. *Skill Registration* — a ~defskill~ form with ~:priority~, ~:trigger~, ~:probabilistic~ / ~:deterministic~
4. *Test Suite*~fiveam:test~ forms verifying the contract
Example:
#+begin_src lisp
(defskill :skill-example
(defskill :passepartout-example
:priority 100
:trigger (lambda (ctx) ...)
:probabilistic nil
:probabilistic (lambda (ctx) ...)
:deterministic (lambda (action ctx) ...))
#+end_src
* The Unified Envelope (Communication Protocol)
All inter-process communication occurs via the Unified Envelope. Do not use legacy specific types like `:CHAT`.
- Always use semantic types: `:REQUEST`, `:EVENT`, `:RESPONSE`, `:STATUS`, `:LOG`.
- Include routing metadata in the `:META` block (e.g., `(:SOURCE :TUI)`).
- Ensure generated `:REQUEST` messages include a mandatory `:TARGET` field.
* Project Structure
* Pull Request Process
1. Choose an Org file and write a failing test in its =* Test Suite= section.
2. Tangle and run to confirm RED (the test fails).
3. Write the implementation in the same Org file, tangle, run to confirm GREEN.
4. Ensure your working tree is clean.
5. Run the full test suite: =sbcl --eval "(asdf:test-system :passepartout)"=.
6. Submit a PR outlining the architectural intent and the specific Literate changes.
| Directory | Purpose |
|----------------------+--------------------------------------------------|
| =org/= | Literate source files (edit these) |
| =lisp/= | Tangled .lisp output (never edit) |
| =docs/= | ROADMAP, ARCHITECTURE, DESIGN_DECISIONS, etc. |
| =scripts/= | Build and utility scripts |
| ~/.local/share/passepartout/= | XDG data dir — deployed lisp files |
| ~/.config/passepartout/= | Config (.env) |
* Key Libraries
| Library | Purpose |
|------------------+----------------------------------|
| Croatoan | TUI (terminal UI) |
| usocket | TCP sockets (daemon protocol) |
| bordeaux-threads | Threading (reader thread) |
| dexador | HTTP client (LLM API calls) |
| cl-ppcre | Regex (search-files, dispatcher) |
| ironclad | SHA-256 (Merkle hashing) |
| hunchentoot | HTTP server |
| cl-json | JSON encoding/decoding |
* Protocol
All inter-process communication uses the Unified Envelope protocol over TCP (port 9105). Message types: ~:REQUEST~, ~:EVENT~, ~:RESPONSE~, ~:STATUS~, ~:LOG~. Each message includes a ~:META~ block with routing metadata.
* Pre-Commit Hook
Validates staged org files by tangling + structural-checking:
#+begin_src bash
ln -sf ../../scripts/pre-commit-repl-check .git/hooks/pre-commit
#+end_src
Runs automatically on ~git commit~.
* Testing Tools
** TUI REPL (~/eval~)
The TUI has a built-in command for live evaluation:
- ~/eval (+ 1 2)~ → result displayed in chat window
- ~/eval (add-msg :system "test")~ → inject a test message
** Tmux (TUI integration testing)
#+begin_src bash
tmux new-session -d -s test "passepartout tui 2>&1 | tee /tmp/tui.log"
tmux send-keys -t test "hello world" Enter
tmux capture-pane -t test -p -S -200
tmux kill-session -t test
#+end_src
** Swank (Emacs REPL for TUI)
1. Start TUI: ~passepartout tui~
2. In Emacs: ~M-x slime-connect RET 127.0.0.1 RET 4006~
3. ~C-M-x~ any form from =org/gateway-tui.org= → evaluates in live TUI process
4. Configure port: ~export TUI_SWANK_PORT=4009~ (default: 4006)

View File

@@ -321,7 +321,7 @@ The structural multipliers are:
1. *Sparse tree retrieval* — the foveal-peripheral model renders relevant Org subtrees (titles and properties for peripheral nodes, full content for foveal and semantically relevant nodes). Active context stays at 2,0004,000 tokens. A "load everything" architecture serializes the entire knowledge base at 50,000150,000 tokens. The mechanism is provably cheaper; the exact multiplier depends on memex size and complexity.
2. *Deterministic safety* — the 9-vector Dispatcher gate stack runs in pure Lisp. Every gate is a Common Lisp function. Verification costs 0 LLM tokens per action. Competitors use prompt-based guardrails consuming 100500 LLM tokens per verification. This multiplier is mathematically infinite — a Lisp function call costs no tokens, a guardrail paragraph in a system prompt costs tokens proportional to its length.
2. *Deterministic safety* — the 10-vector Dispatcher gate stack runs in pure Lisp. Every gate is a Common Lisp function. Verification costs 0 LLM tokens per action. Competitors use prompt-based guardrails consuming 100500 LLM tokens per verification. This multiplier is mathematically infinite — a Lisp function call costs no tokens, a guardrail paragraph in a system prompt costs tokens proportional to its length.
3. *REPL verification* — code is tested in the running image before it is committed. Errors surface in milliseconds at 0 LLM tokens. Competitors discover errors after generation and pay 5002,000 tokens per correction round-trip. The REPL eliminates the most expensive kind of LLM call: the one that produced wrong code and needs a do-over.
@@ -432,7 +432,7 @@ The critical risk is implementation: achieving the retrieval precision, Dispatch
1. *Retrieval accuracy is the bottleneck.* If sparse tree retrieval loads the wrong subtree (low-similarity but causally relevant), the LLM makes unfixable errors. The architecture assumes embedding quality is "good enough" — this is untested at scale.
2. *System prompt overhead can consume savings.* Every =think= cycle iterates all registered skills and calls every =system-prompt-augment= function. With 20+ skills, a trivial interaction could carry 3,000-8,000 tokens of overhead before user input is even processed. This overhead is flat per-call, so it disproportionately affects short interactions.
2. *System prompt overhead can consume savings.* Every =think= cycle builds the full system prompt from IDENTITY + TOOLS + CONTEXT + LOGS. With the foveal-peripheral context model growing over time and the tool belt expanding with skills, the fixed overhead is non-trivial. However, it is driven by context and tool descriptions, not by the ~*standing-mandates*~ list (which contributes ~40 tokens when a single mandate fires, and 0 otherwise). Prefix caching (v0.5.0) is the primary mitigation for this overhead.
3. *Model size vs context quality.* A 3.8B model with perfect context cannot match a 70B model on complex multi-file refactors regardless of context quality. Model size independently determines reasoning depth. The minimum viable model is likely 7-13B parameters for engineering work.

View File

@@ -34,8 +34,8 @@ On release:
3. If a ~CHANGELOG.md~ is needed for packaging tools, auto-generate it from ROADMAP DONE items
** v0.1.0: The Autonomous Foundation — RELEASED 2026-04-20
:PROPERTIES:
:RETROSPECTIVE: [2026-05-07 Wed]
:LOGBOOK:
- State "DONE" from "TODO" [2026-04-20 Mon 19:05]
:END:
The secure, auditable Lisp kernel. All core infrastructure in place.
@@ -129,8 +129,8 @@ The actuator registry pattern makes MCP tools (v0.7.0) possible — they registe
The test infrastructure established in v0.1.0 becomes the TDD runner (v0.7.1) and the SWE-bench harness (v0.9.0).
** v0.2.0: Interactive Refinement — RELEASED 2026-04-29
:PROPERTIES:
:RETROSPECTIVE: [2026-05-07 Wed]
:LOGBOOK:
- State "DONE" from "TODO" [2026-04-29 Wed 20:17]
:END:
The "Brain" meets the "Machine." Standardization and professionalization of the user interface and environment.
@@ -190,12 +190,14 @@ The setup wizard established the "works out of the box" constraint that the gate
Copy-on-write snapshots (deep-copying the memory hash table on every write) gave the pipeline crash recovery. The snapshot mechanism is the root of MVCC concurrency (v0.6.1).
** v0.3.0: Event Orchestration + HITL — DONE, UNRELEASED
** v0.3.0: Event Orchestration + HITL — RELEASED 2026-05-06
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-06 Wed 15:50]
:END:
Unified control plane, Human-in-the-Loop state management, and backfill remediation
for stubs and gaps from v0.1.0/v0.2.0. All features are implemented but not yet
published. The security hardening patches (v0.3.10.3.3) will ship as follow-up
point releases before v0.4.0 feature work begins.
for stubs and gaps from v0.1.0/v0.2.0. Security hardening followed as
v0.3.1v0.3.3 point releases.
*** DONE Secret Exposure Gate, Shell Safety, Lisp Validation
:PROPERTIES:
@@ -384,7 +386,14 @@ CLOSED: [2026-05-03 Sun]
The Dispatcher's role has evolved beyond security guard. It is the seed of the deterministic engine — it learns to execute procedures without invoking the neural net.
*** v0.3.1 — TODO Parser RCE elimination
*** DONE Parser RCE elimination
:PROPERTIES:
:ID: id-v031-parser-rce
:CREATED: [2026-05-06 Wed]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-06 Wed 16:38]
:END:
Rationale: SBCL's default ~*read-eval* accessor is ~t~, enabling the ~#.~ reader macro to execute arbitrary Lisp forms during parsing. Three code paths in the current codebase process untrusted input with ~read-from-string~ or ~read~ without binding ~*read-eval*~ to ~nil~. Each represents a remote code execution vector that bypasses all deterministic safety gates — the Dispatcher's shell safety check, path protection, secret scanning, and network exfiltration detection never execute because the malicious form is evaluated during parsing, before the action plist is even constructed.
@@ -393,7 +402,14 @@ Rationale: SBCL's default ~*read-eval* accessor is ~t~, enabling the ~#.~ reader
- Wrap ~read-from-string~ in ~action-system-execute~ (core-loop-act.lisp:62) with ~(let ((*read-eval* nil)) ...)~ — the ~:system :eval~ path executes untrusted payload code. Explicitly assert that this path requires the Dispatcher's approval gate.
- Add FiveAM test: inject ~"(#.(shell \"echo pwned\"))"~ into the ~think()~ pipeline and assert no shell execution occurs.
*** v0.3.2 — TODO Shell safety & actuator sandboxing
*** DONE Shell safety & actuator sandboxing
:PROPERTIES:
:ID: id-v032-shell-sandbox
:CREATED: [2026-05-06 Wed]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-06 Wed 16:46]
:END:
Rationale: The ~:system :eval~ actuator path is currently unchecked by the Dispatcher's approval gate — only ~:shell~ and ~:tool "shell"~ trigger HITL. The shell actuator wraps commands through double ~bash -c~ nesting (~system-actuator-shell.lisp:10~), where Lisp's ~format~ with ~s~ produces S-expression-safe strings, not shell-safe strings. A command containing quotes or substitution characters can break out. Additionally, skill files loaded via ~skill-initialize-all~ execute arbitrary Lisp in jailed packages — a skill file containing ~(uiop:run-program "dangerous")~ executes immediately on load before any gate can inspect it.
@@ -402,7 +418,14 @@ Rationale: The ~:system :eval~ actuator path is currently unchecked by the Dispa
- Add skill sandbox mode for ~skill-initialize-all~: load each skill's code into a temporary jailed package, run the registered trigger function in isolation, verify it imports no restricted symbols (from CL package: ~run-program~, ~shell~, ~run-shell-command~), then promote to the live registry on pass.
- Add FiveAM test: register a skill containing ~(uiop:run-program "echo test")~ in the body and verify the sandbox blocks its promotion.
*** v0.3.3 — TODO TUI Critical Fixes
*** DONE TUI Critical Fixes
:PROPERTIES:
:ID: id-v033-tui-fixes
:CREATED: [2026-05-06 Wed]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-06 Wed 17:59]
:END:
Rationale: The TUI is Passepartout's only interface. OpenClaw distributes across 25+ messaging channels with voice, Canvas, and macOS/iOS apps. Hermes Agent ships multiline editing, slash-command autocomplete, conversation history, interrupt-and-redirect, and streaming tool output in its TUI. Passepartout's Croatoan TUI must carry the product alone, and it currently lacks word wrap, cursor movement, resize handling, connection-loss feedback, a quit command, and persistent history. None of these fixes require daemon changes — they are pure client-side Croatoan work that closes the gap from "proof of concept" to "daily driver."
@@ -415,7 +438,10 @@ Rationale: The TUI is Passepartout's only interface. OpenClaw distributes across
- Message list storage: replace the O(n²) ~(nth i msgs)~ list indexing with a simple adjustable vector. ~add-msg~ appends; ~view-chat~ iterates with ~aref~. The vector is resized as needed. Same API surface, 100x speedup on message-heavy sessions.
- Add FiveAM tests: word-wrap produces correct line count for a 200-character string at 80-column width; cursor left/right wraps at buffer boundaries; SIGWINCH preserves message state; ~/quit~ saves and restores history.
** v0.4.0: Production Hardening
** v0.4.0: Production Hardening — RELEASED 2026-05-06
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-06 Wed 20:56]
:END:
The features in this version were originally sequenced as v0.3.x patches but represent feature-level scope. They activate the architectural advantages designed in v0.1.0v0.3.0, harden the self-build safety boundary, and expand Passepartout's interaction surfaces beyond the terminal TUI. Each feature depends on infrastructure already in place — the wiring, the sandbox, the gate trace — and activates it.
@@ -454,7 +480,7 @@ CLOSED: [2026-05-06 Tue]
Rationale: Three architectural elements exist today in the daemon that no competitor can render — the Dispatcher gate trace, the foveal-peripheral focus map, and the rules-learned counter. All three run in pure Lisp with 0 LLM tokens. None are visible to the user. Making them visible turns Passepartout's architecture from an internal mechanism into a trust-building UX — the user sees exactly which safety gates passed, exactly what the agent is focusing on, and exactly how many rules the Dispatcher has learned from their decisions. No competitor can ship this because none has deterministic gates to trace, foveal-peripheral context to map, or a rule-synthesizing Dispatcher to count.
- Gate trace per action: extend the daemon's response plist to include ~:gate-trace~ — a list of ~(:gate <name> :result <:passed | :blocked | :approval>)~ entries produced by ~cognitive-verify~. The TUI renders each entry as a colored line below the corresponding agent message: green ~✓ Dispatcher: path allowed~, red ~✗ Dispatcher: blocked (shell safety)~, yellow ~→ HITL required: /approve HITL-ab12~. Gate trace lines are dim and collapsible (press Tab on a message to toggle trace visibility). This turns the invisible nine-vector safety gate into the user's primary trust mechanism.
- Gate trace per action: extend the daemon's response plist to include ~:gate-trace~ — a list of ~(:gate <name> :result <:passed | :blocked | :approval>)~ entries produced by ~cognitive-verify~. The TUI renders each entry as a colored line below the corresponding agent message: green ~✓ Dispatcher: path allowed~, red ~✗ Dispatcher: blocked (shell safety)~, yellow ~→ HITL required: /approve HITL-ab12~. Gate trace lines are dim and collapsible (press Tab on a message to toggle trace visibility). This turns the invisible ten-vector safety gate into the user's primary trust mechanism.
- Focus map in status bar: add a second status bar line showing ~[Focus: core-loop.lisp:think()] [Scope: passepartout] [3 related nodes]~. The daemon already tracks ~foveal-id~ and ~*scope-resolver*~ in the signal plist; the TUI reads these from the most recent response and renders them. Related node count comes from the number of objects with cosine similarity ≥ threshold in the last context assembly. This shows the user *what the agent is looking at* — the single biggest trust gap in AI agents.
- Rule counter in status bar: ~[Rules: 47]~. The Dispatcher's ~*hitl-pending*~ hash table and approved/disallowed memory-object entries provide the count — every HITL decision that produces a rule increments it. The TUI reads the count from a new daemon response field ~:rule-count~. The user watches the counter tick up as they teach the agent their preferences.
- Expanded theme: replace the 7-flat-color ~*tui-theme*~ with a 25-color layered system organized by message category (roles, content types, tool visibility, gate states, status). See the design discussion for the full color mapping. Implement a ~/theme <name>~ command that swaps between named presets (~dark~, ~light~, ~solarized~, ~gruvbox~). Theme change persists to disk and reloads on next session.
@@ -521,7 +547,14 @@ The messaging gateways and Emacs bridge expand Passepartout's interaction surfac
** v0.4.1: Design Cleanup
*** TODO Remove system-prompt-augment mechanism
*** DONE Remove system-prompt-augment mechanism
:PROPERTIES:
:ID: id-v041-augment-removal
:CREATED: [2026-05-07 Thu]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-07 Thu 13:13]
:END:
Rationale: The ~system-prompt-augment~ slot on the skill struct enables skills to inject always-on text into every LLM system prompt via a ~maphash~ over ~*skill-registry*~ in ~think()~ (core-loop-reason.lisp:83-92). Only one skill uses it — ~programming-repl~ — and it does so as a backdoor: the skill's trigger is hardcoded to ~nil~, so it never fires as an active skill. Its sole contribution is injecting a REPL-first mandate into every system prompt. The other ~24 skills have nil augments and are skipped by the ~when aug-fn~ guard. This is architecturally wrong: standing mandates (always-on rules) should live in a dedicated ~*standing-mandates*~ list, not piggyback on a skill that is never triggered. The mechanism also fuels a false claim in DESIGN_DECISIONS about 3,000-8,000 tokens of overhead — the actual overhead is ~40 tokens from the one active augment.
@@ -531,16 +564,164 @@ Rationale: The ~system-prompt-augment~ slot on the skill struct enables skills t
- Introduce ~*standing-mandates*~ (a list of function → string generators). Inject them into the IDENTITY section of the system prompt alongside ~assistant-name~. Move ~repl-mandate~ there: ~(push #'repl-mandate *standing-mandates*)~.
- Tangle the corresponding lisp/ files.
*** TODO Fix false token-overhead claims in docs
*** DONE Fix false token-overhead claims in docs
:PROPERTIES:
:ID: id-v041-doc-fix
:CREATED: [2026-05-07 Thu]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-07 Thu 13:13]
:END:
Rationale: Two documents claim the ~system-prompt-augment~ mechanism can waste 3,000-8,000 tokens per think() call (DESIGN_DECISIONS line 435, ROADMAP line 504). This conflates the ~maphash~ iteration (cheap hash walk, no token cost) with the augments actually emitted (only ~programming-repl~ emits ~40 tokens; the ~when aug-fn~ guard skips the other 24 nil-augment skills). Once issue #1 above is resolved (removing the mechanism), these claims become doubly false.
- DESIGN_DECISIONS: Rewrite or remove bullet 2 under "Open Questions and Risks" (line 435). Replace with a corrected note on standing mandates via ~*standing-mandates*~.
- ROADMAP v0.5.0 intro (line 504): Remove or rewrite the claim that "system prompt overhead alone could reach 3,000-8,000 tokens per call before user input is even processed." The fixed overhead is not from skill augments — it is from the IDENTITY, TOOLS, CONTEXT, and LOGS sections, which prefix caching addresses.
*** DONE Update security vector count 9→10 in docs
:PROPERTIES:
:ID: id-v041-vector-count
:CREATED: [2026-05-07 Thu]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-07 Thu 14:40]
:END:
Rationale: The current dispatcher runs 10 deterministic checks (11 counting the warning-only REPL lint), but the README, ARCHITECTURE.org, and the ~dispatcher-check~ docstring all say 9. The actual count: 0=REPL-lint (warn only), 1=lisp-validation, 2=secret-path, 2b=self-build-core, 3=secret-content, 4=vault-secrets, 5=privacy-tags, 6=privacy-text, 7=shell-safety, 8=network-exfil, 8b=high-impact-approval. Ten blocking/approval checks. The vector 2b (self-build safety) and the new count must be reflected accurately in all documentation.
- Update README.org "What Makes Passepartout Different" → "nine" becomes "ten".
- Update docs/ARCHITECTURE.org Dispatcher Gate Stack table — add self-build entry.
- Update security-dispatcher.lisp:196 docstring to list all 11 vectors.
*** DONE Rewrite README — add "What is an agent?" section, revise claims
:PROPERTIES:
:ID: id-v041-readme-rewrite
:CREATED: [2026-05-07 Thu]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-07 Thu 14:40]
:END:
Rationale: The current README opens with competitive claims (downward cost curve, 2-3x fewer tokens) that are architecturally sound but not yet measured in the implementation. A non-engineer reader doesn't know what an AI agent is or why they'd want one. The README should lead with a short "What is an agent?" section (3-4 sentences, Wikipedia link), then "What Makes It Different" (safety, org-mode, offline — things that actually work today), then honest status of what's implemented vs planned.
- Add "What is an AI Agent?" section at top: 3-4 sentences + link to [[https://en.wikipedia.org/wiki/Software_agent][Software agent]].
- Move competitive cost/speed claims to docs/DESIGN_DECISIONS.org.
- Revise "The more you use it, the cheaper it gets" to reflect current state — architectural aspiration, not measured implementation yet.
- The Current Capabilities table and Quick Start sections stay intact.
*** DONE Register cognitive tools — 10 tools for codebase operations
:PROPERTIES:
:ID: id-v041-cognitive-tools
:CREATED: [2026-05-07 Thu]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-07 Thu 14:40]
:END:
Rationale: The ~def-cognitive-tool~ macro and ~*cognitive-tool-registry*~ are fully implemented but the registry is empty. The LLM sees "No tools registered" in its tool belt prompt. The agent can chat and run shell commands, but cannot search codebases, find files, eval code, run tests, or manipulate Org files. Ten cognitive tools bridge this gap and are prerequisites for the TDD workflow, org-mode additions, and evaluation harness in v0.5.0.
- New skill: ~programming-tools.org~ (~programming-tools.lisp~).
- Register 10 tools via ~def-cognitive-tool~:
1. ~search-files~ — regex search in file contents (uses ~cl-ppcre:scan~). Parameters: ~pattern~, ~path~ (dir), ~include~ (glob filter).
2. ~find-files~ — glob file matching (uses SBCL ~directory~). Parameters: ~pattern~, ~path~.
3. ~read-file~ — read file contents (uses ~uiop:read-file-string~). Parameters: ~filepath~.
4. ~write-file~ — write content to file. Parameters: ~filepath~, ~content~.
5. ~list-directory~ — list directory contents. Parameters: ~path~, ~pattern~ (optional).
6. ~run-shell~ — execute shell command (through existing shell actuator). Parameters: ~cmd~.
7. ~eval-form~ — evaluate Lisp expression in running image. Parameters: ~code~, ~package~ (optional).
8. ~run-tests~ — run FiveAM tests. Parameters: ~test-name~ (optional, nil runs all).
9. ~org-find-headline~ — find Org headline by ID or title. Parameters: ~id~ or ~title~, ~filepath~ (optional, searches memory store if not given).
10. ~org-modify-file~ — surgical text replacement in Org file (reuses existing ~org-modify~). Parameters: ~filepath~, ~old-text~, ~new-text~.
- Descriptive names rather than Unix command names — the LLM reads these in a prompt, not a terminal.
- Each tool is ~20-60 lines. ~search-files~ iterates directory, reads files, scans lines.
- FiveAM tests: each tool gets a test verifying operation on a temp directory.
*** DONE Enforce NO-HARDCODED-CONSTANTS programming standard
:PROPERTIES:
:ID: id-v041-no-hardcoded
:CREATED: [2026-05-07 Thu]
:END:
:LOGBOOK:
- State "DONE" from "TODO" [2026-05-07 Thu 14:40]
:END:
Rationale: Currently, several configurable values are hardcoded in source: the Dispatcher's rule threshold (not yet configurable), similarity thresholds, timeouts, shell max output. The user should control behavior through ~.env~, not by editing source code. This is rule #6 in the ~programming-standards.org~ skill. Each new TODO that introduces a configurable value must add it to ~.env.example~ with a documented default.
- Add ~DISPATCHER_RULE_THRESHOLD=3~ to ~.env.example~ (number of HITL approvals before a pattern becomes a permanent rule).
- Add ~RULES_FILE="$HOME/memex/system/rules.org"~ to ~.env.example~.
- Scan existing source for hardcoded configurable values — add to ~.env.example~ where missing.
- Any new TODO in v0.4.2+ that introduces a configurable value MUST include its ~.env.example~ entry.
** v0.4.2: Structured Output (LLM → JSON → plist)
The current ~think()~ function asks the LLM to produce raw S-expression plists. Four pieces of defensive infrastructure (~handler-case~ around ~read-from-string~, ~markdown-strip~, ~plist-keywords-normalize~, the RCE guard test) exist because LLMs cannot reliably produce balanced, keyword-prefixed plists. The fix: use the LLM API's native function calling / tool-use feature. The LLM always returns guaranteed-valid JSON. Convert to plist deterministically at the boundary.
*** TODO Implement function-calling / tool-use API in provider requests
:PROPERTIES:
:ID: id-v042-function-calling
:CREATED: [2026-05-07 Thu]
:END:
Rationale: Every major provider API (OpenAI, Anthropic, Groq, DeepSeek, OpenRouter) supports function calling. The LLM is sent tool definitions as JSON Schema. It returns ~tool_calls~ with guaranteed-valid JSON arguments. This eliminates the fragile ~read-from-string~ plist parsing entirely — the probabilistic layer speaks JSON (what it was trained on), the deterministic layer speaks plists (what the code controls). Conversion happens at a narrow, well-defined boundary.
- Modify ~provider-openai-request~ in ~system-model-provider.lisp~: add optional ~:tools~ parameter. When tools are provided, include ~"tools": [...]~ and ~"tool_choice": "auto"~ in the request body.
- Parse ~tool_calls~ from the API response: extract ~function.name~ and ~function.arguments~ (guaranteed valid JSON).
- Return a new result shape: ~(:status :success :tool-calls ((:name "shell" :arguments (:cmd "echo hello"))))~ alongside or instead of ~:content~.
- For providers that don't support function calling (local Ollama): keep ~:content~ path as fallback. LLM can still return raw text.
- FiveAM test: send a request with a mock tool definition, verify the response shape.
*** TODO Wire structured tool calls into ~think()~ — JSON→plist at boundary
:PROPERTIES:
:ID: id-v042-wire-tool-calls
:CREATED: [2026-05-07 Thu]
:END:
Rationale: Once the provider layer returns structured ~tool-calls~, the ~think()~ function must convert them to the internal plist format that ~cognitive-verify~ and ~loop-gate-act~ expect. This is a one-way, deterministic conversion at the architectural boundary.
- Add ~json-alist-to-plist~ helper in ~core-loop-reason.lisp~ or ~core-utils.lisp~: convert JSON alist (from ~cl-json:decode-json-from-string~) to keyword-prefixed plist. String keys → keywords. Nested objects recurse. JSON null → ~nil~. ~25 lines.
- In ~think()~ after ~backend-cascade-call~: if result contains ~:tool-calls~, convert each tool call's ~:arguments~ JSON to plist via ~json-alist-to-plist~, wrap in ~(:TYPE :REQUEST :PAYLOAD (:TOOL <name> :ARGS <plist> :EXPLANATION "..."))~.
- Keep the existing ~read-from-string~ path as fallback for providers that return raw text (local Ollama, streaming).
- The ~read-from-string~ path remains guarded by ~*read-eval* nil~ from v0.3.1.
- FiveAM test: JSON ~{"action":"shell","cmd":"echo hello"}~ → plist ~(:ACTION "shell" :CMD "echo hello")~ round-trip verified.
** v0.4.3: Shell Sandboxing & Safety Classification
The current shell safety is regex-based pattern matching — a fast pre-filter that catches obvious attacks but cannot contain sophisticated or encoded payloads. This version adds actual sandbox isolation (bubblewrap Linux namespaces) as the enforcement layer, and introduces severity classification so the rule learning system in v0.5.0 can apply different thresholds to catastrophic vs harmless operations.
*** TODO Add ~bwrap~ sandbox to shell actuator
:PROPERTIES:
:ID: id-v043-bwrap-sandbox
:CREATED: [2026-05-07 Thu]
:END:
Rationale: Regex-based shell safety catches obvious patterns (~rm -rf /~, ~dd if=~, ~mkfs.~) but is fundamentally bypassable with encoding (~base64 -d | bash~), indirection (~find / -exec rm {} \;~), or interpreter-based execution (~python3 -c "import os; os.system(...)"~). Bubblewrap (~bwrap~) is a 200KB unprivileged sandbox binary available on all modern Linux distributions. It creates transient Linux namespaces without root, without Docker, without daemon processes. Combined with the regex pre-filter, it provides defense-in-depth: the regex catches obvious attacks fast (no sandbox spawn), the sandbox contains sophisticated ones.
- In ~actuator-shell-execute~ (~system-actuator-shell.lisp~): detect if ~bwrap~ binary is available (~which bwrap~).
- If available: wrap command in ~bwrap --ro-bind /usr /usr --ro-bind /lib /lib --ro-bind /bin /bin --ro-bind /etc /etc --bind ~/memex ~/memex --bind /tmp /tmp --unshare-net --unshare-ipc timeout ...~.
- ~--unshare-net~: no network access within sandbox. Makes regex-based network exfiltration check redundant for sandboxed commands.
- ~--unshare-ipc~: no shared memory, no semaphore injection.
- If ~bwrap~ is unavailable: log a warning, fall back to current behavior (regex-only safety).
- The regex checks remain as a fast pre-filter — they run before spawning the sandbox.
- FiveAM test: command that reads ~/etc/shadow~ inside sandbox fails with permission error; same command in unsandboxed fallback is at least caught by path protection.
*** TODO Shell safety severity classification system
:PROPERTIES:
:ID: id-v043-severity-classification
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The current shell safety check treats all dangerous patterns equally — ~rm -rf /~ gets the same treatment as a backtick injection in ~echo~. But not all shell operations carry the same risk. A severity classification system enables the rule learning engine (v0.5.0) to apply different thresholds: catastrophic operations are always HITL regardless of approval count, moderate operations graduate to allowed after N approvals, harmless operations are allowed by default.
- Define four severity tiers as plist keywords: ~:catastrophic~ (mkfs, dd to devices, rm -rf /, shred /dev/), ~:dangerous~ (chmod -R /, writes outside ~/memex, curl to unwhitelisted domains), ~:moderate~ (npm install, pip install, git push, writes within ~/memex), ~:harmless~ (echo, ls, cat, find without exec, grep).
- Extend ~*dispatcher-shell-blocked*~ entries from simple ~(NAME REGEX)~ to ~(NAME REGEX :SEVERITY <tier>)~.
- Extend ~dispatcher-check-shell-safety~ to return the severity alongside the matched pattern name.
- ~:catastrophic~ severity always triggers HITL approval, regardless of rule count. ~:harmless~ operations are allowed by default (skip HITL and rule learning).
- The severity classification is the foundation that ~dispatcher-learn~ (v0.5.0) builds on — learning only applies to ~:dangerous~ and ~:moderate~ tiers.
- FiveAM test: ~echo hello~ returns ~:harmless~ severity and passes through; ~mkfs.ext4 /dev/sda~ returns ~:catastrophic~ and is always blocked.
** v0.5.0: Token Economics & Prompt Efficiency
The architecture's single largest gap versus SOTA: Passepartout currently spends tokens like a research prototype. Every ~think()~ call rebuilds and retransmits the full system prompt — IDENTITY + TOOLS + CONTEXT + LOGS + SKILL_AUGMENTS — with no caching, no budget, and no incremental assembly. The foveal-peripheral model prunes memory content but doesn't touch the fixed overhead. With 20+ skills by v1.0.0, system prompt overhead alone could reach 3,0008,000 tokens per call before user input is even processed.
The architecture's single largest gap versus SOTA: Passepartout currently spends tokens like a research prototype. Every ~think()~ call rebuilds and retransmits the full system prompt — IDENTITY + TOOLS + CONTEXT + LOGS — with no caching, no budget, and no incremental assembly. The foveal-peripheral model prunes memory content but doesn't touch the fixed overhead of IDENTITY, TOOLS, and LOGS sections, which together dominate the system prompt size. Standing mandates (~*standing-mandates*~) contribute negligible overhead (~40 tokens when the single active mandate fires).
Competitors (Claude Code, OpenClaw, Copilot) all implement some form of prefix caching — Anthropic's API gives 90% discount on cached tokens, OpenAI caches automatically. Passepartout's prompt structure is already naturally cacheable: IDENTITY, TOOLS, and LOGS format are static across calls. This version turns that structural property into a cost advantage.
@@ -552,7 +733,7 @@ Competitors (Claude Code, OpenClaw, Copilot) all implement some form of prefix c
- Use for three purposes: context budget enforcement (reject assembly if over limit), cost estimation (tokens × provider price), and prompt optimization (measure which sections of the system prompt consume the most budget).
*** TODO Prompt prefix caching
- Split the system prompt into a static prefix (IDENTITY string, TOOLS section, LOGS format header) and a dynamic suffix (CONTEXT render, current log entries, skill augments, user prompt).
- Split the system prompt into a static prefix (IDENTITY string, TOOLS section, LOGS format header) and a dynamic suffix (CONTEXT render, current log entries, standing mandates, user prompt).
- Track a hash of the static prefix; only retransmit when it changes (skill load/unload, identity config change). On cache hit, send the cached prefix with the dynamic suffix appended.
- Implement the Anthropic prompt-caching header protocol for providers that support it (claude-3-* models, up to 90% discount on cached tokens). For OpenAI, the automatic caching layer handles prefix detection without explicit headers.
- Log cache hit/miss rate to telemetry for cost tracking.
@@ -609,6 +790,185 @@ Rationale: The Dispatcher currently learns from blocked and approved actions —
- Induced functions are proposed, not automatically applied. The next time a similar request arrives, the agent checks: "I have an induced function for this. Use it?" The user approves the first invocation, and subsequent invocations of the same function are automatic.
- Add FiveAM test: replay a historical interaction sequence, verify the induced function produces the same outcome.
*** TODO TDD workflow skill — language-agnostic test runner
:PROPERTIES:
:ID: id-v050-programming-tdd
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The REPL-TDD-Literate workflow described in AGENTS.md lives entirely outside the agent's cognitive loop. The agent should be able to write tests, run them, observe red/green, and iterate — without the user manually managing the cycle. This is the Lisp advantage made operational: redefine a function, re-run a single test, get results in <100ms. Claude Code cannot do this — it has no REPL. The skill is language-agnostic: it dispatches to the REPL skill for Lisp, shells out to ~pytest~ for Python, ~go test~ for Go, etc.
- New skill: ~programming-tdd.org~. Depends on REPL skill for Lisp, falls back to shell for other languages.
- Cognitive tools: ~deftest~ (define a test), ~run-test~ (run a specific test), ~list-tests~ (list all defined tests).
- ~run-test~ dispatches on ~:language~ parameter:
- ~:lisp~ → ~(fiveam:run 'test-name)~ via REPL eval
- ~:python~ → shell ~python3 -m pytest test_file.py::test_name~
- ~:go~ → shell ~go test -run TestName ./...~
- ~:rust~ → shell ~cargo test test_name~
- ~:default~ → shell command template from env ~TEST_RUNNER_<LANG>~
- The TDD loop: write test → ~run-test~ (expect RED) → write implementation → ~run-test~ (expect GREEN) → report.
- ~#+DEPENDS_ON: org-skill-utils-repl~ for Lisp TDD; no dependency for other languages (shell fallback).
- FiveAM tests: ~run-test~ on a known-failing test returns RED status; ~run-test~ on a known-passing test returns GREEN.
*** TODO Expand literate programming skill — persist after TDD
:PROPERTIES:
:ID: id-v050-literate-persist
:CREATED: [2026-05-07 Thu]
:END:
Rationale: After the TDD loop confirms green, the agent must persist the working code into its Org source file and tangle to ~.lisp~. Currently ~self-improve-edit~ can do surgical text replacement but doesn't integrate with the TDD confirmation step. The literate skill should provide a ~persist-verified-block~ tool that takes TDD-confirmed code and writes it to the appropriate ~#+begin_src lisp~ block.
- Add ~persist-verified-block~ cognitive tool: accepts ~filepath~, ~block-name~, ~code~, ~test-result~. Only writes if ~test-result~ is GREEN.
- Verifies the written Org file passes ~literate-block-balance-check~ before tangling.
- Tangles via existing ~org-tangle-file~.
- FiveAM test: persist a verified block, verify it appears in the tangled ~.lisp~ file, verify the Org file passes balance check.
*** TODO Org-mode productivity additions — agenda, clock, checklist, table
:PROPERTIES:
:ID: id-v050-org-additions
:CREATED: [2026-05-07 Thu]
:END:
Rationale: Passepartout bets on Org-mode as the universal format for human and machine. But current Org support is thin: headlines, tags, property drawers, source blocks. Missing are the features that make Org a productivity tool: agenda views, clock-in/out, checklists, tables. Adding these turns the agent from a chat partner into a productivity assistant — it can answer "what should I work on today?" with 0 LLM tokens.
- Extend ~programming-org.lisp~ (~programming-org.org~) with five new functions:
1. ~org-agenda-today~ — walk memory (or file tree) for headlines with ~SCHEDULED~ ≤ today or ~DEADLINE~ within N days. Returns list of memory-objects. ~60 lines.
2. ~org-clock-in~ / ~org-clock-out~ — set ~:CLOCK-START~ property; on clock-out, compute duration, append to ~:LOGBOOK:~ drawer. ~80 lines.
3. ~org-checklist-toggle~ — parse ~- [ ]~ / ~- [X]~ checkboxes in headline content, toggle state, return completed/total count. ~50 lines.
4. ~org-table-parse~ / ~org-table-render~ — parse ~| a | b |~ tables into list-of-lists, render back. ~70 lines.
5. ~org-agenda-view~ — compose agenda + clock state + TODO headlines into single Org-formatted string. Used by ~/agenda~ TUI command. ~50 lines.
- ~org-agenda-today~ and ~org-agenda-view~ operate on memory store (zero file I/O, zero LLM tokens).
- FiveAM test for each function.
*** TODO Vault encryption — Ironclad AES + PBKDF2
:PROPERTIES:
:ID: id-v050-vault-encryption
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The vault (~*VAULT-MEMORY*~) stores API keys and credentials in plaintext in a hash table. ~VAULT-MASK-STRING~ always returns ~"[MASKED]"~ ignoring input — it's a stub. Ironclad is already a dependency (used for SHA-256 in Merkle hashing) and provides AES-256-GCM, ChaCha20, and PBKDF2. Encryption makes the vault go from security theater to actual security.
- Add ~vault-encrypt~ / ~vault-decrypt~ using Ironclad AES-256-GCM. Master key derived via PBKDF2 from ~VAULT_MASTER_PASSPHRASE~ env var or ~~/.config/passepartout/.key~ file.
- Store ciphertext instead of plaintext in ~*VAULT-MEMORY*~.
- ~VAULT-MASK-STRING~ actually masks (replaces all chars with ~*~, preserving length).
- ~dispatcher-vault-scan~ searches plaintext after decrypt (still catches leaks before they reach the LLM).
- FiveAM test: round-trip encrypt/decrypt; wrong passphrase fails; masked string has same length as original.
*** TODO Deterministic gate growth — ~dispatcher-learn~ + ~rules.org~
:PROPERTIES:
:ID: id-v050-dispatcher-learn
:CREATED: [2026-05-07 Thu]
:END:
Rationale: This is the "cheaper over time" claim made operational. Every HITL approval or denial becomes data. After N approvals of the same pattern, it becomes a permanent deterministic rule. The LLM no longer asks permission. 0 LLM tokens spent on what used to be a human decision. The user watches the rule counter tick up as they teach the agent.
- ~dispatcher-learn~ function in ~security-dispatcher.lisp~: called from ~hitl-approve~ and ~hitl-deny~. Extracts pattern (~:tool~ + ~:filepath~ glob + ~:cmd~ pattern). Tracks count per pattern in memory store.
- When count passes ~DISPATCHER_RULE_THRESHOLD~ (from ~.env~, default 3), writes a rule to ~RULES_FILE~ (~~/memex/system/rules.org~).
- Each rule is an Org headline with ~:EXPLANATION:~ property explaining what the rule does and why it was created.
- ~dispatcher-check~ consults ~RULES_FILE~ before its blocking vectors — allowed rules pass through, blocked rules are denied.
- Rules are loaded from ~rules.org~ at daemon startup (survive restarts).
- ~dispatcher-severity-allowed-p~: uses severity classification from v0.4.3 — ~:catastrophic~ always HITL regardless of rule count. ~:harmless~ always allowed.
- Severity thresholds: ~:dangerous~ = 5 approvals, ~:moderate~ = 3 approvals (configurable via ~.env~).
- ~DISPATCHER_RULE_THRESHOLD~ and ~RULES_FILE~ env vars already added in v0.4.1's NO-HARDCODED-CONSTANTS TODO.
- ~DISPATCHER_SEVERITY_DANGEROUS_THRESHOLD~ and ~DISPATCHER_SEVERITY_MODERATE_THRESHOLD~ in ~.env.example~.
- FiveAM test: approve same pattern 3 times → rule appears in ~rules.org~ → pattern passes through ~dispatcher-check~ without approval.
*** TODO Rule visibility — TUI ~/rules~ commands
:PROPERTIES:
:ID: id-v050-rule-visibility
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The user must know what rules the Dispatcher has learned and must be able to undo bad learning. The rules live in ~~/memex/system/rules.org~ (editable in any text editor), but the TUI should provide live access.
- TUI commands:
- ~/rules~ — list all rules sorted by recency (most recent first). Shows pattern, decision (allowed/blocked), severity, approval count, explanation.
- ~/rules blocked~ — show only blocked patterns.
- ~/rules allowed~ — show only allowed patterns.
- ~/rule delete <id>~ — remove a rule (undoes the learning). Deletes the headline from ~rules.org~.
- ~/rule allow <id>~ — flip a blocked rule to allowed (user overrides the learning).
- On rule creation, daemon sends ~:rule-created~ event. TUI adds system message: ~[Rules: 47 → 48] New rule: shell commands targeting ~/memex/projects/* are now allowed. /rule delete rule-48 to undo.~
- Rules are visible in the TUI status bar via the rule counter (already implemented in v0.4.0 gate trace).
- FiveAM test: ~/rules~ returns expected rules; ~/rule delete~ removes a rule and it no longer passes through ~dispatcher-check~.
*** TODO Merkle learning — memory-find-similar, outcome recording
:PROPERTIES:
:ID: id-v050-merkle-learning
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The Merkle tree provides content-addressed storage. Combined with embedding vectors (populated at ingest time since v0.4.0), it can answer "what happened the last 3 times I asked something like this?" This is retrieval-augmented generation from the user's own history — the agent learns what approaches succeeded and failed, not from the LLM's training data but from the user's actual sessions.
- ~memory-find-similar~ in ~core-memory.lisp~: given a vector, return N memory objects with highest cosine similarity. Uses ~memory-object-vector~ (already populated via ~ingest-ast~~embeddings-compute~ since v0.4.0). ~30 lines.
- ~memory-outcome-record~: store an outcome (success/failure plist) against a signal. Keyed by Merkle hash of the signal. ~25 lines.
- ~memory-find-outcomes~: given a signal (current context), find similar past signals and their outcomes. Uses ~memory-find-similar~ on the signal's foveal vector. Returns ranked list of past approaches with success/failure labels. ~40 lines.
- Outcome data feeds into ~context-assemble-global-awareness~: when the foveal node has similar past interactions, include them in the context as "Historical: last 3 times you asked this, approach X succeeded, Y failed."
- FiveAM test: record 3 outcomes for similar signals, verify ~memory-find-outcomes~ returns them ranked by similarity.
*** TODO Merkle learning documentation in Design Decisions
:PROPERTIES:
:ID: id-v050-merkle-docs
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The Merkle tree was designed for integrity, not learning. Its second life as a learning substrate — content-addressed history + vector similarity → retrospective knowledge — deserves architectural documentation explaining the data flow, the similarity gating, and how it feeds the "cheaper over time" thesis.
- New section in ~docs/DESIGN_DECISIONS.org~: "The Merkle Tree as Learning Substrate."
- Explain: Merkle hash → content identity. Memory-object-vector → content similarity. Together → "find what worked last time."
- Include data flow diagram (ASCII art) showing ingest → embed → query → retrieve → inform cycle.
- Distinguish from symbolic induction (v0.5.0): Merkle learning answers "what happened last time?" Symbolic induction answers "can I automate this next time?"
*** TODO Internal evaluation harness — ~deftask~, ~run-eval-suite~
:PROPERTIES:
:ID: id-v050-eval-harness
:CREATED: [2026-05-07 Thu]
:END:
Rationale: Without an evaluation harness, there is no way to know if the agent's capabilities improve or regress across releases. SWE-bench (v0.9.0) measures competitive ranking against other agents. The internal suite measures regression detection — it catches when v0.5.1 breaks something v0.5.0 could do. The suite starts with 10 tasks and grows with the codebase.
- New skill: ~system-evaluation.org~ (~system-evaluation.lisp~).
- ~deftask~ macro: define an eval task with ~:setup~ (create test environment), ~:prompt~ (what to ask the agent), ~:verify~ (function that checks the output), ~:teardown~ (cleanup). Similar to ~defskill~ but for agent capabilities, not code.
- ~run-eval-task~: inject ~:prompt~ as ~:user-input~ signal via ~stimulus-inject~, wait for completion (poll ~*memory-store*~ or signal status), run ~:verify~ on the result, return ~(:passed)~ or ~(:failed :reason ...)~.
- ~run-eval-suite~: run all registered eval tasks, produce score (pass count / total), per-task diagnostics, summary.
- ~eval-score~: return current score as a number. Logged to telemetry.
- Initial 10 tasks covering: find TODOs, create Org note, modify file, search codebase, run shell command (safe), list projects, query memory, find definition, run test, set TODO state.
- Task suite grows with codebase: every bug fix adds a regression task. Every new feature adds a capability task.
- FiveAM test: a task that should pass passes; a task that should fail fails with the expected reason.
*** TODO Evaluation workflow in AGENTS.md
:PROPERTIES:
:ID: id-v050-eval-agentsmd
:CREATED: [2026-05-07 Thu]
:END:
Rationale: The AGENTS.md "Development Workflow" section describes how to develop code with REPL → TDD → Literate. A parallel "Evaluation Workflow" section should describe how to verify agent capabilities with eval tasks. Together they form the full quality cycle: TDD verifies the code the agent writes, eval verifies the agent itself.
- New section in AGENTS.md: "## Evaluation Workflow (Must Follow)".
- Mirror the Development Workflow structure: define task → prove BLANK (fresh agent fails) → implement capability → prove COMPLETE → track regression.
- Include ~deftask~ example and ~run-eval-suite~ usage.
- Rule: every new cognitive tool or skill MUST include an eval task before shipping.
*** TODO TDD + Eval + Merkle learning integration into ~.env.example~
:PROPERTIES:
:ID: id-v050-env-vars
:CREATED: [2026-05-07 Thu]
:END:
Rationale: All new configurable values from v0.5.0 must be documented in ~.env.example~ per the NO-HARDCODED-CONSTANTS standard (v0.4.1). This task ensures no env var is forgotten.
- Add to ~.env.example~:
- ~DISPATCHER_RULE_THRESHOLD=3~ (if not already added in v0.4.1 cleanup)
- ~RULES_FILE="$HOME/memex/system/rules.org"~
- ~DISPATCHER_SEVERITY_DANGEROUS_THRESHOLD=5~
- ~DISPATCHER_SEVERITY_MODERATE_THRESHOLD=3~
- ~VAULT_MASTER_PASSPHRASE=""~ (empty = prompt on startup, or read from ~/.key file)
- ~EVAL_TASKS_DIR="$HOME/memex/system/eval/"~
- ~EVAL_TIMEOUT=120~ (seconds before a task is considered failed)
- ~TEST_RUNNER_PYTHON="python3 -m pytest"~
- ~TEST_RUNNER_GO="go test -run"~
- ~TEST_RUNNER_RUST="cargo test"~
- Document each with a comment explaining its purpose and default.
*** Competitive Advantage Analysis — v0.5.0 Summary
Token economics is the dimension where the architecture's theoretical advantage becomes operationally real. The foveal-peripheral model and deterministic gates reduce the tokens *needed* per task; prompt caching and incremental assembly reduce the tokens *spent* per task. Combined, the 23x coding savings and 1324x knowledge management savings in the DESIGN_DECISIONS token analysis become achievable rather than aspirational. Symbolic induction extends this downward cost curve into new territory: the agent doesn't just block fewer dangerous actions — it automates away entire categories of LLM calls by learning reusable Lisp functions from successful interaction patterns.
@@ -820,7 +1180,7 @@ v1.0.0 is not a feature release — it is a verification release. Every feature
| Planning | Task tree DAG with terminal states | Multi-step integration tests |
| Tool ecosystem | 15+ MCP tools + native shell + git | MCP protocol compliance tests |
| Context window | Semantic search + foveal-peripheral + caching| Token budget vs competitor audit |
| Safety | 9-vector Dispatcher + policy + permissions | Chaos testing (v0.9.0) |
| Safety | 10-vector Dispatcher + policy + permissions | Chaos testing (v0.9.0) |
| Multi-step tasks | Task trees with terminal states | SWE-bench score (v0.9.0 harness) |
| Code editing | Full file read/write via MCP + Org | SWE-bench-verified subset |
| Memory | Vector recall + Merkle integrity + MVCC | Concurrency stress test (v0.6.1) |