Split design-decisions.org into design/ subfolder (10 files, one per section)

2026-06-04 20:16:02 +00:00
parent d938f353e4
commit 2ee60b5b4a
14 changed files with 1311 additions and 1143 deletions
--- a/projects/passepartout/architecture/design/engineering-infrastructure.org
+++ b/projects/passepartout/architecture/design/engineering-infrastructure.org
@@ -0,0 +1,231 @@
+---
+title: Engineering Infrastructure
+type: reference
+tags: :passepartout:architecture:
+---
+
+* Engineering Infrastructure
+
+** The REPL as Cognitive Substrate
+:PROPERTIES:
+:ID:       1ede13cb-05c0-4b1a-a873-98cf89bade81
+:ID:       design-repl-cognition
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+A REPL — Read, Eval, Print, Loop — is an interactive programming environment that reads an expression, evaluates it, prints the result, and loops back to read the next expression. It is the opposite of batch processing: where batch compiles and runs a program in one shot, a REPL works one expression at a time, with each evaluation building on all previous ones. The state accumulates. The session is the program.
+
+In Lisp, the REPL is not a debugging tool bolted onto the language — it is the natural mode of interaction. The running image is the environment. When you evaluate =(+ 2 2)=, the result =4= is printed, and you remain in the same image where =+= is defined, where previous definitions persist, where the next expression can reference anything that came before. There is no separation between development and execution. The REPL is not a simulation of the program — it is the program running.
+
+Passepartout uses the REPL in this spirit, but elevated: it is not merely a tool for writing code, it is the mechanism by which the agent interacts with its own cognition — a loop that mirrors the perceive-reason-act metabolic cycle at the implementation level.
+
+In the agent's cognitive architecture, the REPL serves three functions that are difficult or impossible to achieve through batch processing or stateless API calls.
+
+First, the REPL enables verification before commitment. When the agent generates code, it does not write and forget — it evaluates in a running image, observes the result, iterates if incorrect. The feedback loop is tight: the time between writing and seeing the error is measured in milliseconds, not in the round-trip to a language server or a batch compiler. This is the "verification over hallucination" principle made concrete: the agent tests what it writes before claiming it works.
+
+Second, the REPL enables stateful exploration. The agent can define a variable, inspect it, modify it, redefine it. The exploration accumulates state across interactions. This is not a debugging session — it is the agent thinking with its hands, working through a problem by trying variations and observing outcomes, keeping the successful ones and discarding the failures.
+
+Third, the REPL is a shared substrate. When the agent evaluates code, that code runs in the same image as the agent's own cognition. There is no process boundary between the agent and its tools. The REPL is not a subprocess the agent controls — it is a direct interface to the agent's own nervous system.
+
+This is why the REPL becomes more important as the system matures. In early versions, it is a development tool. In v0.6.0 and beyond, it becomes a cognitive tool: the agent explores hypotheses by evaluating them, verifies the output of sub-agents by inspecting live state, and tests modifications before committing them to the knowledge graph.
+
+** The Cybernetic Loop: Why the Metabolic Pipeline Works
+:PROPERTIES:
+:ID:       design-cybernetic-loop
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+The Perceive → Reason → Act cycle is not a software architecture pattern. It is a cybernetic feedback loop — the mechanism by which a system steers itself toward a goal in a changing environment.
+
+Norbert Wiener defined cybernetics in 1948 as "control and communication in the animal and the machine." The metabolic pipeline implements this precisely: Perceive is the sensor (reading the environment), Reason is the controller (evaluating against goals and constraints), Act is the actuator (modifying the environment), and the tool-output feedback signal closes the loop (reading the effect of the action and adjusting the next perception).
+
+The Dispatcher gate stack is the negative feedback governor. When the LLM proposes an action that would violate an invariant, the Dispatcher blocks it and feeds the rejection trace back to the LLM for self-correction. This is Ross Ashby's homeostasis — the system maintains its internal stability by correcting deviations from its set point (the safety invariants). Without this negative feedback, the probabilistic engine would drift into hallucinated proposals that become progressively less grounded. The Dispatcher constrains it to the domain of safe, verifiable actions.
+
+The self-editing capability is second-order cybernetics — autopoiesis, the capacity of a system to create and maintain itself. Humberto Maturana and Francisco Varela defined this as the hallmark of living systems. When the agent detects an error, locates the faulty function, generates a corrected version, and hot-reloads it into the running image without restarting, it is modifying its own architecture while continuing to operate. Passepartout achieves this through Lisp's homoiconicity — code is data, and the running image is the environment.
+
+This framing matters for two reasons. First, it places Passepartout in a lineage that predates and outlasts the current "LLM with tools" paradigm. The cybernetic principles of feedback, homeostasis, and autopoiesis are independent of any specific model architecture. They work whether the perceptual engine is an LLM, a vision model, or a symbolic parser. Second, it explains why the architecture gets more reliable over time — cybernetic systems improve through accumulated negative feedback corrections, not through better training data. Every blocked action is a correction. Every approved exception is a refined set point. The system converges on stability through use.
+
+** Observability and the Thought Trace
+:PROPERTIES:
+:ID:       design-observability
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+When a human asks why the system made a decision, the answer must be findable. In most AI systems, the reasoning is ephemeral — it exists in the model's activations and disappears when the session ends. In Passepartout, every significant cognitive event is written to an Org buffer as it happens.
+
+The thought trace is the agent's journal, written in parallel with its reasoning. When the probabilistic engine generates a proposal, the trace records the input, the prompt, and the raw output. When the deterministic engine evaluates it, the trace records which rules were checked, which passed, which failed, and why. When an action is executed, the trace records the timestamp, the user who approved it (if human-in-the-loop), and the outcome.
+
+This is not logging in the traditional sense. Logs are forensically useful but are written in a machine format optimized for storage, not for human reading. The thought trace is written in Org-mode: headlines for major events, property drawers for structured data, tags for categorization. The human can open the trace in a text editor and navigate it like any other Org file. They can search for a specific decision, filter by time range, find all actions blocked by a specific rule, or see the complete trajectory of a multi-step task.
+
+The trace becomes the foundation for the Dispatcher's learning. Every blocked action is in the trace. Every approved exception is in the trace. The human-in-the-loop decisions are in the trace. The system does not need to reconstruct what happened — it reads what happened from the trace it wrote.
+
+Without observability, the system is a black box that happens to produce correct outputs sometimes. With observability, the system is auditable. The human can see why a decision was made, identify where the reasoning failed, and course-correct the system or its own behavior accordingly.
+
+** Literate Programming as Discipline
+:PROPERTIES:
+:ID:       design-literate-programming
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+The decision to use Org-mode as the source of truth for code, not just documentation, is not a ceremonial preference. It is a constraint mechanism that enforces better engineering habits at the cost of convenience.
+
+The traditional development workflow is: write code, write comments, commit. The literate programming workflow is: write prose, write code, commit the Org. The order matters. The prose must come first not because of style guidelines but because the act of explaining what a function does before writing it forces clarity of thought that editing code directly does not.
+
+When you must write a paragraph describing what a function does before you write the function, you discover the cases you have not considered. You find the edge conditions that are ambiguous. You realize that the function's name does not match its behavior, or that its behavior does not match your intent. The friction is not a bug — it is the mechanism by which thinking is enforced.
+
+The one-function-per-block rule enforces granularity. A function that cannot be explained in a paragraph is a function that is doing too much. The block boundary is not aesthetic — it is architectural. It prevents the drift toward monolithic functions that accumulate responsibilities over time and become untestable, unmaintainable, and incomprehensible.
+
+The tangle step enforces source-of-truth discipline. The .lisp file is generated from the Org file. This means the Org file cannot drift from the implementation. If the implementation changes, the Org must be updated to match. If the Org describes behavior that the implementation does not perform, the tangle produces code that does not match the Org description. Either way, inconsistency is visible and recoverable.
+
+The evaluation gate enforces correctness. Every block can be evaluated independently in a running Lisp image. This means syntax errors are caught at authorship time, not at integration time. The function that compiles in isolation but fails in context is the function whose context dependencies were never made explicit. The evaluation gate forces those dependencies to surface.
+
+Together, these constraints create a development experience that is slower in the small and faster in the large. Writing a new function takes longer because you must explain it. But debugging, maintaining, and extending the codebase is faster because every function has a human-readable explanation of its intent, every function is testable in isolation, and every function's source is always synchronized with its documentation.
+
+The literate programming discipline is not about producing documentation. It is about producing code whose correctness has been verified by the act of explaining it.
+
+** The Evaluation Harness
+:PROPERTIES:
+:ID:       design-evaluation-harness
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+SOTA parity is meaningless without measurement. A system that claims to match commercial agents must demonstrate it through reproducible benchmarks, not through feature checklists. The evaluation harness is the apparatus by which Passepartout proves its capabilities.
+
+The industry standard for coding agents is SWE-bench: a corpus of GitHub issues paired with pull requests. The agent is given an issue, must understand the codebase, write a fix, and submit. Success is measured by whether the submitted PR passes the existing test suite. This tests the full chain: understanding, planning, code generation, verification, and multi-step reasoning.
+
+Passepartout implements a native Lisp harness for this. A background thread clones repositories, feeds issues into the cognitive loop, tracks the resolution trajectory as an Org-mode headline tree, and scores success by test outcomes. The trajectory is persisted: when a resolution fails, the system can inspect where in the chain the reasoning broke down. The headline tree records the agent's thoughts at each step, making the failure auditable and the debugging human-assisted.
+
+Beyond SWE-bench, the harness includes chaos testing. The system is subjected to resource starvation, concurrent load, and adversarial input. The deterministic engine must maintain safety invariants under pressure. The symbolic verifier must not deadlock or livelock. The probabilistic engine must degrade gracefully.
+
+The harness also supports regression testing on the skill set. Every skill is tested against a suite of known inputs and expected outputs. When a modification is proposed to any skill — whether through manual editing or the agent's own self-modification — the test suite runs first. A skill that fails its tests is rejected before it can propagate to the running image. This is not a convenience — it is the mechanism by which self-modification remains safe. The agent can propose changes, but the harness verifies them before the changes take effect.
+
+** The MCP Strategy
+:PROPERTIES:
+:ID:       design-mcp-strategy
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+The Model Context Protocol (MCP) is a standard for connecting AI systems to external tools and data sources. It defines how a client requests tools from a server, how the server exposes its capabilities, and how the client invokes them. The ecosystem is growing: MCP servers exist for GitHub, Slack, Postgres, filesystem access, and much more.
+
+Passepartout connects to this ecosystem, but not by becoming a Node.js runtime. The architecture is: external MCP servers communicate via stdio or SSE to a Lisp-native MCP client that runs in the same image as the agent. The client is pure Common Lisp — it parses the JSON-RPC messages, invokes the tools, and presents results to the agent as Lisp data structures. There is no serialization overhead between the agent and the MCP layer, no process boundary, no impedance mismatch.
+
+When the agent calls a tool via MCP, it receives a plist with the tool name, arguments, and result. The result is immediately usable by the agent's symbolic engine. When the agent generates a file, it can be written to the filesystem through an MCP filesystem server. When the agent needs to send a message, it can use an MCP Slack server. The agent does not need to know that these are MCP interactions — it sees only the plists that flow through its cognitive architecture.
+
+The alternative is to build MCP wrappers in Python or TypeScript and bridge to Lisp via subprocess. This introduces latency, serialization costs, and a maintenance burden. Passepartout's native client is smaller, faster, and more maintainable. The MCP client is a skill, not a core component. It can be reloaded, replaced, or removed without restarting the agent.
+
+** Local-First Architecture
+:PROPERTIES:
+:ID:       design-local-first
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+Passepartout is designed to run on the user's machine, on their hardware, with their data, without requiring an internet connection. This is not a deployment option — it is an architectural commitment. The system must be able to reason, plan, and act using only the resources available locally.
+
+The motivation is not merely philosophical. Cloud-based AI agents are economically incentivized to collect data, to train on user interactions, and to build lock-in through proprietary formats and network effects. When the agent runs locally, the user owns the hardware, owns the data, and can terminate the process without asking permission. There is no vendor that can change terms, no service that can go offline, no model that can be updated without consent.
+
+Technically, local-first means several things. The LLM must be able to run on local hardware. Passepartout supports Ollama as a provider, which runs quantized models on CPU and GPU without requiring an external API. The vector database must be local. Passepartout uses its own org-object store, which is a folder of Org files that the agent already owns. There is no ChromaDB or Qdrant to install, no cloud vector service to authenticate with.
+
+The symbolic engine does not require a network connection. The Prolog/Datalog reasoner that verifies neural proposals runs entirely in the Lisp image. The Dispatcher's rule synthesis does not call an external service. The agent can operate in a disconnected environment indefinitely, resuming full capability when connectivity is restored.
+
+This does not mean Passepartout refuses to use cloud services when available and appropriate. It means cloud services are optional enhancements, not architectural requirements. The core is local. The user can choose to add cloud LLM providers for more capable inference, but the system functions without them.
+
+*On live images and binaries.* Passepartout's primary delivery path is source code running in a live SBCL process. The REPL is available. Skills hot-reload. The cognitive loop runs in an image that is mutable, inspectable, and homeiconic — the user can connect with SLIME, trace functions, inspect memory objects, and modify the system while it runs. A =save-lisp-and-die= binary is provided as a convenience for platforms where SBCL cannot be installed. The binary is the same image saved to disk with Swank pre-loaded — it is not a sealed container. The REPL works. Skills hot-reload. The binary is a packaging format, not an architectural decision.
+
+** Token Economics and Performance Advantage
+:PROPERTIES:
+:ID:       design-token-economics
+:CREATED:  [2026-05-07 Wed]
+:WEIGHT:  40
+:END:
+
+This section analyzes how Passepartout's architectural decisions translate into token usage, latency, and cost versus competing agent designs.
+
+*** The Core Insight: LLM as Expensive Resource, Not Default Engine
+
+Passepartout treats the LLM as a resource to be minimized. Every operation is designed to reduce LLM dependency. Competitors treat the LLM as the core engine through which all operations flow. This is not a difference of degree but of architecture.
+
+The structural multipliers are:
+
+1. *Sparse tree retrieval* — the foveal-peripheral model renders relevant Org subtrees (titles and properties for peripheral nodes, full content for foveal and semantically relevant nodes). Active context stays at 2,000–4,000 tokens. A "load everything" architecture serializes the entire knowledge base at 50,000–150,000 tokens. The mechanism is provably cheaper; the exact multiplier depends on memex size and complexity.
+
+2. *Deterministic safety* — the 10-vector Dispatcher gate stack runs in pure Lisp. Every gate is a Common Lisp function. Verification costs 0 LLM tokens per action. Competitors use prompt-based guardrails consuming 100–500 LLM tokens per verification. This multiplier is mathematically infinite — a Lisp function call costs no tokens, a guardrail paragraph in a system prompt costs tokens proportional to its length.
+
+3. *REPL verification* — code is tested in the running image before it is committed. Errors surface in milliseconds at 0 LLM tokens. Competitors discover errors after generation and pay 500–2,000 tokens per correction round-trip. The REPL eliminates the most expensive kind of LLM call: the one that produced wrong code and needs a do-over.
+
+4. *Hot state* — in a REPL-based agent, variables, file handles, sub-routine results, and memory objects are already in memory. Every turn in a standard chat agent re-sends the full conversation history. Token costs in chat agents are quadratic: a 10-turn session pays for ~55 "turns" of context (10 + 9 + 8 + ... + 1 = 55). In Passepartout, context is stored once in the Lisp image. A 10-turn session pays for ~10 turns of context. This is an ~82% reduction on protocol overhead alone, before any foveal-peripheral pruning.
+
+5. *Temporal filtering* — time-scoped memory queries return only nodes matching the time window. The temporal filter is a pure-Lisp hash-table walk with a numeric comparison on =memory-object-version=. Sub-millisecond. 0 LLM tokens. Competitors without time-indexed memory must serialize all nodes and let the LLM scan for temporal relevance — 5,000–50,000 tokens per temporal query.
+
+*** The Compounding Cost Curve — Unique Among Agents
+
+Every AI agent grows more expensive over time. Context histories accumulate. Safety instructions grow more elaborate. Guardrails become longer prompt paragraphs. The user's data grows. The only way to reduce cost in a standard agent is to cap context — sacrificing capability.
+
+Passepartout has a downward cost curve. Four mechanisms compound:
+
+1. *Dispatcher learning.* Every blocked action and approved exception becomes a deterministic rule. A file write that initially triggered a full LLM proposal → Dispatcher review → HITL approval → rule extraction loop eventually becomes a deterministic rule check. Each hardened rule permanently removes a future LLM call.
+
+2. *Symbolic induction.* The agent extracts patterns from successful interaction sequences and converts them into reusable Lisp functions. A multi-step task that took 5,000 tokens today takes 0 tokens tomorrow — it's now a =defun=. The Dispatcher learns what to block. Symbolic induction learns what to automate.
+
+3. *Native embedding inference.* Every semantic search query runs against in-image vectors at 0 external tokens. Competitors use LLM-assisted search for most retrieval operations. Passepartout's retrieval is a vector cosine similarity check — pure math, no model call.
+
+4. *Prefix caching.* The static portion of the system prompt (IDENTITY, TOOLS, LOGS format) is transmitted once per session. Dynamic content (CONTEXT, user prompt) is sent on each call. Anthropic's prompt caching gives a 90% discount on cached tokens. OpenAI caches automatically.
+
+After 12 months of daily use, Passepartout's per-session costs are expected to be 40–60% of baseline, while competitors' costs rise to 125–140% of baseline. The crossover point is estimated at 3–6 months. This is not a model quality claim — it is a structural property of the architecture.
+
+** Time Awareness as a Structural Advantage
+:PROPERTIES:
+:ID:       design-time-awareness
+:CREATED:  [2026-05-07 Thu]
+:WEIGHT:  40
+:END:
+
+Passepartout's architecture provides three layers of time awareness, each enabled by infrastructure that competitors lack:
+
+*Level 1 — Present Awareness.* The LLM knows the current time, date, and session duration because a single =format-time-for-llm= call injects it into the system prompt. Most agents know the date from the OS. None know the time or session duration. The cost is ~8 incremental tokens per call (trivially prefix-cached). The saving is eliminating "I don't know the current time" preamble tokens, time-check tool calls, and incorrect temporal reasoning from a model guessing the time.
+
+*Level 2 — Temporal Memory.* Memory queries accept =:since= and =:until= parameters. "What did I work on in the last hour?" filters 500 nodes to 12 in sub-millisecond Lisp rather than serializing 500 nodes to the LLM at ~5,000 tokens for it to scan. Every memory node carries a =memory-object-version= timestamp (a monotonic =get-universal-time= value set at ingest since v0.1.0). The temporal filter is a hash-table walk with numeric comparison. 0 LLM tokens. >90% token reduction on time-scoped queries.
+
+*Level 3 — Proactive Triggers.* The heartbeat tick scans for approaching deadlines every 60 seconds. When a deadline is within the warning window (=DEADLINE_WARNING_MINUTES=, default 60), a temporal context note is injected into the awareness assembly. The LLM sees "3 deadlines today: Submit report (45min)" in its context without a triggering call. A "what should I work on today?" query is answered from pre-loaded context — 0 LLM tokens versus 1,500–4,000 for an unassisted agent.
+
+None of these three layers require new infrastructure. Time awareness is not a feature Passepartout builds — it is a feature Passepartout *unlocks* by having timestamped memory (v0.1.0), heartbeat+cron (v0.3.0), and the foveal-peripheral context pruning model (v0.2.0) already in place. Adding time awareness costs ~175 lines of Lisp. Building it in competitors would require building the heartbeat, the time-indexed memory, and the proactive context injection — 800+ lines each — and would still cost LLM tokens because their safety verification is prompt-based.
+
+** Definite Description Resolution
+:PROPERTIES:
+:ID:       design-description-resolution
+:CREATED:  [2026-05-14 Thu]
+:WEIGHT:  40
+:END:
+
+When the user says "the function that validates secrets," the agent must resolve this to a specific code entity. Natural language is ambiguous — there might be multiple functions matching the description. Resolving to the wrong one causes incorrect actions.
+
+/Principia Mathematica/'s theory of descriptions addresses this: "the current king of France is bald" — a sentence that seems to refer to something that doesn't exist. PM formalizes ~ιx(φx)~ as "the unique x such that φ holds." A statement is false (not meaningless) when there is no unique x satisfying φ.
+
+A cognitive tool that checks descriptional uniqueness before resolution:
+
+#+BEGIN_SRC lisp
+(def-cognitive-tool :resolve-reference
+    (query-string &key (max-candidates 10)
+                   (context-path *current-context*))
+  "Resolve a definite description to a unique referent."
+  (let ((candidates (search-knowledge-graph query-string
+                                               :source-path context-path
+                                               :limit max-candidates)))
+    (cond
+      ((null candidates)
+       (values nil :no-referent query-string))
+      ((> (length candidates) 1)
+       (values nil :ambiguous candidates))
+      (t
+       (values (first candidates) :unique nil)))))
+#+END_SRC
+
+~40 lines as a skill in v0.7.2. When VivaceGraph ships (v3.0.0), descriptions become native Prolog queries with uniqueness constraints.
+
+For the philosophical foundations, see the Whitehead analysis in the Validation section below.