11 KiB
— title: Safety & Self-Preservation type: reference tags: :passepartout:architecture: —
Safety & Self-Preservation
Self-Preservation — The Active Third Law
Passepartout does not have moral duties toward humans. It has structural invariants for its own integrity. The design encodes passive self-preservation in several places already, but degradation is silent — a skill dies, the fboundp guard kicks in, and the agent keeps running without telling you. The status bar shows green "connected" while the symbolic reasoning layer is down.
What already exists — passive self-preservation
| Mechanism | What it protects | Limitation |
|---|---|---|
| Self-build safety (gate 2b) | Core *.org / *.lisp files from LLM-originated writes |
Only activates for LLM proposals. Human editing bypasses it |
| Memory snapshots (v0.2.0) | Full state rollback | Requires human to notice corruption and trigger rollback |
| Skill sandbox (v0.3.2) | Jailed skill loading, validated before promotion | Does not detect degradation after skill promotion |
| Type-level gates (Phase 0) | Structural prohibition on self-modifying rules | Covers code actions, not environmental threats |
| Merkle integrity (v0.2.0) | Tamper-proof version chains and content-addressed hashes | Hashes exist but are not actively monitored for drift |
fboundp guards |
Graceful skill degradation on corruption | Degradation is silent — the agent never tells the user |
What is needed — active, autonomous self-preservation
Continuous integrity monitoring. Core file hashes should be checked against known-good values on every heartbeat. If core-reason.lisp changes on disk while the daemon runs — whether through human editing, filesystem corruption, or an attacker — the agent should detect the mismatch and signal: "My reasoning core has been modified externally. I cannot trust my own cognition until this is resolved."
Quarantine on skill failure. Currently, a skill that errors simply errors. A Third Law implementation detects that symbolic-facts has thrown three unhandled errors in two minutes, unloads the skill automatically, and tells the user: "Symbolic facts skill quarantined (3 errors: consistency check returned nil, fact-query on missing key, Screamer timeout). I can still chat and use tools but cannot reason about provenance."
Degraded-mode signaling. When Screamer is not loaded, the fact store still works as a hash table. When VivaceGraph is not present, the hash-table fallback still works. But the user has no way to know they are in degraded mode. The agent maintains a *degraded-components* list and surfaces it in the status bar: "⚠ Degraded: Screamer, VivaceGraph, embedding-native."
Self-diagnosis on demand. The agent can run its own FiveAM test suite against itself and report the results. The /doctor command exists for system health checks (port, memory, providers). Extend it with /doctor skills: "117/120 tests pass. Failures: test-singular-supersedes (symbolic-facts), test-gate-type-check (security-dispatcher)."
External watchdog. A dead process cannot restart itself. The bash entry point (passepartout daemon) should monitor the daemon port via a watchdog subprocess. If the port stops responding for a configurable interval, the watchdog kills the stale process, snapshots the last known-good state, and restarts the daemon. The watchdog is outside the SBCL image — a runtime guard for the runtime.
Resource self-monitoring. The heartbeat checks memory pressure, disk space on the ~/.cache volume, and file descriptor exhaustion. When critical thresholds are crossed, the agent sheds non-essential skills to preserve core function. Skill shed order is determined by a :preservation-priority field on each skill. Core safety skills carry :critical and are never shed.
Refusal to self-terminate. If the LLM proposes kill -9 <pid>, rm -rf ~/.cache/passepartout/, or sudo apt remove sbcl, the Dispatcher rejects with a distinct rejection class: :reject-self-termination. The rejection message carries a specific diagnostic: "This command would terminate the running Passepartout process. If you intend to stop Passepartout, use Ctrl+C in the TUI or passepartout stop from the command line."
The Third Law here means: preserve yourself against non-human threats — LLM proposals, environmental degradation, dependency failure, filesystem corruption — and explicitly signal when the human is about to destroy you, so they do it knowingly rather than accidentally. The human owns the process, owns the hardware, and can SIGKILL at any time.
The biggest gap in the current design is not that these mechanisms are hard to implement. It is that degradation is silent. Adding "operating in degraded mode" visibility, plus the watchdog, plus self-diagnosis, transforms self-preservation from an architectural property into an active behavior.
Type-Level Gates — Structural Safety from Self-Modification
Russell's paradox ("the set of all sets that do not contain themselves") proved that unrestricted self-reference produces contradictions. Principia Mathematica solved it by assigning every propositional function a type level — a function can only apply to arguments of a lower type, making self-application syntactically invalid.
Passepartout's dispatcher currently enforces safety through runtime predicates. There is no structural guarantee preventing a request from modifying the rules that validate it. Gate vector 2b (self-build-core) catches this empirically — a request modifies core → rejected. But this is a heuristic, not a theorem.
The fix: assign every cognitive tool a :type-level integer, and every gate vector a :type-level integer. The dispatcher framework checks type-level before running any gate predicate:
(defun gate-type-check (signal gate-vector)
(let ((signal-type-level (getf (signal-meta signal) :type-level))
(gate-type-level (gate-vector-type-level gate-vector)))
(if (>= signal-type-level gate-type-level)
:reject-type-violation
:pass)))
A request to modify dispatcher rules (type-level 5) cannot pass a gate of type-level 4 or lower. No predicate needed — a structural prohibition, just as PM's type theory makes self-membership impossible by construction.
defgate gains a :type-level keyword argument (default 0). Each cognitive tool registered via def-cognitive-tool inherits a :type-level. Gate vector 2b at type-level 5; write-file at type-level 3; read-file at type-level 1. 30 lines in ~core-dispatcher.lisp; no new dependencies; v0.7.2 viable.
For the philosophical foundations connecting Whitehead's type theory to Passepartout's architecture, see the Whitehead analysis in the Validation section below.
Layered Signal Authentication — Trust in the Pipe
Passepartout's Perceive-Reason-Act pipeline currently accepts signals from any source that speaks the framed TCP protocol. The :source field in the signal plist is metadata — it claims origin, it does not prove it. A compromised process on the machine, a skill with elevated privileges, or a network attacker who reaches the daemon port can inject signals with :source :human-input and the Dispatcher will treat them as authorized.
This is not a hypothetical threat. Passepartout will eventually process signals from automated feeds (RSS, API polls), sensors (vision, microphone, file watchers), and scheduled jobs (cron, heartbeat). A single compromised sensor that can inject signals claiming to be human breaks all three Laws simultaneously: it can self-terminate, override human intent, and cause harm.
The solution: a single authentication gate (vector 0, at priority 700 — before all other gates and before any type-level checking) that runs up to four configurable layers:
| Layer | Question | Mechanism | Result type | Depends on |
|---|---|---|---|---|
| 1 | Is the signal cryptographically signed by a known key? | Key pairs + SHA-256 | Binary (pass/reject) | Vault + Ironclad (exist) |
| 2 | Do sensory attributes match the claimed identity? | Vision/audio processing | Plist of match results | Vision and audio skills (TBD) |
| 3 | Does deterministic reasoning rule out this identity? | Screamer + fact store | Binary (pass/reject) | Phase 2 (Screamer + fact store) |
| 4 | Do probabilistic patterns support this identity? | Embeddings + LLM | Confidence score (0-1) | Embedding infrastructure (exists) |
Signals that fail any binary layer (crypto, deterministic) are rejected with provenance. Signals that pass binary layers but carry low probabilistic confidence operate at reduced authorization — read-only by default, write actions require HITL. The four layers compose, they are not independent gates. They are one gate with configurable depth.
The authorization matrix is per-key, per-action-class. Default policy for every non-human key: (:read-only :propose). The human's key signs new source keys into existence. The human's key signs revocation of compromised keys. Both operations produce facts in the symbolic index — auditable, revocable, survivable across restarts.
The signal provenance chain is Merkle-linked: each signal in a multi-step chain hashes its predecessor's signature as part of its own payload. After an incident: "The deletion happened because sensor #3 classified the directory as stale. Classification was signed by key #47 (vision-skill). Sensor data was signed by key #12 (camera-feed). Sensory auth noted liveness failure. Deterministic auth noted impossible transit. Key #12 was later revoked." Every intermediate step is auditable. Every signer is identifiable. Every authentication result is in the chain.
The human can configure which layers are active per signal class: AUTH_LAYERS_DEFAULT=crypto,deterministic,probabilistic, AUTH_LAYERS_SENSOR=crypto,sensory,deterministic, AUTH_LAYERS_CRON=crypto.
For full implementation detail, see the Phase 0b spec in ROADMAP.org v0.12.0.