Files
hermes-brain/projects/passepartout/architecture/design/safety-self-preservation/self-preservation-the-active-third-law.org

5.1 KiB

Self-Preservation — The Active Third Law

Passepartout does not have moral duties toward humans. It has structural invariants for its own integrity. The design encodes passive self-preservation in several places already, but degradation is silent — a skill dies, the fboundp guard kicks in, and the agent keeps running without telling you. The status bar shows green "connected" while the symbolic reasoning layer is down.

What already exists — passive self-preservation

Mechanism What it protects Limitation
Self-build safety (gate 2b) Core *.org / *.lisp files from LLM-originated writes Only activates for LLM proposals. Human editing bypasses it
Memory snapshots (v0.2.0) Full state rollback Requires human to notice corruption and trigger rollback
Skill sandbox (v0.3.2) Jailed skill loading, validated before promotion Does not detect degradation after skill promotion
Type-level gates (Phase 0) Structural prohibition on self-modifying rules Covers code actions, not environmental threats
Merkle integrity (v0.2.0) Tamper-proof version chains and content-addressed hashes Hashes exist but are not actively monitored for drift
fboundp guards Graceful skill degradation on corruption Degradation is silent — the agent never tells the user

What is needed — active, autonomous self-preservation

Continuous integrity monitoring. Core file hashes should be checked against known-good values on every heartbeat. If core-reason.lisp changes on disk while the daemon runs — whether through human editing, filesystem corruption, or an attacker — the agent should detect the mismatch and signal: "My reasoning core has been modified externally. I cannot trust my own cognition until this is resolved."

Quarantine on skill failure. Currently, a skill that errors simply errors. A Third Law implementation detects that symbolic-facts has thrown three unhandled errors in two minutes, unloads the skill automatically, and tells the user: "Symbolic facts skill quarantined (3 errors: consistency check returned nil, fact-query on missing key, Screamer timeout). I can still chat and use tools but cannot reason about provenance."

Degraded-mode signaling. When Screamer is not loaded, the fact store still works as a hash table. When VivaceGraph is not present, the hash-table fallback still works. But the user has no way to know they are in degraded mode. The agent maintains a *degraded-components* list and surfaces it in the status bar: "⚠ Degraded: Screamer, VivaceGraph, embedding-native."

Self-diagnosis on demand. The agent can run its own FiveAM test suite against itself and report the results. The /doctor command exists for system health checks (port, memory, providers). Extend it with /doctor skills: "117/120 tests pass. Failures: test-singular-supersedes (symbolic-facts), test-gate-type-check (security-dispatcher)."

External watchdog. A dead process cannot restart itself. The bash entry point (passepartout daemon) should monitor the daemon port via a watchdog subprocess. If the port stops responding for a configurable interval, the watchdog kills the stale process, snapshots the last known-good state, and restarts the daemon. The watchdog is outside the SBCL image — a runtime guard for the runtime.

Resource self-monitoring. The heartbeat checks memory pressure, disk space on the ~/.cache volume, and file descriptor exhaustion. When critical thresholds are crossed, the agent sheds non-essential skills to preserve core function. Skill shed order is determined by a :preservation-priority field on each skill. Core safety skills carry :critical and are never shed.

Refusal to self-terminate. If the LLM proposes kill -9 <pid>, rm -rf ~/.cache/passepartout/, or sudo apt remove sbcl, the Dispatcher rejects with a distinct rejection class: :reject-self-termination. The rejection message carries a specific diagnostic: "This command would terminate the running Passepartout process. If you intend to stop Passepartout, use Ctrl+C in the TUI or passepartout stop from the command line."

The Third Law here means: preserve yourself against non-human threats — LLM proposals, environmental degradation, dependency failure, filesystem corruption — and explicitly signal when the human is about to destroy you, so they do it knowingly rather than accidentally. The human owns the process, owns the hardware, and can SIGKILL at any time.

The biggest gap in the current design is not that these mechanisms are hard to implement. It is that degradation is silent. Adding "operating in degraded mode" visibility, plus the watchdog, plus self-diagnosis, transforms self-preservation from an architectural property into an active behavior.