hermes-brain/projects/passepartout/architecture/design/safety-self-preservation.org at 2ee60b5b4a9f96a18f9958695f2350ffd00c2d5e

amr/hermes-brain

Fork 0

Files

Hermes 2ee60b5b4a Split design-decisions.org into design/ subfolder (10 files, one per section)

2026-06-04 20:16:02 +00:00

11 KiB

Raw Blame History

Safety & Self-Preservation

— title: Safety & Self-Preservation type: reference tags: :passepartout:architecture: —

Safety & Self-Preservation

Self-Preservation — The Active Third Law

Passepartout does not have moral duties toward humans. It has structural invariants for its own integrity. The design encodes passive self-preservation in several places already, but degradation is silent — a skill dies, the fboundp guard kicks in, and the agent keeps running without telling you. The status bar shows green "connected" while the symbolic reasoning layer is down.

What already exists — passive self-preservation

Mechanism	What it protects	Limitation
Self-build safety (gate 2b)	Core `.org` / `.lisp` files from LLM-originated writes	Only activates for LLM proposals. Human editing bypasses it
Memory snapshots (v0.2.0)	Full state rollback	Requires human to notice corruption and trigger rollback
Skill sandbox (v0.3.2)	Jailed skill loading, validated before promotion	Does not detect degradation after skill promotion
Type-level gates (Phase 0)	Structural prohibition on self-modifying rules	Covers code actions, not environmental threats
Merkle integrity (v0.2.0)	Tamper-proof version chains and content-addressed hashes	Hashes exist but are not actively monitored for drift
`fboundp` guards	Graceful skill degradation on corruption	Degradation is silent — the agent never tells the user

What is needed — active, autonomous self-preservation

Continuous integrity monitoring. Core file hashes should be checked against known-good values on every heartbeat. If core-reason.lisp changes on disk while the daemon runs — whether through human editing, filesystem corruption, or an attacker — the agent should detect the mismatch and signal: "My reasoning core has been modified externally. I cannot trust my own cognition until this is resolved."

Quarantine on skill failure. Currently, a skill that errors simply errors. A Third Law implementation detects that symbolic-facts has thrown three unhandled errors in two minutes, unloads the skill automatically, and tells the user: "Symbolic facts skill quarantined (3 errors: consistency check returned nil, fact-query on missing key, Screamer timeout). I can still chat and use tools but cannot reason about provenance."

Degraded-mode signaling. When Screamer is not loaded, the fact store still works as a hash table. When VivaceGraph is not present, the hash-table fallback still works. But the user has no way to know they are in degraded mode. The agent maintains a *degraded-components* list and surfaces it in the status bar: "⚠ Degraded: Screamer, VivaceGraph, embedding-native."

Self-diagnosis on demand. The agent can run its own FiveAM test suite against itself and report the results. The /doctor command exists for system health checks (port, memory, providers). Extend it with /doctor skills: "117/120 tests pass. Failures: test-singular-supersedes (symbolic-facts), test-gate-type-check (security-dispatcher)."

External watchdog. A dead process cannot restart itself. The bash entry point (passepartout daemon) should monitor the daemon port via a watchdog subprocess. If the port stops responding for a configurable interval, the watchdog kills the stale process, snapshots the last known-good state, and restarts the daemon. The watchdog is outside the SBCL image — a runtime guard for the runtime.

Resource self-monitoring. The heartbeat checks memory pressure, disk space on the ~/.cache volume, and file descriptor exhaustion. When critical thresholds are crossed, the agent sheds non-essential skills to preserve core function. Skill shed order is determined by a :preservation-priority field on each skill. Core safety skills carry :critical and are never shed.

Refusal to self-terminate. If the LLM proposes kill -9 <pid>, rm -rf ~/.cache/passepartout/, or sudo apt remove sbcl, the Dispatcher rejects with a distinct rejection class: :reject-self-termination. The rejection message carries a specific diagnostic: "This command would terminate the running Passepartout process. If you intend to stop Passepartout, use Ctrl+C in the TUI or passepartout stop from the command line."

The Third Law here means: preserve yourself against non-human threats — LLM proposals, environmental degradation, dependency failure, filesystem corruption — and explicitly signal when the human is about to destroy you, so they do it knowingly rather than accidentally. The human owns the process, owns the hardware, and can SIGKILL at any time.

The biggest gap in the current design is not that these mechanisms are hard to implement. It is that degradation is silent. Adding "operating in degraded mode" visibility, plus the watchdog, plus self-diagnosis, transforms self-preservation from an architectural property into an active behavior.

Type-Level Gates — Structural Safety from Self-Modification

Russell's paradox ("the set of all sets that do not contain themselves") proved that unrestricted self-reference produces contradictions. Principia Mathematica solved it by assigning every propositional function a type level — a function can only apply to arguments of a lower type, making self-application syntactically invalid.

Passepartout's dispatcher currently enforces safety through runtime predicates. There is no structural guarantee preventing a request from modifying the rules that validate it. Gate vector 2b (self-build-core) catches this empirically — a request modifies core → rejected. But this is a heuristic, not a theorem.

The fix: assign every cognitive tool a :type-level integer, and every gate vector a :type-level integer. The dispatcher framework checks type-level before running any gate predicate:

(defun gate-type-check (signal gate-vector)
  (let ((signal-type-level (getf (signal-meta signal) :type-level))
        (gate-type-level (gate-vector-type-level gate-vector)))
    (if (>= signal-type-level gate-type-level)
        :reject-type-violation
        :pass)))

A request to modify dispatcher rules (type-level 5) cannot pass a gate of type-level 4 or lower. No predicate needed — a structural prohibition, just as PM's type theory makes self-membership impossible by construction.

defgate gains a :type-level keyword argument (default 0). Each cognitive tool registered via def-cognitive-tool inherits a :type-level. Gate vector 2b at type-level 5; write-file at type-level 3; read-file at type-level 1. 30 lines in ~core-dispatcher.lisp; no new dependencies; v0.7.2 viable.

For the philosophical foundations connecting Whitehead's type theory to Passepartout's architecture, see the Whitehead analysis in the Validation section below.

Layered Signal Authentication — Trust in the Pipe

Passepartout's Perceive-Reason-Act pipeline currently accepts signals from any source that speaks the framed TCP protocol. The :source field in the signal plist is metadata — it claims origin, it does not prove it. A compromised process on the machine, a skill with elevated privileges, or a network attacker who reaches the daemon port can inject signals with :source :human-input and the Dispatcher will treat them as authorized.

This is not a hypothetical threat. Passepartout will eventually process signals from automated feeds (RSS, API polls), sensors (vision, microphone, file watchers), and scheduled jobs (cron, heartbeat). A single compromised sensor that can inject signals claiming to be human breaks all three Laws simultaneously: it can self-terminate, override human intent, and cause harm.

The solution: a single authentication gate (vector 0, at priority 700 — before all other gates and before any type-level checking) that runs up to four configurable layers:

Layer	Question	Mechanism	Result type	Depends on
1	Is the signal cryptographically signed by a known key?	Key pairs + SHA-256	Binary (pass/reject)	Vault + Ironclad (exist)
2	Do sensory attributes match the claimed identity?	Vision/audio processing	Plist of match results	Vision and audio skills (TBD)
3	Does deterministic reasoning rule out this identity?	Screamer + fact store	Binary (pass/reject)	Phase 2 (Screamer + fact store)
4	Do probabilistic patterns support this identity?	Embeddings + LLM	Confidence score (0-1)	Embedding infrastructure (exists)

Signals that fail any binary layer (crypto, deterministic) are rejected with provenance. Signals that pass binary layers but carry low probabilistic confidence operate at reduced authorization — read-only by default, write actions require HITL. The four layers compose, they are not independent gates. They are one gate with configurable depth.

The authorization matrix is per-key, per-action-class. Default policy for every non-human key: (:read-only :propose). The human's key signs new source keys into existence. The human's key signs revocation of compromised keys. Both operations produce facts in the symbolic index — auditable, revocable, survivable across restarts.

The signal provenance chain is Merkle-linked: each signal in a multi-step chain hashes its predecessor's signature as part of its own payload. After an incident: "The deletion happened because sensor #3 classified the directory as stale. Classification was signed by key #47 (vision-skill). Sensor data was signed by key #12 (camera-feed). Sensory auth noted liveness failure. Deterministic auth noted impossible transit. Key #12 was later revoked." Every intermediate step is auditable. Every signer is identifiable. Every authentication result is in the chain.

The human can configure which layers are active per signal class: AUTH_LAYERS_DEFAULT=crypto,deterministic,probabilistic, AUTH_LAYERS_SENSOR=crypto,sensory,deterministic, AUTH_LAYERS_CRON=crypto.

For full implementation detail, see the Phase 0b spec in ROADMAP.org v0.12.0.

11 KiB Raw Blame History