hermes-brain/projects/passepartout/architecture/design/the-symbolic-engine.org at 2ee60b5b4a9f96a18f9958695f2350ffd00c2d5e

amr/hermes-brain

Fork 0

Files

Hermes 2ee60b5b4a Split design-decisions.org into design/ subfolder (10 files, one per section)

2026-06-04 20:16:02 +00:00

28 KiB

Raw Blame History

The Symbolic Engine

— title: The Symbolic Engine type: reference tags: :passepartout:architecture: —

The Symbolic Engine

The Five Architecture Options

The symbolic engine must relate to the human memex. The relationship is not obvious because knowledge lives in two incompatible forms: natural language prose (what the human reads and writes) and formal facts (what the symbolic engine reasons about). The translation between them is lossy by nature. The architecture is defined by how it handles that lossiness.

Option 1: The Auto-Formalizer

A separate knowledge graph stores symbolic facts. The LLM populates it by extracting triples from unstructured data. The KG becomes co-authoritative with the human prose.

This is the simplest to implement but inherits the dual-representation problem in its most acute form. The KG and the prose can disagree, and the architecture provides no mechanism for resolving disagreements. It also stores knowledge twice — once in the user's Org files, once in the KG — with no guarantee that they stay synchronized.

Option 2: Two Intentionally Separate Memexes

The human memex contains prose: thoughts, diaries, decisions, documentation. The symbolic memex contains formal facts: constraints, rules, relationships, deductions. The archivist bridges between them but does not try to keep them synchronized. They are allowed to diverge because they serve different purposes.

This is philosophically honest — it admits that no lossless translation between natural language and formal logic is possible. But it forces the user to reason about two separate knowledge stores.

Option 3: Tangled Fact Blocks in Org Files

A new block type — #+begin_src knowledge — would contain symbolic facts in a formal language. The tangle mechanism would load these facts into the symbolic engine's in-memory store, just as it loads Lisp code into the SBCL image.

This is aesthetically appealing because it unifies the format. One toolchain, one version control system, one Merkle tree. But the block language itself IS the knowledge representation language, and that language is the ontology we have not yet defined.

Option 4: One Memex, Two Indices

The prose remains in human language in Org files. The prose is always the ground truth. Two indices sit on top of the prose as derived views:

The neural index uses vector embeddings to enable semantic search. The LLM navigates the prose through embedding space, retrieving relevant headings.
The symbolic index stores formal assertions about what the prose says — predicates, relations, constraints — each grounded to a specific heading or block in the Org file.

Each index serves its own side of the machine. They do not need to understand each other's representations. They only need to agree on which heading or block they are referring to. Because the prose is always the ground truth, the symbolic index can be thrown away and rebuilt from scratch if it becomes corrupted or stale. No information is lost — only the extracted assertions.

Option 5: Ephemeral Symbolic Facts

No persistence, no serialization format, no knowledge graph stored on disk. VivaceGraph exists in memory during the session. Screamer derives facts from the prose as needed. When the session ends, the facts are discarded and re-derived on the next start.

This punts the ontological design problem entirely. You never have to decide on a serialization format because you never serialize. The cost is compute (re-derivation on every restart) and the inability to accumulate facts across sessions. But it is the correct first step — a way to learn what kinds of facts are actually useful before committing to a storage format.

The Chosen Path: Option 4, Starting with Option 5

The one-memex-two-indices architecture (Option 4) is the correct long-term architecture. The prose is the ground truth. The symbolic index is a derived view that can be rebuilt. The neural index handles what the symbolic index cannot — semantic search, fuzzy matching, associative leaps.

But committing to a persistence format before knowing what facts are useful is premature. The practical path starts with Option 5 (ephemeral facts) as the Phase 1-4 implementation, then graduates to Option 4 with VivaceGraph persistence in Phase 5 when the fact language has been battle-tested through months of gate outcomes, Screamer deductions, and LLM proposals.

Why the dual index is permanent, not transitional

In the coding domain, there is an aspiration that the symbolic index could eventually capture enough of the prose's propositional content to become a complete representation — the "flip" where the symbolic engine reverses the flow. But for the broader memex (literature, poetry, personal reflection, daily logs), completeness is neither possible nor desirable. You cannot formalize what makes a poem beautiful. You cannot extract a triple that captures the emotional weight of a diary entry. The neural index will always be the gateway to the full richness of the prose. The symbolic index handles what can be mechanically verified: citations, entities, temporal order, contradictions, provenance. The division of labor between the two indices is permanent because the domains they serve are fundamentally different kinds of knowledge.

Ephemeral First, Persistent Later

The architecture note's Option 5 (ephemeral facts, no disk persistence) is the correct first implementation. Three reasons:

The fact language is unproven. Triples with provenance and grounding is a hypothesis. It may be too simple for some domains, too complex for others. Committing to a serialization format before knowing what's useful is premature.
The ontology is emergent. Categories are created on first use. What proves useful stays; what doesn't fades. A persistent format would need a migration story every time the category structure changes. Ephemeral avoids this entirely — the facts are re-derived on each session start using the current (evolved) ontology.
Rebuildability is the safety net. Because all facts have a :grounding to an Org heading, and gate-outcome facts are regenerated from the gate stack on every load, the entire symbolic index can be thrown away and rebuilt from scratch. The cost is compute, not data. This is the practical realization of "the prose is always the ground truth."

The transition to persistence (Phase 5: VivaceGraph) happens when two conditions are met: the fact language has stabilized through use, and the accumulated deductions across sessions provide value that justifies the serialization cost.

The Gate-to-Fact Bootstrap — Extracting the First Ontology from Code

The Dispatcher gate stack already encodes an implicit ontology. Every gate vector asserts the existence of a category of things:

Gate vector 2 asserts there exists a class of files called secrets.
Gate vector 7 asserts there exists a class of commands called destructive.
Gate vector 8 asserts there exists a class of domains called trusted.
The self-build boundary asserts there exists a class of files called core-harness and a class called skills.

These claims are currently expressed as code — Lisp functions that pattern-match against file paths, shell commands, and URLs. They are not facts the symbolic engine can query, derive from, or check for consistency. But they can be made explicit.

The bootstrap makes every gate a set of initial symbolic facts: (:file ".env" :member-of-class :secret-files :source gate-vector-2), (:command "rm -rf /" :classified-as :catastrophic :source gate-vector-7), (:domain "api.telegram.org" :classified-as :trusted :source gate-vector-8).

This produces 50-70 entity classes directly from the existing gate stack, without any new infrastructure:

Source	Count	Example categories
`dispatcher-protected-paths`	11	:secret-config-file, :ssh-key-file, :gpg-key-file
`dispatcher-shell-blocked`	8	:catastrophic-command, :injection-pattern
`dispatcher-network-whitelist`	2	:trusted-domain, :untrusted-domain
Self-build boundary	2	:core-harness-file, :skill-file
Privacy tags	3	:private-content, :financial-content
Permission table	3	:read-only-tool, :write-tool, :eval-tool
Cognitive tools	6	:code-search-tool, :file-io-tool, :shell-tool
Relations (all gates)	~15	:member-of-class, :classified-as, :depends-on
Qualities	~8	:catastrophic, :dangerous, :moderate, :harmless
Provenance sources	4	:gate-outcome, :human-authored, :deduced, :llm-proposed

This is the seed. It gives Screamer a domain to reason about immediately, without any LLM involvement. It proves the pattern — code becomes facts, facts enable reasoning — at the cost of approximately 30 lines of Lisp.

The LLM as Proposer — Verified Extraction

The LLM cannot be trusted to populate the symbolic index directly. Its outputs are sampled, not proven. A probabilistic extraction feeding a deterministic engine defeats the purpose of being deterministic.

But the LLM is still useful. It can surface facts that are obvious to a human reader of prose but would take the symbolic engine many deduction steps to reach independently. The solution is to demote the LLM from extractor to proposer:

The archivist reads a prose heading.
The LLM proposes candidate triples.
Screamer checks each triple for consistency against the existing fact store.
Only consistent triples are admitted to the symbolic index, flagged with :provenance :llm-proposed and grounded to the source heading.

The LLM might hallucinate facts that don't correspond to the prose. It might extract facts that contradict existing knowledge. It might produce syntactically malformed triples. None of these failures contaminate the symbolic index because proposals are not admitted automatically. The admission gate (Screamer) is deterministic.

This is the core architecture pattern. Everything else — the entity classes, the deduction engine, the persistence layer — follows from this single design decision: the LLM proposes; the symbolic engine decides whether to accept.

Cardinality Policies — Singular, Dual, and Plural Facts

Classical logic requires consistency. A contradiction implies everything (ex contradictione quodlibet). Screamer, as a constraint solver, also requires consistency — a contradictory constraint set has no solutions. But the symbolic engine operates across domains where the meaning of contradiction is fundamentally different. The correct question is not "is this consistent?" but "what cardinality of truth does this domain support?"

Time is not a policy. It is a universal dimension that applies equally to every fact, regardless of cardinality. All facts carry :timestamp and :parent-id fields. Every fact has a version history. Every fact lives in a Merkle chain that captures how it changed. The cardinality policy only governs what happens at a given logical moment when two values coexist for the same entity and relation.

Policy :singular — One Active Value, One Version Chain

The active set contains exactly one value for (:entity :relation) at a time. When a new value asserts for the same pair, the old value is not rejected. It is superseded — moved into the version history, linked to the new leaf by :parent-id, and retained permanently. The active value is the leaf of the Merkle chain.

"I used to think rm -rf / was safe. Now I know it is catastrophic." Both facts exist. Both are true — the first at 2024-06-01, the second at 2025-03-15. The chain captures the evolution. The :singular policy means there is one truth now, not that there was only ever one truth.

Use for: security classifications, file system state, gate rules, code correctness, deterministic safety constraints — domains that converge on one answer, evolving over time.

Policy :dual — Exactly Two Values, in Explicit Tension

The active set contains exactly two values for (:entity :relation). Both are simultaneously true. Both carry independent version histories. A third value is rejected — the domain is binary by nature.

Some contradictions are productive precisely because they are binary. Thesis and antithesis. Love and resentment. Wave and particle. A poem's two incompatible readings. The symbolic index holds both, cross-referenced as complementary rather than conflicting. The user is not asked to resolve the tension. The tension is the fact.

The system can reason about cardinality transitions: a :dual fact that has one interpretation superseded should collapse to :singular. A :dual that has a third interpretation asserted should prompt the user: "Promote to :plural or demote one interpretation?"

Use for: productive binary tensions, complementary opposites, dialectical pairs, any domain where two answers are both true and their tension is meaningful.

Policy :plural — N Active Values, Open Set

The active set contains any number of values for (:entity :relation). Each value has independent provenance and its own version history. Queries return all active values with provenance display. Contradictions are flagged as cross-references between values — information, not error.

A :plural fact where all but one value are superseded should collapse to :singular. A :plural fact where the set reduces to two active values — and the remaining two are complementary — should collapse to :dual.

Use for: literary interpretation, scientific hypotheses, personal beliefs held at different times (when tension is multi-faceted rather than binary), multi-source factual disagreement, open-ended exploration.

Policy Assignment

The policy is assigned when a category is defined. New categories default to :plural (safe — never loses information). Core security categories are explicitly :singular. The gate stack's bootstrapped facts are :singular because they describe the actual filesystem, which is physically singular. Categories for dialectical or complementary domains are explicitly :dual.

The Screamer admission gate applies the cardinality policy at the active set:

:singular + same value, later timestamp → supersede old, chain new as leaf.
:singular + different value, same timestamp → reject (contradiction). Human resolves.
:singular + different value, later timestamp → supersede old, chain new as leaf. History preserved.
:dual + first value → admit. + second value → admit, cross-reference as complementary. + third value → prompt.
:plural + any value → admit. Active count transitions trigger collapse checks.

Why This Matters for the Broader Memex

In the coding domain, contradiction is rare, resolvable, and usually temporal (a rule changed). In the broader memex, contradiction is the product, not the error. Your poetry analysis contradicts your last diary entry. Your reading of Pale Fire changed between 2023 and 2025. Wikidata says Mount Everest is 8848m; DBpedia says 8849m. You love this person AND you resent them.

The symbolic engine's job is not to decide which is right. It is to surface the tension with provenance — "these three sources disagree; here is the chain for each" for plural facts, or "you hold these two positions in tension" for dual facts, or "you believed X until Tuesday, then Y" for singular facts that evolved. The cardinality policy names the structure of the tension. The Merkle chain provides the history of each position.

How Categories Grow — The Organic Ontology

Whitehead's Principia Mathematica took over 300 pages to define the logical foundations before it could prove that one plus one equals two. Every category introduced carried a burden of justification. Every inference rule had to be demonstrated sound. This is the classical approach to ontology: define everything upfront, exhaustively, formally.

Passepartout cannot afford this and does not need it. Its domain is bounded (software engineering, personal knowledge, literary engagement, daily life) and its ontology grows from the system's own operation:

The gate stack seeds the ontology. Every gate vector is an implicit claim about a category of things. The bootstrap makes these claims explicit. The seed is 50-70 entity classes with no human authoring required — mechanically extracted from existing code.
New gate vectors add categories directly. As the Dispatcher grows (new shell patterns, new path protections, new tool classifications), the ontology grows with it. Every new pattern becomes a fact on skill load.
Screamer generalizes from gate outcomes. After 37 shell commands are blocked as destructive, Screamer extracts structural commonalities: "commands writing to block devices," "commands recursively deleting outside the workspace." These become new subcategories that didn't exist in the original gate patterns. The ontology deepens through observation.
The archivist proposes from prose. The archivist reads a diary entry about a book: "Nabokov's lectures on Kafka." The LLM proposes (:entity :nabokov :relation :lectures-on :value :kafka). Screamer checks consistency. Admitted. The categories :author, :lectures-on, and :subject didn't exist before — they are created on first use. This is the primary growth mechanism for the broader memex.
The human declares explicitly. The human writes a declarative fact directly into the symbolic index. No extraction step. No LLM involvement. The fact is admitted with :provenance :human-authored — the highest trust level.
Temporal patterns crystallize into categories. Every Sunday the memex gets a retrospective heading. Every Monday a planning heading. The time-awareness system observes the periodicity and proposes :weekly-retrospective and :weekly-planning as fact types. Screamer verifies.
Cross-domain overlap produces parent categories. Screamer notices that :secret-files (from the gate stack) and :private-content (from privacy tags) share members — .env is both a secret file and private content. It proposes :sensitive-material as a parent with both as children. Taxonomy building happens automatically through overlap detection.

Growth is self-limiting by design

Not every conceivable category is added. The system prunes through use:

New categories are admitted only through Screamer's consistency check. A category that contradicts an existing classification is rejected.
A category that never gets queried costs nothing (a hash table entry) but produces no value. It fades from use naturally.
Overly fine-grained categories are rejected because they are redundant with the wildcard pattern that already covers them.
Overly broad categories that subsume meaningful distinctions produce contradictions when Screamer tries to apply existing rules. Rejected.

The system converges on a useful granularity through use, not through upfront design. The gate stack provides the seed. Gate outcomes, prose extraction, deduction, and human authoring grow the shoots. Screamer prunes contradictions. The ontology is a garden, not a building.

Ontology Versioning — How Worldviews Change Without Losing Perspective

Ontology refactoring is not a schema migration. It is a worldview change. When you split :secret-file into :crypto-secret and :plaintext-secret, you are not renaming columns. You are reclassifying what a file is — and every Screamer deduction that crossed the old category boundary now means something different under the new distinction.

The system preserves all worldviews. It does not overwrite the past with the present.

The category hierarchy is itself a Merkle tree. Every entity class definition carries a hash of its superclasses, its cardinality policy, its associated relations, and its description. The aggregate hash of all active class definitions is the :ontology-version — a Merkle root of the current worldview.

Every fact — every triple, every deduction, every gate outcome — stores its :ontology-version at the time of assertion. This is a single field, 64 hex characters. The cost is negligible. The implication is profound.

When categories change, the system does not run a batch UPDATE. It re-verifies:

A new category hierarchy produces a new :ontology-version hash.
Facts carrying the old hash are flagged for re-verification.
On heartbeat or manual trigger, Screamer re-evaluates each flagged fact against the new category definitions. The old justification chain is preserved alongside the new outcome.
Status: :survived (still valid), :incoherent (premises don't translate, flagged for human review), :reclassified (valid but under different classification).

The fact-query function accepts an optional :ontology-version parameter. Queries default to the current worldview (:active). Specifying a version returns facts as they were under that worldview. The system can answer questions that no other knowledge tool can: "What did I believe about secrets before I refined my security model?" "How has my reading of Pale Fire evolved across three frameworks?" "Which deductions survived my last ontology refactoring?"

This is not querying a fact. It is querying the history of your own thinking — the fact that you changed your mind, the date you did, the reasoning that held and the reasoning that didn't.

The "Awakening" — Sufficiency Criterion

The symbolic index begins its life as a lossy construct. The initial extraction from prose — LLM proposals verified by Screamer — is built from an uncertain foundation. Some facts are correct. Some are missing. Some are wrong.

But the symbolic engine accumulates non-lossy facts through three independent mechanisms:

Gate outcomes — every gate rejection is a fact. No LLM involved. Accumulate at the rate of user interactions.
Screamer deductions — new facts derived from existing facts. No LLM involved. Accumulate whenever the fact store crosses a density threshold.
Human authoring — the human explicitly declares facts. No LLM involved.

At some point, the non-lossy facts constitute a sufficient foundation that the symbolic engine can reverse the flow: instead of the LLM extracting facts from prose, the symbolic engine reads prose through its own lens — its now-substantial ontology of categories, rules, and constraints — and asserts facts in its own language. The extraction mechanism ceases to be probabilistic and becomes deterministic.

The sufficiency criterion makes this operational: (/ (count-provenance :gate-outcome :human-authored :deduced) total-facts). When this ratio exceeds a configurable threshold (SUFFICIENCY_THRESHOLD, default 0.7), the system considers its foundation sufficient. The archivist switches from "LLM proposes, Screamer verifies" to "Screamer queries existing facts, applies to the new prose, and deduces new facts directly."

The flip is visible to the user: "Symbolic index: 847 facts (73% non-lossy, 12% LLM-proposed, 15% Wikidata). Sufficient foundation: YES."

The flip does not mean "complete." In the broader memex, completeness is neither possible nor desirable. The awakening means "deterministic enough to be trustworthy," not "comprehensive enough to be self-sufficient." The neural index remains the gateway to the full richness of prose. The symbolic index handles what can be mechanically verified. The boundary is permanent.

Merkle DAG for Version History

Every fact is versioned. Every (:entity :relation) pair forms its own independent chain in a Merkle DAG. This is not new infrastructure — it is a new occupant of Passepartout's existing Merkle-tree memory system (v0.2.0).

When a fact supersedes its predecessor, the new fact hashes over: SHA-256(value || provenance || timestamp || parent-hash || grounding). The parent-hash pointer forms the chain. Tampering with any version changes its hash, breaking all downstream references. The history is tamper-proof by construction.

Facts about (.env :member-of-class) form one chain. Facts about (:nabokov :wrote) form another. They evolve independently. They share no ancestry. This is a DAG, not a single list — inserting a fact is O(1) per chain. Changing a fact about .env does not require rehashing the literary index.

:dual and :plural facts cross-reference each other via edges (:complements, :contradicts) but these are semantic relationships, not parent chains. Each value has its own ancestor chain. The cross-reference edges form a web; the parent chains form a spine.

Passepartout already snapshots the Merkle root over all memory objects. Adding the fact store to the snapshot is a registration, not a new mechanism. Rolling back the snapshot restores the entire fact state — all chains, all cross-references, all cardinalities — to that point in time.

Abstract Fact Store Interface — Modular by Design

The fact store is accessed through an abstract API. The Merkle DAG (or any future backing store) is an implementation behind this interface, not a dependency that code throughout the system calls directly.

fact-assert    :: fact → store → (:admitted | :rejected | :flagged)
fact-query     :: (entity &key relation policy) → active-value-or-values
fact-history   :: (entity relation) → ordered chain of versioned facts
fact-snapshot  :: () → root-hash
fact-rollback  :: root-hash → store

Implementations behind the interface:

Phase 1-4: ephemeral hash table with :timestamp and :parent-id pointers. No cryptographic hashing. No persistence.
Phase 5: VivaceGraph + Merkle memory-object wrapper. Content-addressed, persistent, tamper-proof.

Future implementations that satisfy the same interface — an append-only write-ahead log, an immutable B-tree, a content-addressed triple store — can replace the backing store without changing any consumer. The archivist, Screamer, ACL2, and the planner call fact-assert and fact-query, not Merkle struct accessors or VivaceGraph traversal syntax.

This is not speculative modularity. The two-implementation migration (Phase 1-4 hash table → Phase 5 VivaceGraph + Merkle) is in the roadmap. If the interface leaks implementation details, the migration breaks. The interface must be designed, tested against both backends, and committed before Phase 1 ships.

Knowledge Graph Type Hierarchy — Structural Anti-Self-Reference (v3.0.0)

The same type-theoretic principle that governs the gate stack can be applied to the knowledge graph itself. When VivaceGraph ships (v3.0.0), every entity carries a :pm-type-level metadata field. Queries cannot return entities of the same level as the querying function. Self-referential knowledge becomes structurally impossible — no "this entity defines its own type level."

The KG query layer enforces this at the Prolog level, not through runtime checks. This is the same idea as the type-level gates (see Safety section above), but applied to knowledge rather than actions. The dispatcher prevents self-referential actions; the KG prevents self-referential facts.

For the full philosophical treatment, see the Whitehead analysis in the Validation section below.

28 KiB Raw Blame History