hermes-brain/ideas/neurosymbolic-loop-architectures.org

:PROPERTIES:
:CREATED:  [2026-05-27 Wed]
:ID:       be9bccc7-5adf-4d0d-8ee4-8855892189bf
:END:
#+title: Neurosymbolic Loop Architectures
#+filetags: :passepartout:neurosymbolic:research:verification:architectures:

Taxonomy of loop architectures that combine neural components (LLMs, neural program synthesis) with symbolic components (interpreters, theorem provers, constraint solvers). Each architecture is defined by its answer to a single question: *who verifies whom, and what gets verified?*

* The Six Architectures

**1. Neural-Guided Program Synthesis (DreamCoder et al.)**

Path: generate → execute → filter → train.

The neural network proposes candidate programs. The symbolic executor runs them. The neural network learns from which ones succeeded. The symbolic side is a ground-truth interpreter — a Lisp evaluator, a lambda calculus reducer — but it only answers "does this program execute without error?" It does not answer "is this program correct for any spec." The neural network learns heuristics for what tends to work, but there is no formal correctness guarantee.

Guarantee: empirical (the program ran on the test cases the synthesizer generated).
Bottleneck: search space — too many candidates, too little signal from just "ran without error."
Neural authority: equal to symbolic (both are parts of a loss function).

**2. Neural Theorem Proving (GPT-f, Thor, COPRA, Magnushammer)**

Path: generate tactic → check with prover kernel → backpropagate search guidance.

The neural network proposes proof steps (tactics). The theorem prover kernel (Lean, Isabelle, Metamath) checks them. The neural network learns which tactics work in which proof states. The symbolic side is the final arbiter of correctness — the kernel cannot be wrong.

The constraint: the neural network never generates full programs. It proposes the next single move in a proof that a human already set up. The loop terminates when the kernel accepts.

Guarantee: formal (step-level — the kernel checks each tactic, but the theorem to prove was written by a human).
Bottleneck: tactic prediction quality — the neural net must suggest the right move in a combinatorially large search space.
Neural authority: none over verification — the kernel can reject any tactic. But the neural net has no creative freedom; it fills narrow holes in an already-structured proof.

**3. Differentiable Program Synthesis (τ-MAML, neural Turing machines)**

Path: end-to-end differentiable computation graph.

The program is represented as a continuous computation graph. The symbolic structure is learned jointly with the parameters. There is no separation between neural and symbolic components — they are fused.

Guarantee: none (the "symbolic" structure is a regularizer, not a verifier).
Bottleneck: gradient signal — complex program structure is hard to learn through backpropagation alone.
Neural authority: total (there is no independent symbolic verifier).

**4. LLM + Partial Verifier (ReAct, Reflexion, STaR, self-consistency, Tree-of-Thought)**

Path: generate → check (partial) → retry or halt.

The LLM generates a response. A symbolic layer extracts structured claims. A verifier — often another LLM, sometimes a constraint solver — checks for consistency. If the verifier flags a problem, the LLM retries.

The verifier is partial. It cannot prove the response is correct. It can only detect certain classes of inconsistency (contradictions, type errors, out-of-range values). The user is asked to trust the unverified parts.

Guarantee: partial (some consistency checks pass; no claim of complete correctness).
Bottleneck: LLM cost per loop iteration, and the verifier misses anything it was not programmed to check.
Neural authority: can override the symbolic verifier by producing output the verifier is not designed to catch.

**5. Neural Optimization of Symbolic Programs (DSPy)**

Path: declarative program skeleton → neural optimizer searches prompt/tool-use space → loss function measures output quality → update optimizer.

The symbolic structure (the program skeleton with typed modules) is written by the human. A neural optimizer searches the space of prompts, few-shot examples, and module compositions to minimize a loss function. The goal is program-level optimization, not correctness.

Guarantee: empirical (the loss improved on the evaluation set).
Bottleneck: optimization budget (number of program variants the optimizer can try before the user runs out of patience or tokens).
Neural authority: symmetric with symbolic (both are searchable parameters).

**6. Passepartout: Verified Synthesis Loop**

Path: specification + gates → agent synthesizes implementation → ACL2 prover verifies against spec → gate engine checks policy → counterexample feedback to agent → retry or halt.

Contrast with every other architecture:

| Property | Passepartout | All others |
|----------|-------------|------------|
| Generator scope | Full implementations from spec | Tactics (GPT-f), candidate programs (DreamCoder), responses (ReAct) |
| Verification scope | Complete functional correctness | Step-level (GPT-f), execution-only (DreamCoder), partial consistency (ReAct) |
| Verifier authority | Asymmetric — cannot override | None (DreamCoder), symmetric (DSPy), LLM self-evaluates (Reflexion) |
| Feedback from verifier | Counterexamples with structure | Pass/fail (most), tactic-level error (GPT-f) |
| Gate layer | Independent policy verification | None |

Guarantee: formal (complete — the prover checks the implementation against the full spec for all inputs).
Bottleneck: spec quality (the spec must be complete enough for the prover to work with).
Neural authority: none over verification (the prover is the final authority and cannot be overridden), but the neural component has full creative freedom in how to satisfy the spec.

* The Key Design Dimension: Asymmetric Authority

Every existing loop falls into one of three camps:

**Camp A: No independent verifier.** DreamCoder, differentiable synthesis, DSPy, and pure LLM pipelines have no component that can say "this output is wrong" with authority. The system improves empirically; it never proves correctness. This is the largest camp by publication count.

**Camp B: Constrained generator + complete verifier.** GPT-f, Thor, and COPRA have a theorem prover as the authority, but the generator is constrained to single-tactic moves within a human-written proof skeleton. The verifier covers everything the generator produces — but the generator can only produce a tiny part of the solution. The human provides the creative structure.

**Camp C: Unconstrained generator + partial verifier.** ReAct, Reflexion, and STaR give the LLM broad freedom but use a partial verifier that misses large classes of errors. The verifier's incompleteness means the neural component can always override it by producing output the verifier cannot analyze.

**Passepartout occupies a previously empty position: Camp D — unconstrained generator + complete verifier.** No previous published system gives the neural component complete creative freedom (synthesize the entire implementation) while subjecting its output to complete verification (the prover checks every path against the full spec).

* Why This Position Was Empty

The reasons are historical and economic, not technical:

**The LLM came first, the verifier second.** GPT-f (2021) and Thor (2022) were constrained to tactic-level because the LLMs available at the time could not reliably produce correct code at the module level. DreamCoder (2019) used neural program synthesis, not LLMs, and its generator was too weak to implement a full protocol. The hardware needed to run an LLM powerful enough to synthesize an entire TLS implementation, plus an ACL2 prover to verify it, did not exist in the same price range until roughly 2024.

**The verifier scales differently from the generator.** ACL2 can verify a 50,000-line TLS implementation, but doing so requires the spec to be written in a form the prover can consume — which is a human investment at least as large as writing the implementation. The old cost structure (Gabriel's "correctness is expensive") made the Passepartout loop uneconomical for any practical use case. Only the cost inversion described in [[id:c2789c0f-0955-43af-8a4a-f83ba87128fd][Better is Cheaper]] makes it viable.

**The field prioritized breadth over depth.** DSPy, ReAct, and similar systems optimize for broad applicability across many domains with partial guarantees — the "worse is better" of neurosymbolic design. Passepartout optimizes for deep correctness in a narrower domain. The field has not yet asked "what if we take the complete verification path and give the generator full freedom?" because the components to ask that question did not converge until now.

* What This Means for Research

Passepartout's loop is not patentable as a combination of existing parts — the individual components (LLM, ACL2, gates) are well-known. But the *unoccupied position in the design space* is a research contribution that no paper has described because the loop has only been technically feasible for approximately the last 12-18 months.

The open questions that define the research agenda:

1. **Counterexample bandwidth.** Can the prover produce counterexamples rich enough for the LLM to learn from in a single retry? Prover counterexamples are typically concrete (a specific input where the implementation deviates from spec). The LLM must generalize from one concrete failure to a correct implementation. This is harder for the LLM than what GPT-f faces (the prover tells it exactly which tactic failed and why).

2. **Spec completeness threshold.** How incomplete can a spec be and still produce correct implementations? If the spec is vague (in natural language with formal fragments), the prover cannot check everything, and the loop degrades to ReAct-level partial verification. The research question is the minimum spec density needed to enter the complete-verification regime.

3. **Gate interaction with verification.** What happens when a spec passes the prover but violates a gate (organizational policy)? The gate is a second symbolic layer with different authority — it checks policy, not correctness. Does the gate's failure feedback go to the agent (regenerate), the human (update policy), or both? This interaction has no existing literature because no system has two independent symbolic verification layers.

4. **Specification as bottleneck.** The field has spent decades studying how hard it is to write code. Passepartout replaces "writing code" with "writing specifications." Is specification-writing easier than implementation-writing at the same level of quality? Or does the difficulty just shift from a familiar bottleneck to an unfamiliar one?

These questions are unanswered because the architecture that makes them meaningful — unconstrained generator + complete verifier — did not exist in any published system. Passepartout is not just "another neurosymbolic loop." It occupies a cell in the design space that was previously empty. That is what makes it novel.

* References

- [[id:dddd52a7-adb8-470e-a459-614ade5f76af][Closing the Lisp Gap]] — the context in which this architecture becomes viable
- [[id:c2789c0f-0955-43af-8a4a-f83ba87128fd][Better is Cheaper]] — the cost inversion that makes this loop economical
- [[id:9af13fff-9725-542b-93b1-a555bc74ad72][Lisp Economics]] — why the old cost structure prevented this architecture from being built earlier