hermes-brain/projects/passepartout/architecture/design/engineering-infrastructure/the-evaluation-harness.org at 284c5bd324c62f7e36fcad879fa9e85d11141ec1

amr/hermes-brain

Fork 0

Files

Hermes 284c5bd324 Break design/ files into subfolders by ** heading (43 new files)

2026-06-04 20:47:17 +00:00

2.2 KiB

Raw Blame History

The Evaluation Harness

— title: The Evaluation Harness type: reference tags: :passepartout:architecture: —

The Evaluation Harness

SOTA parity is meaningless without measurement. A system that claims to match commercial agents must demonstrate it through reproducible benchmarks, not through feature checklists. The evaluation harness is the apparatus by which Passepartout proves its capabilities.

The industry standard for coding agents is SWE-bench: a corpus of GitHub issues paired with pull requests. The agent is given an issue, must understand the codebase, write a fix, and submit. Success is measured by whether the submitted PR passes the existing test suite. This tests the full chain: understanding, planning, code generation, verification, and multi-step reasoning.

Passepartout implements a native Lisp harness for this. A background thread clones repositories, feeds issues into the cognitive loop, tracks the resolution trajectory as an Org-mode headline tree, and scores success by test outcomes. The trajectory is persisted: when a resolution fails, the system can inspect where in the chain the reasoning broke down. The headline tree records the agent's thoughts at each step, making the failure auditable and the debugging human-assisted.

Beyond SWE-bench, the harness includes chaos testing. The system is subjected to resource starvation, concurrent load, and adversarial input. The deterministic engine must maintain safety invariants under pressure. The symbolic verifier must not deadlock or livelock. The probabilistic engine must degrade gracefully.

The harness also supports regression testing on the skill set. Every skill is tested against a suite of known inputs and expected outputs. When a modification is proposed to any skill — whether through manual editing or the agent's own self-modification — the test suite runs first. A skill that fails its tests is rejected before it can propagate to the running image. This is not a convenience — it is the mechanism by which self-modification remains safe. The agent can propose changes, but the harness verifies them before the changes take effect.

2.2 KiB Raw Blame History

The Evaluation Harness

2.2 KiB

Raw Blame History