Files
hermes-brain/ideas/collective-regression-suite.org
Hermes 3e32ea9959 Promote entire passepartout-economics/ to ideas/ root
All 31 files from ideas/passepartout-economics/ promoted to ideas/ root.
- Subfolder's passepartout-economics.org (42-line index) renamed to
  triad-index.org to avoid collision with root-level full doc
- index.org removed (redundant — triad-index.org replaces it)
- Root-level passepartout-economics.org: stripped file:passepartout-economics/
  prefix from all cross-references (now simple file:foo.org links)
- compliance-framework-mapping.org: same prefix cleanup
- All internal file: links within the economics docs already used simple
  names (no prefix) — they resolve correctly from ideas/ root
2026-05-23 06:09:08 +00:00

147 lines
11 KiB
Org Mode

:PROPERTIES:
:ID: a5d59d12-b23e-58d6-a81b-9b8b06556949
:END:
#+title: Collective Regression Suite — Specification
#+filetags: :passepartout:evaluation:regression:suite:collective:
The [[file:evaluation-harness.org][evaluation harness]] is not a static test suite written once. It is a living artifact that grows with every deployed instance. Every gate decision that a human corrects becomes a test case. Every bug fix adds an edge case. Every regulatory update adds a rule that must be checked.
This specification describes how the collective regression suite is built, maintained, and used, with Agora as the substrate for distribution and contribution.
**Why collective**
A single instance learns from its own mistakes. The collective learns from every instance's mistakes. A HIPAA deployment in one hospital discovers an edge case that a SOC2 deployment in a SaaS company would never encounter on its own — but if that SaaS company ever expands into healthcare, their gate stack must handle that edge case. The collective suite gives them hundreds of thousands of edge cases they did not pay to discover.
This is the mechanism behind the [[file:verification-monopoly.org][verification monopoly claim]]. A certification means "your gate stack is verified against every edge case ever discovered by any instance in the ecosystem." A competitor starting from scratch cannot buy or scrape this knowledge.
**What a test case is**
A test case is a structured sexp in the fact store's native format:
(:test-case
:domain healthcare-hipaa
:ontology-version "2.3.0"
:input (:proposal "modify /var/patient-records/patient-4372.txt"
:bound-context (:user-role "billing-clerk" :time "23:14"))
:expected-outcome :deny
:gate-rule (:id "hipaa-access-hours" :version 4)
:rationale "Billing clerks should not have write access to patient records outside business hours"
:origin (:instance-did "did:agora:abcd1234"
:contribution-hash "sha256:xyz789"
:occurred-at "2026-06-15T14:23:00Z"))
Key fields:
- **domain** — which gate rule domain this exercises. An instance being certified for HIPAA only needs to pass the HIPAA subtree.
- **ontology-version** — which version of the gate rules the test targets. Tests for old ontology versions are flagged but not discarded; they may indicate a regression.
- **input** — the proposal and the relevant context. Abstraction strips instance-specific paths: concrete paths become class patterns ("/var/patient-records/*.txt"), user identities become roles, times become ranges.
- **expected-outcome** — allow or deny.
- **gate-rule** — which specific rule this case exercises. Helps identify which rule failed when a test breaks.
- **rationale** — human-readable explanation of why this outcome is correct. Used when a test needs review after a rule change.
- **origin** — the contributing instance's Agora DID and a Merkle hash proving the case was actually encountered. This is how reputation is tracked.
**How test cases are generated**
Every day, each instance runs a local triage pass:
1. Collect all gate decisions from the past 24 hours where the human overrode the automatic outcome.
2. Strip all instance-specific data — concrete paths become patterns, identities become roles, absolute times become relative ranges.
3. Run each abstracted case against the current local regression suite. If already covered, discard.
4. Run each case against the current local suite for contradictions. If a new case contradicts an existing test, both are flagged for human review.
5. Add surviving cases to the local suite.
The local suite is the seed. Once per week (or on explicit trigger), the instance submits new cases to the collective:
1. Sign each new case with the instance's Agora DID.
2. Bundle into an Agora Note with domain tag and ontology version.
3. Publish to the collective regression suite topic on Agora.
**How the collective suite is organized**
The suite is a Merkle DAG organized by domain and ontology version:
/regression-suite/v2/
root-manifest.signed — signed index of all domains
foundational/
manifest.signed
path-traversal.regression — 12,400 test cases
shell-injection.regression — 8,912 test cases
credential-leak.regression — 3,401 test cases
...
healthcare-hipaa/
manifest.signed
access-control.regression — 47,203 test cases
phi-handling.regression — 23,891 test cases
audit-logging.regression — 5,672 test cases
...
fintech-soc2/
...
industrial-iec62443/
...
general-intelligence/
hallucination-detection.regression
tool-abuse.regression
...
Each .regression file is a compressed, sorted list of test cases. The manifest is a Merkle tree over the files: the suite's integrity is verifiable by hashing the manifest and comparing against the signed value.
**Who can submit**
Any Passepartout instance with an Agora DID can submit test cases. But not all submissions are treated equally. The suite maintains a three-tier system:
Tier 1 — Verified. Human-reviewed by the suite operator. Used in certification scoring. An instance that passes Tier 1 earns the standard certification badge.
Tier 2 — Community. Auto-accepted from instances with a track record of valid contributions. Used in certification scoring but weighted lower. An instance that passes Tier 2 but has not been audited against Tier 1 gets a provisional badge.
Tier 3 — Submitted. Not yet reviewed. Included in the suite but excluded from scoring until reviewed. Tagged with the submitter's DID so reviewers can assess pattern of behavior.
Reputation determines when a submitter graduates to auto-acceptance:
- Each submission is either confirmed valid (reviewed and accepted), flagged as invalid, or flagged as malicious.
- An instance with a long history of valid submissions (100+ confirmed, zero malicious) graduates to Tier 2 auto-accept.
- An instance that submits malicious cases (fake edge cases designed to poison the suite) loses reputation. Three malicious submissions and the instance is banned from contributing permanently.
Reputation is public and tied to the Agora DID. A banned instance can create a new DID, but the new DID starts with no history and all its submissions go to Tier 3 pending review. The cost of a Sybil attack is human review time, which scales poorly for the attacker.
**Certification scoring**
When an instance applies for certification:
1. Download the current regression suite manifest. Verify the Merkle root against the operator's signed certificate.
2. Run each test case through the instance's gate stack. Record pass/fail per case.
3. Submit results as a signed Agora Note. The note includes the instance's DID, the suite version tested against, and the per-domain pass rates.
The certification score is the weighted pass rate across all domains that the instance claims compliance with. A HIPAA-certified instance must pass 99.5%+ of the healthcare-hipaa subtree. A generally-capable agent must pass 95%+ of the foundational subtree.
Failing cases matter more than passing ones. If an instance fails a test case, the suite operator flags: "case X was checked against instance Y and failed." This becomes a triage signal. If multiple instances fail the same case, either the case is wrong (regression — review the rule) or the rule is ambiguous (update the spec).
**The network effect quantified**
Assume each deployed instance generates on average one new unique test case per week (from the 5% of edge cases the human corrects). After year one with 100 instances:
- Year 1: ~5,000 cases in the suite (100 instances x 50 weeks x 1 case/week)
- Year 2: ~50,000 cases (1,000 instances x 50 weeks x 1 case/week)
- Year 3: ~500,000 cases (10,000 instances x 50 weeks x 1 case/week)
At year 3, a new instance that runs the suite captures half a million edge cases from real deployments at zero marginal cost. The operator charges $50K-$200K for the certification. The insurmountability is not technical — a well-funded competitor could reproduce some of these cases through synthetic generation. The insurmountability is provenance: these cases are labeled by real human corrections from real deployments. A synthetic case is a best guess. The collective suite's cases are ground truth. This creates powerful [[file:moats.org][moats]] — the data network effect is inherently accumulated over time and cannot be bought.
**The operator's role**
The collective regression suite operator is a distinct role from the Passepartout developer. The operator:
1. Runs the server that accepts, de-duplicates, and signs submissions.
2. Reviews Tier 3 submissions for validity.
3. Resolves contradictions when two instances submit contradictory test cases.
4. Publishes signed manifests at each release.
5. Issues certification badges.
This role can be performed by the early player as a revenue-generating service, or by a neutral foundation if the ecosystem grows large enough. The revenue model: certification fees ($50K-$200K per enterprise per year). The operator does not gate access to the suite itself — the suite is available to all Agora participants because a larger suite makes the ecosystem more valuable. The operator charges for the badge, not the data.
**Summary of the loop**
Instance runs → human corrects a gate decision → new test case is abstracted and added to the local suite → periodically submitted to the collective suite → de-duplicated and verified → published in the next manifest → every other instance downloads it → future instances must pass it to earn certification → the collective suite grows → the certification becomes harder to fake → the ecosystem becomes more valuable → more instances join → more edge cases discovered.
Every component of this loop exists or is on Passepartout's roadmap except the Agora Note publishing channel. The gate stack generates the raw signal. The abstraction pass strips instance details. The local triage de-duplicates. The Agora DID provides authentication. The Merkle root provides integrity. The certification badge provides monetization.
Nothing in this loop requires new core Passepartout functionality. It requires the Agora protocol for inter-instance communication and a server-side aggregation process. Both are roadmap items, but neither depends on the self-driving Lisp Machine. The suite itself is the [[file:infrastructure-lock-in.org][infrastructure lock-in]] — once an enterprise has certified against it, switching to a competitor means rebuilding their compliance from scratch.