From 77b3bb6e6a410e94e48854d243fd287173b5c6ff Mon Sep 17 00:00:00 2001 From: Hermes Date: Fri, 22 May 2026 00:42:50 +0000 Subject: [PATCH] =?UTF-8?q?Add=20collective=20regression=20suite=20spec=20?= =?UTF-8?q?=E2=80=94=20how=20the=20evaluation=20harness=20compounds=20acro?= =?UTF-8?q?ss=20instances?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../collective-regression-suite.org | 152 ++++++++++++++++++ .../evaluation-harness.org | 2 +- .../passepartout-economics.org | 1 + 3 files changed, 154 insertions(+), 1 deletion(-) create mode 100644 ideas/passepartout-economics/collective-regression-suite.org diff --git a/ideas/passepartout-economics/collective-regression-suite.org b/ideas/passepartout-economics/collective-regression-suite.org new file mode 100644 index 0000000..66844af --- /dev/null +++ b/ideas/passepartout-economics/collective-regression-suite.org @@ -0,0 +1,152 @@ +:PROPERTIES: +:ID: a5d59d12-b23e-58d6-a81b-9b8b06556949 +:END: +#+title: Collective Regression Suite — Specification +#+filetags: :passepartout:evaluation:regression:suite:collective: + +The evaluation harness is not a static test suite written once. It is a living artifact that grows with every deployed instance. Every gate decision that a human corrects becomes a test case. Every bug fix adds an edge case. Every regulatory update adds a rule that must be checked. + +This specification describes how the collective regression suite is built, maintained, and used, with Agora as the substrate for distribution and contribution. + +**Why collective** + +A single instance learns from its own mistakes. The collective learns from every instance's mistakes. A HIPAA deployment in one hospital discovers an edge case that a SOC2 deployment in a SaaS company would never encounter on its own — but if that SaaS company ever expands into healthcare, their gate stack must handle that edge case. The collective suite gives them hundreds of thousands of edge cases they did not pay to discover. + +This is the mechanism behind the verification monopoly claim. A certification means "your gate stack is verified against every edge case ever discovered by any instance in the ecosystem." A competitor starting from scratch cannot buy or scrape this knowledge. + +**What a test case is** + +A test case is a structured sexp in the fact store's native format: + + (:test-case + :domain healthcare-hipaa + :ontology-version "2.3.0" + :input (:proposal "modify /var/patient-records/patient-4372.txt" + :bound-context (:user-role "billing-clerk" :time "23:14")) + :expected-outcome :deny + :gate-rule (:id "hipaa-access-hours" :version 4) + :rationale "Billing clerks should not have write access to patient records outside business hours" + :origin (:instance-did "did:agora:abcd1234" + :contribution-hash "sha256:xyz789" + :occurred-at "2026-06-15T14:23:00Z")) + +Key fields: + +- **domain** — which gate rule domain this exercises. An instance being certified for HIPAA only needs to pass the HIPAA subtree. +- **ontology-version** — which version of the gate rules the test targets. Tests for old ontology versions are flagged but not discarded; they may indicate a regression. +- **input** — the proposal and the relevant context. Abstraction strips instance-specific paths: concrete paths become class patterns ("/var/patient-records/*.txt"), user identities become roles, times become ranges. +- **expected-outcome** — allow or deny. +- **gate-rule** — which specific rule this case exercises. Helps identify which rule failed when a test breaks. +- **rationale** — human-readable explanation of why this outcome is correct. Used when a test needs review after a rule change. +- **origin** — the contributing instance's Agora DID and a Merkle hash proving the case was actually encountered. This is how reputation is tracked. + +**How test cases are generated** + +Every day, each instance runs a local triage pass: + +1. Collect all gate decisions from the past 24 hours where the human overrode the automatic outcome. +2. Strip all instance-specific data — concrete paths become patterns, identities become roles, absolute times become relative ranges. +3. Run each abstracted case against the current local regression suite. If already covered, discard. +4. Run each case against the current local suite for contradictions. If a new case contradicts an existing test, both are flagged for human review. +5. Add surviving cases to the local suite. + +The local suite is the seed. Once per week (or on explicit trigger), the instance submits new cases to the collective: + +1. Sign each new case with the instance's Agora DID. +2. Bundle into an Agora Note with domain tag and ontology version. +3. Publish to the collective regression suite topic on Agora. + +**How the collective suite is organized** + +The suite is a Merkle DAG organized by domain and ontology version: + + /regression-suite/v2/ + root-manifest.signed — signed index of all domains + foundational/ + manifest.signed + path-traversal.regression — 12,400 test cases + shell-injection.regression — 8,912 test cases + credential-leak.regression — 3,401 test cases + ... + healthcare-hipaa/ + manifest.signed + access-control.regression — 47,203 test cases + phi-handling.regression — 23,891 test cases + audit-logging.regression — 5,672 test cases + ... + fintech-soc2/ + ... + industrial-iec62443/ + ... + general-intelligence/ + hallucination-detection.regression + tool-abuse.regression + ... + +Each .regression file is a compressed, sorted list of test cases. The manifest is a Merkle tree over the files: the suite's integrity is verifiable by hashing the manifest and comparing against the signed value. + +**Who can submit** + +Any Passepartout instance with an Agora DID can submit test cases. But not all submissions are treated equally. The suite maintains a three-tier system: + +Tier 1 — Verified. Human-reviewed by the suite operator. Used in certification scoring. An instance that passes Tier 1 earns the standard certification badge. + +Tier 2 — Community. Auto-accepted from instances with a track record of valid contributions. Used in certification scoring but weighted lower. An instance that passes Tier 2 but has not been audited against Tier 1 gets a provisional badge. + +Tier 3 — Submitted. Not yet reviewed. Included in the suite but excluded from scoring until reviewed. Tagged with the submitter's DID so reviewers can assess pattern of behavior. + +Reputation determines when a submitter graduates to auto-acceptance: + +- Each submission is either confirmed valid (reviewed and accepted), flagged as invalid, or flagged as malicious. +- An instance with a long history of valid submissions (100+ confirmed, zero malicious) graduates to Tier 2 auto-accept. +- An instance that submits malicious cases (fake edge cases designed to poison the suite) loses reputation. Three malicious submissions and the instance is banned from contributing permanently. + +Reputation is public and tied to the Agora DID. A banned instance can create a new DID, but the new DID starts with no history and all its submissions go to Tier 3 pending review. The cost of a Sybil attack is human review time, which scales poorly for the attacker. + +**Certification scoring** + +When an instance applies for certification: + +1. Download the current regression suite manifest. Verify the Merkle root against the operator's signed certificate. +2. Run each test case through the instance's gate stack. Record pass/fail per case. +3. Submit results as a signed Agora Note. The note includes the instance's DID, the suite version tested against, and the per-domain pass rates. + +The certification score is the weighted pass rate across all domains that the instance claims compliance with. A HIPAA-certified instance must pass 99.5%+ of the healthcare-hipaa subtree. A generally-capable agent must pass 95%+ of the foundational subtree. + +Failing cases matter more than passing ones. If an instance fails a test case, the suite operator flags: "case X was checked against instance Y and failed." This becomes a triage signal. If multiple instances fail the same case, either the case is wrong (regression — review the rule) or the rule is ambiguous (update the spec). + +**The network effect quantified** + +Assume each deployed instance generates on average one new unique test case per week (from the 5% of edge cases the human corrects). After year one with 100 instances: + +- Year 1: ~5,000 cases in the suite (100 instances x 50 weeks x 1 case/week) +- Year 2: ~50,000 cases (1,000 instances x 50 weeks x 1 case/week) +- Year 3: ~500,000 cases (10,000 instances x 50 weeks x 1 case/week) + +At year 3, a new instance that runs the suite captures half a million edge cases from real deployments at zero marginal cost. The operator charges $50K-$200K for the certification. The insurmountability is not technical — a well-funded competitor could reproduce some of these cases through synthetic generation. The insurmountability is provenance: these cases are labeled by real human corrections from real deployments. A synthetic case is a best guess. The collective suite's cases are ground truth. + +**The operator's role** + +The collective regression suite operator is a distinct role from the Passepartout developer. The operator: + +1. Runs the server that accepts, de-duplicates, and signs submissions. +2. Reviews Tier 3 submissions for validity. +3. Resolves contradictions when two instances submit contradictory test cases. +4. Publishes signed manifests at each release. +5. Issues certification badges. + +This role can be performed by the early player as a revenue-generating service, or by a neutral foundation if the ecosystem grows large enough. The revenue model: certification fees ($50K-$200K per enterprise per year). The operator does not gate access to the suite itself — the suite is available to all Agora participants because a larger suite makes the ecosystem more valuable. The operator charges for the badge, not the data. + +**Summary of the loop** + +Instance runs → human corrects a gate decision → new test case is abstracted and added to the local suite → periodically submitted to the collective suite → de-duplicated and verified → published in the next manifest → every other instance downloads it → future instances must pass it to earn certification → the collective suite grows → the certification becomes harder to fake → the ecosystem becomes more valuable → more instances join → more edge cases discovered. + +Every component of this loop exists or is on Passepartout's roadmap except the Agora Note publishing channel. The gate stack generates the raw signal. The abstraction pass strips instance details. The local triage de-duplicates. The Agora DID provides authentication. The Merkle root provides integrity. The certification badge provides monetization. + +Nothing in this loop requires new core Passepartout functionality. It requires the Agora protocol for inter-instance communication and a server-side aggregation process. Both are roadmap items, but neither depends on the self-driving Lisp Machine. + +See also: +[[file:evaluation-harness.org][Evaluation harness certification service]] +[[file:verification-monopoly.org][Verification monopoly — the big money]] +[[file:infrastructure-lock-in.org][Infrastructure lock-in]] +[[file:moats.org][Moats]] diff --git a/ideas/passepartout-economics/evaluation-harness.org b/ideas/passepartout-economics/evaluation-harness.org index 4078696..1c642ba 100644 --- a/ideas/passepartout-economics/evaluation-harness.org +++ b/ideas/passepartout-economics/evaluation-harness.org @@ -16,4 +16,4 @@ The regression suite grows with every deployment, making the certification incre Long-term endpoint: this becomes the UL certification for AI — a third-party verification nobody can ignore. [[file:verification-monopoly.org][The verification monopoly]]. -See also: [[file:verification-appliance.org][Verification appliance]], [[file:infrastructure-lock-in.org][Infrastructure lock-in]], [[file:moats.org][Moats]] +See also: [[file:collective-regression-suite.org][Collective regression suite]], [[file:verification-appliance.org][Verification appliance]], [[file:infrastructure-lock-in.org][Infrastructure lock-in]], [[file:moats.org][Moats]] diff --git a/ideas/passepartout-economics/passepartout-economics.org b/ideas/passepartout-economics/passepartout-economics.org index 2e1129b..59b9392 100644 --- a/ideas/passepartout-economics/passepartout-economics.org +++ b/ideas/passepartout-economics/passepartout-economics.org @@ -16,6 +16,7 @@ The business model is the AWS of provable computing: AGPL infrastructure is free [[file:lisp-machine-security.org][Lisp Machine security — unified memory threat model]] [[file:common-logic-iso-24707.org][Common Logic (ISO 24707) — relevance to the triad]] +[[file:collective-regression-suite.org][Collective regression suite — how it compounds]] Key analytical frames: - [[file:investment-thesis.org][Investment thesis — the unified view]]