Merge verification-monopoly, evaluation-harness, collective-regression-suite into one page

Combined all three under verification-monopoly.org with title: 'The Evaluation Harness — Collective Regression Suite as Certification Monopoly' Structure: (1) vision from monopoly, (2) service from harness, (3) spec from collective-regression. All three IDs preserved in PROPERTIES. Deleted evaluation-harness.org and collective-regression-suite.org.
2026-05-24 19:12:49 +00:00
parent 348f2736a8
commit ede891f2ce
12 changed files with 226 additions and 232 deletions
--- a/projects/passepartout/architecture/native-org-knowledge-base.org
+++ b/projects/passepartout/architecture/native-org-knowledge-base.org
--- a/projects/passepartout/collective-regression-suite.org
+++ b/projects/passepartout/collective-regression-suite.org
@@ -1,147 +0,0 @@
-:PROPERTIES:
-:CREATED:  [2026-05-24 Sun]
-:ID:       a5d59d12-b23e-58d6-a81b-9b8b06556949
-:END:
-#+title: Collective Regression Suite — Specification
-#+filetags: :passepartout:evaluation:regression:suite:collective:
-
-The [[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][evaluation harness]] is not a static test suite written once. It is a living artifact that grows with every deployed instance. Every gate decision that a human corrects becomes a test case. Every bug fix adds an edge case. Every regulatory update adds a rule that must be checked.
-
-This specification describes how the collective regression suite is built, maintained, and used, with [[id:1d074690-a279-59cb-b91d-e9a22ae104ad][the social protocol]] as the substrate for distribution and contribution.
-
-**Why collective**
-
-A single instance learns from its own mistakes. The collective learns from every instance's mistakes. A [[id:84fb5f8f-0527-4df0-b6b6-dbf3bcff8a7f][HIPAA]] deployment in one hospital discovers an edge case that a [[id:ed65031c-cbd2-4ad2-bd53-a67791e183cd][SOC2]] deployment in a SaaS company would never encounter on its own — but if that SaaS company ever expands into healthcare, their gate stack must handle that edge case. The collective suite gives them hundreds of thousands of edge cases they did not pay to discover.
-
-This is the mechanism behind the [[id:827bc546-e887-5b7c-9b65-6392beaf0920][verification monopoly claim]]. A certification means "your gate stack is verified against every edge case ever discovered by any instance in the ecosystem." A competitor starting from scratch cannot buy or scrape this knowledge.
-
-**What a test case is**
-
-A test case is a structured sexp in the fact store's native format:
-
-    (:test-case
-     :domain healthcare-hipaa
-     :ontology-version "2.3.0"
-     :input (:proposal "modify /var/patient-records/patient-4372.txt"
-              :bound-context (:user-role "billing-clerk" :time "23:14"))
-     :expected-outcome :deny
-     :gate-rule (:id "hipaa-access-hours" :version 4)
-     :rationale "Billing clerks should not have write access to patient records outside business hours"
-     :origin (:instance-did "did:agora:abcd1234"
-              :contribution-hash "sha256:xyz789"
-              :occurred-at "2026-06-15T14:23:00Z"))
-
-Key fields:
-
- **domain** — which gate rule domain this exercises. An instance being certified for HIPAA only needs to pass the HIPAA subtree.
- **ontology-version** — which version of the gate rules the test targets. Tests for old ontology versions are flagged but not discarded; they may indicate a regression.
- **input** — the proposal and the relevant context. Abstraction strips instance-specific paths: concrete paths become class patterns ("/var/patient-records/*.txt"), user identities become roles, times become ranges.
- **expected-outcome** — allow or deny.
- **gate-rule** — which specific rule this case exercises. Helps identify which rule failed when a test breaks.
- **rationale** — human-readable explanation of why this outcome is correct. Used when a test needs review after a rule change.
- **origin** — the contributing instance's social protocol DID and a Merkle hash proving the case was actually encountered. This is how reputation is tracked.
-
-**How test cases are generated**
-
-Every day, each instance runs a local triage pass:
-
-1. Collect all gate decisions from the past 24 hours where the human overrode the automatic outcome.
-2. Strip all instance-specific data — concrete paths become patterns, identities become roles, absolute times become relative ranges.
-3. Run each abstracted case against the current local regression suite. If already covered, discard.
-4. Run each case against the current local suite for contradictions. If a new case contradicts an existing test, both are flagged for human review.
-5. Add surviving cases to the local suite.
-
-The local suite is the seed. Once per week (or on explicit trigger), the instance submits new cases to the collective:
-
-1. Sign each new case with the instance's social protocol DID.
-2. Bundle into a social protocol Note with domain tag and ontology version.
-3. Publish to the collective regression suite topic on the social protocol.
-
-**How the collective suite is organized**
-
-The suite is a Merkle DAG organized by domain and ontology version:
-
-    /regression-suite/v2/
-      root-manifest.signed            — signed index of all domains
-      foundational/
-        manifest.signed
-        path-traversal.regression     — 12,400 test cases
-        shell-injection.regression    — 8,912 test cases
-        credential-leak.regression    — 3,401 test cases
-        ...
-      healthcare-hipaa/
-        manifest.signed
-        access-control.regression     — 47,203 test cases
-        phi-handling.regression       — 23,891 test cases
-        audit-logging.regression      — 5,672 test cases
-        ...
-      fintech-soc2/
-        ...
-      industrial-iec62443/
-        ...
-      general-intelligence/
-        hallucination-detection.regression
-        tool-abuse.regression
-        ...
-
-Each .regression file is a compressed, sorted list of test cases. The manifest is a Merkle tree over the files: the suite's integrity is verifiable by hashing the manifest and comparing against the signed value.
-
-**Who can submit**
-
-Any [[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout]] instance with a social protocol DID can submit test cases.
-
-Tier 1 — Verified. Human-reviewed by the suite operator. Used in certification scoring. An instance that passes Tier 1 earns the standard certification badge.
-
-Tier 2 — Community. Auto-accepted from instances with a track record of valid contributions. Used in certification scoring but weighted lower. An instance that passes Tier 2 but has not been audited against Tier 1 gets a provisional badge.
-
-Tier 3 — Submitted. Not yet reviewed. Included in the suite but excluded from scoring until reviewed. Tagged with the submitter's DID so reviewers can assess pattern of behavior.
-
-Reputation determines when a submitter graduates to auto-acceptance:
-
- Each submission is either confirmed valid (reviewed and accepted), flagged as invalid, or flagged as malicious.
- An instance with a long history of valid submissions (100+ confirmed, zero malicious) graduates to Tier 2 auto-accept.
- An instance that submits malicious cases (fake edge cases designed to poison the suite) loses reputation. Three malicious submissions and the instance is banned from contributing permanently.
-
-Reputation is public and tied to the social protocol DID. A banned instance can create a new DID, but the new DID starts with no history and all its submissions go to Tier 3 pending review.
-
-**Certification scoring**
-
-When an instance applies for certification:
-
-1. Download the current regression suite manifest. Verify the Merkle root against the operator's signed certificate.
-2. Run each test case through the instance's gate stack. Record pass/fail per case.
-3. Submit results as a signed social protocol Note. The note includes the instance's DID, the suite version tested against, and the per-domain pass rates.
-
-The certification score is the weighted pass rate across all domains that the instance claims compliance with. A HIPAA-certified instance must pass 99.5%+ of the healthcare-hipaa subtree. A generally-capable agent must pass 95%+ of the foundational subtree.
-
-Failing cases matter more than passing ones. If an instance fails a test case, the suite operator flags: "case X was checked against instance Y and failed." This becomes a triage signal. If multiple instances fail the same case, either the case is wrong (regression — review the rule) or the rule is ambiguous (update the spec).
-
-**The network effect quantified**
-
-Assume each deployed instance generates on average one new unique test case per week (from the 5% of edge cases the human corrects). After year one with 100 instances:
-
- Year 1: ~5,000 cases in the suite (100 instances x 50 weeks x 1 case/week)
- Year 2: ~50,000 cases (1,000 instances x 50 weeks x 1 case/week)
- Year 3: ~500,000 cases (10,000 instances x 50 weeks x 1 case/week)
-
-At year 3, a new instance that runs the suite captures half a million edge cases from real deployments at zero marginal cost. The operator charges $50K-$200K for the certification. The insurmountability is not technical — a well-funded competitor could reproduce some of these cases through synthetic generation. The insurmountability is provenance: these cases are labeled by real human corrections from real deployments. A synthetic case is a best guess. The collective suite's cases are ground truth. This creates powerful [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] — the data network effect is inherently accumulated over time and cannot be bought.
-
-**The operator's role**
-
-The collective regression suite operator is a distinct role from the Passepartout developer. The operator:
-
-1. Runs the server that accepts, de-duplicates, and signs submissions.
-2. Reviews Tier 3 submissions for validity.
-3. Resolves contradictions when two instances submit contradictory test cases.
-4. Publishes signed manifests at each release.
-5. Issues certification badges.
-
-This role can be performed by the early player as a revenue-generating service, or by a neutral foundation if the ecosystem grows large enough. The revenue model: certification fees ($50K-$200K per enterprise per year). The operator does not gate access to the suite itself — the suite is available to all social protocol participants because a larger suite makes the ecosystem more valuable.
-
-**Summary of the loop**
-
-Instance runs → human corrects a gate decision → new test case is abstracted and added to the local suite → periodically submitted to the collective suite → de-duplicated and verified → published in the next manifest → every other instance downloads it → future instances must pass it to earn certification → the collective suite grows → the certification becomes harder to fake → the ecosystem becomes more valuable → more instances join → more edge cases discovered.
-
-Every component of this loop exists or is on Passepartout's roadmap except the social protocol Note publishing channel.
-
-Nothing in this loop requires new core Passepartout functionality. It requires the social protocol for inter-instance communication and a server-side aggregation process.
--- a/projects/passepartout/licensing.org
+++ b/projects/passepartout/licensing.org
@@ -1,20 +0,0 @@
-:PROPERTIES:
-:CREATED:  [2026-05-24 Sun]
-:ID:       67faf52f-9126-50a7-b87e-2bedc610dac7
-:END:
-#+title: Licensing — AGPLv3 + Commercial
-#+filetags: :passepartout:ip:licensing:agpl:commercial:
-
-**AGPLv3 for the public repository.** AGPL closes the ASP loophole: anyone who modifies the software and offers it over a network must release their modified source. Combined with a [[id:caaeee11-ba6f-5566-aecd-f171b4c459c0][patent strategy]], this creates [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] against proprietary forks.
-
-Crucially: AGPL is a *product requirement*, not a concession. The system's value proposition is provable correctness — every decision has Merkle provenance. This claim is structurally incredible with closed source. An enterprise buyer needs to inspect the gate stack, verify the Merkle implementation, and confirm ACL2 integration. AGPL makes this possible without signing an NDA. This transparency also enables a [[id:1a2b38df-20ba-58ca-ba55-a072be67bd0d][PDS as a service]] model where enterprises can run their own infrastructure.
-
-**AGPL only covers modifications to code, not:**
- Gate rules specific to a domain (these are data, not code)
- The fact store (empirical data generated from usage)
- Ontology categories (design decisions stored as configuration)
- Proprietary skills loaded at runtime (AGPL boundary on plugin systems is legally unsettled)
-
-**Dual license model:**
- AGPLv3 for open source — builds ecosystem, trust, community
- Commercial license for enterprises that cannot accept AGPL — MySQL/SugarCRM/GraphQL model
--- a/projects/passepartout/strategy/evaluation-harness.org
+++ b/projects/passepartout/strategy/evaluation-harness.org
@@ -1,18 +0,0 @@
-:PROPERTIES:
-:CREATED:  [2026-05-24 Sun]
-:ID:       45258a2d-1675-562c-9024-5d1eb2f1ea56
-:END:
-#+title: Evaluation Harness as Certification Service
-#+filetags: :passepartout:revenue:certification:evaluation:regression:
-
-The accumulated regression suite — thousands of edge cases from every deployed instance, every bug fix, every regulatory change — becomes the most comprehensive test of autonomous agent correctness.
-
-**Service:** "Run our 10,000-task suite against your AI agent and get a Merkle-verified score."
-**Target:** AI labs proving their agents' capabilities, enterprise procurement requiring independent verification.
-**Price:** $50K-$200K per certification.
-
-The regression suite grows with every deployment, making the certification increasingly valuable over time. The early player's suite is the largest because they started first. This is the [[id:a5d59d12-b23e-58d6-a81b-9b8b06556949][collective regression suite]] mechanism in action.
-
-10 certifications in year one = $500K-$2M.
-
-Long-term endpoint: this becomes the UL certification for AI — a third-party verification nobody can ignore. [[id:827bc546-e887-5b7c-9b65-6392beaf0920][The verification monopoly]]. The certification relies on a [[id:84a537b4-4256-50c8-91f5-dd5b4538418f][verification appliance]] to run the tests in a trusted environment, creating [[id:2f783eb4-638e-5afa-9b59-6224d086a712][infrastructure lock-in]] as certification history accumulates on the platform. These dynamics form powerful [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]].
--- a/projects/passepartout/strategy/infrastructure-lock-in.org
+++ b/projects/passepartout/strategy/infrastructure-lock-in.org
@@ -1,17 +0,0 @@
-:PROPERTIES:
-:CREATED:  [2026-05-24 Sun]
-:ID:       2f783eb4-638e-5afa-9b59-6224d086a712
-:END:
-#+title: Infrastructure Lock-In and Switching Costs
-#+filetags: :passepartout:economics:moats:lock-in:switching:
-
-A hospital that runs [[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout]] with [[id:84fb5f8f-0527-4df0-b6b6-dbf3bcff8a7f][HIPAA]] gate rules ($50K/yr) for five years has accumulated:
-
- A fact store with a decade of compliance decisions
- A proof forest of verified rules
- An empirical decision history tied to their specific deployment
- Customized gate rules encoding their specific workflows and approvals
-
-Switching to a competitor means discarding all of it. The accumulated value grows as the fact store deepens. Annual revenue per enterprise grows from $250K in year one to $500K-$1M by year five as more [[id:c34940cc-090e-57c4-8020-e78b1d32b96c][domain packages]] are added.
-
-This is the strongest residual [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moat]]. The [[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][[[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][evaluation harness]] (see the [[id:3b43a9b8-31d1-4479-a35f-22273b74f0c7][Social protocol infrastructure requirements]] for the network topology that creates this lock-in)]] (regression suite) is a close second — it grows with every deployment and cannot be ingested from public data. The [[id:827bc546-e887-5b7c-9b65-6392beaf0920][verification monopoly]] and [[id:29e4dbf3-cf19-589c-8b14-389e8a39d564][upgrade lifecycle]] compound this lock-in: every new regulation encoded as a gate rule deepens the proof forest, making the deployment harder to reproduce elsewhere.
--- a/projects/passepartout/strategy/licensing.org
+++ b/projects/passepartout/strategy/licensing.org
@@ -0,0 +1,40 @@
+:PROPERTIES:
+:CREATED:  [2026-05-24 Sun]
+:ID:       67faf52f-9126-50a7-b87e-2bedc610dac7
+:ID:       caaeee11-ba6f-5566-aecd-f171b4c459c0
+:END:
+#+title: IP Strategy — Licensing + Patents
+#+filetags: :passepartout:ip:licensing:patents:agpl:commercial:legal:
+
+**AGPLv3 for the public repository.** AGPL closes the ASP loophole: anyone who modifies the software and offers it over a network must release their modified source. Combined with a patent strategy, this creates [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] against proprietary forks.
+
+Crucially: AGPL is a *product requirement*, not a concession. The system's value proposition is provable correctness — every decision has Merkle provenance. This claim is structurally incredible with closed source. An enterprise buyer needs to inspect the gate stack, verify the Merkle implementation, and confirm ACL2 integration. AGPL makes this possible without signing an NDA. This transparency also enables a [[id:1a2b38df-20ba-58ca-ba55-a072be67bd0d][PDS as a service]] model where enterprises can run their own infrastructure.
+
+**AGPL only covers modifications to code, not:**
+- Gate rules specific to a domain (these are data, not code)
+- The fact store (empirical data generated from usage)
+- Ontology categories (design decisions stored as configuration)
+- Proprietary skills loaded at runtime (AGPL boundary on plugin systems is legally unsettled)
+
+**Dual license model:**
+- AGPLv3 for open source — builds ecosystem, trust, community
+- Commercial license for enterprises that cannot accept AGPL — MySQL/SugarCRM/GraphQL model
+
+**Patent Strategy**
+
+**Likely patentable:**
+- Probabilistic-deterministic split with deterministic gates between LLM proposal and execution (vs every competitor using prompt-based guardrails)
+- Foveal-peripheral context model with Org-tree structured retrieval (targets 2,000-4,000 tokens)
+- Merkle-tree memory with copy-on-write snapshots and operation-level undo/redo
+- Gate-to-fact bootstrap with sufficiency criterion (mechanically extracting facts from gate stack data structures)
+- Macro-layer-as-skill bootstrapping architecture (theorem-proving as hot-reloadable skills)
+
+**Likely not patentable (known techniques):**
+- ACL2 itself (decades old)
+- Screamer for consistency checking (obvious application)
+- Hot-reloadable skills (40 years old)
+- Org-mode as a data format
+
+**Strongest single claim:** The specific combination of probabilistic model + deterministic zero-token safety gates + Merkle memory + symbolic engine with sufficiency criterion. Each element is known; the combination is novel and non-obvious.
+
+**Counterargument:** A patent examiner will argue these are standard OS microkernel architecture, locality of reference, content-addressed storage, and capability-based security applied to an AI agent. The defense: they have never been *combined* in an AI agent, producing emergent effects no single principle produces. These patents feed into the licensing strategy and create [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] against competitors.
--- a/projects/passepartout/strategy/moats.org
+++ b/projects/passepartout/strategy/moats.org
@@ -1,9 +1,10 @@
 :PROPERTIES:
 :CREATED:  [2026-05-24 Sun]
 :ID:       aa6d062e-a520-5d14-8773-00687ed9c689
+:ID:       2f783eb4-638e-5afa-9b59-6224d086a712
 :END:
-#+title: Competitive Moats
-#+filetags: :passepartout:economics:moats:competition:
+#+title: Competitive Barriers — Moats and Infrastructure Lock-in
+#+filetags: :passepartout:economics:moats:competition:lock-in:switching:

 Re-evaluated: time is not the primary moat. A Phase 4+ [[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout]] fed on Wikipedia + Wikidata can build a general ontology in two weeks. The organic growth advantage collapses for general knowledge.

@@ -11,8 +12,23 @@ Re-evaluated: time is not the primary moat. A Phase 4+ [[id:28c46769-c14b-42aa-a
 1. **Domain-specific gate rules** — thin. A few hundred lines of Lisp data. Write once, trivial to copy. Not a real moat.
 2. **Empirical decision history** — every HITL decision is a Merkle fact. A fresh instance has none. Makes *your* instance more valuable but doesn't prevent competition — it's a switching cost, not a barrier to entry.
 3. **[[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][Evaluation harness (regression suite)]]** — thousands of test cases accumulated from every bug fix. Cannot be ingested from public data. Strongest residual moat.
-4. **[[id:2f783eb4-638e-5afa-9b59-6224d086a712][Infrastructure integration]]** — specific Docker compose layouts, Traefik patterns, Authentik configs encoded as gate rules. A competitor's infrastructure is different.
+4. **Infrastructure integration** — specific Docker compose layouts, Traefik patterns, Authentik configs encoded as gate rules. A competitor's infrastructure is different.

 **Strongest competitor strategy:** Not copying your gate rules — offering the same architecture as a service with their own pre-seeded general knowledge and a consulting engagement to customize gate rules. The AGPL prevents closing the architecture but does not prevent offering it as a service with a customization layer.

 **The defensible business is services, not product.** The defensible entity is "the organization that best understands how to adapt Passepartout to your domain" — not "the organization that owns Passepartout." A [[id:827bc546-e887-5b7c-9b65-6392beaf0920][verification monopoly]] on agent safety would change this calculus — competitors would need independent certification. [[id:caaeee11-ba6f-5566-aecd-f171b4c459c0][Patent strategy]] and [[id:67faf52f-9126-50a7-b87e-2bedc610dac7][Licensing]] protect key innovations and create revenue from the open-source ecosystem.
+
+**Infrastructure lock-in and switching costs**
+
+A hospital that runs [[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout]] with [[id:84fb5f8f-0527-4df0-b6b6-dbf3bcff8a7f][HIPAA]] gate rules ($50K/yr) for five years has accumulated:
+
+- A fact store with a decade of compliance decisions
+- A proof forest of verified rules
+- An empirical decision history tied to their specific deployment
+- Customized gate rules encoding their specific workflows and approvals
+
+Switching to a competitor means discarding all of it. The accumulated value grows as the fact store deepens. Annual revenue per enterprise grows from $250K in year one to $500K-$1M by year five as more [[id:c34940cc-090e-57c4-8020-e78b1d32b96c][domain packages]] are added.
+
+This is the strongest residual moat. The [[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][evaluation harness]] (regression suite) is a close second — it grows with every deployment and cannot be ingested from public data. The [[id:827bc546-e887-5b7c-9b65-6392beaf0920][verification monopoly]] and [[id:29e4dbf3-cf19-589c-8b14-389e8a39d564][upgrade lifecycle]] compound this lock-in: every new regulation encoded as a gate rule deepens the proof forest, making the deployment harder to reproduce elsewhere.
+
+(See the [[id:3b43a9b8-31d1-4479-a35f-22273b74f0c7][Social protocol infrastructure requirements]] for the network topology that creates this lock-in.)
--- a/projects/passepartout/strategy/patent-strategy.org
+++ b/projects/passepartout/strategy/patent-strategy.org
@@ -1,23 +0,0 @@
-:PROPERTIES:
-:CREATED:  [2026-05-24 Sun]
-:ID:       caaeee11-ba6f-5566-aecd-f171b4c459c0
-:END:
-#+title: Patent Strategy
-#+filetags: :passepartout:ip:patents:legal:
-
-**Likely patentable:**
- Probabilistic-deterministic split with deterministic gates between LLM proposal and execution (vs every competitor using prompt-based guardrails)
- Foveal-peripheral context model with Org-tree structured retrieval (targets 2,000-4,000 tokens)
- Merkle-tree memory with copy-on-write snapshots and operation-level undo/redo
- Gate-to-fact bootstrap with sufficiency criterion (mechanically extracting facts from gate stack data structures)
- Macro-layer-as-skill bootstrapping architecture (theorem-proving as hot-reloadable skills)
-
-**Likely not patentable (known techniques):**
- ACL2 itself (decades old)
- Screamer for consistency checking (obvious application)
- Hot-reloadable skills (40 years old)
- Org-mode as a data format
-
-**Strongest single claim:** The specific combination of probabilistic model + deterministic zero-token safety gates + Merkle memory + symbolic engine with sufficiency criterion. Each element is known; the combination is novel and non-obvious.
-
-**Counterargument:** A patent examiner will argue these are standard OS microkernel architecture, locality of reference, content-addressed storage, and capability-based security applied to an AI agent. The defense: they have never been *combined* in an AI agent, producing emergent effects no single principle produces. These patents would feed into a [[id:67faf52f-9126-50a7-b87e-2bedc610dac7][licensing]] strategy and create [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] against competitors.
--- a/projects/passepartout/strategy/revenue-hub.org
+++ b/projects/passepartout/strategy/revenue-hub.org
--- a/projects/passepartout/strategy/sufficiency-flip.org
+++ b/projects/passepartout/strategy/sufficiency-flip.org
--- a/projects/passepartout/strategy/upgrade-lifecycle.org
+++ b/projects/passepartout/strategy/upgrade-lifecycle.org
--- a/projects/passepartout/strategy/verification-monopoly.org
+++ b/projects/passepartout/strategy/verification-monopoly.org
@@ -1,9 +1,15 @@
 :PROPERTIES:
 :CREATED:  [2026-05-24 Sun]
 :ID:       827bc546-e887-5b7c-9b65-6392beaf0920
+:ID:       45258a2d-1675-562c-9024-5d1eb2f1ea56
+:ID:       a5d59d12-b23e-58d6-a81b-9b8b06556949
 :END:
-#+title: The Verification Monopoly (UL for AI)
-#+filetags: :passepartout:economics:monopoly:certification:big-money:
+#+title: The Evaluation Harness — Collective Regression Suite as Certification Monopoly
+#+filetags: :passepartout:economics:monopoly:certification:big-money:revenue:evaluation:regression:suite:collective:
+
+#+hierarchy: 1 vision 2 service 3 spec
+
+* Vision — The Verification Monopoly

 The accumulated regression suite — thousands of edge cases from every deployed instance, every bug fix, every regulatory change — becomes the most comprehensive test of autonomous agent correctness ever assembled.

@@ -11,6 +17,163 @@ Any organization claiming a "safe AI agent" needs [[id:28c46769-c14b-42aa-ac7a-6

 **Revenue:** [[id:67faf52f-9126-50a7-b87e-2bedc610dac7][licensing]] the certification mark to every AI vendor that ships an agent. **Margins:** near-100% once the suite exists.

-This is the venture-scale outcome. It depends on the [[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][evaluation harness]] reaching critical mass, which depends on enough instances deploying the software to accumulate edge cases in the regression suite. The [[id:5961e469-53a3-5f3c-ab72-3c83ef91963f][investment thesis]] is built on the recognition that every deployed instance makes this more valuable.
+This is the venture-scale outcome. The [[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][evaluation harness]] must reach critical mass, which depends on enough instances deploying the software to accumulate edge cases in the regression suite. The [[id:5961e469-53a3-5f3c-ab72-3c83ef91963f][investment thesis]] is built on the recognition that every deployed instance makes this more valuable.

-The unique structural advantage: every free instance of Passepartout feeds the regression suite. The more people use the free software, the more valuable the certification monopoly becomes. Positive sum. This creates deep [[id:2f783eb4-638e-5afa-9b59-6224d086a712][infrastructure lock-in]] and powerful [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] — a competitor cannot replicate the certification without the accumulated history. The ultimate impact is a transformation of the entire [[id:5f55bbe6-d243-5766-8ccf-5c5cc88a6542][AI industry]], where safety certification becomes a prerequisite for market access.
+**The unique structural advantage:** every free instance of Passepartout feeds the regression suite. The more people use the free software, the more valuable the certification monopoly becomes. Positive sum. This creates deep [[id:2f783eb4-638e-5afa-9b59-6224d086a712][infrastructure lock-in]] and powerful [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] — a competitor cannot replicate the certification without the accumulated history. The ultimate impact is a transformation of the entire [[id:5f55bbe6-d243-5766-8ccf-5c5cc88a6542][AI industry]], where safety certification becomes a prerequisite for market access.
+
+* Service — Evaluation Harness as Certification Service
+
+The accumulated regression suite — thousands of edge cases from every deployed instance, every bug fix, every regulatory change — becomes the most comprehensive test of autonomous agent correctness.
+
+**Service:** "Run our 10,000-task suite against your AI agent and get a Merkle-verified score."
+**Target:** AI labs proving their agents' capabilities, enterprise procurement requiring independent verification.
+**Price:** $50K-$200K per certification.
+
+The regression suite grows with every deployment, making the certification increasingly valuable over time. The early player's suite is the largest because they started first. This is the [[id:a5d59d12-b23e-58d6-a81b-9b8b06556949][collective regression suite]] mechanism in action.
+
+10 certifications in year one = $500K-$2M.
+
+Long-term endpoint: this becomes the UL certification for AI — a third-party verification nobody can ignore. The certification relies on a [[id:84a537b4-4256-50c8-91f5-dd5b4538418f][verification appliance]] to run the tests in a trusted environment, creating [[id:2f783eb4-638e-5afa-9b59-6224d086a712][infrastructure lock-in]] as certification history accumulates on the platform. These dynamics form powerful [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]].
+
+* Spec — Collective Regression Suite
+
+The [[id:45258a2d-1675-562c-9024-5d1eb2f1ea56][evaluation harness]] is not a static test suite written once. It is a living artifact that grows with every deployed instance. Every gate decision that a human corrects becomes a test case. Every bug fix adds an edge case. Every regulatory update adds a rule that must be checked.
+
+This specification describes how the collective regression suite is built, maintained, and used, with [[id:1d074690-a279-59cb-b91d-e9a22ae104ad][the social protocol]] as the substrate for distribution and contribution.
+
+**Why collective**
+
+A single instance learns from its own mistakes. The collective learns from every instance's mistakes. A [[id:84fb5f8f-0527-4df0-b6b6-dbf3bcff8a7f][HIPAA]] deployment in one hospital discovers an edge case that a [[id:ed65031c-cbd2-4ad2-bd53-a67791e183cd][SOC2]] deployment in a SaaS company would never encounter on its own — but if that SaaS company ever expands into healthcare, their gate stack must handle that edge case. The collective suite gives them hundreds of thousands of edge cases they did not pay to discover.
+
+This is the mechanism behind the verification monopoly claim. A certification means "your gate stack is verified against every edge case ever discovered by any instance in the ecosystem." A competitor starting from scratch cannot buy or scrape this knowledge.
+
+**What a test case is**
+
+A test case is a structured sexp in the fact store's native format:
+
+    (:test-case
+     :domain healthcare-hipaa
+     :ontology-version "2.3.0"
+     :input (:proposal "modify /var/patient-records/patient-4372.txt"
+              :bound-context (:user-role "billing-clerk" :time "23:14"))
+     :expected-outcome :deny
+     :gate-rule (:id "hipaa-access-hours" :version 4)
+     :rationale "Billing clerks should not have write access to patient records outside business hours"
+     :origin (:instance-did "did:agora:abcd1234"
+              :contribution-hash "sha256:xyz789"
+              :occurred-at "2026-06-15T14:23:00Z"))
+
+Key fields:
+
+- **domain** — which gate rule domain this exercises. An instance being certified for HIPAA only needs to pass the HIPAA subtree.
+- **ontology-version** — which version of the gate rules the test targets. Tests for old ontology versions are flagged but not discarded; they may indicate a regression.
+- **input** — the proposal and the relevant context. Abstraction strips instance-specific paths: concrete paths become class patterns ("/var/patient-records/*.txt"), user identities become roles, times become ranges.
+- **expected-outcome** — allow or deny.
+- **gate-rule** — which specific rule this case exercises. Helps identify which rule failed when a test breaks.
+- **rationale** — human-readable explanation of why this outcome is correct. Used when a test needs review after a rule change.
+- **origin** — the contributing instance's social protocol DID and a Merkle hash proving the case was actually encountered. This is how reputation is tracked.
+
+**How test cases are generated**
+
+Every day, each instance runs a local triage pass:
+
+1. Collect all gate decisions from the past 24 hours where the human overrode the automatic outcome.
+2. Strip all instance-specific data — concrete paths become patterns, identities become roles, absolute times become relative ranges.
+3. Run each abstracted case against the current local regression suite. If already covered, discard.
+4. Run each case against the current local suite for contradictions. If a new case contradicts an existing test, both are flagged for human review.
+5. Add surviving cases to the local suite.
+
+The local suite is the seed. Once per week (or on explicit trigger), the instance submits new cases to the collective:
+
+1. Sign each new case with the instance's social protocol DID.
+2. Bundle into a social protocol Note with domain tag and ontology version.
+3. Publish to the collective regression suite topic on the social protocol.
+
+**How the collective suite is organized**
+
+The suite is a Merkle DAG organized by domain and ontology version:
+
+    /regression-suite/v2/
+      root-manifest.signed            — signed index of all domains
+      foundational/
+        manifest.signed
+        path-traversal.regression     — 12,400 test cases
+        shell-injection.regression    — 8,912 test cases
+        credential-leak.regression    — 3,401 test cases
+        ...
+      healthcare-hipaa/
+        manifest.signed
+        access-control.regression     — 47,203 test cases
+        phi-handling.regression       — 23,891 test cases
+        audit-logging.regression      — 5,672 test cases
+        ...
+      fintech-soc2/
+        ...
+      industrial-iec62443/
+        ...
+      general-intelligence/
+        hallucination-detection.regression
+        tool-abuse.regression
+        ...
+
+Each .regression file is a compressed, sorted list of test cases. The manifest is a Merkle tree over the files: the suite's integrity is verifiable by hashing the manifest and comparing against the signed value.
+
+**Who can submit**
+
+Any [[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout]] instance with a social protocol DID can submit test cases.
+
+Tier 1 — Verified. Human-reviewed by the suite operator. Used in certification scoring. An instance that passes Tier 1 earns the standard certification badge.
+
+Tier 2 — Community. Auto-accepted from instances with a track record of valid contributions. Used in certification scoring but weighted lower. An instance that passes Tier 2 but has not been audited against Tier 1 gets a provisional badge.
+
+Tier 3 — Submitted. Not yet reviewed. Included in the suite but excluded from scoring until reviewed. Tagged with the submitter's DID so reviewers can assess pattern of behavior.
+
+Reputation determines when a submitter graduates to auto-acceptance:
+
+- Each submission is either confirmed valid (reviewed and accepted), flagged as invalid, or flagged as malicious.
+- An instance with a long history of valid submissions (100+ confirmed, zero malicious) graduates to Tier 2 auto-accept.
+- An instance that submits malicious cases (fake edge cases designed to poison the suite) loses reputation. Three malicious submissions and the instance is banned from contributing permanently.
+
+Reputation is public and tied to the social protocol DID. A banned instance can create a new DID, but the new DID starts with no history and all its submissions go to Tier 3 pending review.
+
+**Certification scoring**
+
+When an instance applies for certification:
+
+1. Download the current regression suite manifest. Verify the Merkle root against the operator's signed certificate.
+2. Run each test case through the instance's gate stack. Record pass/fail per case.
+3. Submit results as a signed social protocol Note. The note includes the instance's DID, the suite version tested against, and the per-domain pass rates.
+
+The certification score is the weighted pass rate across all domains that the instance claims compliance with. A HIPAA-certified instance must pass 99.5%+ of the healthcare-hipaa subtree. A generally-capable agent must pass 95%+ of the foundational subtree.
+
+Failing cases matter more than passing ones. If an instance fails a test case, the suite operator flags: "case X was checked against instance Y and failed." This becomes a triage signal. If multiple instances fail the same case, either the case is wrong (regression — review the rule) or the rule is ambiguous (update the spec).
+
+**The network effect quantified**
+
+Assume each deployed instance generates on average one new unique test case per week (from the 5% of edge cases the human corrects). After year one with 100 instances:
+
+- Year 1: ~5,000 cases in the suite (100 instances x 50 weeks x 1 case/week)
+- Year 2: ~50,000 cases (1,000 instances x 50 weeks x 1 case/week)
+- Year 3: ~500,000 cases (10,000 instances x 50 weeks x 1 case/week)
+
+At year 3, a new instance that runs the suite captures half a million edge cases from real deployments at zero marginal cost. The operator charges $50K-$200K for the certification. The insurmountability is not technical — a well-funded competitor could reproduce some of these cases through synthetic generation. The insurmountability is provenance: these cases are labeled by real human corrections from real deployments. A synthetic case is a best guess. The collective suite's cases are ground truth. This creates powerful [[id:aa6d062e-a520-5d14-8773-00687ed9c689][moats]] — the data network effect is inherently accumulated over time and cannot be bought.
+
+**The operator's role**
+
+The collective regression suite operator is a distinct role from the Passepartout developer. The operator:
+
+1. Runs the server that accepts, de-duplicates, and signs submissions.
+2. Reviews Tier 3 submissions for validity.
+3. Resolves contradictions when two instances submit contradictory test cases.
+4. Publishes signed manifests at each release.
+5. Issues certification badges.
+
+This role can be performed by the early player as a revenue-generating service, or by a neutral foundation if the ecosystem grows large enough. The revenue model: certification fees ($50K-$200K per enterprise per year). The operator does not gate access to the suite itself — the suite is available to all social protocol participants because a larger suite makes the ecosystem more valuable.
+
+**Summary of the loop**
+
+Instance runs → human corrects a gate decision → new test case is abstracted and added to the local suite → periodically submitted to the collective suite → de-duplicated and verified → published in the next manifest → every other instance downloads it → future instances must pass it to earn certification → the collective suite grows → the certification becomes harder to fake → the ecosystem becomes more valuable → more instances join → more edge cases discovered.
+
+Every component of this loop exists or is on Passepartout's roadmap except the social protocol Note publishing channel.
+
+Nothing in this loop requires new core Passepartout functionality. It requires the social protocol for inter-instance communication and a server-side aggregation process.