per-domain flip: knowledge types and fastest acquisition
- Sufficiency flip is per-domain, not global. Poetry never flips. - Three knowledge types: structural (published rules), empirical (observations), performance (profiling data) - Fastest acquisition: active sandboxed probing, contrastive queries to human (not waiting for HITL to accumulate), ontology transfer from related domains, benchmark harness - Codified domain: flip within days (hours LLM + hours expert review) - Uncodified learnable domain: flip within weeks (probe + real use) - Never-flip domains: system is honest, LLM handles 100%
This commit is contained in:
@@ -779,6 +779,77 @@ whether the bootstrap succeeds. A patient operator with a
|
||||
destination as an operator with bottomless API credits — the
|
||||
second arrives faster, but both arrive.
|
||||
|
||||
** The per-domain sufficiency flip
|
||||
|
||||
The sufficiency flip is not a single event. It happens
|
||||
independently for each domain, and some domains never flip.
|
||||
The flip point is determined by the kind of knowledge the
|
||||
domain requires.
|
||||
|
||||
*** Knowledge types required for a flip
|
||||
|
||||
| Domain | Knowledge required | Can it flip? | How fast? |
|
||||
|--------|-------------------|--------------|-----------|
|
||||
| Shell safety, path rules | Structural only — the deployment's config, shell semantics | Immediately | Instant — ingested from config |
|
||||
| Healthcare compliance | Structural (HIPAA text) + empirical (human reviews of edge cases) | Yes | Weeks — one pass of LLM translation + human review cycle |
|
||||
| Codebase refactoring | Structural (dependency graph, API surface) + empirical (test suite, build results) + performance (latency, throughput) | Yes | Months — depends on how many build cycles the system has observed |
|
||||
| Microcode optimization | Structural (RISC-V ISA, core topology) + performance (profiling data) | Yes | Weeks — after enough benchmark runs to characterize hardware |
|
||||
| Poetry, creative writing | Neither structural rules nor empirical ground truth (beauty is subjective) | **Never** | N/A — the gate stack cannot verify aesthetic quality |
|
||||
| Novel scientific discovery | Structural (known laws) + empirical (experiments) | Eventually | Years — requires experimental data the system must gather through instruments |
|
||||
|
||||
*** The fastest acquisition strategy per domain
|
||||
|
||||
The goal: reach the flip point with the fewest calendar days
|
||||
and the fewest human hours.
|
||||
|
||||
| Knowledge type | Acquisition strategy | Calendar time | Human time |
|
||||
|----------------|---------------------|---------------|------------|
|
||||
| **Structural** (published rules, configs, specs) | LLM translation of source documents + ACL2 consistency verification + one-shot human review | Hours for the LLM pass, days for human review | Days — one domain expert reviewing the output |
|
||||
| **Structural** (unpublished — your deployment, your codebase) | Automated scanning — the system walks your filesystem, reads your configs, builds the dependency graph | Minutes to hours | Zero — fully automated |
|
||||
| **Empirical** (what happens when X?) | **Active probing** — the system does not wait for user interactions. It probes its own environment: runs shell commands in sandbox to verify gate rules, executes test suites to verify dependency graph, measures what happens when it pushes boundaries. | Hours to days | Zero — automated sandboxed probing |
|
||||
| **Empirical** (what does the human prefer?) | **Contrastive queries** — instead of waiting for HITL approvals to accumulate, the system asks targeted questions: "Which of these two interpretations of the regulation is correct? This one or this one?" Each question produces a fact. | Days — batch of targeted questions answered in one session | Hours — one review session with the domain expert |
|
||||
| **Performance** (latency, throughput) | **Benchmark harness** — the system runs its own workload at varying parameters, records timing data, stores it as facts with `:provenance :benchmark` | Hours — automated sweep | Zero |
|
||||
| **Transfer** from related domains | **Ontology alignment** — if the system already knows compliance for GDPR, it can ask Screamer: "Which GDPR rules have the same structure as HIPAA rules? Use those as seed hypotheses, flag for human review." The transferred rules flip faster because Screamer starts from a consistent foundation, not from scratch. | Days — Screamer finds alignments automatically, human reviews the suggestions | Hours — review the cross-domain alignment suggestions |
|
||||
|
||||
*** The fastest path to flip any domain
|
||||
|
||||
1. **Ingest all published text** for the domain (laws, specs, configs)
|
||||
via LLM translation. One pass. Hours.
|
||||
|
||||
2. **Run the benchmark harness** to measure the system's own performance
|
||||
in the domain. One sweep. Hours.
|
||||
|
||||
3. **Run active sandboxed probes** — test each gate rule against
|
||||
synthetic inputs to verify it behaves as expected. Automated.
|
||||
|
||||
4. **Generate contrastive queries** — Screamer identifies the 5%
|
||||
of rules where the LLM's translation is most uncertain (contradicts
|
||||
a transferred rule, has no precedent, has multiple valid
|
||||
interpretations). Present these as yes/no questions to the human
|
||||
domain expert in a single session.
|
||||
|
||||
5. **Start serving real interactions.** Every gate outcome generates
|
||||
an empirical fact that Screamer feeds back into the rule set.
|
||||
The empirical loop tightens from the first real interaction.
|
||||
|
||||
For a codified domain (healthcare compliance, financial regulation,
|
||||
industrial safety): flip within days of step 1-4. The only bottleneck
|
||||
is the domain expert's review session in step 4 — a few hours of
|
||||
human time.
|
||||
|
||||
For an uncodified but learnable domain (codebase refactoring):
|
||||
flip within weeks of step 3 (benchmark harness) + step 5 (real
|
||||
interactions). No LLM translation needed — the knowledge comes
|
||||
from the system probing its own environment.
|
||||
|
||||
For a domain that can never flip (poetry, aesthetics): the system
|
||||
never reaches sufficiency. It never claims to. The TUI shows
|
||||
"Symbolic index: 0 facts. This domain has no codifiable rules."
|
||||
The LLM handles 100% of poetry interactions. The gate stack
|
||||
only checks for safety (no shell commands, no file deletions)
|
||||
and passes through everything else to the LLM. The system is
|
||||
honest about its frontier.
|
||||
|
||||
Large refactoring projects (extract module, rename API, split monolith)
|
||||
are the hardest test for any AI agent. Current approaches (Claude Code,
|
||||
Copilot) handle them probabilistically — every step costs tokens, and
|
||||
|
||||
Reference in New Issue
Block a user