Break design/ files into subfolders by ** heading (43 new files)

This commit is contained in:
Hermes
2026-06-04 20:47:17 +00:00
parent 9ce3883005
commit 284c5bd324
59 changed files with 1436 additions and 1093 deletions

View File

@@ -0,0 +1,9 @@
:PROPERTIES:
:CREATED: [2026-05-11 Mon]
:ID: bc537cf8-5ac8-46b9-a625-1d4f678ee9fc
:END:
* Knowledge Sources
- [[file:semantic-wikipedia-as-entity-backbone.org][Semantic Wikipedia as Entity Backbone]] — The gate stack provides 50-70 entity classes — adequate for a coding agent where
- [[file:empirical-validation-momo-and-modular-ontology-engineering.org][Empirical Validation — MOMo and Modular Ontology Engineering]] — Shimizu and Hitzler (2025, /Journal of Web Semantics/) argue that LLMs can signi

View File

@@ -0,0 +1,38 @@
---
title: Empirical Validation — MOMo and Modular Ontology Engineering
type: reference
tags: :passepartout:architecture:
---
* Empirical Validation — MOMo and Modular Ontology Engineering
:PROPERTIES:
:ID: b572b3a0-0238-470f-9bf8-63d356f68fe0
:ID: design-momo
:CREATED: [2026-05-08 Fri]
:WEIGHT: 40
:END:
Shimizu and Hitzler (2025, /Journal of Web Semantics/) argue that LLMs can significantly accelerate knowledge graph and ontology engineering — modeling, extension, population, alignment, and entity disambiguation — but /only/ if ontologies are modular.
*** The central finding: modularity is the key variable
In a complex ontology alignment task, an LLM without module information detected correct mappings for 5 of 109 alignment rules — effectively useless. When the same LLM was given the module structure of the target ontology (20 named conceptual modules), it detected correct mappings for 104 of 109 rules — 95% accuracy. The variable was modularity.
For ontology population (extracting triples from text), their best results came from prompts that included a schematic representation of a /single module/ plus one extraction example. Against ground truth, this achieved approximately 90% extraction accuracy. Without module-scoped prompting, quality degraded substantially.
The mechanism: conceptual modules scope the LLM's attention to something human-sized. The paper's central claim — "by somehow limiting the scope, we achieve a more human-like approach — and one more capable of being expressed succinctly in language" — is an independent discovery of the same principle underlying Passepartout's domain-scoped Screamer checks and per-domain cardinality policies.
*** What Passepartout should adopt
*The modular prompt pattern.* The archivist should use module-scoped prompts: a schematic representation of a domain module plus a single extraction example. Instead of a generic "extract triples" prompt, the prompt should reference the relevant module(s) and include an example triple for each relation in that module. The module provides /context/; the example provides /format/. Both improve LLM extraction quality without increasing Screamer's verification burden.
*MOMo modules as ontology scaffold.* The 50-70 gate-bootstrapped entity classes are starvation for the broader memex. MOMo's micropattern library provides a ready-made scaffold — hundreds of commonsense patterns for temporal relations, spatial relations, agent-action, organizational structure, provenance, and event participation. Loading these as initial modules — with =:policy :plural= and =:provenance :external-ontology= — would give the symbolic index a structured vocabulary for domains where the gate stack has nothing to offer. Organic growth then /extends and refines/ these modules rather than inventing them from scratch.
*Cross-source validation.* The archivist can extract facts from the user's prose, extract facts from Wikidata for the same entities, and present disagreements with provenance. This is the =:plural= cardinality policy applied at extraction time.
The paper validates three design decisions already made: (1) modularity is non-negotiable — the difference between 5% and 95% accuracy; (2) the extraction pipeline is feasible — 90% population accuracy with module-scoped prompts means the archivist /can/ extract useful facts, and the remaining 10% hallucination rate is what Screamer catches; (3) knowledge graphs are positioned as anti-hallucination infrastructure — the Passepartout thesis stated in the academic literature.
References:
- Shimizu, C., & Hitzler, P. (2025). Accelerating knowledge graph and ontology engineering with large language models. /Journal of Web Semantics, 85/, 100862.
- Shimizu, C., Hammar, K., & Hitzler, P. (2023). Modular ontology modeling. /Semantic Web, 14/(3), 459489.
- Norouzi, S.S. et al. (2024). Ontology Population using LLMs. arXiv:2411.01612.

View File

@@ -0,0 +1,31 @@
---
title: Semantic Wikipedia as Entity Backbone
type: reference
tags: :passepartout:architecture:
---
* Semantic Wikipedia as Entity Backbone
:PROPERTIES:
:ID: 183253ff-a0b7-468e-9858-6cb8ae2fb246
:ID: design-wikipedia
:CREATED: [2026-05-08 Fri]
:WEIGHT: 40
:END:
The gate stack provides 50-70 entity classes — adequate for a coding agent where the domain is bounded to files, commands, and code symbols. For a general-knowledge memex, 50-70 is starvation. Your memex mentions Nabokov, /Pale Fire/, Kinbote, Zembla, paranoid reading, unreliable narrators, postmodernism, butterfly migration, chess problems, and the Russian exile experience. The gate stack knows none of these. Organic growth through prose extraction would take years just to cover the entities in one person's engagement with a single novel.
Wikidata has already done this work: approximately 2 million entity classes, over 100 million entities, a decade of human curation. By loading the neighborhood of your memex into the symbolic index (entities referenced in your prose, plus their N-hop property net from Wikidata), the entity recognition problem vanishes. The archivist doesn't need to discover Nabokov from your diary. It needs to connect your heading to the existing Wikidata entity. That is a simpler task — reference resolution, not knowledge extraction.
The LLM's role shrinks to three thin boundaries:
1. *Input translation* — natural language question to structured query. "What do I think about monorepos?" → =(fact-query :entity :monorepo :relation :opinion :source :memex)=. Formulaic, ~100 tokens, any model sufficient.
2. *Prose to candidate triple* — for personal memex entries that have no Wikidata counterpart: your opinions, your day's events, your project plans. Proposals verified by Screamer before admission. This is the only extraction path that still requires an LLM, and its scope is limited to what Wikidata cannot provide.
3. *Result to prose* — structured answer to readable sentence. "Your 2023 diary says 8848m. Wikidata (last edited Feb 2024) says 8849m. They disagree on height." The reasoning is done; the LLM wraps the plist in grammar. ~100 tokens, any model sufficient, purely cosmetic.
Everything else — the gate stack, the fact store, the constraint solver, the type hierarchy, the provenance tracking, the contradiction surfacing, the cross-domain comparison — is pure deterministic Lisp with zero LLM tokens.
The decisive simplification: without Wikidata, the archivist must /discover/ entities from prose. With Wikidata loaded, the entity graph is pre-structured. The archivist's job changes from "discover that Nabokov wrote /Pale Fire/ and lectured on Kafka" to "verify that the Nabokov referenced in heading #47 is Wikidata item Q36591."
Wikidata facts are admitted with =:provenance :wikidata= and cardinality policy =:plural=. They do not override your memex's facts. They sit alongside them. Disagreements are surfaced, not resolved.