hermes-brain/projects/passepartout/architecture/native-org-knowledge-base.org

:PROPERTIES:
:ID:       7f4e6b9a-2c1d-5e8f-9a3b-6d7c4e5f2a1b
:CREATED:  [2026-05-23 Sat]
:END:
#+title: Passepartout Native Org-Mode Knowledge Base
#+filetags: :passepartout:roadmap:knowledge:org:gbrain:

** What

[[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout]] should be able to use Org-mode files directly as its
knowledge base — no pandoc conversion, no markdown intermediary.

Currently gbrain provides vector search + entity linking over markdown,
but we bridge via a conversion layer (org → pandoc → markdown → gbrain).
This loses Org-mode semantics: properties drawers become flat YAML, tag
inheritance is lost, file: links become relative markdown links, TODO
states vanish, and the tree structure (headings with content subtrees)
collapses into flat markdown headings.

** Why

Org-mode's data model is strictly richer than markdown's. A Passepartout
that can ingest, index, and query org files natively has:
- Property-based entity extraction (no separate links: frontmatter needed)
- Tag-inheritance for automatic categorization
- TODO/priority/timestamps for knowledge freshness signals
- ID-based stable cross-references (org-id) that survive file moves
- Heading-level chunking (one heading = one knowledge unit)
- The same file format for everything — no split between "authoring format"
  and "knowledge base format"

** What it replaces

The current pipeline: org file → pandoc → markdown file → gbrain import →

gbrain embed → gbrain query. This is four serial steps with a conversion
at each boundary that degrades the data model.

The target: org file → (Passepartout-native indexer) → query. Zero
conversion, zero data loss.

** Architecture sketch

A Passepartout-native knowledge module that directly ingests
ideas/*.org:

- Parser: extract each heading as a chunk. Preserve:
  - Heading path (H1 → H2 → H3) as a hierarchical path
  - Properties drawer as structured metadata
  - file: links as typed entity references
  - org-id as stable identifier
  - Tags (inherited from parent headings)
  - TODO state, priority, timestamps

- Embedder: vector-embed each heading chunk with metadata prefix

- Query: hybrid search over headings + full-text over content.
  Result includes the heading path + sibling headings for context.

- Cross-reference graph: build a typed entity graph from:
  - file: links → typed reference
  - org-id links → stable cross-doc reference
  - Tag co-occurrence → implicit relationship
  - Same-property values → attribute-based grouping

- Dream cycle: auto-discover entities from org properties and file:
  links. Enrich thin sections. Flag sections with stale timestamps.

** Priority

Below the gate stack and ACL2 planner (core dependencies) but above
the Lisp Machine hardware. Target: after TUI stabilization and eval harness, once Screamer
planner is stable enough to route queries through the knowledge base.

The short-term bridge (current) is gbrain with nightly org→md sync.
This is adequate while the gate stack and planner are the priority.
The native org module replaces gbrain entirely once built.

The nightly pipeline uses gbrain to provide hybrid search across the existing
org files. The [[id:36e5b948-e07b-477f-9036-4dfe88254347][compliance framework mapping]] is the largest single
dataset this would serve, and the broader
[[id:28c46769-c14b-42aa-ac7a-69d310157f8f][Passepartout economics]] knowledge base demonstrates the value of
native org querying at scale.