feat(arch): finalize Universal Literate Note transition for all projects and skills

2026-03-31 16:14:37 -04:00
parent 1712b1e4a9
commit 70be8ab93e
79 changed files with 1606 additions and 417 deletions
--- a/projects/token-optimization/docs/plan.org
+++ b/projects/token-optimization/docs/plan.org
@@ -0,0 +1,215 @@
+#+TITLE: Token Optimization Strategy
+#+author: Amero Garcia
+#+created: [2026-03-16 Mon 14:28]
+#+DATE: 2026-03-04
+#+FILETAGS: :strategy:token:optimization:cost
+
+* Executive Summary
+
+** Goal: Minimize inference costs while maximizing capability
+
+Current approach: Single default model → Multi-tier, multi-provider strategy
+
+* Three-Tier Model Strategy
+
+** Tier 1: Fast/Cheap (80% of queries)
+- *Purpose:* Simple tasks, formatting, lookups
+- *Models:* Google Gemini Flash, Local models
+- *Cost:* $0-0.000001 per 1K tokens
+- *Speed:* Fastest
+
+** Tier 2: Balanced (18% of queries)
+- *Purpose:* Complex reasoning, code generation, analysis
+- *Models:* Gemini Pro, Claude Haiku, Llama 3 70B
+- *Cost:* $0.0001-0.003 per 1K tokens
+- *Speed:* Medium
+
+** Tier 3: High-Performance (2% of queries)
+- *Purpose:* Critical decisions, complex architecture, final review
+- *Models:* GPT-4, Claude Opus, Gemini Ultra
+- *Cost:* $0.01-0.03 per 1K tokens
+- *Speed:* Slower
+
+* Provider Analysis
+
+** Google AI Studio (Primary Recommended)
+
+| Model | Free Tier | Rate Limit | Best For |
+|-------|-----------|------------|----------|
+| Gemini 2.0 Flash | 300K tokens/day | 60 req/min | Quick tasks, coding |
+| Gemini 1.5 Flash | 300K tokens/day | 60 req/min | Fast responses |
+| Gemini 1.5 Pro | 300K tokens/day | 60 req/min | Complex tasks |
+
+*Cost: FREE (within limits)*
+
+** OpenRouter.Aggregated (Secondary)
+
+| Model | Price/1K tokens | Context | Reliability |
+|-------|-----------------|---------|-------------|
+| Qwen 3 235B | $0.0001-0.0003 | 128K | High |
+| Mistral Large | $0.002-0.006 | 128K | High |
+| Llama 4 405B | $0.0002-0.0005 | 128K | Medium |
+| Free tier models | $0 | Varies | Variable |
+
+** OpenAI (Tier 3 only)
+- GPT-4: $0.03/1K tokens (expensive)
+- GPT-4o: $0.005/1K tokens (better value)
+- Use sparingly for critical tasks only
+
+** Local Inference (Long-term goal)
+- Hardware: $1000-5000 initial investment
+- Ongoing: $0 (electricity only)
+- Models: Llama 3, Mistral, DeepSeek
+- Best for: High-volume, privacy-sensitive work
+
+* Context Optimization Strategies
+
+** 1. Context Windows by Task Type
+
+| Task Type | Optimal Context | Compression | Savings |
+|-----------|-----------------|-------------|---------|
+| Code review | 4K-8K | Truncate old files | 50% |
+| Documentation | 8K-16K | Summarize sections | 30% |
+| Research | 16K-32K | Chunk + RAG | 70% |
+| Architecture | 32K-128K | Maintain full | 0% |
+
+** 2. Conversation Pruning
+- Remove "thinking" blocks from history
+- Summarize conversation every 10 turns
+- Archive old sessions to external storage
+
+** 3. RAG vs. Full Context
+- *Rule:* < 5K tokens of context → Full
+- *Rule:* > 10K tokens of context → Use embeddings/RAG
+- *Savings:* 60-80% on large document tasks
+
+* Request Optimization
+
+** Batching Strategy
+- Group similar requests (3-5 per batch)
+- Same model, same parameters
+- Shared overhead costs
+
+** Caching Strategy
+- Cache embeddings for repeated contexts
+- Store common completions (templates)
+- Reuse code snippet suggestions
+
+** Streaming vs. Non-Stream
+- *Streaming:* Better UX, but higher token overhead
+- *Non-stream:* More efficient for programmatic use
+- *Recommendation:* Non-stream for background tasks
+
+* Smart Routing Rules
+
+** Automatic Selection Logic
+
+```
+IF task_type == "simple_lookup" OR "formatting":
+    → Gemini Flash (free)
+
+ELIF task_type == "code_generation" AND complexity < 3:
+    → Gemini Pro (free tier)
+
+ELIF task_type == "complex_reasoning" OR "architecture":
+    → Claude Sonnet or GPT-4o
+
+ELIF task_type == "final_review" OR "critical_decision":
+    → GPT-4 or Claude Opus
+```
+
+** Fallback Chain
+1. Try Gemini (free)
+2. If rate limited → OpenRouter (cheap)
+3. If quality insufficient → GPT-4o
+4. If critical failure → GPT-4
+
+* Concrete Implementation
+
+** Config Structure (openclaw.json)
+
+```json
+{
+  "models": {
+    "defaults": {
+      "primary": "google-gemini-cli/gemini-2.0-flash",
+      "fallbacks": [
+        "openrouter/qwen/qwen3-235b-a22b",
+        "google-gemini-cli/gemini-1.5-pro",
+        "openai/gpt-4o"
+      ]
+    },
+    "providers": {
+      "google-gemini-cli": {
+        "freeTier": true,
+        "dailyLimit": 300000,
+        "rateLimit": 60
+      },
+      "openrouter": {
+        "freeTierModels": ["openrouter/auto"],
+        "budgetLimit": 500
+      },
+      "openai": {
+        "budgetLimit": 200,
+        "useFor": ["critical", "architecture"]
+      }
+    }
+  }
+}
+```
+
+** Monitoring & Alerts
+
+- Track daily token usage per provider
+- Alert at 80% of free tier limits
+- Monthly budget review and adjustment
+
+* Cost Projections
+
+** Current Unknown Usage → Optimized
+
+| Scenario | Monthly Tokens | Current Cost | Optimized Cost | Savings |
+|----------|---------------|--------------|----------------|---------|
+| Light (< 1M) | 1M | $50-100 | $0-10 | 90% |
+| Medium (1-5M) | 3M | $200-500 | $20-100 | 80% |
+| Heavy (5-20M) | 10M | $1000-3000 | $200-500 | 80% |
+
+* Immediate Actions
+
+** Week 1: Setup
+- Configure Gemini as primary provider
+- Set up OpenRouter fallback
+- Implement basic usage tracking
+- Document current baseline
+
+** Week 2: Implement
+- Add smart routing logic
+- Implement context compression
+- Set up budget alerts
+- A/B test model choices
+
+** Week 3: Optimize
+- Analyze usage patterns
+- Fine-tune routing rules
+- Tune context windows
+- Document findings
+
+** Week 4: Scale
+- Full multi-provider setup
+- Implement full caching
+- Maximize free tier usage
+- Plan for paid tiers if needed
+
+* Long-term: Local Inference Path
+
+** Minimum Viable Setup
+- Hardware: RTX 4090 or Apple Silicon M3 Max
+- Software: Ollama + OpenClaw integration
+- Cost: ~$2000-4000 one-time
+- Break-even: 3-6 months vs. API costs
+
+** Full Self-Hosted
+- Hardware: Dual RTX 4090 or 2x Mac Studio
+- Models: Llama 3 70B, Mixtral 8x22B
+- Cost: ~$8000-12000
+- For: Privacy, unlimited inference, control