16 KiB
SKILL: Token Accountant Agent (Universal Literate Note) Token Management & Model Optimization Research Token Optimization - $50 Monthly Budget Token Optimization Token Optimization - Quick Start Token Optimization Strategy
- Overview
- Phase A: Demand (PRD)
- Phase B: Blueprint (PROTOCOL)
- Registration
- Documentation (Token Optimization)
- Token Management Strategy Research
- Budget: $50/Month
- Token Optimization
- Project Tasks
- Key Documents
- Current Focus
- Quick Reference for Daily Use
- Executive Summary
- Three-Tier Model Strategy
- Provider Analysis
- Context Optimization Strategies
- Request Optimization
- Smart Routing Rules
- Concrete Implementation
- Cost Projections
- Immediate Actions
- Long-term: Local Inference Path
Overview
The Token Accountant is the governor of the Neural Engine. It manages the cost, reliability, and routing of LLM providers. Its primary mission is to ensure the PSF operates at maximum intelligence with minimum marginal cost by aggressively prioritizing subsidized free models when appropriate.
Phase A: Demand (PRD)
1. Purpose
Autonomously manage the provider cascade and model selection to optimize for cost, speed, and reliability.
Phase B: Blueprint (PROTOCOL)
1. Architectural Intent
Maintain a state-aware provider cascade that routes around "pain" (failures) and dynamically selects models based on task complexity.
2. Semantic Interfaces
Routing and Pain Management
(in-package :org-agent)
(defvar *provider-pain-table* (make-hash-table :test 'equal))
(defun token-accountant-record-pain (provider)
"Marks a provider as 'pained' (failed). It will be de-prioritized."
(setf (gethash provider *provider-pain-table*) (+ (get-universal-time) 600)) ; 10 min penalty
(harness-log "ACCOUNTANT - Provider ~a de-prioritized due to failure." provider))
(defun token-accountant-get-cascade (context)
"Returns a dynamic list of providers, routing around pained ones. Uses standardized gateway keywords."
(let ((all-providers '(:openrouter :groq :gemini-api :ollama))
(healthy nil)
(pained nil)
(now (get-universal-time)))
(dolist (p all-providers)
(if (> (or (gethash p *provider-pain-table*) 0) now)
(push p pained)
(push p healthy)))
(append (nreverse healthy) (nreverse pained))))
(defun token-accountant-get-model-for-provider (provider &optional context)
"Returns the recommended model for the provider, prioritizing free/subsidized models. Updated April 2026."
(let ((complexity (ignore-errors (uiop:symbol-call :org-agent.skills.org-skill-router :router-classify-complexity context))))
(case provider
(:openrouter
(case complexity
(:REASONING "meta-llama/llama-3.3-70b-instruct:free") ; High fidelity, zero cost
(:COGNITION "qwen/qwen3.6-plus:free") ; Latest interaction, zero cost
(t "meta-llama/llama-3.2-3b-instruct:free"))) ; Ultra-fast reflex, zero cost
(:groq
(case complexity
(:REASONING "llama-3.3-70b-versatile")
(t "llama-3.1-8b-instant")))
(:gemini-api
"gemini-1.5-flash-latest")
(t nil))))
(defun token-accountant-patch-kernel ()
"Hot-patches the harness's cascade and model selector to use our dynamic logic."
(setf org-agent:*provider-cascade* #'token-accountant-get-cascade)
(setf org-agent::*model-selector-fn* #'token-accountant-get-model-for-provider))
Registration
(progn
(token-accountant-patch-kernel)
(defskill :skill-token-accountant
:priority 100
:trigger (lambda (context)
(let ((sensor (getf (getf context :payload) :sensor)))
(or (eq sensor :tool-error) (eq sensor :cost-audit))))
:neuro (lambda (context) nil)
:symbolic (lambda (action context)
(let ((p (getf (getf context :payload) :provider)))
(when p (token-accountant-record-pain p))
action))))
Documentation (Token Optimization)
research.org
Token Management Strategy Research
Initial Findings
OpenRouter Free Tier
- URL: https://openrouter.ai/collections/free-models
- Providers moving from free to paid-only models
- Belief: "Free models play crucial role in democratizing access"
Google AI Studio (Gemini)
- Free tier available
- Limits: 60 requests/minute, 300K tokens/day
- No credit card required
- Every API key gets these limits
Research Questions
- Which providers offer free or low-cost tiers?
- What are the rate limits and quotas?
- Which models are best for which use cases?
- How to optimize context windows?
- What is the cost per token breakdown?
To Research Further
| Provider | Free Tier | Paid Tier | Best For |
|---|---|---|---|
| Google Gemini | 300K tokens/day | Pay per use? | General, coding |
| OpenRouter | Varies by model | Per-request | Routing, variety |
| OpenAI | ? | ? | GPT-4 quality |
| Anthropic | ? | ? | Claude capabilities |
| Mistral | ? | ? | Open weights |
| Local | Hardware cost | Free | Privacy, control |
Token Optimization Strategies to Explore
-
Tiered Model Usage
- Simple tasks: Fast/cheap models
- Complex tasks: Stronger models
- Fallback: Lower tier if higher fails
-
Context Compression
- Summarize long contexts
- Use RAG instead of full context
- Prune old conversation
-
Caching
- Cache common responses
- Reuse embeddings
- Batch requests
-
Hybrid Approach
- Local models for simple queries
- Cloud APIs for complex tasks
- Manual review for critical outputs
X Account Access
Pending: X account access via Google login Blocker: Requires OTP from user per security rule (SOUL.md) Action needed: User provides OTP, I complete OAuth, access bookmarks
budget-50.org
Budget: $50/Month
Budget Breakdown
| Tier | Provider | Allocation | Tokens Est. | Use Case |
|---|---|---|---|---|
| FREE | Google Gemini | $0 | ~9M/month | 90% of work |
| CHEAP | OpenRouter | $20 | ~6M tokens | Fallback, complex tasks |
| PREMIUM | Claude/GPT-4o | $25 | ~500K tokens | Critical decisions |
| BUFFER | Various | $5 | Emergency | Overruns, testing |
Daily Free Allowance
- Google Gemini: 300K tokens/day = 9M/month = $0
- This covers 90-95% of expected workload
Paid Tier Allocation ($45)
-
$20 → OpenRouter (Qwen, Mistral, Llama)
- ~6M tokens at $0.003/1K
- Use when: Gemini rate limited, need different model
-
$25 → Premium models (Claude, GPT-4o)
- ~500K tokens at $0.05/1K average
- Use when: Architecture decisions, critical code review, final validation
-
$5 → Buffer
- Handle overruns
- Emergency access
- Testing new models
Hard Limits
| Provider | Monthly Cap | Alert At |
|---|---|---|
| OpenRouter | $20 | $16 (80%) |
| Premium | $25 | $20 (80%) |
| Total | $50 | $45 (90%) |
Daily Tracking
Target: Monitor consumption every session
``` IF daily_cost > $1.50: → Switch to Gemini only → Defer premium tasks
IF weekly_cost > $12: → Review usage patterns → Find optimization opportunities ```
Emergency Protocol
If approaching $50 limit before month end:
- Halt all paid API calls
- Switch to Gemini-only mode
- Queue premium tasks for next month
- Consider local inference setup
Cost-Per-Task Guidelines
| Task Type | Max Cost | Preferred Model |
|---|---|---|
| Quick lookup | $0.00 | Gemini |
| Code review | $0.01 | Gemini/OpenRouter |
| Feature design | $0.05 | OpenRouter |
| Architecture review | $0.10 | Claude/GPT-4o |
| Emergency debug | $0.20 | Best available |
Optimization Imperative
With $50/month, waste is not affordable:
- ❌ No speculative queries
- ❌ No "just curious" premium calls
- ❌ No repeated similar prompts
- ✅ Always use Gemini first
- ✅ Batch similar requests
- ✅ Cache embeddings locally
- ✅ Summarize long contexts
Monthly Review
- Compare actual vs. projected usage
- Adjust model routing rules
- Identify expensive query patterns
- Plan next month's allocation
Break-Even Analysis
At $50/month = $600/year:
- Option A: Continue APIs (flexible, managed)
-
Option B: Local inference (~$800 hardware, $0 ongoing)
- Break-even: 16 months
- Risk: Hardware failure, maintenance
Recommendation: Stick with APIs until $100+/month, then evaluate hardware.
Questions for Human Partner
- Is $50 firm or flexible in emergencies?
- What happens if we hit limit mid-critical-task?
- Preference for which premium model? (Claude vs GPT-4 vs both)
- Should I track and report costs per project?
- Any tasks that are "unlimited budget" critical?
README.org
Cost-effective LLM usage through smart routing, context compression, and multi-provider strategies.
Token Optimization
Strategy and implementation for minimizing LLM costs while maintaining quality.
Project Tasks
See the actionable tasks for this project in GTD.org > Projects > Token Optimization
Key Documents
Current Focus
- Multi-provider setup (Gemini primary, OpenRouter fallback)
- Usage tracking and budget alerts
- Smart routing by task type
- Context compression techniques
quick-start.org
Quick Reference for Daily Use
Rule of Thumb
| What you need | Use this | Cost |
|---|---|---|
| Quick answer, formatting, lookup | Gemini Flash | FREE |
| Code review, analysis | Gemini Pro | FREE |
| Complex problem solving | Claude Haiku / Qwen | $ |
| Critical architecture decision | GPT-4o | $$ |
Free Tier Limits (Daily)
| Provider | Tokens | Requests | Reset |
|---|---|---|---|
| Google AI Studio | 300,000 | 60/min | Daily |
| OpenRouter Free | Varies | Limited | - |
Current Recommendation
→ Use Google Gemini exclusively until hitting 250K tokens/day → Then add OpenRouter fallback → Only use GPT-4 for final reviews
This will reduce token costs by ~90%
Next Steps
- Configure Gemini as primary (already partially done)
- Add quota tracking
- Set alerts at 80% of free limits
- Implement tiered routing
Savings Potential: $100-500/month → $10-50/month
plan.org
Executive Summary
Goal: Minimize inference costs while maximizing capability
Current approach: Single default model → Multi-tier, multi-provider strategy
Three-Tier Model Strategy
Tier 1: Fast/Cheap (80% of queries)
- Purpose: Simple tasks, formatting, lookups
- Models: Google Gemini Flash, Local models
- Cost: $0-0.000001 per 1K tokens
- Speed: Fastest
Tier 2: Balanced (18% of queries)
- Purpose: Complex reasoning, code generation, analysis
- Models: Gemini Pro, Claude Haiku, Llama 3 70B
- Cost: $0.0001-0.003 per 1K tokens
- Speed: Medium
Tier 3: High-Performance (2% of queries)
- Purpose: Critical decisions, complex architecture, final review
- Models: GPT-4, Claude Opus, Gemini Ultra
- Cost: $0.01-0.03 per 1K tokens
- Speed: Slower
Provider Analysis
Google AI Studio (Primary Recommended)
| Model | Free Tier | Rate Limit | Best For |
|---|---|---|---|
| Gemini 2.0 Flash | 300K tokens/day | 60 req/min | Quick tasks, coding |
| Gemini 1.5 Flash | 300K tokens/day | 60 req/min | Fast responses |
| Gemini 1.5 Pro | 300K tokens/day | 60 req/min | Complex tasks |
Cost: FREE (within limits)
OpenRouter.Aggregated (Secondary)
| Model | Price/1K tokens | Context | Reliability |
|---|---|---|---|
| Qwen 3 235B | $0.0001-0.0003 | 128K | High |
| Mistral Large | $0.002-0.006 | 128K | High |
| Llama 4 405B | $0.0002-0.0005 | 128K | Medium |
| Free tier models | $0 | Varies | Variable |
OpenAI (Tier 3 only)
- GPT-4: $0.03/1K tokens (expensive)
- GPT-4o: $0.005/1K tokens (better value)
- Use sparingly for critical tasks only
Local Inference (Long-term goal)
- Hardware: $1000-5000 initial investment
- Ongoing: $0 (electricity only)
- Models: Llama 3, Mistral, DeepSeek
- Best for: High-volume, privacy-sensitive work
Context Optimization Strategies
1. Context Windows by Task Type
| Task Type | Optimal Context | Compression | Savings |
|---|---|---|---|
| Code review | 4K-8K | Truncate old files | 50% |
| Documentation | 8K-16K | Summarize sections | 30% |
| Research | 16K-32K | Chunk + RAG | 70% |
| Architecture | 32K-128K | Maintain full | 0% |
2. Conversation Pruning
- Remove "thinking" blocks from history
- Summarize conversation every 10 turns
- Archive old sessions to external storage
3. RAG vs. Full Context
- Rule: < 5K tokens of context → Full
- Rule: > 10K tokens of context → Use embeddings/RAG
- Savings: 60-80% on large document tasks
Request Optimization
Batching Strategy
- Group similar requests (3-5 per batch)
- Same model, same parameters
- Shared overhead costs
Caching Strategy
- Cache embeddings for repeated contexts
- Store common completions (templates)
- Reuse code snippet suggestions
Streaming vs. Non-Stream
- Streaming: Better UX, but higher token overhead
- Non-stream: More efficient for programmatic use
- Recommendation: Non-stream for background tasks
Smart Routing Rules
Automatic Selection Logic
``` IF task_type == "simple_lookup" OR "formatting": → Gemini Flash (free)
ELIF task_type == "code_generation" AND complexity < 3: → Gemini Pro (free tier)
ELIF task_type == "complex_reasoning" OR "architecture": → Claude Sonnet or GPT-4o
ELIF task_type == "final_review" OR "critical_decision": → GPT-4 or Claude Opus ```
Fallback Chain
- Try Gemini (free)
- If rate limited → OpenRouter (cheap)
- If quality insufficient → GPT-4o
- If critical failure → GPT-4
Concrete Implementation
Config Structure (openclaw.json)
```json { "models": { "defaults": { "primary": "google-gemini-cli/gemini-2.0-flash", "fallbacks": [ "openrouter/qwen/qwen3-235b-a22b", "google-gemini-cli/gemini-1.5-pro", "openai/gpt-4o" ] }, "providers": { "google-gemini-cli": { "freeTier": true, "dailyLimit": 300000, "rateLimit": 60 }, "openrouter": { "freeTierModels": ["openrouter/auto"], "budgetLimit": 500 }, "openai": { "budgetLimit": 200, "useFor": ["critical", "architecture"] } } } } ```
Monitoring & Alerts
- Track daily token usage per provider
- Alert at 80% of free tier limits
- Monthly budget review and adjustment
Cost Projections
Current Unknown Usage → Optimized
| Scenario | Monthly Tokens | Current Cost | Optimized Cost | Savings |
|---|---|---|---|---|
| Light (< 1M) | 1M | $50-100 | $0-10 | 90% |
| Medium (1-5M) | 3M | $200-500 | $20-100 | 80% |
| Heavy (5-20M) | 10M | $1000-3000 | $200-500 | 80% |
Immediate Actions
Week 1: Setup
- Configure Gemini as primary provider
- Set up OpenRouter fallback
- Implement basic usage tracking
- Document current baseline
Week 2: Implement
- Add smart routing logic
- Implement context compression
- Set up budget alerts
- A/B test model choices
Week 3: Optimize
- Analyze usage patterns
- Fine-tune routing rules
- Tune context windows
- Document findings
Week 4: Scale
- Full multi-provider setup
- Implement full caching
- Maximize free tier usage
- Plan for paid tiers if needed
Long-term: Local Inference Path
Minimum Viable Setup
- Hardware: RTX 4090 or Apple Silicon M3 Max
- Software: Ollama + OpenClaw integration
- Cost: ~$2000-4000 one-time
- Break-even: 3-6 months vs. API costs
Full Self-Hosted
- Hardware: Dual RTX 4090 or 2x Mac Studio
- Models: Llama 3 70B, Mixtral 8x22B
- Cost: ~$8000-12000
- For: Privacy, unlimited inference, control