feat(arch): finalize Universal Literate Note transition for all projects and skills
This commit is contained in:
215
projects/token-optimization/docs/plan.org
Normal file
215
projects/token-optimization/docs/plan.org
Normal file
@@ -0,0 +1,215 @@
|
||||
#+TITLE: Token Optimization Strategy
|
||||
#+author: Amero Garcia
|
||||
#+created: [2026-03-16 Mon 14:28]
|
||||
#+DATE: 2026-03-04
|
||||
#+FILETAGS: :strategy:token:optimization:cost
|
||||
|
||||
* Executive Summary
|
||||
|
||||
** Goal: Minimize inference costs while maximizing capability
|
||||
|
||||
Current approach: Single default model → Multi-tier, multi-provider strategy
|
||||
|
||||
* Three-Tier Model Strategy
|
||||
|
||||
** Tier 1: Fast/Cheap (80% of queries)
|
||||
- *Purpose:* Simple tasks, formatting, lookups
|
||||
- *Models:* Google Gemini Flash, Local models
|
||||
- *Cost:* $0-0.000001 per 1K tokens
|
||||
- *Speed:* Fastest
|
||||
|
||||
** Tier 2: Balanced (18% of queries)
|
||||
- *Purpose:* Complex reasoning, code generation, analysis
|
||||
- *Models:* Gemini Pro, Claude Haiku, Llama 3 70B
|
||||
- *Cost:* $0.0001-0.003 per 1K tokens
|
||||
- *Speed:* Medium
|
||||
|
||||
** Tier 3: High-Performance (2% of queries)
|
||||
- *Purpose:* Critical decisions, complex architecture, final review
|
||||
- *Models:* GPT-4, Claude Opus, Gemini Ultra
|
||||
- *Cost:* $0.01-0.03 per 1K tokens
|
||||
- *Speed:* Slower
|
||||
|
||||
* Provider Analysis
|
||||
|
||||
** Google AI Studio (Primary Recommended)
|
||||
|
||||
| Model | Free Tier | Rate Limit | Best For |
|
||||
|-------|-----------|------------|----------|
|
||||
| Gemini 2.0 Flash | 300K tokens/day | 60 req/min | Quick tasks, coding |
|
||||
| Gemini 1.5 Flash | 300K tokens/day | 60 req/min | Fast responses |
|
||||
| Gemini 1.5 Pro | 300K tokens/day | 60 req/min | Complex tasks |
|
||||
|
||||
*Cost: FREE (within limits)*
|
||||
|
||||
** OpenRouter.Aggregated (Secondary)
|
||||
|
||||
| Model | Price/1K tokens | Context | Reliability |
|
||||
|-------|-----------------|---------|-------------|
|
||||
| Qwen 3 235B | $0.0001-0.0003 | 128K | High |
|
||||
| Mistral Large | $0.002-0.006 | 128K | High |
|
||||
| Llama 4 405B | $0.0002-0.0005 | 128K | Medium |
|
||||
| Free tier models | $0 | Varies | Variable |
|
||||
|
||||
** OpenAI (Tier 3 only)
|
||||
- GPT-4: $0.03/1K tokens (expensive)
|
||||
- GPT-4o: $0.005/1K tokens (better value)
|
||||
- Use sparingly for critical tasks only
|
||||
|
||||
** Local Inference (Long-term goal)
|
||||
- Hardware: $1000-5000 initial investment
|
||||
- Ongoing: $0 (electricity only)
|
||||
- Models: Llama 3, Mistral, DeepSeek
|
||||
- Best for: High-volume, privacy-sensitive work
|
||||
|
||||
* Context Optimization Strategies
|
||||
|
||||
** 1. Context Windows by Task Type
|
||||
|
||||
| Task Type | Optimal Context | Compression | Savings |
|
||||
|-----------|-----------------|-------------|---------|
|
||||
| Code review | 4K-8K | Truncate old files | 50% |
|
||||
| Documentation | 8K-16K | Summarize sections | 30% |
|
||||
| Research | 16K-32K | Chunk + RAG | 70% |
|
||||
| Architecture | 32K-128K | Maintain full | 0% |
|
||||
|
||||
** 2. Conversation Pruning
|
||||
- Remove "thinking" blocks from history
|
||||
- Summarize conversation every 10 turns
|
||||
- Archive old sessions to external storage
|
||||
|
||||
** 3. RAG vs. Full Context
|
||||
- *Rule:* < 5K tokens of context → Full
|
||||
- *Rule:* > 10K tokens of context → Use embeddings/RAG
|
||||
- *Savings:* 60-80% on large document tasks
|
||||
|
||||
* Request Optimization
|
||||
|
||||
** Batching Strategy
|
||||
- Group similar requests (3-5 per batch)
|
||||
- Same model, same parameters
|
||||
- Shared overhead costs
|
||||
|
||||
** Caching Strategy
|
||||
- Cache embeddings for repeated contexts
|
||||
- Store common completions (templates)
|
||||
- Reuse code snippet suggestions
|
||||
|
||||
** Streaming vs. Non-Stream
|
||||
- *Streaming:* Better UX, but higher token overhead
|
||||
- *Non-stream:* More efficient for programmatic use
|
||||
- *Recommendation:* Non-stream for background tasks
|
||||
|
||||
* Smart Routing Rules
|
||||
|
||||
** Automatic Selection Logic
|
||||
|
||||
```
|
||||
IF task_type == "simple_lookup" OR "formatting":
|
||||
→ Gemini Flash (free)
|
||||
|
||||
ELIF task_type == "code_generation" AND complexity < 3:
|
||||
→ Gemini Pro (free tier)
|
||||
|
||||
ELIF task_type == "complex_reasoning" OR "architecture":
|
||||
→ Claude Sonnet or GPT-4o
|
||||
|
||||
ELIF task_type == "final_review" OR "critical_decision":
|
||||
→ GPT-4 or Claude Opus
|
||||
```
|
||||
|
||||
** Fallback Chain
|
||||
1. Try Gemini (free)
|
||||
2. If rate limited → OpenRouter (cheap)
|
||||
3. If quality insufficient → GPT-4o
|
||||
4. If critical failure → GPT-4
|
||||
|
||||
* Concrete Implementation
|
||||
|
||||
** Config Structure (openclaw.json)
|
||||
|
||||
```json
|
||||
{
|
||||
"models": {
|
||||
"defaults": {
|
||||
"primary": "google-gemini-cli/gemini-2.0-flash",
|
||||
"fallbacks": [
|
||||
"openrouter/qwen/qwen3-235b-a22b",
|
||||
"google-gemini-cli/gemini-1.5-pro",
|
||||
"openai/gpt-4o"
|
||||
]
|
||||
},
|
||||
"providers": {
|
||||
"google-gemini-cli": {
|
||||
"freeTier": true,
|
||||
"dailyLimit": 300000,
|
||||
"rateLimit": 60
|
||||
},
|
||||
"openrouter": {
|
||||
"freeTierModels": ["openrouter/auto"],
|
||||
"budgetLimit": 500
|
||||
},
|
||||
"openai": {
|
||||
"budgetLimit": 200,
|
||||
"useFor": ["critical", "architecture"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
** Monitoring & Alerts
|
||||
|
||||
- Track daily token usage per provider
|
||||
- Alert at 80% of free tier limits
|
||||
- Monthly budget review and adjustment
|
||||
|
||||
* Cost Projections
|
||||
|
||||
** Current Unknown Usage → Optimized
|
||||
|
||||
| Scenario | Monthly Tokens | Current Cost | Optimized Cost | Savings |
|
||||
|----------|---------------|--------------|----------------|---------|
|
||||
| Light (< 1M) | 1M | $50-100 | $0-10 | 90% |
|
||||
| Medium (1-5M) | 3M | $200-500 | $20-100 | 80% |
|
||||
| Heavy (5-20M) | 10M | $1000-3000 | $200-500 | 80% |
|
||||
|
||||
* Immediate Actions
|
||||
|
||||
** Week 1: Setup
|
||||
- Configure Gemini as primary provider
|
||||
- Set up OpenRouter fallback
|
||||
- Implement basic usage tracking
|
||||
- Document current baseline
|
||||
|
||||
** Week 2: Implement
|
||||
- Add smart routing logic
|
||||
- Implement context compression
|
||||
- Set up budget alerts
|
||||
- A/B test model choices
|
||||
|
||||
** Week 3: Optimize
|
||||
- Analyze usage patterns
|
||||
- Fine-tune routing rules
|
||||
- Tune context windows
|
||||
- Document findings
|
||||
|
||||
** Week 4: Scale
|
||||
- Full multi-provider setup
|
||||
- Implement full caching
|
||||
- Maximize free tier usage
|
||||
- Plan for paid tiers if needed
|
||||
|
||||
* Long-term: Local Inference Path
|
||||
|
||||
** Minimum Viable Setup
|
||||
- Hardware: RTX 4090 or Apple Silicon M3 Max
|
||||
- Software: Ollama + OpenClaw integration
|
||||
- Cost: ~$2000-4000 one-time
|
||||
- Break-even: 3-6 months vs. API costs
|
||||
|
||||
** Full Self-Hosted
|
||||
- Hardware: Dual RTX 4090 or 2x Mac Studio
|
||||
- Models: Llama 3 70B, Mixtral 8x22B
|
||||
- Cost: ~$8000-12000
|
||||
- For: Privacy, unlimited inference, control
|
||||
Reference in New Issue
Block a user