5.7 KiB
Token Optimization Strategy
- Executive Summary
- Three-Tier Model Strategy
- Provider Analysis
- Context Optimization Strategies
- Request Optimization
- Smart Routing Rules
- Concrete Implementation
- Cost Projections
- Immediate Actions
- Long-term: Local Inference Path
Executive Summary
Goal: Minimize inference costs while maximizing capability
Current approach: Single default model → Multi-tier, multi-provider strategy
Three-Tier Model Strategy
Tier 1: Fast/Cheap (80% of queries)
- Purpose: Simple tasks, formatting, lookups
- Models: Google Gemini Flash, Local models
- Cost: $0-0.000001 per 1K tokens
- Speed: Fastest
Tier 2: Balanced (18% of queries)
- Purpose: Complex reasoning, code generation, analysis
- Models: Gemini Pro, Claude Haiku, Llama 3 70B
- Cost: $0.0001-0.003 per 1K tokens
- Speed: Medium
Tier 3: High-Performance (2% of queries)
- Purpose: Critical decisions, complex architecture, final review
- Models: GPT-4, Claude Opus, Gemini Ultra
- Cost: $0.01-0.03 per 1K tokens
- Speed: Slower
Provider Analysis
Google AI Studio (Primary Recommended)
| Model | Free Tier | Rate Limit | Best For |
|---|---|---|---|
| Gemini 2.0 Flash | 300K tokens/day | 60 req/min | Quick tasks, coding |
| Gemini 1.5 Flash | 300K tokens/day | 60 req/min | Fast responses |
| Gemini 1.5 Pro | 300K tokens/day | 60 req/min | Complex tasks |
Cost: FREE (within limits)
OpenRouter.Aggregated (Secondary)
| Model | Price/1K tokens | Context | Reliability |
|---|---|---|---|
| Qwen 3 235B | $0.0001-0.0003 | 128K | High |
| Mistral Large | $0.002-0.006 | 128K | High |
| Llama 4 405B | $0.0002-0.0005 | 128K | Medium |
| Free tier models | $0 | Varies | Variable |
OpenAI (Tier 3 only)
- GPT-4: $0.03/1K tokens (expensive)
- GPT-4o: $0.005/1K tokens (better value)
- Use sparingly for critical tasks only
Local Inference (Long-term goal)
- Hardware: $1000-5000 initial investment
- Ongoing: $0 (electricity only)
- Models: Llama 3, Mistral, DeepSeek
- Best for: High-volume, privacy-sensitive work
Context Optimization Strategies
1. Context Windows by Task Type
| Task Type | Optimal Context | Compression | Savings |
|---|---|---|---|
| Code review | 4K-8K | Truncate old files | 50% |
| Documentation | 8K-16K | Summarize sections | 30% |
| Research | 16K-32K | Chunk + RAG | 70% |
| Architecture | 32K-128K | Maintain full | 0% |
2. Conversation Pruning
- Remove "thinking" blocks from history
- Summarize conversation every 10 turns
- Archive old sessions to external storage
3. RAG vs. Full Context
- Rule: < 5K tokens of context → Full
- Rule: > 10K tokens of context → Use embeddings/RAG
- Savings: 60-80% on large document tasks
Request Optimization
Batching Strategy
- Group similar requests (3-5 per batch)
- Same model, same parameters
- Shared overhead costs
Caching Strategy
- Cache embeddings for repeated contexts
- Store common completions (templates)
- Reuse code snippet suggestions
Streaming vs. Non-Stream
- Streaming: Better UX, but higher token overhead
- Non-stream: More efficient for programmatic use
- Recommendation: Non-stream for background tasks
Smart Routing Rules
Automatic Selection Logic
``` IF task_type == "simple_lookup" OR "formatting": → Gemini Flash (free)
ELIF task_type == "code_generation" AND complexity < 3: → Gemini Pro (free tier)
ELIF task_type == "complex_reasoning" OR "architecture": → Claude Sonnet or GPT-4o
ELIF task_type == "final_review" OR "critical_decision": → GPT-4 or Claude Opus ```
Fallback Chain
- Try Gemini (free)
- If rate limited → OpenRouter (cheap)
- If quality insufficient → GPT-4o
- If critical failure → GPT-4
Concrete Implementation
Config Structure (openclaw.json)
```json { "models": { "defaults": { "primary": "google-gemini-cli/gemini-2.0-flash", "fallbacks": [ "openrouter/qwen/qwen3-235b-a22b", "google-gemini-cli/gemini-1.5-pro", "openai/gpt-4o" ] }, "providers": { "google-gemini-cli": { "freeTier": true, "dailyLimit": 300000, "rateLimit": 60 }, "openrouter": { "freeTierModels": ["openrouter/auto"], "budgetLimit": 500 }, "openai": { "budgetLimit": 200, "useFor": ["critical", "architecture"] } } } } ```
Monitoring & Alerts
- Track daily token usage per provider
- Alert at 80% of free tier limits
- Monthly budget review and adjustment
Cost Projections
Current Unknown Usage → Optimized
| Scenario | Monthly Tokens | Current Cost | Optimized Cost | Savings |
|---|---|---|---|---|
| Light (< 1M) | 1M | $50-100 | $0-10 | 90% |
| Medium (1-5M) | 3M | $200-500 | $20-100 | 80% |
| Heavy (5-20M) | 10M | $1000-3000 | $200-500 | 80% |
Immediate Actions
Week 1: Setup
- Configure Gemini as primary provider
- Set up OpenRouter fallback
- Implement basic usage tracking
- Document current baseline
Week 2: Implement
- Add smart routing logic
- Implement context compression
- Set up budget alerts
- A/B test model choices
Week 3: Optimize
- Analyze usage patterns
- Fine-tune routing rules
- Tune context windows
- Document findings
Week 4: Scale
- Full multi-provider setup
- Implement full caching
- Maximize free tier usage
- Plan for paid tiers if needed
Long-term: Local Inference Path
Minimum Viable Setup
- Hardware: RTX 4090 or Apple Silicon M3 Max
- Software: Ollama + OpenClaw integration
- Cost: ~$2000-4000 one-time
- Break-even: 3-6 months vs. API costs
Full Self-Hosted
- Hardware: Dual RTX 4090 or 2x Mac Studio
- Models: Llama 3 70B, Mixtral 8x22B
- Cost: ~$8000-12000
- For: Privacy, unlimited inference, control