#+TITLE: Token Optimization Strategy #+author: Amero Garcia #+created: [2026-03-16 Mon 14:28] #+DATE: 2026-03-04 #+FILETAGS: :strategy:token:optimization:cost * Executive Summary ** Goal: Minimize inference costs while maximizing capability Current approach: Single default model → Multi-tier, multi-provider strategy * Three-Tier Model Strategy ** Tier 1: Fast/Cheap (80% of queries) - *Purpose:* Simple tasks, formatting, lookups - *Models:* Google Gemini Flash, Local models - *Cost:* $0-0.000001 per 1K tokens - *Speed:* Fastest ** Tier 2: Balanced (18% of queries) - *Purpose:* Complex reasoning, code generation, analysis - *Models:* Gemini Pro, Claude Haiku, Llama 3 70B - *Cost:* $0.0001-0.003 per 1K tokens - *Speed:* Medium ** Tier 3: High-Performance (2% of queries) - *Purpose:* Critical decisions, complex architecture, final review - *Models:* GPT-4, Claude Opus, Gemini Ultra - *Cost:* $0.01-0.03 per 1K tokens - *Speed:* Slower * Provider Analysis ** Google AI Studio (Primary Recommended) | Model | Free Tier | Rate Limit | Best For | |-------|-----------|------------|----------| | Gemini 2.0 Flash | 300K tokens/day | 60 req/min | Quick tasks, coding | | Gemini 1.5 Flash | 300K tokens/day | 60 req/min | Fast responses | | Gemini 1.5 Pro | 300K tokens/day | 60 req/min | Complex tasks | *Cost: FREE (within limits)* ** OpenRouter.Aggregated (Secondary) | Model | Price/1K tokens | Context | Reliability | |-------|-----------------|---------|-------------| | Qwen 3 235B | $0.0001-0.0003 | 128K | High | | Mistral Large | $0.002-0.006 | 128K | High | | Llama 4 405B | $0.0002-0.0005 | 128K | Medium | | Free tier models | $0 | Varies | Variable | ** OpenAI (Tier 3 only) - GPT-4: $0.03/1K tokens (expensive) - GPT-4o: $0.005/1K tokens (better value) - Use sparingly for critical tasks only ** Local Inference (Long-term goal) - Hardware: $1000-5000 initial investment - Ongoing: $0 (electricity only) - Models: Llama 3, Mistral, DeepSeek - Best for: High-volume, privacy-sensitive work * Context Optimization Strategies ** 1. Context Windows by Task Type | Task Type | Optimal Context | Compression | Savings | |-----------|-----------------|-------------|---------| | Code review | 4K-8K | Truncate old files | 50% | | Documentation | 8K-16K | Summarize sections | 30% | | Research | 16K-32K | Chunk + RAG | 70% | | Architecture | 32K-128K | Maintain full | 0% | ** 2. Conversation Pruning - Remove "thinking" blocks from history - Summarize conversation every 10 turns - Archive old sessions to external storage ** 3. RAG vs. Full Context - *Rule:* < 5K tokens of context → Full - *Rule:* > 10K tokens of context → Use embeddings/RAG - *Savings:* 60-80% on large document tasks * Request Optimization ** Batching Strategy - Group similar requests (3-5 per batch) - Same model, same parameters - Shared overhead costs ** Caching Strategy - Cache embeddings for repeated contexts - Store common completions (templates) - Reuse code snippet suggestions ** Streaming vs. Non-Stream - *Streaming:* Better UX, but higher token overhead - *Non-stream:* More efficient for programmatic use - *Recommendation:* Non-stream for background tasks * Smart Routing Rules ** Automatic Selection Logic ``` IF task_type == "simple_lookup" OR "formatting": → Gemini Flash (free) ELIF task_type == "code_generation" AND complexity < 3: → Gemini Pro (free tier) ELIF task_type == "complex_reasoning" OR "architecture": → Claude Sonnet or GPT-4o ELIF task_type == "final_review" OR "critical_decision": → GPT-4 or Claude Opus ``` ** Fallback Chain 1. Try Gemini (free) 2. If rate limited → OpenRouter (cheap) 3. If quality insufficient → GPT-4o 4. If critical failure → GPT-4 * Concrete Implementation ** Config Structure (openclaw.json) ```json { "models": { "defaults": { "primary": "google-gemini-cli/gemini-2.0-flash", "fallbacks": [ "openrouter/qwen/qwen3-235b-a22b", "google-gemini-cli/gemini-1.5-pro", "openai/gpt-4o" ] }, "providers": { "google-gemini-cli": { "freeTier": true, "dailyLimit": 300000, "rateLimit": 60 }, "openrouter": { "freeTierModels": ["openrouter/auto"], "budgetLimit": 500 }, "openai": { "budgetLimit": 200, "useFor": ["critical", "architecture"] } } } } ``` ** Monitoring & Alerts - Track daily token usage per provider - Alert at 80% of free tier limits - Monthly budget review and adjustment * Cost Projections ** Current Unknown Usage → Optimized | Scenario | Monthly Tokens | Current Cost | Optimized Cost | Savings | |----------|---------------|--------------|----------------|---------| | Light (< 1M) | 1M | $50-100 | $0-10 | 90% | | Medium (1-5M) | 3M | $200-500 | $20-100 | 80% | | Heavy (5-20M) | 10M | $1000-3000 | $200-500 | 80% | * Immediate Actions ** Week 1: Setup - Configure Gemini as primary provider - Set up OpenRouter fallback - Implement basic usage tracking - Document current baseline ** Week 2: Implement - Add smart routing logic - Implement context compression - Set up budget alerts - A/B test model choices ** Week 3: Optimize - Analyze usage patterns - Fine-tune routing rules - Tune context windows - Document findings ** Week 4: Scale - Full multi-provider setup - Implement full caching - Maximize free tier usage - Plan for paid tiers if needed * Long-term: Local Inference Path ** Minimum Viable Setup - Hardware: RTX 4090 or Apple Silicon M3 Max - Software: Ollama + OpenClaw integration - Cost: ~$2000-4000 one-time - Break-even: 3-6 months vs. API costs ** Full Self-Hosted - Hardware: Dual RTX 4090 or 2x Mac Studio - Models: Llama 3 70B, Mixtral 8x22B - Cost: ~$8000-12000 - For: Privacy, unlimited inference, control