Files
memex/projects/token-optimization/plan.org

5.7 KiB

Token Optimization Strategy

Executive Summary

Goal: Minimize inference costs while maximizing capability

Current approach: Single default model → Multi-tier, multi-provider strategy

Three-Tier Model Strategy

Tier 1: Fast/Cheap (80% of queries)

  • Purpose: Simple tasks, formatting, lookups
  • Models: Google Gemini Flash, Local models
  • Cost: $0-0.000001 per 1K tokens
  • Speed: Fastest

Tier 2: Balanced (18% of queries)

  • Purpose: Complex reasoning, code generation, analysis
  • Models: Gemini Pro, Claude Haiku, Llama 3 70B
  • Cost: $0.0001-0.003 per 1K tokens
  • Speed: Medium

Tier 3: High-Performance (2% of queries)

  • Purpose: Critical decisions, complex architecture, final review
  • Models: GPT-4, Claude Opus, Gemini Ultra
  • Cost: $0.01-0.03 per 1K tokens
  • Speed: Slower

Provider Analysis

Google AI Studio (Primary Recommended)

Model Free Tier Rate Limit Best For
Gemini 2.0 Flash 300K tokens/day 60 req/min Quick tasks, coding
Gemini 1.5 Flash 300K tokens/day 60 req/min Fast responses
Gemini 1.5 Pro 300K tokens/day 60 req/min Complex tasks

Cost: FREE (within limits)

OpenRouter.Aggregated (Secondary)

Model Price/1K tokens Context Reliability
Qwen 3 235B $0.0001-0.0003 128K High
Mistral Large $0.002-0.006 128K High
Llama 4 405B $0.0002-0.0005 128K Medium
Free tier models $0 Varies Variable

OpenAI (Tier 3 only)

  • GPT-4: $0.03/1K tokens (expensive)
  • GPT-4o: $0.005/1K tokens (better value)
  • Use sparingly for critical tasks only

Local Inference (Long-term goal)

  • Hardware: $1000-5000 initial investment
  • Ongoing: $0 (electricity only)
  • Models: Llama 3, Mistral, DeepSeek
  • Best for: High-volume, privacy-sensitive work

Context Optimization Strategies

1. Context Windows by Task Type

Task Type Optimal Context Compression Savings
Code review 4K-8K Truncate old files 50%
Documentation 8K-16K Summarize sections 30%
Research 16K-32K Chunk + RAG 70%
Architecture 32K-128K Maintain full 0%

2. Conversation Pruning

  • Remove "thinking" blocks from history
  • Summarize conversation every 10 turns
  • Archive old sessions to external storage

3. RAG vs. Full Context

  • Rule: < 5K tokens of context → Full
  • Rule: > 10K tokens of context → Use embeddings/RAG
  • Savings: 60-80% on large document tasks

Request Optimization

Batching Strategy

  • Group similar requests (3-5 per batch)
  • Same model, same parameters
  • Shared overhead costs

Caching Strategy

  • Cache embeddings for repeated contexts
  • Store common completions (templates)
  • Reuse code snippet suggestions

Streaming vs. Non-Stream

  • Streaming: Better UX, but higher token overhead
  • Non-stream: More efficient for programmatic use
  • Recommendation: Non-stream for background tasks

Smart Routing Rules

Automatic Selection Logic

``` IF task_type == "simple_lookup" OR "formatting": → Gemini Flash (free)

ELIF task_type == "code_generation" AND complexity < 3: → Gemini Pro (free tier)

ELIF task_type == "complex_reasoning" OR "architecture": → Claude Sonnet or GPT-4o

ELIF task_type == "final_review" OR "critical_decision": → GPT-4 or Claude Opus ```

Fallback Chain

  1. Try Gemini (free)
  2. If rate limited → OpenRouter (cheap)
  3. If quality insufficient → GPT-4o
  4. If critical failure → GPT-4

Concrete Implementation

Config Structure (openclaw.json)

```json { "models": { "defaults": { "primary": "google-gemini-cli/gemini-2.0-flash", "fallbacks": [ "openrouter/qwen/qwen3-235b-a22b", "google-gemini-cli/gemini-1.5-pro", "openai/gpt-4o" ] }, "providers": { "google-gemini-cli": { "freeTier": true, "dailyLimit": 300000, "rateLimit": 60 }, "openrouter": { "freeTierModels": ["openrouter/auto"], "budgetLimit": 500 }, "openai": { "budgetLimit": 200, "useFor": ["critical", "architecture"] } } } } ```

Monitoring & Alerts

  • Track daily token usage per provider
  • Alert at 80% of free tier limits
  • Monthly budget review and adjustment

Cost Projections

Current Unknown Usage → Optimized

Scenario Monthly Tokens Current Cost Optimized Cost Savings
Light (< 1M) 1M $50-100 $0-10 90%
Medium (1-5M) 3M $200-500 $20-100 80%
Heavy (5-20M) 10M $1000-3000 $200-500 80%

Immediate Actions

Week 1: Setup

  • Configure Gemini as primary provider
  • Set up OpenRouter fallback
  • Implement basic usage tracking
  • Document current baseline

Week 2: Implement

  • Add smart routing logic
  • Implement context compression
  • Set up budget alerts
  • A/B test model choices

Week 3: Optimize

  • Analyze usage patterns
  • Fine-tune routing rules
  • Tune context windows
  • Document findings

Week 4: Scale

  • Full multi-provider setup
  • Implement full caching
  • Maximize free tier usage
  • Plan for paid tiers if needed

Long-term: Local Inference Path

Minimum Viable Setup

  • Hardware: RTX 4090 or Apple Silicon M3 Max
  • Software: Ollama + OpenClaw integration
  • Cost: ~$2000-4000 one-time
  • Break-even: 3-6 months vs. API costs

Full Self-Hosted

  • Hardware: Dual RTX 4090 or 2x Mac Studio
  • Models: Llama 3 70B, Mixtral 8x22B
  • Cost: ~$8000-12000
  • For: Privacy, unlimited inference, control