amr/memex

Fork 0

Files

Amr Gharbeia 70be8ab93e feat(arch): finalize Universal Literate Note transition for all projects and skills

2026-03-31 16:14:37 -04:00

5.7 KiB

Raw Blame History

Token Optimization Strategy

Executive Summary
- Goal: Minimize inference costs while maximizing capability
Three-Tier Model Strategy
Provider Analysis
Context Optimization Strategies
Request Optimization
Smart Routing Rules
- Automatic Selection Logic
- Fallback Chain
Concrete Implementation
- Config Structure (openclaw.json)
- Monitoring & Alerts
Cost Projections
- Current Unknown Usage → Optimized
Immediate Actions
Long-term: Local Inference Path
- Minimum Viable Setup
- Full Self-Hosted

Executive Summary

Goal: Minimize inference costs while maximizing capability

Current approach: Single default model → Multi-tier, multi-provider strategy

Three-Tier Model Strategy

Tier 1: Fast/Cheap (80% of queries)

Purpose: Simple tasks, formatting, lookups
Models: Google Gemini Flash, Local models
Cost: $0-0.000001 per 1K tokens
Speed: Fastest

Tier 2: Balanced (18% of queries)

Purpose: Complex reasoning, code generation, analysis
Models: Gemini Pro, Claude Haiku, Llama 3 70B
Cost: $0.0001-0.003 per 1K tokens
Speed: Medium

Tier 3: High-Performance (2% of queries)

Purpose: Critical decisions, complex architecture, final review
Models: GPT-4, Claude Opus, Gemini Ultra
Cost: $0.01-0.03 per 1K tokens
Speed: Slower

Provider Analysis

Google AI Studio (Primary Recommended)

Model	Free Tier	Rate Limit	Best For
Gemini 2.0 Flash	300K tokens/day	60 req/min	Quick tasks, coding
Gemini 1.5 Flash	300K tokens/day	60 req/min	Fast responses
Gemini 1.5 Pro	300K tokens/day	60 req/min	Complex tasks

Cost: FREE (within limits)

OpenRouter.Aggregated (Secondary)

Model	Price/1K tokens	Context	Reliability
Qwen 3 235B	$0.0001-0.0003	128K	High
Mistral Large	$0.002-0.006	128K	High
Llama 4 405B	$0.0002-0.0005	128K	Medium
Free tier models	$0	Varies	Variable

OpenAI (Tier 3 only)

GPT-4: $0.03/1K tokens (expensive)
GPT-4o: $0.005/1K tokens (better value)
Use sparingly for critical tasks only

Local Inference (Long-term goal)

Hardware: $1000-5000 initial investment
Ongoing: $0 (electricity only)
Models: Llama 3, Mistral, DeepSeek
Best for: High-volume, privacy-sensitive work

Context Optimization Strategies

1. Context Windows by Task Type

Task Type	Optimal Context	Compression	Savings
Code review	4K-8K	Truncate old files	50%
Documentation	8K-16K	Summarize sections	30%
Research	16K-32K	Chunk + RAG	70%
Architecture	32K-128K	Maintain full	0%

2. Conversation Pruning

Remove "thinking" blocks from history
Summarize conversation every 10 turns
Archive old sessions to external storage

3. RAG vs. Full Context

Rule: < 5K tokens of context → Full
Rule: > 10K tokens of context → Use embeddings/RAG
Savings: 60-80% on large document tasks

Request Optimization

Batching Strategy

Group similar requests (3-5 per batch)
Same model, same parameters
Shared overhead costs

Caching Strategy

Cache embeddings for repeated contexts
Store common completions (templates)
Reuse code snippet suggestions

Streaming vs. Non-Stream

Streaming: Better UX, but higher token overhead
Non-stream: More efficient for programmatic use
Recommendation: Non-stream for background tasks

Smart Routing Rules

Automatic Selection Logic

``` IF task_type == "simple_lookup" OR "formatting": → Gemini Flash (free)

ELIF task_type == "code_generation" AND complexity < 3: → Gemini Pro (free tier)

ELIF task_type == "complex_reasoning" OR "architecture": → Claude Sonnet or GPT-4o

ELIF task_type == "final_review" OR "critical_decision": → GPT-4 or Claude Opus ```

Fallback Chain

Try Gemini (free)
If rate limited → OpenRouter (cheap)
If quality insufficient → GPT-4o
If critical failure → GPT-4

Concrete Implementation

Config Structure (openclaw.json)

```json { "models": { "defaults": { "primary": "google-gemini-cli/gemini-2.0-flash", "fallbacks": [ "openrouter/qwen/qwen3-235b-a22b", "google-gemini-cli/gemini-1.5-pro", "openai/gpt-4o" ] }, "providers": { "google-gemini-cli": { "freeTier": true, "dailyLimit": 300000, "rateLimit": 60 }, "openrouter": { "freeTierModels": ["openrouter/auto"], "budgetLimit": 500 }, "openai": { "budgetLimit": 200, "useFor": ["critical", "architecture"] } } } } ```

Monitoring & Alerts

Track daily token usage per provider
Alert at 80% of free tier limits
Monthly budget review and adjustment

Cost Projections

Current Unknown Usage → Optimized

Scenario	Monthly Tokens	Current Cost	Optimized Cost	Savings
Light (< 1M)	1M	$50-100	$0-10	90%
Medium (1-5M)	3M	$200-500	$20-100	80%
Heavy (5-20M)	10M	$1000-3000	$200-500	80%

Immediate Actions

Week 1: Setup

Configure Gemini as primary provider
Set up OpenRouter fallback
Implement basic usage tracking
Document current baseline

Week 2: Implement

Add smart routing logic
Implement context compression
Set up budget alerts
A/B test model choices

Week 3: Optimize

Analyze usage patterns
Fine-tune routing rules
Tune context windows
Document findings

Week 4: Scale

Full multi-provider setup
Implement full caching
Maximize free tier usage
Plan for paid tiers if needed

Long-term: Local Inference Path

Minimum Viable Setup

Hardware: RTX 4090 or Apple Silicon M3 Max
Software: Ollama + OpenClaw integration
Cost: ~$2000-4000 one-time
Break-even: 3-6 months vs. API costs

Full Self-Hosted

Hardware: Dual RTX 4090 or 2x Mac Studio
Models: Llama 3 70B, Mixtral 8x22B
Cost: ~$8000-12000
For: Privacy, unlimited inference, control

5.7 KiB Raw Blame History