amr/org-agent-contrib

Fork 0

Files

Amr Gharbeia c5d3d8c5fa chore: Audit terminology (org-agent -> opencortex)

2026-04-14 16:03:37 -04:00

16 KiB

Raw Blame History

SKILL: Token Accountant Agent (Universal Literate Note) Token Management & Model Optimization Research Token Optimization - $50 Monthly Budget Token Optimization Token Optimization - Quick Start Token Optimization Strategy

Overview
Phase A: Demand (PRD)
- 1. Purpose
Phase B: Blueprint (PROTOCOL)
- 1. Architectural Intent
- 2. Semantic Interfaces
  - Routing and Pain Management
Registration
Documentation (Token Optimization)
- research.org
Token Management Strategy Research
- Initial Findings
  - OpenRouter Free Tier
  - Google AI Studio (Gemini)
- Research Questions
- To Research Further
- Token Optimization Strategies to Explore
- X Account Access
- budget-50.org
Budget: $50/Month
- Budget Breakdown
- Daily Free Allowance
- Paid Tier Allocation ($45)
- Hard Limits
- Daily Tracking
- Emergency Protocol
- Cost-Per-Task Guidelines
- Optimization Imperative
- Monthly Review
- Break-Even Analysis
- Questions for Human Partner
- README.org
Token Optimization
Project Tasks
Key Documents
Current Focus
- quick-start.org
Quick Reference for Daily Use
- Rule of Thumb
- Free Tier Limits (Daily)
- Current Recommendation
- This will reduce token costs by ~90%
- Next Steps
- Savings Potential: $100-500/month → $10-50/month
- plan.org
Executive Summary
- Goal: Minimize inference costs while maximizing capability
Three-Tier Model Strategy
- Tier 1: Fast/Cheap (80% of queries)
- Tier 2: Balanced (18% of queries)
- Tier 3: High-Performance (2% of queries)
Provider Analysis
- Google AI Studio (Primary Recommended)
- OpenRouter.Aggregated (Secondary)
- OpenAI (Tier 3 only)
- Local Inference (Long-term goal)
Context Optimization Strategies
- 1. Context Windows by Task Type
- 2. Conversation Pruning
- 3. RAG vs. Full Context
Request Optimization
- Batching Strategy
- Caching Strategy
- Streaming vs. Non-Stream
Smart Routing Rules
- Automatic Selection Logic
- Fallback Chain
Concrete Implementation
- Config Structure (openclaw.json)
- Monitoring & Alerts
Cost Projections
- Current Unknown Usage → Optimized
Immediate Actions
- Week 1: Setup
- Week 2: Implement
- Week 3: Optimize
- Week 4: Scale
Long-term: Local Inference Path
- Minimum Viable Setup
- Full Self-Hosted

Overview

The Token Accountant is the governor of the Neural Engine. It manages the cost, reliability, and routing of LLM providers. Its primary mission is to ensure the PSF operates at maximum intelligence with minimum marginal cost by aggressively prioritizing subsidized free models when appropriate.

Phase A: Demand (PRD)

1. Purpose

Autonomously manage the provider cascade and model selection to optimize for cost, speed, and reliability.

Phase B: Blueprint (PROTOCOL)

1. Architectural Intent

Maintain a state-aware provider cascade that routes around "pain" (failures) and dynamically selects models based on task complexity.

2. Semantic Interfaces

Routing and Pain Management

(in-package :opencortex)

(defvar *provider-pain-table* (make-hash-table :test 'equal))

(defun token-accountant-record-pain (provider)
  "Marks a provider as 'pained' (failed). It will be de-prioritized."
  (setf (gethash provider *provider-pain-table*) (+ (get-universal-time) 600)) ; 10 min penalty
  (harness-log "ACCOUNTANT - Provider ~a de-prioritized due to failure." provider))

(defun token-accountant-get-cascade (context)
  "Returns a dynamic list of providers, routing around pained ones. Uses standardized gateway keywords."
  (let ((all-providers '(:openrouter :groq :gemini-api :ollama))
        (healthy nil)
        (pained nil)
        (now (get-universal-time)))
    (dolist (p all-providers)
      (if (> (or (gethash p *provider-pain-table*) 0) now)
          (push p pained)
          (push p healthy)))
    (append (nreverse healthy) (nreverse pained))))

(defun token-accountant-get-model-for-provider (provider &optional context)
  "Returns the recommended model for the provider, prioritizing free/subsidized models. Updated April 2026."
  (let ((complexity (ignore-errors (uiop:symbol-call :opencortex.skills.org-skill-router :router-classify-complexity context))))
    (case provider
      (:openrouter
       (case complexity
         (:REASONING "meta-llama/llama-3.3-70b-instruct:free") ; High fidelity, zero cost
         (:COGNITION "qwen/qwen3.6-plus:free")               ; Latest interaction, zero cost
         (t "meta-llama/llama-3.2-3b-instruct:free")))       ; Ultra-fast reflex, zero cost
      (:groq
       (case complexity
         (:REASONING "llama-3.3-70b-versatile")
         (t "llama-3.1-8b-instant")))
      (:gemini-api
       "gemini-1.5-flash-latest")
      (t nil))))

(defun token-accountant-patch-kernel ()
  "Hot-patches the harness's cascade and model selector to use our dynamic logic."
  (setf opencortex:*provider-cascade* #'token-accountant-get-cascade)
  (setf opencortex::*model-selector-fn* #'token-accountant-get-model-for-provider))

Registration

(progn
  (token-accountant-patch-kernel)
  (defskill :skill-token-accountant
    :priority 100
    :trigger (lambda (context) 
               (let ((sensor (getf (getf context :payload) :sensor)))
                 (or (eq sensor :tool-error) (eq sensor :cost-audit))))
    :neuro (lambda (context) nil)
    :symbolic (lambda (action context) 
                (let ((p (getf (getf context :payload) :provider)))
                  (when p (token-accountant-record-pain p))
                  action))))

Documentation (Token Optimization)

research.org

Token Management Strategy Research

Initial Findings

OpenRouter Free Tier

URL: https://openrouter.ai/collections/free-models
Providers moving from free to paid-only models
Belief: "Free models play crucial role in democratizing access"

Google AI Studio (Gemini)

Free tier available
Limits: 60 requests/minute, 300K tokens/day
No credit card required
Every API key gets these limits

Research Questions

Which providers offer free or low-cost tiers?
What are the rate limits and quotas?
Which models are best for which use cases?
How to optimize context windows?
What is the cost per token breakdown?

To Research Further

Provider	Free Tier	Paid Tier	Best For
Google Gemini	300K tokens/day	Pay per use?	General, coding
OpenRouter	Varies by model	Per-request	Routing, variety
OpenAI	?	?	GPT-4 quality
Anthropic	?	?	Claude capabilities
Mistral	?	?	Open weights
Local	Hardware cost	Free	Privacy, control

Token Optimization Strategies to Explore

Tiered Model Usage
- Simple tasks: Fast/cheap models
- Complex tasks: Stronger models
- Fallback: Lower tier if higher fails
Context Compression
- Summarize long contexts
- Use RAG instead of full context
- Prune old conversation
Caching
- Cache common responses
- Reuse embeddings
- Batch requests
Hybrid Approach
- Local models for simple queries
- Cloud APIs for complex tasks
- Manual review for critical outputs

X Account Access

Pending: X account access via Google login Blocker: Requires OTP from user per security rule (SOUL.md) Action needed: User provides OTP, I complete OAuth, access bookmarks

budget-50.org

Budget: $50/Month

Budget Breakdown

Tier	Provider	Allocation	Tokens Est.	Use Case
FREE	Google Gemini	$0	~9M/month	90% of work
CHEAP	OpenRouter	$20	~6M tokens	Fallback, complex tasks
PREMIUM	Claude/GPT-4o	$25	~500K tokens	Critical decisions
BUFFER	Various	$5	Emergency	Overruns, testing

Daily Free Allowance

Google Gemini: 300K tokens/day = 9M/month = $0
This covers 90-95% of expected workload

Paid Tier Allocation ($45)

$20 → OpenRouter (Qwen, Mistral, Llama)
- ~6M tokens at $0.003/1K
- Use when: Gemini rate limited, need different model
$25 → Premium models (Claude, GPT-4o)
- ~500K tokens at $0.05/1K average
- Use when: Architecture decisions, critical code review, final validation
$5 → Buffer
- Handle overruns
- Emergency access
- Testing new models

Hard Limits

Provider	Monthly Cap	Alert At
OpenRouter	$20	$16 (80%)
Premium	$25	$20 (80%)
Total	$50	$45 (90%)

Daily Tracking

Target: Monitor consumption every session

``` IF daily_cost > $1.50: → Switch to Gemini only → Defer premium tasks

IF weekly_cost > $12: → Review usage patterns → Find optimization opportunities ```

Emergency Protocol

If approaching $50 limit before month end:

Halt all paid API calls
Switch to Gemini-only mode
Queue premium tasks for next month
Consider local inference setup

Cost-Per-Task Guidelines

Task Type	Max Cost	Preferred Model
Quick lookup	$0.00	Gemini
Code review	$0.01	Gemini/OpenRouter
Feature design	$0.05	OpenRouter
Architecture review	$0.10	Claude/GPT-4o
Emergency debug	$0.20	Best available

Optimization Imperative

With $50/month, waste is not affordable:

❌ No speculative queries
❌ No "just curious" premium calls
❌ No repeated similar prompts
✅ Always use Gemini first
✅ Batch similar requests
✅ Cache embeddings locally
✅ Summarize long contexts

Monthly Review

Compare actual vs. projected usage
Adjust model routing rules
Identify expensive query patterns
Plan next month's allocation

Break-Even Analysis

At $50/month = $600/year:

Option A: Continue APIs (flexible, managed)
Option B: Local inference (~$800 hardware, $0 ongoing)
- Break-even: 16 months
- Risk: Hardware failure, maintenance

Recommendation: Stick with APIs until $100+/month, then evaluate hardware.

Questions for Human Partner

Is $50 firm or flexible in emergencies?
What happens if we hit limit mid-critical-task?
Preference for which premium model? (Claude vs GPT-4 vs both)
Should I track and report costs per project?
Any tasks that are "unlimited budget" critical?

README.org

Cost-effective LLM usage through smart routing, context compression, and multi-provider strategies.

Token Optimization

Strategy and implementation for minimizing LLM costs while maintaining quality.

Project Tasks

See the actionable tasks for this project in GTD.org > Projects > Token Optimization

Key Documents

Current Focus

Multi-provider setup (Gemini primary, OpenRouter fallback)
Usage tracking and budget alerts
Smart routing by task type
Context compression techniques

quick-start.org

Quick Reference for Daily Use

Rule of Thumb

What you need	Use this	Cost
Quick answer, formatting, lookup	Gemini Flash	FREE
Code review, analysis	Gemini Pro	FREE
Complex problem solving	Claude Haiku / Qwen	$
Critical architecture decision	GPT-4o	$$

Free Tier Limits (Daily)

Provider	Tokens	Requests	Reset
Google AI Studio	300,000	60/min	Daily
OpenRouter Free	Varies	Limited	-

Current Recommendation

→ Use Google Gemini exclusively until hitting 250K tokens/day → Then add OpenRouter fallback → Only use GPT-4 for final reviews

This will reduce token costs by ~90%

Next Steps

Configure Gemini as primary (already partially done)
Add quota tracking
Set alerts at 80% of free limits
Implement tiered routing

Savings Potential: $100-500/month → $10-50/month

plan.org

Executive Summary

Goal: Minimize inference costs while maximizing capability

Current approach: Single default model → Multi-tier, multi-provider strategy

Three-Tier Model Strategy

Tier 1: Fast/Cheap (80% of queries)

Purpose: Simple tasks, formatting, lookups
Models: Google Gemini Flash, Local models
Cost: $0-0.000001 per 1K tokens
Speed: Fastest

Tier 2: Balanced (18% of queries)

Purpose: Complex reasoning, code generation, analysis
Models: Gemini Pro, Claude Haiku, Llama 3 70B
Cost: $0.0001-0.003 per 1K tokens
Speed: Medium

Tier 3: High-Performance (2% of queries)

Purpose: Critical decisions, complex architecture, final review
Models: GPT-4, Claude Opus, Gemini Ultra
Cost: $0.01-0.03 per 1K tokens
Speed: Slower

Provider Analysis

Google AI Studio (Primary Recommended)

Model	Free Tier	Rate Limit	Best For
Gemini 2.0 Flash	300K tokens/day	60 req/min	Quick tasks, coding
Gemini 1.5 Flash	300K tokens/day	60 req/min	Fast responses
Gemini 1.5 Pro	300K tokens/day	60 req/min	Complex tasks

Cost: FREE (within limits)

OpenRouter.Aggregated (Secondary)

Model	Price/1K tokens	Context	Reliability
Qwen 3 235B	$0.0001-0.0003	128K	High
Mistral Large	$0.002-0.006	128K	High
Llama 4 405B	$0.0002-0.0005	128K	Medium
Free tier models	$0	Varies	Variable

OpenAI (Tier 3 only)

GPT-4: $0.03/1K tokens (expensive)
GPT-4o: $0.005/1K tokens (better value)
Use sparingly for critical tasks only

Local Inference (Long-term goal)

Hardware: $1000-5000 initial investment
Ongoing: $0 (electricity only)
Models: Llama 3, Mistral, DeepSeek
Best for: High-volume, privacy-sensitive work

Context Optimization Strategies

1. Context Windows by Task Type

Task Type	Optimal Context	Compression	Savings
Code review	4K-8K	Truncate old files	50%
Documentation	8K-16K	Summarize sections	30%
Research	16K-32K	Chunk + RAG	70%
Architecture	32K-128K	Maintain full	0%

2. Conversation Pruning

Remove "thinking" blocks from history
Summarize conversation every 10 turns
Archive old sessions to external storage

3. RAG vs. Full Context

Rule: < 5K tokens of context → Full
Rule: > 10K tokens of context → Use embeddings/RAG
Savings: 60-80% on large document tasks

Request Optimization

Batching Strategy

Group similar requests (3-5 per batch)
Same model, same parameters
Shared overhead costs

Caching Strategy

Cache embeddings for repeated contexts
Store common completions (templates)
Reuse code snippet suggestions

Streaming vs. Non-Stream

Streaming: Better UX, but higher token overhead
Non-stream: More efficient for programmatic use
Recommendation: Non-stream for background tasks

Smart Routing Rules

Automatic Selection Logic

``` IF task_type == "simple_lookup" OR "formatting": → Gemini Flash (free)

ELIF task_type == "code_generation" AND complexity < 3: → Gemini Pro (free tier)

ELIF task_type == "complex_reasoning" OR "architecture": → Claude Sonnet or GPT-4o

ELIF task_type == "final_review" OR "critical_decision": → GPT-4 or Claude Opus ```

Fallback Chain

Try Gemini (free)
If rate limited → OpenRouter (cheap)
If quality insufficient → GPT-4o
If critical failure → GPT-4

Concrete Implementation

Config Structure (openclaw.json)

```json { "models": { "defaults": { "primary": "google-gemini-cli/gemini-2.0-flash", "fallbacks": [ "openrouter/qwen/qwen3-235b-a22b", "google-gemini-cli/gemini-1.5-pro", "openai/gpt-4o" ] }, "providers": { "google-gemini-cli": { "freeTier": true, "dailyLimit": 300000, "rateLimit": 60 }, "openrouter": { "freeTierModels": ["openrouter/auto"], "budgetLimit": 500 }, "openai": { "budgetLimit": 200, "useFor": ["critical", "architecture"] } } } } ```

Monitoring & Alerts

Track daily token usage per provider
Alert at 80% of free tier limits
Monthly budget review and adjustment

Cost Projections

Current Unknown Usage → Optimized

Scenario	Monthly Tokens	Current Cost	Optimized Cost	Savings
Light (< 1M)	1M	$50-100	$0-10	90%
Medium (1-5M)	3M	$200-500	$20-100	80%
Heavy (5-20M)	10M	$1000-3000	$200-500	80%

Immediate Actions

Week 1: Setup

Configure Gemini as primary provider
Set up OpenRouter fallback
Implement basic usage tracking
Document current baseline

Week 2: Implement

Add smart routing logic
Implement context compression
Set up budget alerts
A/B test model choices

Week 3: Optimize

Analyze usage patterns
Fine-tune routing rules
Tune context windows
Document findings

Week 4: Scale

Full multi-provider setup
Implement full caching
Maximize free tier usage
Plan for paid tiers if needed

Long-term: Local Inference Path

Minimum Viable Setup

Hardware: RTX 4090 or Apple Silicon M3 Max
Software: Ollama + OpenClaw integration
Cost: ~$2000-4000 one-time
Break-even: 3-6 months vs. API costs

Full Self-Hosted

Hardware: Dual RTX 4090 or 2x Mac Studio
Models: Llama 3 70B, Mixtral 8x22B
Cost: ~$8000-12000
For: Privacy, unlimited inference, control

16 KiB Raw Blame History