agent-harness/COST-OPTIMIZATION.md

357 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Cost Optimization — Getting More Work Per Dollar
> AI model subscriptions have fundamentally different billing models.
> Understanding them is the difference between $5 and $50 for the same output.
> This guide teaches you to match your work patterns to your billing model.
---
## The Two Billing Models
### Request-Based (GitHub Copilot Pro)
- **What counts:** Each API call = 1 request
- **Multipliers:** Advanced models cost more per request (e.g., Opus 4.6 = 3x)
- **Key insight:** A 2-second request costs the same as a 10-minute request
- **Budget:** Fixed monthly request pool (e.g., 300 premium requests/month)
**What's expensive:** Many small requests
**What's cheap:** Few large requests that do lots of work
### Token-Based (Anthropic Claude Pro)
- **What counts:** Input tokens + output tokens consumed
- **Windows:** Per-session (5-hour) and weekly token budgets
- **Key insight:** Context grows with every turn — turn 50 includes ALL previous turns as input
- **Danger zone:** Long conversations burn tokens exponentially
**What's expensive:** Large context windows, long conversations
**What's cheap:** Fresh sessions with minimal context
---
## How Context Affects Cost
### The Context Growth Problem (Token-Based)
Every message in a conversation gets re-sent as context:
```
Turn 1: Input = system prompt (2K tokens) → Total: 2K
Turn 5: Input = system prompt + 4 prior turns (8K) → Total: 8K
Turn 10: Input = system prompt + 9 prior turns (20K) → Total: 20K
Turn 20: Input = system prompt + 19 prior turns (50K) → Total: 50K
Turn 30: Input = system prompt + 29 prior turns (90K) → Total: 90K
```
By turn 30, you're burning 90K input tokens PER TURN just for context. The actual
new content might be 500 tokens, but you're paying for the full history every time.
**This is why sub-agents are token-efficient on Anthropic:**
Each sub-agent starts with ~2K tokens of context (just the task prompt), regardless
of how long your main conversation has been running.
### Context Doesn't Matter (Request-Based)
On Copilot, a request with 2K context costs the same as a request with 100K context.
It's still 1 request. So for request-based billing:
- Let context accumulate — it's free
- Pack more work into each request
- Don't spawn sub-agents unnecessarily (each spawn = new request)
---
## Optimal Strategies by Subscription
### GitHub Copilot Pro — Batch Everything
**Goal:** Maximize work per request
**Pattern: Fat Sub-Agents**
```
# BAD: 5 requests × 3 multiplier = 15 premium requests
sessions_spawn("Do task 1") → 1 request
sessions_spawn("Do task 2") → 1 request
sessions_spawn("Do task 3") → 1 request
sessions_spawn("Do task 4") → 1 request
sessions_spawn("Do task 5") → 1 request
# GOOD: 1 request × 3 multiplier = 3 premium requests
sessions_spawn("Do tasks 1-5 sequentially. For each:
implement, test, commit, then move to the next.")
```
**Same work, 80% cheaper.**
**Pattern: Compound Tasks**
```
# BAD: 3 separate requests
"Review the code" → 1 request
"Fix the issues you found" → 1 request
"Update the docs" → 1 request
# GOOD: 1 compound request
"Review the code, fix any issues you find, and update the
docs to reflect the changes. Commit each fix separately."
```
**Pattern: Agent Harness with Multi-Task Iterations**
```
# Modified AGENT-INSTRUCTIONS.md for Copilot:
### 3. Pick Tasks
- Find the NEXT 3-5 unchecked tasks in IMPLEMENTATION_PLAN.md
- Complete ALL of them in this iteration
- Commit after each task (for clean git history)
- Then exit for a fresh context restart
```
**When to use Copilot models:**
- Long autonomous coding sessions (lots of tool calls = still 1 request)
- Complex multi-step tasks (planning + implementation + testing)
- Agent harness iterations (pack 3-5 tasks per iteration)
- Overnight batch work
**When NOT to use Copilot models:**
- Quick questions ("What time is it in Tokyo?")
- Simple file reads or lookups
- Anything you could do with a cheaper/free model
### Anthropic Claude Pro — Stay Lean
**Goal:** Minimize token consumption per interaction
**Pattern: Fresh Sub-Agents**
```
# GOOD on Anthropic: Each sub-agent starts with clean context
sessions_spawn("Do task 1") → ~5K tokens (fresh context)
sessions_spawn("Do task 2") → ~5K tokens (fresh context)
sessions_spawn("Do task 3") → ~5K tokens (fresh context)
Total: ~15K tokens
# BAD on Anthropic: One long conversation
Main session turn 1: "Do task 1" → 3K input tokens
Main session turn 5: "Do task 2" → 15K input tokens
Main session turn 10: "Do task 3" → 35K input tokens
Total: ~53K tokens (3.5x more!)
```
**Pattern: Minimal Context Agent Instructions**
```
# BAD: Agent reads entire spec every iteration
"Read PROJECT-SPEC.md (5000 tokens), IMPLEMENTATION_PLAN.md (2000 tokens),
DECISIONS.md (1500 tokens), and the last 20 git commits..."
# GOOD: Agent reads only what it needs
"Read IMPLEMENTATION_PLAN.md. Find the first unchecked task.
Read ONLY the relevant section of PROJECT-SPEC.md for that task.
Implement, test, commit."
```
**Pattern: Offload to Cheaper Models**
```
# Use Sonnet (cheaper) for routine work
sessions_spawn("Implement the CRUD endpoints", model: "sonnet")
# Use Opus (expensive) only for complex reasoning
sessions_spawn("Design the Monte Carlo simulation algorithm", model: "opus")
```
**When to use Anthropic models:**
- Quick interactions in main session (small context = few tokens)
- Tasks requiring strong reasoning (Opus quality)
- Sub-agent swarms (fresh context each time)
**When NOT to use Anthropic models:**
- Long main-session conversations (context grows = token burn)
- Low-complexity tasks (use a cheaper model)
- Repetitive iterations (context grows even with similar content)
---
## Model Selection Guide
### By Task Complexity
| Task | Recommended | Why |
|------|-------------|-----|
| Planning & decomposition | Opus (either provider) | Needs strong reasoning |
| Scaffolding & config | Sonnet or GPT-4.1 | Simple, deterministic |
| Feature implementation | Sonnet | Good balance |
| Complex algorithms | Opus | Deep reasoning needed |
| Bug diagnosis | Opus | Pattern recognition |
| Bug fixing | Sonnet | Usually straightforward once diagnosed |
| Documentation | Sonnet or GPT-4.1 | Writing, not complex reasoning |
| Code review | Opus | Needs to spot subtle issues |
| Test writing | Sonnet | Follows patterns from spec |
### By Provider Optimization
| Scenario | Best Provider | Reasoning |
|----------|--------------|-----------|
| 5 tasks in agent harness | Copilot (batch 5 tasks = 1 request) | Request efficiency |
| Quick "what's the status?" | Anthropic (small context) | Token efficiency |
| Overnight autonomous loop | Copilot (fewer requests total) | Request efficiency |
| Sub-agent swarm (10 agents) | Anthropic (fresh context each) | Token efficiency |
| Long planning conversation | Copilot (context growth is free) | Request efficiency |
| One-shot code generation | Either (1 request, small context) | Similar cost |
---
## The Hybrid Strategy
Use both subscriptions strategically:
```
Morning check-in (main session): Anthropic Sonnet (small context, quick)
Planning conversation: Copilot Opus (context growth is free)
Agent harness iterations: Copilot Sonnet (batch tasks, 1 request each)
Complex debugging: Copilot Opus (1 request, deep reasoning)
Quick questions throughout the day: Anthropic Sonnet (minimal tokens)
Overnight autonomous work: Copilot Sonnet (batch tasks, few requests)
```
### Budget Allocation Example
Monthly budget:
- Copilot Pro: 300 premium requests (Opus = 3x, Sonnet = 1x)
- Anthropic Pro: Weekly token budget (resets Sundays)
**Agent harness project (20 iterations):**
```
Copilot approach:
Planning: 1 Opus request = 3 premium → 3
20 iterations × batch 5 tasks = 4 Sonnet → 4
Code review: 2 Opus requests = 6 premium → 6
Total: 13 premium requests
Anthropic approach:
Planning: 1 session = ~10K tokens
20 iterations × 1 task = 20 sessions × ~8K → 160K tokens
Code review: 2 sessions × ~15K → 30K tokens
Total: ~200K tokens (could eat a chunk of weekly budget)
```
For an agent harness project, **Copilot is usually cheaper** because you can batch.
For daily conversational use, **Anthropic is usually cheaper** because most interactions are short.
---
## Anti-Patterns to Avoid
### 1. The Chatty Agent (Anthropic killer)
```
Turn 1: "What should I work on?" ← Wastes a turn
Turn 2: "I'll start with the parser" ← Wastes a turn
Turn 3: "Here's my plan..." ← Wastes a turn
Turn 4: *actually starts working*
# Fix: Give clear instructions upfront so the agent works immediately
```
### 2. The Spawn Happy Pattern (Copilot killer)
```
sessions_spawn("Read the plan") ← 1 request for reading?!
sessions_spawn("Pick the next task") ← 1 request for picking?!
sessions_spawn("Implement the task") ← Finally useful
sessions_spawn("Run the tests") ← 1 request for one command?!
# Fix: One spawn that does all four steps
```
### 3. The "Let Me Check" Loop (Both killers)
```
"Check if the build passes" → agent runs build, reports back
"OK now run the tests" → agent runs tests, reports back
"OK now check the linter" → agent runs linter, reports back
# Fix: "Run build, tests, and linter. Report all results."
```
### 4. Using Opus for Everything
```
# Opus is 3x on Copilot, token-heavy on Anthropic
# Most tasks don't need it
# Fix: Default to Sonnet. Upgrade to Opus only for:
# - Planning and decomposition
# - Complex algorithm design
# - Subtle bug diagnosis
# - Architecture decisions
```
### 5. Ignoring Context Size (Anthropic killer)
```
Main session at turn 50: "Hey can you also check the weather?"
# That weather check just cost 90K input tokens because of context
# Fix: Use a sub-agent for unrelated tasks
# Or start a new session for new topics
```
---
## Monitoring Your Usage
### GitHub Copilot
- Check premium request usage at: github.com/settings/copilot
- Track requests per task in your daily memory notes
- Set alerts at 80% of monthly budget
### Anthropic Claude Pro
- Check usage at: claude.ai/settings/usage (subscription)
- API usage at: console.anthropic.com (if using API directly)
- Monitor "Current session X% used" — stop at 90%
- Weekly reset: Sundays at 11 AM ET
### OpenClaw Session Status
```
/status → Shows current model, session %, premium request %
```
### Logging Strategy
Track in your daily memory notes:
```markdown
## Model Usage — 2026-03-18
- Copilot premium requests: 12 used today (45/300 monthly)
- Anthropic session: 35% used (resets in 3 days)
- Tasks completed: 8
- Cost per task: ~1.5 premium requests average
```
---
## Quick Reference Card
```
┌─────────────────────────────────────────────────┐
│ COST OPTIMIZATION CHEAT SHEET │
├─────────────────────────────────────────────────┤
│ │
│ COPILOT PRO (request-based): │
│ ✅ Batch tasks into one request │
│ ✅ Let context grow (it's free) │
│ ✅ Long sessions with many tool calls │
│ ❌ Don't spawn many small sub-agents │
│ ❌ Don't use Opus for simple tasks (3x!) │
│ │
│ ANTHROPIC PRO (token-based): │
│ ✅ Fresh sub-agents (clean context) │
│ ✅ Short, focused interactions │
│ ✅ Use Sonnet for most work │
│ ❌ Don't let main session context grow │
│ ❌ Don't have long planning conversations │
│ │
│ GENERAL: │
│ • Sonnet for building, Opus for thinking │
│ • Batch related work, split unrelated work │
│ • Monitor usage daily, adjust weekly │
│ • When in doubt, use the cheaper model first │
│ │
└─────────────────────────────────────────────────┘
```
---
_The cheapest token is the one you don't spend. The cheapest request is the one that does five things._