agent-harness/archive/Agent-Harness-Project-spec-...

467 lines
18 KiB
Markdown

# Project Specification: Agent Harness System
## 1. Project Overview
### What are we building?
The Agent Harness System is a collection of templates, scripts, and best practices for running autonomous AI-powered coding agents on complex software projects. It provides a structured framework to decompose large projects into manageable tasks, execute them iteratively with fresh agent contexts, and maintain high-quality code through mandatory testing and verification.
### Why does it matter?
Traditional AI coding assistants struggle with large, multi-step projects due to context window limitations and the need for iterative refinement. The Agent Harness addresses this by providing a "Ralph Wiggum Loop" mechanism that spawns fresh agents for each task iteration, preventing context drift while maintaining project coherence through structured documentation and git-based memory.
### Success criteria
- [ ] Agents can autonomously decompose complex project specs into testable tasks
- [ ] Fresh agent iterations prevent context overflow and stale reasoning
- [ ] Mandatory build/test cycles ensure code quality
- [ ] Git history serves as reliable inter-iteration memory
- [ ] System works with multiple AI agents (Claude, Codex, etc.)
- [ ] Clear signals for completion, stuck states, and errors
- [ ] Comprehensive documentation enables easy adoption
---
## 2. Technical Foundation
### Tech stack
- **Language:** Bash (for the loop script), Markdown (for templates)
- **Tools:** Git, shell commands, AI agent CLIs (claude, codex)
- **Build system:** N/A (templates for various project types)
- **Test framework:** Project-specific (agents run their own tests)
- **Package manager:** N/A
### Project structure
```
docs/agent-harness/
├── README.md # Quick overview and file purposes
├── AGENT-INSTRUCTIONS.md # Template for agent system prompts
├── PROJECT-SPEC.md # Template for project specifications
├── ralph-loop.sh # The loop execution script
└── EXAMPLES.md # Worked examples and best practices
```
### Build & test commands
The harness itself doesn't have build/test commands, but agents using it must define them in their PROJECT-SPEC.md.
### Coding standards
- Markdown files use consistent formatting with headers, lists, code blocks
- Bash scripts use set -euo pipefail for error handling
- Templates include clear placeholders and examples
- Documentation focuses on actionable, specific guidance
---
## 3. Requirements
### Functional Requirements
#### FR-001: Project Specification Template
**Description:** A comprehensive template that captures all necessary project details for autonomous agent work.
**Acceptance criteria:**
- [ ] Covers project overview, technical foundation, requirements, data models
- [ ] Includes phasing for large projects
- [ ] Provides reference materials and anti-patterns
- [ ] Enables agents to work without human intervention
#### FR-002: Agent Instructions Template
**Description:** System prompt template that defines agent behavior, the core loop, and rules.
**Acceptance criteria:**
- [ ] Defines senior engineer role with full codebase access
- [ ] Specifies exact sequence: orient → plan → pick task → implement → verify → commit → exit
- [ ] Includes output signals for loop control (<promise> tags)
- [ ] Enforces one-task-per-iteration rule
#### FR-003: Ralph Wiggum Loop Script
**Description:** Bash script that orchestrates agent iterations with fresh contexts.
**Acceptance criteria:**
- [ ] Spawns fresh agent processes each iteration
- [ ] Supports planning mode and build mode
- [ ] Monitors output signals for completion/stuck/error states
- [ ] Logs all iterations for debugging
- [ ] Configurable max iterations and agent type
#### FR-004: Implementation Plan Management
**Description:** Dynamic task decomposition and tracking system.
**Acceptance criteria:**
- [ ] Agents create IMPLEMENTATION_PLAN.md from project spec
- [ ] Tasks ordered by dependency with checkboxes
- [ ] Plan updated after each completed task
- [ ] Git commits preserve plan history
#### FR-005: Quality Assurance Integration
**Description:** Mandatory build and test verification in each iteration.
**Acceptance criteria:**
- [ ] Agents run project-specific build commands
- [ ] All tests must pass before committing
- [ ] Build failures prevent progression
- [ ] Linting enforced if configured
### Non-Functional Requirements
#### NFR-001: Simplicity
- [ ] No complex dependencies or frameworks
- [ ] Works with standard shell and git
- [ ] Easy to copy templates into any project
- [ ] Minimal setup required
#### NFR-002: Reliability
- [ ] Fresh contexts prevent reasoning drift
- [ ] Git history provides audit trail
- [ ] Clear error signals for human intervention
- [ ] Handles agent failures gracefully
#### NFR-003: Flexibility
- [ ] Supports multiple AI agents (Claude, Codex, etc.)
- [ ] Works with various project types and tech stacks
- [ ] Configurable iteration limits and modes
- [ ] Extensible for custom workflows
---
## 4. Data Model
The Agent Harness is documentation-focused, not data-focused. The "data" is the project files themselves.
### Entities
Entity: Project Spec
- Overview: what/why/success criteria
- Technical foundation: stack, structure, commands
- Requirements: functional/non-functional
- Data model: project-specific entities
- Architecture: constraints, decisions
- Phasing: optional breakdown
- References: docs, examples, anti-patterns
Entity: Implementation Plan
- Tasks: discrete, testable, dependency-ordered
- Status: checkbox per task
- Notes: agent comments on stuck tasks
- History: git commits track plan evolution
Entity: Agent Iteration
- Context: fresh read of spec + plan + git log
- Task: one unchecked item from plan
- Changes: code modifications + tests
- Verification: build + test results
- Commit: descriptive message + plan update
### Relationships
- Project Spec → Implementation Plan (agent creates from spec)
- Implementation Plan → Agent Iterations (one task per iteration)
- Agent Iterations → Git Commits (each iteration commits changes)
---
## 5. API / Interface Design
The harness provides command-line interfaces:
### ralph-loop.sh Commands
```bash
./ralph-loop.sh # Build mode (default)
./ralph-loop.sh plan # Planning mode
./ralph-loop.sh --max 20 # Limit iterations
./ralph-loop.sh --agent claude # Specify agent
```
### Template Files
- PROJECT-SPEC.md: Fill with project details
- AGENT.md: Copy from AGENT-INSTRUCTIONS.md
- IMPLEMENTATION_PLAN.md: Generated by agent
### Output Signals
Agents output special tags that the loop monitors:
- `<promise>PLANNED</promise>`: Plan created
- `<promise>DONE</promise>`: All tasks complete
- `<promise>STUCK</promise>`: Needs human help
- `<promise>ERROR</promise>`: Unrecoverable error
---
## 6. Architecture Decisions
### Constraints
- MUST: Use fresh agent contexts each iteration
- MUST: One task per agent iteration
- MUST: Mandatory build/test verification
- MUST NOT: Allow context compaction or memory accumulation
- PREFER: Git as the coordination mechanism
- PREFER: Simple bash orchestration over complex frameworks
### Dependencies
- Git (version control)
- AI agent CLI (claude, codex, etc.)
- Shell environment (bash)
- Project-specific build tools (npm, etc.)
### Known Challenges
- Context window limitations of AI agents
- Maintaining coherence across iterations
- Handling agent failures or stuck states
- Balancing specificity vs flexibility in templates
---
## 7. Phasing (Optional)
The harness itself is complete in one phase, but projects using it should phase their work.
### Phase 1: Foundation
- [ ] Copy templates into project
- [ ] Fill PROJECT-SPEC.md
- [ ] Run planning mode to create IMPLEMENTATION_PLAN.md
### Phase 2: Execution
- [ ] Run build iterations until completion
- [ ] Monitor for stuck/error signals
- [ ] Intervene as needed
### Phase 3: Refinement
- [ ] Review final codebase
- [ ] Update templates based on lessons learned
- [ ] Document improvements for future use
---
## 8. Reference Materials
### External docs
- Geoffrey Huntley's Ralph Wiggum approach
- Nate Jones task decomposition method
- Ezward's sequential PRD style
- OpenClaw sessions_spawn documentation
### Existing code to learn from
- ralph-loop.sh: Clean bash scripting with error handling
- Templates: Structured markdown with clear sections
- Examples: Real-world project specifications
### Anti-patterns
- Don't try to pass context between iterations
- Don't let agents work on multiple tasks simultaneously
- Don't skip build/test verification
- Don't use complex orchestration when bash loop suffices
- Don't make templates too rigid — they should be adapted per project
---
## All Template Files and Their Roles
### AGENT-INSTRUCTIONS.md
**Role:** System prompt template for the AI agent. Defines the senior engineer role, core workflow loop, strict rules, and output signals. Agents read this each iteration to understand their behavior.
**Key Sections:**
- Role definition and capabilities
- Core loop: orient → plan/pick → implement → verify → commit → exit
- Rules: one task per iteration, mandatory testing, no over-engineering
- Output signals: <promise> tags for loop control
- Context management: fresh starts with git as memory
### PROJECT-SPEC.md
**Role:** Comprehensive project definition template. The single source of truth that agents read every iteration. Captures all requirements, constraints, and context needed for autonomous work.
**Key Sections:**
- Project overview (what, why, success criteria)
- Technical foundation (stack, structure, commands)
- Detailed requirements (functional + non-functional)
- Data models and API design
- Architecture decisions and constraints
- Phasing and reference materials
### ralph-loop.sh
**Role:** Bash script implementing the Ralph Wiggum Loop mechanism. Orchestrates agent iterations, monitors completion signals, handles errors, and maintains logs.
**Key Features:**
- Fresh agent spawning each iteration
- Planning mode vs build mode
- Signal monitoring (<promise> tags)
- Configurable agents and iteration limits
- Comprehensive logging
### EXAMPLES.md
**Role:** Worked examples, comparisons of approaches, and best practices. Shows how to write good specs, compares different methodologies, and provides integration examples.
**Key Content:**
- Comparison of Ezward/Ralph/Nate approaches
- Complete FinPlan project spec example
- Best practices for spec writing
- OpenClaw integration examples
## The Ralph Wiggum Loop Mechanism
The Ralph Wiggum Loop is named after the Simpsons character known for forgetting everything immediately, forcing fresh starts. This is the core innovation:
### How It Works
1. **Fresh Context Each Time:** Every iteration spawns a completely new agent process with no accumulated context from previous runs.
2. **Read-Only Memory:** Agents rely on:
- PROJECT-SPEC.md (static requirements)
- IMPLEMENTATION_PLAN.md (current task status)
- Git log (recent changes)
- Codebase state
- Test results
3. **One Task Per Iteration:** Agents pick exactly one unchecked task, implement it completely, verify with build/tests, commit, and exit.
4. **Signal-Based Control:** Agents output <promise> tags that the bash loop monitors to determine next action.
5. **Git as Coordination:** Each iteration's changes are committed, creating an audit trail and allowing the next agent to see what was done.
### Benefits
- Prevents context window overflow
- Eliminates stale reasoning problems
- Enables indefinite project scaling
- Provides clear intervention points
- Maintains code quality through iteration
### Flow Diagram
```
Start Loop
├── Read PROJECT-SPEC.md
├── Run Agent with Fresh Context
├── Agent: Orient (read plan, git log)
├── Agent: Pick ONE Task
├── Agent: Implement + Verify
├── Agent: Commit + Mark Done
├── Check Output Signals
├── If DONE: Exit Success
├── If STUCK/ERROR: Exit with Warning
└── Else: Loop Again
```
## How to Use for Autonomous Coding Workflows
### Quick Start
1. Copy templates into your project root
2. Fill out PROJECT-SPEC.md with complete project details
3. Run `./ralph-loop.sh plan` to generate IMPLEMENTATION_PLAN.md
4. Run `./ralph-loop.sh` to start autonomous building
5. Monitor progress; intervene if agent gets stuck
### Detailed Workflow
1. **Preparation:**
- Choose project directory
- Copy all 4 template files
- Customize PROJECT-SPEC.md with your requirements
- Ensure build/test commands work
2. **Planning Phase:**
- Run `./ralph-loop.sh plan`
- Agent reads spec and creates task decomposition
- Review IMPLEMENTATION_PLAN.md for completeness
3. **Build Iterations:**
- Run `./ralph-loop.sh --max 50` (or your preferred limit)
- Each iteration: fresh agent → one task → verify → commit
- Loop continues until DONE or max iterations
4. **Monitoring:**
- Check `.ralph-logs/` for iteration details
- Look for STUCK/ERROR signals requiring intervention
- Review git log for progress
5. **Intervention:**
- If stuck: update IMPLEMENTATION_PLAN.md with notes
- If error: fix the issue and restart loop
- If plan needs changes: edit and restart
### Configuration Options
- `--max N`: Limit iterations (default 50)
- `--agent claude|codex`: Choose AI agent
- `plan` mode: Just create implementation plan
## Examples and Use Cases
### Personal Finance App (FinPlan)
Complete example in EXAMPLES.md showing:
- Privacy-first local finance dashboard
- Transaction import, categorization, projections
- Monte Carlo retirement simulations
- Tech stack: TypeScript, Express, SQLite, vanilla JS
- 15+ features decomposed into phases
### Key Patterns from Examples
- **Be Specific:** Acceptance criteria like "Parse QFX files and extract: date, amount, payee, memo, type"
- **Define Tech Stack:** Don't let agents choose — specify "TypeScript, Express.js, SQLite"
- **Include Data Models:** Explicit entity definitions with constraints
- **Phase Large Projects:** Independent deployable phases
- **Anti-Patterns:** "Don't use localStorage — SQLite is source of truth"
### Use Cases
- **Complex Web Apps:** Multi-feature applications with databases
- **Libraries/Frameworks:** API design and implementation
- **Data Processing:** ETL pipelines, analysis tools
- **CLI Tools:** Command-line utilities with multiple commands
- **Prototypes to Production:** Start with working prototype, iterate to full product
## Integration with OpenClaw sessions_spawn
OpenClaw provides `sessions_spawn` for agent orchestration, offering an alternative to the bash loop.
### Basic Usage
```bash
# Planning phase
sessions_spawn --task "Read PROJECT-SPEC.md. Decompose into tasks. Write IMPLEMENTATION_PLAN.md." --model opus
# Build iterations
sessions_spawn --task "Read AGENT.md. Follow core loop: pick one task, implement, test, commit." --model sonnet
```
### Advanced Integration
- **Parallel Tasks:** Spawn multiple agents for independent tasks
- **Different Models:** Use opus for planning, sonnet for coding
- **Cron Scheduling:** Automate iterations with cron jobs
- **Channel Output:** Direct results to specific channels
### Benefits Over Bash Loop
- Model selection per task type
- Parallel execution for independent work
- Integration with OpenClaw's session management
- Richer output formatting and notifications
### When to Use Each
- **Ralph Loop:** Simple sequential projects, bash environments
- **OpenClaw:** Complex projects, parallel work, advanced features
## Best Practices for Agent-Driven Development
### Writing Project Specs
1. **Be Exhaustively Specific:** Include exact acceptance criteria, not vague requirements
2. **Define Everything:** Tech stack, directory structure, build commands, coding standards
3. **Provide Examples:** Sample data, API responses, UI mockups
4. **Phase Appropriately:** Break large projects into independent phases
5. **Document Constraints:** What MUST/MUST NOT do, plus preferences
6. **Include Anti-Patterns:** Lessons from previous attempts
### Agent Instructions
1. **Role Definition:** Clear capabilities and limitations
2. **Strict Rules:** One task per iteration, mandatory testing, no refactoring unrelated code
3. **Clear Signals:** Use <promise> tags for loop control
4. **Context Boundaries:** Fresh start each time, rely on files/git
### Loop Management
1. **Monitor Logs:** Check .ralph-logs/ for issues
2. **Set Reasonable Limits:** --max 20-50 iterations depending on project size
3. **Plan Reviews:** Always review IMPLEMENTATION_PLAN.md after planning phase
4. **Intervention Ready:** Be prepared to help when agents get stuck
### Quality Assurance
1. **Test Everything:** Unit, integration, end-to-end tests
2. **Build Verification:** Every iteration must pass build
3. **Code Standards:** Lint, format, document consistently
4. **Manual Reviews:** Spot-check critical functionality
### Scaling Up
1. **Phase Work:** Complete foundations before features
2. **Parallel Execution:** Use OpenClaw for independent tasks
3. **Iterative Refinement:** Start with working prototype, enhance gradually
4. **Documentation Updates:** Improve templates based on lessons learned
### Common Pitfalls
- **Vague Specs:** Leads to agent confusion and poor decomposition
- **Missing Build/Test:** Code quality suffers without verification
- **Context Sharing:** Don't try to pass state between iterations
- **Over-Parallelization:** Dependencies must be respected
- **Ignoring Signals:** STUCK/ERROR states need attention
This system transforms AI coding assistants from helpful sidekicks into autonomous development partners capable of delivering complete, tested software projects.