docs(runbook): add agent harness failure modes and recovery guide

2026-03-24 22:39:05 -04:00 · 2026-03-24 22:39:05 -04:00 · 3e269a4d4c
parent d4aed475a2
commit 3e269a4d4c
1 changed files with 186 additions and 0 deletions
--- a/RUNBOOK.md
+++ b/RUNBOOK.md
@ -0,0 +1,186 @@
+# Recipe Manager Agentic Runbook
+
+Last updated: 2026-03-24
+
+## Purpose
+Operational guide for running the Recipe Manager agent harness reliably.
+
+---
+
+## Core Execution Model
+
+- One task per iteration
+- One commit per iteration
+- TODO.md is the authoritative queue
+- Work only in:
+  `/home/paulh/.openclaw/workspace/projects/recipe-manager`
+
+---
+
+## Required Guards (Must Pass Before Coding)
+
+### Pre-flight checks
+Before any iteration starts, verify these files exist:
+- `AGENT_INSTRUCTIONS.md`
+- `TODO.md`
+
+If missing, fail with:
+`STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager`
+
+---
+
+## Monitoring Signals (How we know it's working)
+
+A run is healthy only when all 3 are true:
+1. Active session updated recently (`recipe-v1-iter*`)
+2. New git commits are landing
+3. TODO checkboxes advance
+
+---
+
+## Known Failure Modes and Fixes
+
+## 1) Wrong working directory
+### Symptom
+Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in `/workspace`.
+
+### Root cause
+Spawner started outside project root.
+
+### Fix
+- Force absolute project path in every task prompt
+- Add mandatory pre-flight guard
+- Relaunch fresh iteration
+
+---
+
+## 2) False “iteration already running”
+### Symptom
+Auto-iterator repeatedly prints SKIP even when no coding progress occurs.
+
+### Root cause
+It treated stale historical sessions as active.
+
+### Fix
+- Treat a session as active only if updated recently (freshness window)
+- Use current phase labels only (`recipe-v1-iter*`)
+
+---
+
+## 3) Label mismatch across phases
+### Symptom
+Monitor reports wrong status or misses active runs.
+
+### Root cause
+MVP labels (`recipe-mvp-*`) used during v1 phase.
+
+### Fix
+- Update monitor + iterator to phase-specific labels
+- Standardize naming per phase:
+  - MVP: `recipe-mvp-iter*`
+  - v1: `recipe-v1-iter*`
+
+---
+
+## 4) Model/provider auth mismatch
+### Symptom
+Cron jobs fail with:
+- `No API key found for provider openai`
+- or Copilot cooldown rate-limit errors
+
+### Root cause
+Using `openai/...` models without OpenAI API key.
+
+### Fix
+- Use OAuth provider model prefix: `openai-codex/...`
+- For this project, prefer:
+  `openai-codex/gpt-5.3-codex`
+
+---
+
+## 5) Environment capability mismatch (Docker)
+### Symptom
+Task fails with `docker: command not found`.
+
+### Root cause
+Agent runtime host lacks Docker.
+
+### Fix
+- Mark as manual host validation task
+- Continue with unblocked tasks
+
+---
+
+## 6) Runtime module mismatch (ESM/CommonJS)
+### Symptom
+Backend runtime error: `require is not defined`.
+
+### Root cause
+Using `require()` in ESM code path.
+
+### Fix
+- Replace `require('fs')` calls with ESM imports (`writeFileSync`)
+- Build + rerun server
+
+---
+
+## Operational Controls
+
+## Pause automation
+Disable both jobs:
+- Recipe Manager Auto-Iterator
+- Recipe Manager Progress Monitor
+
+## Resume automation
+Enable both jobs, then manually kick one fresh iteration.
+
+## Manual override iteration (safe restart)
+Spawn one explicit iteration with:
+- absolute project path
+- pre-flight guard
+- one-task/one-commit rule
+
+---
+
+## Completion Definition
+
+A phase is complete when:
+1. No unchecked tasks remain in that phase section of TODO.md
+2. Latest iteration exits without STUCK/ERROR
+3. Commit + TODO update are present
+
+---
+
+## Recommended Cadence
+
+- Auto-iterator: every 15 minutes
+- Progress monitor: every 5 minutes (high visibility mode)
+
+If noisy, set monitor to every 10–15 minutes.
+
+---
+
+## Handoff Checklist (Before ending a session)
+
+- [ ] Confirm latest commit hash
+- [ ] Confirm active phase + next unchecked task
+- [ ] Confirm auto-iterator enabled/disabled status
+- [ ] Confirm monitor enabled/disabled status
+- [ ] Confirm no stale active-session false positives
+
+---
+
+## Quick Status Commands
+
+### Latest commit
+`git log -1 --oneline`
+
+### Next tasks
+`grep -n "^- \[ \]" TODO.md | head`
+
+### Recent progress
+`git log --oneline -5`
+
+---
+
+This runbook should be updated whenever a new failure mode appears.