docs(runbook): add agent harness failure modes and recovery guide

2026-03-24 22:39:05 -04:00 · 2026-03-24 22:39:05 -04:00 · 3e269a4d4c
parent d4aed475a2
commit 3e269a4d4c
1 changed files with 186 additions and 0 deletions
--- a/RUNBOOK.md
+++ b/RUNBOOK.md
@ -0,0 +1,186 @@
 # Recipe Manager Agentic Runbook
 Last updated: 2026-03-24
 ## Purpose
 Operational guide for running the Recipe Manager agent harness reliably.
 ---
 ## Core Execution Model
 - One task per iteration
 - One commit per iteration
 - TODO.md is the authoritative queue
 - Work only in:
  `/home/paulh/.openclaw/workspace/projects/recipe-manager`
 ---
 ## Required Guards (Must Pass Before Coding)
 ### Pre-flight checks
 Before any iteration starts, verify these files exist:
 - `AGENT_INSTRUCTIONS.md`
 - `TODO.md`
 If missing, fail with:
 `STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager`
 ---
 ## Monitoring Signals (How we know it's working)
 A run is healthy only when all 3 are true:
 1. Active session updated recently (`recipe-v1-iter*`)
 2. New git commits are landing
 3. TODO checkboxes advance
 ---
 ## Known Failure Modes and Fixes
 ## 1) Wrong working directory
 ### Symptom
 Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in `/workspace`.
 ### Root cause
 Spawner started outside project root.
 ### Fix
 - Force absolute project path in every task prompt
 - Add mandatory pre-flight guard
 - Relaunch fresh iteration
 ---
 ## 2) False “iteration already running”
 ### Symptom
 Auto-iterator repeatedly prints SKIP even when no coding progress occurs.
 ### Root cause
 It treated stale historical sessions as active.
 ### Fix
 - Treat a session as active only if updated recently (freshness window)
 - Use current phase labels only (`recipe-v1-iter*`)
 ---
 ## 3) Label mismatch across phases
 ### Symptom
 Monitor reports wrong status or misses active runs.
 ### Root cause
 MVP labels (`recipe-mvp-*`) used during v1 phase.
 ### Fix
 - Update monitor + iterator to phase-specific labels
 - Standardize naming per phase:
  - MVP: `recipe-mvp-iter*`
  - v1: `recipe-v1-iter*`
 ---
 ## 4) Model/provider auth mismatch
 ### Symptom
 Cron jobs fail with:
 - `No API key found for provider openai`
 - or Copilot cooldown rate-limit errors
 ### Root cause
 Using `openai/...` models without OpenAI API key.
 ### Fix
 - Use OAuth provider model prefix: `openai-codex/...`
 - For this project, prefer:
  `openai-codex/gpt-5.3-codex`
 ---
 ## 5) Environment capability mismatch (Docker)
 ### Symptom
 Task fails with `docker: command not found`.
 ### Root cause
 Agent runtime host lacks Docker.
 ### Fix
 - Mark as manual host validation task
 - Continue with unblocked tasks
 ---
 ## 6) Runtime module mismatch (ESM/CommonJS)
 ### Symptom
 Backend runtime error: `require is not defined`.
 ### Root cause
 Using `require()` in ESM code path.
 ### Fix
 - Replace `require('fs')` calls with ESM imports (`writeFileSync`)
 - Build + rerun server
 ---
 ## Operational Controls
 ## Pause automation
 Disable both jobs:
 - Recipe Manager Auto-Iterator
 - Recipe Manager Progress Monitor
 ## Resume automation
 Enable both jobs, then manually kick one fresh iteration.
 ## Manual override iteration (safe restart)
 Spawn one explicit iteration with:
 - absolute project path
 - pre-flight guard
 - one-task/one-commit rule
 ---
 ## Completion Definition
 A phase is complete when:
 1. No unchecked tasks remain in that phase section of TODO.md
 2. Latest iteration exits without STUCK/ERROR
 3. Commit + TODO update are present
 ---
 ## Recommended Cadence
 - Auto-iterator: every 15 minutes
 - Progress monitor: every 5 minutes (high visibility mode)
 If noisy, set monitor to every 10–15 minutes.
 ---
 ## Handoff Checklist (Before ending a session)
 - [ ] Confirm latest commit hash
 - [ ] Confirm active phase + next unchecked task
 - [ ] Confirm auto-iterator enabled/disabled status
 - [ ] Confirm monitor enabled/disabled status
 - [ ] Confirm no stale active-session false positives
 ---
 ## Quick Status Commands
 ### Latest commit
 `git log -1 --oneline`
 ### Next tasks
 `grep -n "^- \[ \]" TODO.md | head`
 ### Recent progress
 `git log --oneline -5`
 ---
 This runbook should be updated whenever a new failure mode appears.