recipe-manager/RUNBOOK.md

309 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Recipe Manager Agentic Runbook
Last updated: 2026-03-24
## Purpose
Operational guide for running the Recipe Manager agent harness reliably.
---
## Core Execution Model
- One task per iteration
- One commit per iteration
- TODO.md is the authoritative queue
- Work only in:
`/home/paulh/.openclaw/workspace/projects/recipe-manager`
---
## Required Guards (Must Pass Before Coding)
### Pre-flight checks
Before any iteration starts, verify these files exist:
- `AGENT_INSTRUCTIONS.md`
- `TODO.md`
If missing, fail with:
`STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager`
---
## Monitoring Signals (How we know it's working)
A run is healthy only when all 3 are true:
1. Active session updated recently (`recipe-v1-iter*`)
2. New git commits are landing
3. TODO checkboxes advance
---
## Known Failure Modes and Fixes
## 1) Wrong working directory
### Symptom
Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in `/workspace`.
### Root cause
Spawner started outside project root.
### Fix
- Force absolute project path in every task prompt
- Add mandatory pre-flight guard
- Relaunch fresh iteration
---
## 2) False “iteration already running”
### Symptom
Auto-iterator repeatedly prints SKIP even when no coding progress occurs.
### Root cause
It treated stale historical sessions as active.
### Fix
- Treat a session as active only if updated recently (freshness window)
- Use current phase labels only (`recipe-v1-iter*`)
---
## 3) Label mismatch across phases
### Symptom
Monitor reports wrong status or misses active runs.
### Root cause
MVP labels (`recipe-mvp-*`) used during v1 phase.
### Fix
- Update monitor + iterator to phase-specific labels
- Standardize naming per phase:
- MVP: `recipe-mvp-iter*`
- v1: `recipe-v1-iter*`
---
## 4) Model/provider auth mismatch
### Symptom
Cron jobs fail with:
- `No API key found for provider openai`
- or Copilot cooldown rate-limit errors
### Root cause
Using `openai/...` models without OpenAI API key.
### Fix
- Use OAuth provider model prefix: `openai-codex/...`
- For this project, prefer:
`openai-codex/gpt-5.3-codex`
---
## 5) Environment capability mismatch (Docker)
### Symptom
Task fails with `docker: command not found`.
### Root cause
Agent runtime host lacks Docker.
### Fix
- Mark as manual host validation task
- Continue with unblocked tasks
---
## 6) Runtime module mismatch (ESM/CommonJS)
### Symptom
Backend runtime error: `require is not defined`.
### Root cause
Using `require()` in ESM code path.
### Fix
- Replace `require('fs')` calls with ESM imports (`writeFileSync`)
- Build + rerun server
---
## Operational Controls
## Pause automation
Disable both jobs:
- Recipe Manager Auto-Iterator
- Recipe Manager Progress Monitor
## Resume automation
Enable both jobs, then manually kick one fresh iteration.
## Manual override iteration (safe restart)
Spawn one explicit iteration with:
- absolute project path
- pre-flight guard
- one-task/one-commit rule
---
## Workflow Periodic Execution (cron + systemd)
All commands assume project root:
`/home/paulh/.openclaw/workspace/projects/recipe-manager`
### Manual commands
```bash
# Resume from checkpoint (default mode)
npm run workflow:run
# Force restart from stage 1
npm run workflow:run -- --mode restart
# Scheduled run entrypoint (resume + morning report)
npm run workflow:schedule
# Health signal for automation (0=healthy, 1=failed/blocked/unknown)
npm run workflow:health-check
```
### Cron example
Run scheduler every 15 minutes, health check every 5 minutes:
```cron
*/15 * * * * cd /home/paulh/.openclaw/workspace/projects/recipe-manager && /usr/bin/npm run workflow:schedule >> /home/paulh/.openclaw/workspace/projects/recipe-manager/status/workflow-schedule.log 2>&1
*/5 * * * * cd /home/paulh/.openclaw/workspace/projects/recipe-manager && /usr/bin/npm run workflow:health-check >> /home/paulh/.openclaw/workspace/projects/recipe-manager/status/workflow-health.log 2>&1
```
### systemd example
Create one-shot services and timers:
`/etc/systemd/system/recipe-workflow-schedule.service`
```ini
[Unit]
Description=Recipe Manager scheduled workflow run
After=network.target
[Service]
Type=oneshot
WorkingDirectory=/home/paulh/.openclaw/workspace/projects/recipe-manager
ExecStart=/usr/bin/npm run workflow:schedule
```
`/etc/systemd/system/recipe-workflow-schedule.timer`
```ini
[Unit]
Description=Run Recipe Manager scheduled workflow every 15 minutes
[Timer]
OnCalendar=*:0/15
Persistent=true
[Install]
WantedBy=timers.target
```
`/etc/systemd/system/recipe-workflow-health.service`
```ini
[Unit]
Description=Recipe Manager workflow health check
After=network.target
[Service]
Type=oneshot
WorkingDirectory=/home/paulh/.openclaw/workspace/projects/recipe-manager
ExecStart=/usr/bin/npm run workflow:health-check
```
`/etc/systemd/system/recipe-workflow-health.timer`
```ini
[Unit]
Description=Run Recipe Manager workflow health check every 5 minutes
[Timer]
OnCalendar=*:0/5
Persistent=true
[Install]
WantedBy=timers.target
```
Enable timers:
```bash
sudo systemctl daemon-reload
sudo systemctl enable --now recipe-workflow-schedule.timer recipe-workflow-health.timer
```
### Troubleshooting failed/blocked status
When `npm run workflow:health-check` returns exit code `1` with `{"status":"failed"}` or `{"status":"blocked"}`:
1. Check current workflow status payload:
```bash
cat status/workflow-status.json
```
2. Check recent progress log entries:
```bash
tail -n 50 status/workflow-progress.jsonl
```
3. Retry from checkpoint:
```bash
npm run workflow:run
```
4. If still blocked/failed, force a clean restart:
```bash
npm run workflow:run -- --mode restart
```
5. Re-run health check and confirm healthy output (`idle`, `running`, or `completed`):
```bash
npm run workflow:health-check
```
If status file is missing or malformed, the health check prints `status_read_failed` and exits `1`; regenerate state with `npm run workflow:run -- --mode restart`.
---
## Completion Definition
A phase is complete when:
1. No unchecked tasks remain in that phase section of TODO.md
2. Latest iteration exits without STUCK/ERROR
3. Commit + TODO update are present
---
## Recommended Cadence
- Auto-iterator: every 15 minutes
- Progress monitor: every 5 minutes (high visibility mode)
If noisy, set monitor to every 1015 minutes.
---
## Handoff Checklist (Before ending a session)
- [ ] Confirm latest commit hash
- [ ] Confirm active phase + next unchecked task
- [ ] Confirm auto-iterator enabled/disabled status
- [ ] Confirm monitor enabled/disabled status
- [ ] Confirm no stale active-session false positives
---
## Quick Status Commands
### Latest commit
`git log -1 --oneline`
### Next tasks
`grep -n "^- \[ \]" TODO.md | head`
### Recent progress
`git log --oneline -5`
---
This runbook should be updated whenever a new failure mode appears.
See also: `INCIDENT_LOG.md` for timestamped operational incidents and fixes.