recipe-manager/RUNBOOK.md

# Recipe Manager Agentic Runbook

Last updated: 2026-03-24

## Purpose
Operational guide for running the Recipe Manager agent harness reliably.

---

## Core Execution Model

- One task per iteration
- One commit per iteration
- TODO.md is the authoritative queue
- Work only in:
  `/home/paulh/.openclaw/workspace/projects/recipe-manager`

---

## Required Guards (Must Pass Before Coding)

### Pre-flight checks
Before any iteration starts, verify these files exist:
- `AGENT_INSTRUCTIONS.md`
- `TODO.md`

If missing, fail with:
`STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager`

---

## Monitoring Signals (How we know it's working)

A run is healthy only when all 3 are true:
1. Active session updated recently (`recipe-v1-iter*`)
2. New git commits are landing
3. TODO checkboxes advance

---

## Known Failure Modes and Fixes

## 1) Wrong working directory
### Symptom
Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in `/workspace`.

### Root cause
Spawner started outside project root.

### Fix
- Force absolute project path in every task prompt
- Add mandatory pre-flight guard
- Relaunch fresh iteration

---

## 2) False “iteration already running”
### Symptom
Auto-iterator repeatedly prints SKIP even when no coding progress occurs.

### Root cause
It treated stale historical sessions as active.

### Fix
- Treat a session as active only if updated recently (freshness window)
- Use current phase labels only (`recipe-v1-iter*`)

---

## 3) Label mismatch across phases
### Symptom
Monitor reports wrong status or misses active runs.

### Root cause
MVP labels (`recipe-mvp-*`) used during v1 phase.

### Fix
- Update monitor + iterator to phase-specific labels
- Standardize naming per phase:
  - MVP: `recipe-mvp-iter*`
  - v1: `recipe-v1-iter*`

---

## 4) Model/provider auth mismatch
### Symptom
Cron jobs fail with:
- `No API key found for provider openai`
- or Copilot cooldown rate-limit errors

### Root cause
Using `openai/...` models without OpenAI API key.

### Fix
- Use OAuth provider model prefix: `openai-codex/...`
- For this project, prefer:
  `openai-codex/gpt-5.3-codex`

---

## 5) Environment capability mismatch (Docker)
### Symptom
Task fails with `docker: command not found`.

### Root cause
Agent runtime host lacks Docker.

### Fix
- Mark as manual host validation task
- Continue with unblocked tasks

---

## 6) Runtime module mismatch (ESM/CommonJS)
### Symptom
Backend runtime error: `require is not defined`.

### Root cause
Using `require()` in ESM code path.

### Fix
- Replace `require('fs')` calls with ESM imports (`writeFileSync`)
- Build + rerun server

---

## Operational Controls

## Pause automation
Disable both jobs:
- Recipe Manager Auto-Iterator
- Recipe Manager Progress Monitor

## Resume automation
Enable both jobs, then manually kick one fresh iteration.

## Manual override iteration (safe restart)
Spawn one explicit iteration with:
- absolute project path
- pre-flight guard
- one-task/one-commit rule

---

## Workflow Periodic Execution (cron + systemd)

All commands assume project root:
`/home/paulh/.openclaw/workspace/projects/recipe-manager`

### Manual commands

```bash
# Resume from checkpoint (default mode)
npm run workflow:run

# Force restart from stage 1
npm run workflow:run -- --mode restart

# Scheduled run entrypoint (resume + morning report)
npm run workflow:schedule

# Health signal for automation (0=healthy, 1=failed/blocked/unknown)
npm run workflow:health-check
```

### Cron example

Run scheduler every 15 minutes, health check every 5 minutes:

```cron
*/15 * * * * cd /home/paulh/.openclaw/workspace/projects/recipe-manager && /usr/bin/npm run workflow:schedule >> /home/paulh/.openclaw/workspace/projects/recipe-manager/status/workflow-schedule.log 2>&1
*/5 * * * * cd /home/paulh/.openclaw/workspace/projects/recipe-manager && /usr/bin/npm run workflow:health-check >> /home/paulh/.openclaw/workspace/projects/recipe-manager/status/workflow-health.log 2>&1
```

### systemd example

Create one-shot services and timers:

`/etc/systemd/system/recipe-workflow-schedule.service`
```ini
[Unit]
Description=Recipe Manager scheduled workflow run
After=network.target

[Service]
Type=oneshot
WorkingDirectory=/home/paulh/.openclaw/workspace/projects/recipe-manager
ExecStart=/usr/bin/npm run workflow:schedule
```

`/etc/systemd/system/recipe-workflow-schedule.timer`
```ini
[Unit]
Description=Run Recipe Manager scheduled workflow every 15 minutes

[Timer]
OnCalendar=*:0/15
Persistent=true

[Install]
WantedBy=timers.target
```

`/etc/systemd/system/recipe-workflow-health.service`
```ini
[Unit]
Description=Recipe Manager workflow health check
After=network.target

[Service]
Type=oneshot
WorkingDirectory=/home/paulh/.openclaw/workspace/projects/recipe-manager
ExecStart=/usr/bin/npm run workflow:health-check
```

`/etc/systemd/system/recipe-workflow-health.timer`
```ini
[Unit]
Description=Run Recipe Manager workflow health check every 5 minutes

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target
```

Enable timers:

```bash
sudo systemctl daemon-reload
sudo systemctl enable --now recipe-workflow-schedule.timer recipe-workflow-health.timer
```

### Troubleshooting failed/blocked status

When `npm run workflow:health-check` returns exit code `1` with `{"status":"failed"}` or `{"status":"blocked"}`:

1. Check current workflow status payload:
   ```bash
   cat status/workflow-status.json
   ```
2. Check recent progress log entries:
   ```bash
   tail -n 50 status/workflow-progress.jsonl
   ```
3. Retry from checkpoint:
   ```bash
   npm run workflow:run
   ```
4. If still blocked/failed, force a clean restart:
   ```bash
   npm run workflow:run -- --mode restart
   ```
5. Re-run health check and confirm healthy output (`idle`, `running`, or `completed`):
   ```bash
   npm run workflow:health-check
   ```

If status file is missing or malformed, the health check prints `status_read_failed` and exits `1`; regenerate state with `npm run workflow:run -- --mode restart`.

---

## Completion Definition

A phase is complete when:
1. No unchecked tasks remain in that phase section of TODO.md
2. Latest iteration exits without STUCK/ERROR
3. Commit + TODO update are present

---

## Recommended Cadence

- Auto-iterator: every 15 minutes
- Progress monitor: every 5 minutes (high visibility mode)

If noisy, set monitor to every 10–15 minutes.

---

## Handoff Checklist (Before ending a session)

- [ ] Confirm latest commit hash
- [ ] Confirm active phase + next unchecked task
- [ ] Confirm auto-iterator enabled/disabled status
- [ ] Confirm monitor enabled/disabled status
- [ ] Confirm no stale active-session false positives

---

## Quick Status Commands

### Latest commit
`git log -1 --oneline`

### Next tasks
`grep -n "^- \[ \]" TODO.md | head`

### Recent progress
`git log --oneline -5`

---

This runbook should be updated whenever a new failure mode appears.

See also: `INCIDENT_LOG.md` for timestamped operational incidents and fixes.