110 lines
3.5 KiB
Markdown
110 lines
3.5 KiB
Markdown
# Incident Log — Recipe Manager Harness
|
|
|
|
Purpose: track operational failures, impact, root cause, and permanent fixes.
|
|
|
|
---
|
|
|
|
## Template
|
|
|
|
## [YYYY-MM-DD HH:MM TZ] Incident Title
|
|
- **Severity:** Low / Medium / High
|
|
- **Status:** Open / Mitigated / Resolved
|
|
- **Detected by:** Monitor / Human / Agent
|
|
- **Impact:**
|
|
- What stopped or degraded
|
|
- Duration
|
|
- **Symptoms:**
|
|
- Exact error text
|
|
- Observable behavior
|
|
- **Root cause:**
|
|
- Why it happened
|
|
- **Immediate mitigation:**
|
|
- What was done to restore service
|
|
- **Permanent fix:**
|
|
- Config/code/process changes
|
|
- **Verification:**
|
|
- How we confirmed it works
|
|
- **Prevention follow-up:**
|
|
- Guardrails/tests added
|
|
- **Links:**
|
|
- Commit(s):
|
|
- Related files:
|
|
- Session/cron IDs:
|
|
|
|
---
|
|
|
|
## Recorded Incidents
|
|
|
|
## [2026-03-24 08:00 EDT] Auto-iterator/monitor stalls due to model auth mismatch
|
|
- **Severity:** High
|
|
- **Status:** Resolved
|
|
- **Detected by:** Human
|
|
- **Impact:**
|
|
- Iterations stopped for ~10 hours
|
|
- No new recipe-manager commits during outage
|
|
- **Symptoms:**
|
|
- Cron failures: `No API key found for provider "openai"`
|
|
- Repeated job errors with no productive iteration
|
|
- **Root cause:**
|
|
- Cron jobs used `openai/...` model path (API-key provider) while environment was authenticated via `openai-codex` OAuth
|
|
- **Immediate mitigation:**
|
|
- Disabled broken jobs
|
|
- Manually spawned recovery iterations
|
|
- **Permanent fix:**
|
|
- Cron jobs updated to `openai-codex/gpt-5.3-codex`
|
|
- **Verification:**
|
|
- Iterations resumed and commits landed again
|
|
- **Prevention follow-up:**
|
|
- Runbook updated with provider-prefix rule
|
|
- **Links:**
|
|
- Related files: RUNBOOK.md
|
|
|
|
## [2026-03-24 21:40 EDT] Iteration skips due to stale session detection + wrong working dir
|
|
- **Severity:** High
|
|
- **Status:** Resolved
|
|
- **Detected by:** Human + monitor alerts
|
|
- **Impact:**
|
|
- Auto-iterator repeatedly skipped or produced STUCK responses
|
|
- **Symptoms:**
|
|
- `SKIP: iteration already running` with no new commit
|
|
- `STUCK: ... AGENT_INSTRUCTIONS.md and TODO.md missing from /workspace`
|
|
- **Root cause:**
|
|
- Stale completed sessions counted as active
|
|
- Iteration prompts sometimes lacked explicit project-root guard
|
|
- **Immediate mitigation:**
|
|
- Spawned manual iteration with absolute path + pre-flight checks
|
|
- **Permanent fix:**
|
|
- Added mandatory pre-flight guard in AGENT_INSTRUCTIONS.md
|
|
- Updated auto-iterator to require absolute path and freshness-based active-run detection
|
|
- **Verification:**
|
|
- New iterations completed successfully with commits:
|
|
- `87e9181` (import test)
|
|
- `276e03c` (import UI page/form)
|
|
- `d4aed47` (parsed preview)
|
|
- **Prevention follow-up:**
|
|
- Monitor updated to track `recipe-v1-iter*` labels for v1 phase
|
|
- **Links:**
|
|
- Commit(s): `37b17f7`, `d4aed47`, `276e03c`, `87e9181`
|
|
- Related files: AGENT_INSTRUCTIONS.md, TODO.md, RUNBOOK.md
|
|
|
|
## [2026-03-24 17:55 EDT] Docker validation blocked in runtime host
|
|
- **Severity:** Medium
|
|
- **Status:** Mitigated (manual follow-up required)
|
|
- **Detected by:** Agent
|
|
- **Impact:**
|
|
- Could not complete local docker deployment test from agent environment
|
|
- **Symptoms:**
|
|
- `docker: command not found`
|
|
- **Root cause:**
|
|
- Runtime host lacks Docker CLI/daemon
|
|
- **Immediate mitigation:**
|
|
- Marked task as manual host validation
|
|
- **Permanent fix:**
|
|
- Keep as explicit manual step in TODO for host with Docker installed
|
|
- **Verification:**
|
|
- Manual non-docker dev run validated separately
|
|
- **Prevention follow-up:**
|
|
- Documented as environment capability mismatch in RUNBOOK.md
|
|
- **Links:**
|
|
- Commit: `1a4b984`
|