From 4c512a5161c4095711d803b1d932ac8c21d014a4 Mon Sep 17 00:00:00 2001 From: Paul Huliganga Date: Tue, 24 Mar 2026 22:41:49 -0400 Subject: [PATCH] docs(ops): add INCIDENT_LOG and link from RUNBOOK --- INCIDENT_LOG.md | 109 ++++++++++++++++++++++++++++++++++++++++++++++++ RUNBOOK.md | 2 + 2 files changed, 111 insertions(+) create mode 100644 INCIDENT_LOG.md diff --git a/INCIDENT_LOG.md b/INCIDENT_LOG.md new file mode 100644 index 0000000..07b359e --- /dev/null +++ b/INCIDENT_LOG.md @@ -0,0 +1,109 @@ +# Incident Log — Recipe Manager Harness + +Purpose: track operational failures, impact, root cause, and permanent fixes. + +--- + +## Template + +## [YYYY-MM-DD HH:MM TZ] Incident Title +- **Severity:** Low / Medium / High +- **Status:** Open / Mitigated / Resolved +- **Detected by:** Monitor / Human / Agent +- **Impact:** + - What stopped or degraded + - Duration +- **Symptoms:** + - Exact error text + - Observable behavior +- **Root cause:** + - Why it happened +- **Immediate mitigation:** + - What was done to restore service +- **Permanent fix:** + - Config/code/process changes +- **Verification:** + - How we confirmed it works +- **Prevention follow-up:** + - Guardrails/tests added +- **Links:** + - Commit(s): + - Related files: + - Session/cron IDs: + +--- + +## Recorded Incidents + +## [2026-03-24 08:00 EDT] Auto-iterator/monitor stalls due to model auth mismatch +- **Severity:** High +- **Status:** Resolved +- **Detected by:** Human +- **Impact:** + - Iterations stopped for ~10 hours + - No new recipe-manager commits during outage +- **Symptoms:** + - Cron failures: `No API key found for provider "openai"` + - Repeated job errors with no productive iteration +- **Root cause:** + - Cron jobs used `openai/...` model path (API-key provider) while environment was authenticated via `openai-codex` OAuth +- **Immediate mitigation:** + - Disabled broken jobs + - Manually spawned recovery iterations +- **Permanent fix:** + - Cron jobs updated to `openai-codex/gpt-5.3-codex` +- **Verification:** + - Iterations resumed and commits landed again +- **Prevention follow-up:** + - Runbook updated with provider-prefix rule +- **Links:** + - Related files: RUNBOOK.md + +## [2026-03-24 21:40 EDT] Iteration skips due to stale session detection + wrong working dir +- **Severity:** High +- **Status:** Resolved +- **Detected by:** Human + monitor alerts +- **Impact:** + - Auto-iterator repeatedly skipped or produced STUCK responses +- **Symptoms:** + - `SKIP: iteration already running` with no new commit + - `STUCK: ... AGENT_INSTRUCTIONS.md and TODO.md missing from /workspace` +- **Root cause:** + - Stale completed sessions counted as active + - Iteration prompts sometimes lacked explicit project-root guard +- **Immediate mitigation:** + - Spawned manual iteration with absolute path + pre-flight checks +- **Permanent fix:** + - Added mandatory pre-flight guard in AGENT_INSTRUCTIONS.md + - Updated auto-iterator to require absolute path and freshness-based active-run detection +- **Verification:** + - New iterations completed successfully with commits: + - `87e9181` (import test) + - `276e03c` (import UI page/form) + - `d4aed47` (parsed preview) +- **Prevention follow-up:** + - Monitor updated to track `recipe-v1-iter*` labels for v1 phase +- **Links:** + - Commit(s): `37b17f7`, `d4aed47`, `276e03c`, `87e9181` + - Related files: AGENT_INSTRUCTIONS.md, TODO.md, RUNBOOK.md + +## [2026-03-24 17:55 EDT] Docker validation blocked in runtime host +- **Severity:** Medium +- **Status:** Mitigated (manual follow-up required) +- **Detected by:** Agent +- **Impact:** + - Could not complete local docker deployment test from agent environment +- **Symptoms:** + - `docker: command not found` +- **Root cause:** + - Runtime host lacks Docker CLI/daemon +- **Immediate mitigation:** + - Marked task as manual host validation +- **Permanent fix:** + - Keep as explicit manual step in TODO for host with Docker installed +- **Verification:** + - Manual non-docker dev run validated separately +- **Prevention follow-up:** + - Documented as environment capability mismatch in RUNBOOK.md +- **Links:** + - Commit: `1a4b984` diff --git a/RUNBOOK.md b/RUNBOOK.md index 4ff079c..145d41c 100644 --- a/RUNBOOK.md +++ b/RUNBOOK.md @@ -184,3 +184,5 @@ If noisy, set monitor to every 10–15 minutes. --- This runbook should be updated whenever a new failure mode appears. + +See also: `INCIDENT_LOG.md` for timestamped operational incidents and fixes.