11 KiB
Troubleshooting — When Things Go Wrong
Agents are remarkably capable, but they fail in predictable ways. This guide catalogs the common failure modes and how to fix each one.
Failure Taxonomy
Agent failures fall into five categories:
| Category | Symptom | Severity | Fix Effort |
|---|---|---|---|
| Stuck Loop | Same task attempted repeatedly | 🟡 Medium | Clarify spec or split task |
| Drift | Code diverges from spec | 🟠 High | Review + reset + constrain |
| Overengineering | Too much abstraction, unnecessary complexity | 🟡 Medium | Simplify constraints in spec |
| Test Theater | Tests pass but don't test anything real | 🔴 Critical | Rewrite tests, add examples |
| Context Overflow | Agent loses track mid-iteration | 🟡 Medium | Reduce task size |
Problem 1: The Stuck Loop
Symptom
The agent attempts the same task 3+ iterations in a row. Git log shows repeated attempts and reverts, or no commits at all.
Root Causes
1a. Task is too large
# Bad
- [ ] Implement the entire authentication system
# Fix: Split into smaller pieces
- [ ] Create JWT token generation function
- [ ] Create token refresh function
- [ ] Create auth middleware
- [ ] Wire auth into CLI commands
1b. Spec is ambiguous The agent interprets the requirement differently each iteration, never matching what you expect.
# Bad
- [ ] Handle errors properly
# Fix: Be explicit
- [ ] Return HTTP 400 with { error: "message" } for validation failures
- [ ] Return HTTP 401 with { error: "Token expired" } for auth failures
- [ ] Log errors to stderr, never expose stack traces to users
1c. External dependency is broken The agent can't complete the task because an API, service, or tool isn't working.
# Detection: Agent's commit messages mention the same error repeatedly
# Fix: Add a note to the plan
- [ ] CLM API integration
> BLOCKED: CLM API returns 401 for all accounts. Skip this task.
> Will revisit when account provisioning is resolved.
1d. Tests are impossible to pass Previous iteration wrote tests with wrong expectations, now the agent can't make them pass.
# Fix: Reset tests to last known-good state
git checkout HEAD~3 -- tests/
# Then update the plan with clarification
Recovery Steps
- Check git log — is the agent making progress at all?
- Read the agent's output from the last 2-3 iterations (session transcripts)
- Identify which root cause matches
- Apply the appropriate fix
- Add a note to the plan explaining the resolution
Problem 2: Architecture Drift
Symptom
Over many iterations, the codebase structure diverges from what the spec defines. Files appear in wrong directories, patterns change between iterations, or the agent introduces frameworks/tools not in the spec.
Root Causes
2a. Spec doesn't specify architecture strongly enough
# Bad (too vague)
### Project structure
src/
tests/
# Fix (explicit)
### Project structure
packages/
├── server/
│ ├── src/
│ │ ├── routes/ # Express route handlers
│ │ ├── services/ # Business logic (no HTTP awareness)
│ │ ├── models/ # Database access
│ │ └── index.ts # Server entry point
│ └── tests/
│ ├── routes/ # Integration tests (HTTP)
│ └── services/ # Unit tests (pure functions)
2b. Agent "improves" existing patterns Iteration 8's agent thinks iteration 3's pattern is bad and refactors it, breaking the consistency.
# Fix: Add to spec constraints
### Constraints
- MUST NOT refactor code from previous iterations unless the current task requires it
- MUST follow existing patterns (look at how similar features are already implemented)
- MUST NOT introduce new dependencies without explicit approval
2c. Fresh context means no memory of decisions Each iteration starts fresh. The agent in iteration 10 doesn't know WHY iteration 3 chose a particular approach.
# Fix: Document decisions in a DECISIONS.md file
## Architecture Decisions
### ADR-001: curl over fetch for HTTP calls
**Context:** Node.js fetch sends extra headers that cause SpringCM 500 errors.
**Decision:** Use child_process.exec with curl for all API calls.
**Status:** Accepted. DO NOT CHANGE.
### ADR-002: Shared package for cross-cutting utilities
**Context:** 7 packages had duplicated auth, env, and API code.
**Decision:** Extract to packages/shared/, import as docusign-direct-shared.
**Status:** Accepted. All new packages must use shared utilities.
Add DECISIONS.md to the files the agent reads during the Orient phase.
Recovery Steps
git diffthe current state against the spec's project structure- Identify what drifted and when (git log + git blame)
- Reset if severe, or add corrective tasks to the plan
- Strengthen the spec's architecture section
- Add DECISIONS.md for non-obvious choices
Problem 3: Overengineering
Symptom
The agent creates elaborate abstractions, design patterns, and infrastructure that the spec doesn't call for. Factory factories. Abstract base classes for things with one implementation. Configuration systems for things with one value.
Root Causes
3a. Agent defaults to "enterprise" patterns LLMs are trained on a lot of enterprise code. They gravitate toward abstraction.
# Fix: Add to spec constraints
### Constraints
- PREFER simple functions over classes
- PREFER direct implementation over abstraction layers
- MUST NOT create an interface unless there are 2+ implementations
- MUST NOT add configuration for things that have one value
- Follow YAGNI: You Aren't Gonna Need It
3b. Task is too vague, agent fills the gap with architecture
# Bad
- [ ] Create the data layer
# Fix
- [ ] Create SQLite database with schema: users(id, name, email),
transactions(id, user_id, amount, date). Use better-sqlite3.
One file: src/db.ts. No ORM.
Recovery Steps
- Identify the unnecessary abstraction
- Add explicit simplicity constraints to the spec
- If the code works despite being over-engineered, leave it (unless it impedes future tasks)
- If it's blocking progress, simplify and recommit
Problem 4: Test Theater
Symptom
All tests pass, but when you actually USE the software, it doesn't work correctly. Tests are checking for existence (toBeDefined), not behavior.
This is the most dangerous failure mode because it's invisible.
Root Causes
4a. No example I/O in the spec The agent doesn't know what correct output looks like, so it tests for "something came back."
# Fix: Add to spec
### Input/Output Examples
**QFX Import:**
Input file: (see data/sample.qfx)
Expected output:
[
{ date: "2024-01-15", amount: -42.50, payee: "COSTCO", memo: "Purchase", type: "debit" },
{ date: "2024-01-16", amount: 2500.00, payee: "EMPLOYER INC", memo: "Payroll", type: "credit" }
]
4b. No test quality standards in the spec
# Fix: Add to spec
### Testing Standards
- Every test must assert SPECIFIC values, not just "defined" or "truthy"
- Tests must include at least one edge case (empty input, null values)
- Tests must include at least one error case (invalid input, missing data)
- Use realistic test data, not "foo" and "bar"
- Test the PUBLIC behavior, not internal implementation details
4c. Agent writes tests after implementation (confirmation bias) The agent sees what the code does and writes tests that confirm it — even if the code is wrong.
# Fix: Use the test-first pattern in the plan
- [ ] Write failing tests for QFX parser (based on spec examples)
- [ ] Implement QFX parser to pass tests
Recovery Steps
- Run the code yourself — does it actually work?
- Read the test assertions — are they testing behavior or existence?
- Add example I/O to the spec
- Add a "rewrite tests" task to the plan with explicit expected values
- Consider adding a test-first constraint to the spec
Problem 5: Context Overflow
Symptom
Agent starts strong but degrades mid-iteration. Later changes contradict earlier ones. The agent "forgets" what it was doing partway through a task.
Root Causes
5a. Task requires reading too many files The agent fills its context window with file contents and loses track of the goal.
# Fix: Make tasks more focused
# Bad
- [ ] Refactor all 7 packages to use shared utilities
# Good
- [ ] Refactor clm-direct to use shared utilities
- [ ] Refactor docgen-direct to use shared utilities
- [ ] Refactor maestro-direct to use shared utilities
# ... (one per package)
5b. Spec is too long If PROJECT-SPEC.md is 50 pages, the agent uses half its context just reading it.
# Fix: Section the spec so agents only read what they need
# In AGENT.md:
### Orient
- Read PROJECT-SPEC.md sections 1-2 (overview and tech stack)
- Read the acceptance criteria ONLY for the current task
- Read IMPLEMENTATION_PLAN.md
- Do NOT read sections you don't need this iteration
5c. Agent reads too many files during orient
# Fix: Limit the orient phase
### Orient
- Read IMPLEMENTATION_PLAN.md (always)
- Read PROJECT-SPEC.md section for current task (not the whole thing)
- Run git log --oneline -5 (not -50)
- Check build status: npm run build 2>&1 | tail -5
Recovery Steps
- Check if the task requires touching many files — if so, split it
- Trim the spec (move detailed examples to separate reference files)
- Adjust AGENT.md to limit the orient phase
- Reduce iteration timeout to force smaller tasks
The Meta-Problem: When to Give Up
Sometimes a task genuinely can't be done by an agent. Signs:
- Requires external knowledge the agent can't access (undocumented API behavior)
- Requires human judgment that can't be specified (design aesthetics, UX decisions)
- Requires real-time interaction with a service (OAuth browser flows, 2FA)
- Requires physical access (hardware testing, network configuration)
In these cases:
- Mark the task as
HUMANin the plan - Do it yourself
- Commit the result
- Let the agent continue with the next task
- [x] Set up OAuth application in DocuSign admin panel (HUMAN)
- [ ] Implement JWT auth flow using the credentials from .env
There's no shame in doing parts yourself. The harness is a tool, not a religion.
Quick Reference: Failure → Fix
| Failure | Quick Fix |
|---|---|
| Agent repeats same task | Split the task into smaller pieces |
| Wrong architecture | Add explicit project structure to spec |
| Too much abstraction | Add YAGNI constraint to spec |
| Tests don't test anything | Add example I/O to spec |
| Agent adds unrequested features | Add "MUST NOT add features not in spec" constraint |
| Agent changes existing patterns | Add "MUST follow existing patterns" constraint |
| Agent uses wrong tool/framework | Be explicit about tech stack (MUST use X, MUST NOT use Y) |
| Progress stalls completely | Read last 3 transcripts, identify the blocker, unblock manually |
Every failure teaches you to write a better spec. That's the real loop.