agent-harness/TROUBLESHOOTING.md

11 KiB

Troubleshooting — When Things Go Wrong

Agents are remarkably capable, but they fail in predictable ways. This guide catalogs the common failure modes and how to fix each one.


Failure Taxonomy

Agent failures fall into five categories:

Category Symptom Severity Fix Effort
Stuck Loop Same task attempted repeatedly 🟡 Medium Clarify spec or split task
Drift Code diverges from spec 🟠 High Review + reset + constrain
Overengineering Too much abstraction, unnecessary complexity 🟡 Medium Simplify constraints in spec
Test Theater Tests pass but don't test anything real 🔴 Critical Rewrite tests, add examples
Context Overflow Agent loses track mid-iteration 🟡 Medium Reduce task size

Problem 1: The Stuck Loop

Symptom

The agent attempts the same task 3+ iterations in a row. Git log shows repeated attempts and reverts, or no commits at all.

Root Causes

1a. Task is too large

# Bad
- [ ] Implement the entire authentication system

# Fix: Split into smaller pieces
- [ ] Create JWT token generation function
- [ ] Create token refresh function  
- [ ] Create auth middleware
- [ ] Wire auth into CLI commands

1b. Spec is ambiguous The agent interprets the requirement differently each iteration, never matching what you expect.

# Bad
- [ ] Handle errors properly

# Fix: Be explicit
- [ ] Return HTTP 400 with { error: "message" } for validation failures
- [ ] Return HTTP 401 with { error: "Token expired" } for auth failures
- [ ] Log errors to stderr, never expose stack traces to users

1c. External dependency is broken The agent can't complete the task because an API, service, or tool isn't working.

# Detection: Agent's commit messages mention the same error repeatedly
# Fix: Add a note to the plan
- [ ] CLM API integration
  > BLOCKED: CLM API returns 401 for all accounts. Skip this task.
  > Will revisit when account provisioning is resolved.

1d. Tests are impossible to pass Previous iteration wrote tests with wrong expectations, now the agent can't make them pass.

# Fix: Reset tests to last known-good state
git checkout HEAD~3 -- tests/
# Then update the plan with clarification

Recovery Steps

  1. Check git log — is the agent making progress at all?
  2. Read the agent's output from the last 2-3 iterations (session transcripts)
  3. Identify which root cause matches
  4. Apply the appropriate fix
  5. Add a note to the plan explaining the resolution

Problem 2: Architecture Drift

Symptom

Over many iterations, the codebase structure diverges from what the spec defines. Files appear in wrong directories, patterns change between iterations, or the agent introduces frameworks/tools not in the spec.

Root Causes

2a. Spec doesn't specify architecture strongly enough

# Bad (too vague)
### Project structure
src/
tests/

# Fix (explicit)
### Project structure
packages/
├── server/
│   ├── src/
│   │   ├── routes/        # Express route handlers
│   │   ├── services/      # Business logic (no HTTP awareness)
│   │   ├── models/        # Database access
│   │   └── index.ts       # Server entry point
│   └── tests/
│       ├── routes/        # Integration tests (HTTP)
│       └── services/      # Unit tests (pure functions)

2b. Agent "improves" existing patterns Iteration 8's agent thinks iteration 3's pattern is bad and refactors it, breaking the consistency.

# Fix: Add to spec constraints
### Constraints
- MUST NOT refactor code from previous iterations unless the current task requires it
- MUST follow existing patterns (look at how similar features are already implemented)
- MUST NOT introduce new dependencies without explicit approval

2c. Fresh context means no memory of decisions Each iteration starts fresh. The agent in iteration 10 doesn't know WHY iteration 3 chose a particular approach.

# Fix: Document decisions in a DECISIONS.md file
## Architecture Decisions

### ADR-001: curl over fetch for HTTP calls
**Context:** Node.js fetch sends extra headers that cause SpringCM 500 errors.
**Decision:** Use child_process.exec with curl for all API calls.
**Status:** Accepted. DO NOT CHANGE.

### ADR-002: Shared package for cross-cutting utilities
**Context:** 7 packages had duplicated auth, env, and API code.
**Decision:** Extract to packages/shared/, import as docusign-direct-shared.
**Status:** Accepted. All new packages must use shared utilities.

Add DECISIONS.md to the files the agent reads during the Orient phase.

Recovery Steps

  1. git diff the current state against the spec's project structure
  2. Identify what drifted and when (git log + git blame)
  3. Reset if severe, or add corrective tasks to the plan
  4. Strengthen the spec's architecture section
  5. Add DECISIONS.md for non-obvious choices

Problem 3: Overengineering

Symptom

The agent creates elaborate abstractions, design patterns, and infrastructure that the spec doesn't call for. Factory factories. Abstract base classes for things with one implementation. Configuration systems for things with one value.

Root Causes

3a. Agent defaults to "enterprise" patterns LLMs are trained on a lot of enterprise code. They gravitate toward abstraction.

# Fix: Add to spec constraints
### Constraints
- PREFER simple functions over classes
- PREFER direct implementation over abstraction layers
- MUST NOT create an interface unless there are 2+ implementations
- MUST NOT add configuration for things that have one value
- Follow YAGNI: You Aren't Gonna Need It

3b. Task is too vague, agent fills the gap with architecture

# Bad
- [ ] Create the data layer

# Fix
- [ ] Create SQLite database with schema: users(id, name, email), 
      transactions(id, user_id, amount, date). Use better-sqlite3.
      One file: src/db.ts. No ORM.

Recovery Steps

  1. Identify the unnecessary abstraction
  2. Add explicit simplicity constraints to the spec
  3. If the code works despite being over-engineered, leave it (unless it impedes future tasks)
  4. If it's blocking progress, simplify and recommit

Problem 4: Test Theater

Symptom

All tests pass, but when you actually USE the software, it doesn't work correctly. Tests are checking for existence (toBeDefined), not behavior.

This is the most dangerous failure mode because it's invisible.

Root Causes

4a. No example I/O in the spec The agent doesn't know what correct output looks like, so it tests for "something came back."

# Fix: Add to spec
### Input/Output Examples

**QFX Import:**
Input file: (see data/sample.qfx)
Expected output:
[
  { date: "2024-01-15", amount: -42.50, payee: "COSTCO", memo: "Purchase", type: "debit" },
  { date: "2024-01-16", amount: 2500.00, payee: "EMPLOYER INC", memo: "Payroll", type: "credit" }
]

4b. No test quality standards in the spec

# Fix: Add to spec
### Testing Standards
- Every test must assert SPECIFIC values, not just "defined" or "truthy"
- Tests must include at least one edge case (empty input, null values)
- Tests must include at least one error case (invalid input, missing data)
- Use realistic test data, not "foo" and "bar"
- Test the PUBLIC behavior, not internal implementation details

4c. Agent writes tests after implementation (confirmation bias) The agent sees what the code does and writes tests that confirm it — even if the code is wrong.

# Fix: Use the test-first pattern in the plan
- [ ] Write failing tests for QFX parser (based on spec examples)
- [ ] Implement QFX parser to pass tests

Recovery Steps

  1. Run the code yourself — does it actually work?
  2. Read the test assertions — are they testing behavior or existence?
  3. Add example I/O to the spec
  4. Add a "rewrite tests" task to the plan with explicit expected values
  5. Consider adding a test-first constraint to the spec

Problem 5: Context Overflow

Symptom

Agent starts strong but degrades mid-iteration. Later changes contradict earlier ones. The agent "forgets" what it was doing partway through a task.

Root Causes

5a. Task requires reading too many files The agent fills its context window with file contents and loses track of the goal.

# Fix: Make tasks more focused
# Bad
- [ ] Refactor all 7 packages to use shared utilities

# Good
- [ ] Refactor clm-direct to use shared utilities
- [ ] Refactor docgen-direct to use shared utilities
- [ ] Refactor maestro-direct to use shared utilities
# ... (one per package)

5b. Spec is too long If PROJECT-SPEC.md is 50 pages, the agent uses half its context just reading it.

# Fix: Section the spec so agents only read what they need
# In AGENT.md:
### Orient
- Read PROJECT-SPEC.md sections 1-2 (overview and tech stack)
- Read the acceptance criteria ONLY for the current task
- Read IMPLEMENTATION_PLAN.md
- Do NOT read sections you don't need this iteration

5c. Agent reads too many files during orient

# Fix: Limit the orient phase
### Orient
- Read IMPLEMENTATION_PLAN.md (always)
- Read PROJECT-SPEC.md section for current task (not the whole thing)
- Run git log --oneline -5 (not -50)
- Check build status: npm run build 2>&1 | tail -5

Recovery Steps

  1. Check if the task requires touching many files — if so, split it
  2. Trim the spec (move detailed examples to separate reference files)
  3. Adjust AGENT.md to limit the orient phase
  4. Reduce iteration timeout to force smaller tasks

The Meta-Problem: When to Give Up

Sometimes a task genuinely can't be done by an agent. Signs:

  • Requires external knowledge the agent can't access (undocumented API behavior)
  • Requires human judgment that can't be specified (design aesthetics, UX decisions)
  • Requires real-time interaction with a service (OAuth browser flows, 2FA)
  • Requires physical access (hardware testing, network configuration)

In these cases:

  1. Mark the task as HUMAN in the plan
  2. Do it yourself
  3. Commit the result
  4. Let the agent continue with the next task
- [x] Set up OAuth application in DocuSign admin panel (HUMAN)
- [ ] Implement JWT auth flow using the credentials from .env

There's no shame in doing parts yourself. The harness is a tool, not a religion.


Quick Reference: Failure → Fix

Failure Quick Fix
Agent repeats same task Split the task into smaller pieces
Wrong architecture Add explicit project structure to spec
Too much abstraction Add YAGNI constraint to spec
Tests don't test anything Add example I/O to spec
Agent adds unrequested features Add "MUST NOT add features not in spec" constraint
Agent changes existing patterns Add "MUST follow existing patterns" constraint
Agent uses wrong tool/framework Be explicit about tech stack (MUST use X, MUST NOT use Y)
Progress stalls completely Read last 3 transcripts, identify the blocker, unblock manually

Every failure teaches you to write a better spec. That's the real loop.