agent-harness/DECISIONS.md

# Architecture Decision Records (ADR)

> Copy this file into your project root as `DECISIONS.md`.
> Use it to document non-obvious architecture choices so agents don't undo them.
> This is your defense against drift and "helpful improvements."

---

## Why Architecture Decision Records?

The agent harness spawns fresh agents each iteration. Each one starts with zero memory of:
- **Why** you chose approach X over Y
- **What** you tried that didn't work
- **Which** patterns are intentional vs accidental

Without ADRs, iteration 10's agent "improves" iteration 3's code in ways that break things. ADRs create continuity across fresh contexts.

---

## When to Write an ADR

Write a decision record when:
- You choose an unusual approach (curl instead of fetch)
- You explicitly avoid a "better" solution (no ORM, raw SQL instead)
- You make a tradeoff (simple+slow over complex+fast)
- You discover a gotcha with a tool/API (SpringCM headers, DocuSign auth quirks)
- An agent keeps trying to "fix" something that's working

**Not every decision needs an ADR.** Only the non-obvious ones. If the code is self-explanatory, skip the record.

---

## ADR Template

```markdown
### ADR-NNN: [Short Title]
**Date:** YYYY-MM-DD
**Status:** [Proposed | Accepted | Deprecated | Superseded by ADR-XXX]

**Context:**
What's the situation? Why does this decision matter?

**Decision:**
What did you decide? Be specific and actionable.

**Consequences:**
What does this enable? What does it prevent? What are the tradeoffs?

**Alternatives Considered:**
What else did you try or think about? Why didn't you choose them?
```

---

## Example ADRs

### ADR-001: Use curl over Node.js fetch for HTTP calls
**Date:** 2024-03-15
**Status:** Accepted

**Context:**
The DocuSign SpringCM API returns 500 errors when called from Node.js using the native `fetch()` function. The errors are inconsistent — same request works in Postman but fails in Node.

Investigation showed that Node's fetch sends extra headers (`Connection: keep-alive`, `Accept-Encoding: gzip, deflate`) that SpringCM's proxy doesn't handle correctly.

**Decision:**
All HTTP calls to SpringCM (and potentially other DocuSign APIs) will use `child_process.exec` with `curl` instead of fetch or axios.

```typescript
// CORRECT
const result = execSync(`curl -X GET "${url}" -H "Authorization: Bearer ${token}"`,
  { encoding: 'utf-8' });

// DO NOT USE
const result = await fetch(url, { headers: { Authorization: `Bearer ${token}` }});
```

**Consequences:**
✅ API calls work reliably
✅ Easier to debug (can copy curl command to terminal)
❌ Slightly less "Node-native" (using shell commands)
❌ Must escape shell arguments carefully

**Alternatives Considered:**
- **Axios with minimal headers:** Still sent headers that broke SpringCM
- **Got library:** Same issue as axios
- **Manual request crafting:** Too complex, curl is simpler

**Notes:**
If this decision becomes annoying (shell escaping hell), consider writing a thin wrapper:
```typescript
function curlGet(url: string, token: string): string {
  const safeUrl = shellEscape(url);
  return execSync(`curl -X GET ${safeUrl} -H "Authorization: Bearer ${token}"`,
    { encoding: 'utf-8' });
}
```

---

### ADR-002: Shared package for cross-cutting utilities
**Date:** 2024-03-17
**Status:** Accepted

**Context:**
Seven packages (`clm-direct`, `docgen-direct`, `maestro-direct`, `template-direct`, `formbuilder-direct`, `springcm-direct`, `powerforms-direct`) had duplicated code for:
- JWT authentication
- Environment variable loading
- API error handling
- Retry logic

Changes required updating 7 files. Tests were inconsistent across packages.

**Decision:**
Extract shared utilities to `packages/shared/` and import as `docusign-direct-shared`.

```typescript
// Before (duplicated in each package)
const token = process.env.DOCUSIGN_TOKEN;
if (!token) throw new Error("Missing token");

// After (shared utility)
import { requireEnv } from 'docusign-direct-shared';
const token = requireEnv('DOCUSIGN_TOKEN');
```

**Consequences:**
✅ Single source of truth for auth logic
✅ Consistent error messages across packages
✅ Tests only need to cover shared code once
❌ Adds a build-time dependency (shared must build first)
❌ Breaking changes in shared affect all packages

**Alternatives Considered:**
- **Copy-paste approach:** Current state, unacceptable for long-term maintenance
- **Monolithic package:** All functionality in one big package — loses clarity of API-specific packages
- **External npm package:** Overkill for this codebase, adds publish/version complexity

**Migration:**
New packages MUST use shared utilities. Existing packages should migrate opportunistically (when touching auth code anyway, switch to shared).

---

### ADR-003: No ORM — Raw SQL with better-sqlite3
**Date:** 2024-04-02
**Status:** Accepted

**Context:**
Early iterations suggested using Prisma or TypeORM for database access. The schema is simple (4 tables: users, transactions, categories, rules). Most queries are straightforward CRUD or aggregations.

ORMs add:
- Build complexity (Prisma requires codegen)
- Migration complexity (Prisma's migration format vs raw SQL)
- Learning curve (Prisma's query API vs SQL)
- Bundle size (~50KB for Prisma client)

**Decision:**
Use `better-sqlite3` with raw parameterized SQL queries. No ORM.

```typescript
// Query example
const transactions = db.prepare(`
  SELECT * FROM transactions
  WHERE account_id = ? AND date >= ? AND date <= ?
  ORDER BY date DESC
`).all(accountId, startDate, endDate);

// Migration example (migrations/001-initial-schema.sql)
CREATE TABLE transactions (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  account_id INTEGER NOT NULL,
  date TEXT NOT NULL,
  amount REAL NOT NULL,
  ...
);
```

**Consequences:**
✅ Simple: SQL is explicit, no query builder magic
✅ Fast: better-sqlite3 is one of the fastest SQLite bindings
✅ No build step for database code
✅ Migrations are just SQL files (easy to review and version)
❌ No automatic type generation from schema
❌ Must manually write types (acceptable for 4 tables)

**Alternatives Considered:**
- **Prisma:** Overkill for this schema size, adds complexity
- **TypeORM:** Active Record pattern feels wrong for this use case
- **Knex:** Query builder without ORM — middle ground, but still adds abstraction

**Review after:**
If the schema grows to 15+ tables with complex relationships, revisit this decision. For now, raw SQL is the right fit.

---

### ADR-004: Monte Carlo in main thread, not Web Workers
**Date:** 2024-04-10
**Status:** Accepted

**Context:**
Monte Carlo simulation runs 1,000+ iterations. Each iteration:
- Samples random market returns
- Calculates portfolio balance year by year (30+ years)
- Tracks success/failure

Initial implementation in a Web Worker for non-blocking UI. However:
- Simulation completes in < 2 seconds on modern hardware
- Complexity of Worker setup (separate build, message passing) adds 100 LOC
- User expectation: click "Run Simulation" → see results immediately (not async)

**Decision:**
Run Monte Carlo in the main thread. Show a loading spinner during execution.

```typescript
// Main thread, synchronous
function runMonteCarlo(profile: RetirementProfile, runs: number): SimulationResult {
  const results = [];
  for (let i = 0; i < runs; i++) {
    results.push(runSingleSimulation(profile));
  }
  return aggregateResults(results);
}

// UI
button.onclick = () => {
  showSpinner();
  const result = runMonteCarlo(profile, 1000);
  hideSpinner();
  renderResults(result);
};
```

**Consequences:**
✅ Simpler code (no Worker setup, no message passing)
✅ Easier to debug (single execution context)
✅ Fast enough (<2s) that blocking isn't a problem
❌ UI freezes for ~1.5 seconds during simulation
❌ Can't run multiple simulations in parallel

**Alternatives Considered:**
- **Web Worker:** Complexity not justified for 1.5s task
- **WebAssembly:** Overkill, JS is fast enough
- **Async/await with chunking:** Adds complexity, still blocks event loop

**Review after:**
If simulation time grows to >5 seconds (e.g., 10,000 runs, more complex models), reconsider Web Workers.

---

## Tips for Writing Good ADRs

### 1. Write them when the decision is fresh
Don't wait until the project is done. Write the ADR immediately after you make the choice. You'll remember the reasoning.

### 2. Include what DIDN'T work
"We tried fetch — it sent headers that broke SpringCM" is more valuable than "We use curl." Future agents need to know the trap.

### 3. Be specific with code examples
Show the DO and DON'T patterns. Code is clearer than prose.

### 4. Include review triggers
"Review this after the schema grows to 15+ tables" gives future-you permission to change course.

### 5. Status field matters
- **Proposed:** Still discussing
- **Accepted:** This is the way
- **Deprecated:** Don't use this anymore (but still in codebase)
- **Superseded by ADR-XXX:** This approach was replaced

### 6. Keep them short
An ADR should fit on one screen. If it's a 5-page essay, you're over-explaining. Extract detail to separate docs and link them.

---

## Maintaining ADRs

### Add to AGENT.md Orient Phase
```markdown
## Orient
- Read PROJECT-SPEC.md
- Read IMPLEMENTATION_PLAN.md
- Read DECISIONS.md  ← Add this
- Check git log --oneline -5
```

### Reference in Constraints
```markdown
## Constraints
- MUST follow architecture decisions documented in DECISIONS.md
- MUST NOT change approaches listed as "Accepted" without human approval
```

### Review Periodically
Every few weeks, scan DECISIONS.md:
- Are the decisions still valid?
- Should any be deprecated?
- Are agents following them?

---

_Decisions don't drift if they're written down._