agent-harness/PROJECT-SPEC.md

# Project Specification Template

> Copy this file into your project root as `PROJECT-SPEC.md`.
> Fill out each section. The more specific you are, the better the agent performs.
> This is the single source of truth — the agent reads it every iteration.

---

## 1. Project Overview

### What are we building?
<!-- One paragraph. What is it? Who is it for? What problem does it solve? -->

### Why does it matter?
<!-- Business context. Why build this now? What's the alternative? -->

### Success criteria
<!-- How do we know when we're done? Be specific and measurable. -->
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3

---

## 2. Technical Foundation

### Tech stack
<!-- Be explicit. Don't let the agent choose unless you want it to. -->
- **Language:**
- **Framework:**
- **Database:**
- **Build system:**
- **Test framework:**
- **Package manager:**

### Project structure
<!-- Define the directory layout you want. -->
```
project/
├── src/
├── tests/
├── docs/
└── ...
```

### Build & test commands
<!-- The agent runs these every iteration to verify its work. -->
```bash
# Build
npm run build

# Test
npm test

# Lint
npm run lint
```

### Coding standards
<!-- Style, patterns, conventions the agent should follow. -->
-
-
-

---

## 3. Requirements

### Functional Requirements

> List every feature the system must have. Each should be independently testable.
> Use the format: FR-NNN: Description — Acceptance criteria

#### FR-001: [Feature Name]
**Description:** What it does.
**Acceptance criteria:**
- [ ] Given X, when Y, then Z
- [ ] Edge case: ...
- [ ] Error case: ...

#### FR-002: [Feature Name]
**Description:** What it does.
**Acceptance criteria:**
- [ ] ...

<!-- Add as many as needed -->

### Non-Functional Requirements

#### NFR-001: Performance
- [ ] Response time < X ms for Y operation
- [ ] Handles Z concurrent users

#### NFR-002: Security
- [ ] No secrets in source code
- [ ] Input validation on all user inputs

#### NFR-003: Testing
- [ ] Minimum 80% code coverage
- [ ] All happy paths tested
- [ ] All error paths tested

---

## 4. Data Model

### Entities
<!-- Define your data structures. Be precise about types and constraints. -->

```
Entity: User
  - id: UUID (primary key)
  - name: string (required, max 100 chars)
  - email: string (required, unique, valid email)
  - created_at: timestamp (auto-set)
```

### Relationships
<!-- How entities relate to each other. -->

### Sample Data
<!-- Provide example data the agent can use for testing. -->

---

## 5. API / Interface Design

### Endpoints / Commands / UI Screens
<!-- Define what the user interacts with. -->

### Input/Output Examples
<!-- Show exact examples of what goes in and what comes out. -->

---

## 6. Architecture Decisions

### Constraints
<!-- Things the agent MUST or MUST NOT do. -->

> Four categories of constraints. Fill all four — even if one is just "none in this category."
> The ESCALATE category is critical for autonomous agents: it defines when to stop and ask rather than decide.

- **MUST:** ... (non-negotiable requirements)
- **MUST NOT:** ... (explicit prohibitions — prevents entire failure categories)
- **PREFER:** ... (soft preferences when multiple valid approaches exist)
- **ESCALATE:** ... (conditions where the agent must stop and ask the human rather than decide autonomously)

**Example ESCALATE entries:**
- If a task requires deleting production data → ask first
- If acceptance criteria are ambiguous and the spec doesn't resolve it → ask before implementing
- If a dependency needs to be added that wasn't in the tech stack → confirm before installing
- If the task conflicts with another requirement or constraint → flag the conflict and wait
- If the agent discovers a requirement gap not covered by any FR-NNN → don't fill the gap silently

### Dependencies
<!-- External services, APIs, libraries the project uses. -->

### Known Challenges
<!-- Things that might be tricky. Give the agent a heads-up. -->

---

## 7. Phasing (Optional)

> If the project is large, break it into phases. Each phase should be
> independently deployable and testable.

### Phase 1: Foundation
- [ ] Task A
- [ ] Task B

### Phase 2: Core Features
- [ ] Task C
- [ ] Task D

### Phase 3: Polish
- [ ] Task E
- [ ] Task F

---

## 8. Reference Materials

### External docs
<!-- Links to API docs, design systems, or other references. -->

### Existing code to learn from
<!-- Point at existing code the agent should study for patterns. -->

### Anti-patterns
<!-- Things you've tried that didn't work. Save the agent from repeating mistakes. -->

---

## 9. Evaluation Design

> How do you know the agent's output is good? Not "does it look reasonable?"
> but "can you prove it's correct, measurably and consistently?"
>
> Evaluation design is the only thing standing between AI output you can use
> as-is and AI output that requires 40 minutes of cleanup. Build evals before
> you need them — especially before model updates that could silently change behavior.

### Test cases
<!-- 3-5 test cases with known-good outputs. Run these after every model update. -->

#### TC-001: [Test Name]
**Input:** What you feed to the agent
**Expected output:** What you should get back (specific, verifiable)
**Verification:** How to check programmatically (command, assertion, manual checklist)

#### TC-002: [Test Name]
**Input:** ...
**Expected output:** ...
**Verification:** ...

### Verification steps
<!-- Commands or checklists to verify the output meets acceptance criteria. -->
```bash
# Example: verify the build passes
npm test

# Example: check that specific files exist
ls -la src/new-feature/

# Example: verify no regressions
npm run lint
```

### Run-after-update checklist
<!-- Steps to re-verify after any model or dependency update. -->
- [ ] Run all test cases (TC-001 through TC-00N)
- [ ] Compare outputs to expected baselines
- [ ] Check for regression in failure modes (see anti-patterns)
- [ ] Verify ESCALATE triggers still work correctly

### Regression baselines
<!-- Save representative outputs here so you can compare after updates. -->
<!-- Include at least one happy path, one edge case, and one failure case. -->
```json
{
  "tc_001_baseline": "...",
  "tc_002_baseline": "..."
}
```