agent-harness/PROJECT-SPEC.md

247 lines
6.2 KiB
Markdown

# Project Specification Template
> Copy this file into your project root as `PROJECT-SPEC.md`.
> Fill out each section. The more specific you are, the better the agent performs.
> This is the single source of truth — the agent reads it every iteration.
---
## 1. Project Overview
### What are we building?
<!-- One paragraph. What is it? Who is it for? What problem does it solve? -->
### Why does it matter?
<!-- Business context. Why build this now? What's the alternative? -->
### Success criteria
<!-- How do we know when we're done? Be specific and measurable. -->
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
---
## 2. Technical Foundation
### Tech stack
<!-- Be explicit. Don't let the agent choose unless you want it to. -->
- **Language:**
- **Framework:**
- **Database:**
- **Build system:**
- **Test framework:**
- **Package manager:**
### Project structure
<!-- Define the directory layout you want. -->
```
project/
├── src/
├── tests/
├── docs/
└── ...
```
### Build & test commands
<!-- The agent runs these every iteration to verify its work. -->
```bash
# Build
npm run build
# Test
npm test
# Lint
npm run lint
```
### Coding standards
<!-- Style, patterns, conventions the agent should follow. -->
-
-
-
---
## 3. Requirements
### Functional Requirements
> List every feature the system must have. Each should be independently testable.
> Use the format: FR-NNN: Description — Acceptance criteria
#### FR-001: [Feature Name]
**Description:** What it does.
**Acceptance criteria:**
- [ ] Given X, when Y, then Z
- [ ] Edge case: ...
- [ ] Error case: ...
#### FR-002: [Feature Name]
**Description:** What it does.
**Acceptance criteria:**
- [ ] ...
<!-- Add as many as needed -->
### Non-Functional Requirements
#### NFR-001: Performance
- [ ] Response time < X ms for Y operation
- [ ] Handles Z concurrent users
#### NFR-002: Security
- [ ] No secrets in source code
- [ ] Input validation on all user inputs
#### NFR-003: Testing
- [ ] Minimum 80% code coverage
- [ ] All happy paths tested
- [ ] All error paths tested
---
## 4. Data Model
### Entities
<!-- Define your data structures. Be precise about types and constraints. -->
```
Entity: User
- id: UUID (primary key)
- name: string (required, max 100 chars)
- email: string (required, unique, valid email)
- created_at: timestamp (auto-set)
```
### Relationships
<!-- How entities relate to each other. -->
### Sample Data
<!-- Provide example data the agent can use for testing. -->
---
## 5. API / Interface Design
### Endpoints / Commands / UI Screens
<!-- Define what the user interacts with. -->
### Input/Output Examples
<!-- Show exact examples of what goes in and what comes out. -->
---
## 6. Architecture Decisions
### Constraints
<!-- Things the agent MUST or MUST NOT do. -->
> Four categories of constraints. Fill all four — even if one is just "none in this category."
> The ESCALATE category is critical for autonomous agents: it defines when to stop and ask rather than decide.
- **MUST:** ... (non-negotiable requirements)
- **MUST NOT:** ... (explicit prohibitions — prevents entire failure categories)
- **PREFER:** ... (soft preferences when multiple valid approaches exist)
- **ESCALATE:** ... (conditions where the agent must stop and ask the human rather than decide autonomously)
**Example ESCALATE entries:**
- If a task requires deleting production data → ask first
- If acceptance criteria are ambiguous and the spec doesn't resolve it → ask before implementing
- If a dependency needs to be added that wasn't in the tech stack → confirm before installing
- If the task conflicts with another requirement or constraint → flag the conflict and wait
- If the agent discovers a requirement gap not covered by any FR-NNN → don't fill the gap silently
### Dependencies
<!-- External services, APIs, libraries the project uses. -->
### Known Challenges
<!-- Things that might be tricky. Give the agent a heads-up. -->
---
## 7. Phasing (Optional)
> If the project is large, break it into phases. Each phase should be
> independently deployable and testable.
### Phase 1: Foundation
- [ ] Task A
- [ ] Task B
### Phase 2: Core Features
- [ ] Task C
- [ ] Task D
### Phase 3: Polish
- [ ] Task E
- [ ] Task F
---
## 8. Reference Materials
### External docs
<!-- Links to API docs, design systems, or other references. -->
### Existing code to learn from
<!-- Point at existing code the agent should study for patterns. -->
### Anti-patterns
<!-- Things you've tried that didn't work. Save the agent from repeating mistakes. -->
---
## 9. Evaluation Design
> How do you know the agent's output is good? Not "does it look reasonable?"
> but "can you prove it's correct, measurably and consistently?"
>
> Evaluation design is the only thing standing between AI output you can use
> as-is and AI output that requires 40 minutes of cleanup. Build evals before
> you need them — especially before model updates that could silently change behavior.
### Test cases
<!-- 3-5 test cases with known-good outputs. Run these after every model update. -->
#### TC-001: [Test Name]
**Input:** What you feed to the agent
**Expected output:** What you should get back (specific, verifiable)
**Verification:** How to check programmatically (command, assertion, manual checklist)
#### TC-002: [Test Name]
**Input:** ...
**Expected output:** ...
**Verification:** ...
### Verification steps
<!-- Commands or checklists to verify the output meets acceptance criteria. -->
```bash
# Example: verify the build passes
npm test
# Example: check that specific files exist
ls -la src/new-feature/
# Example: verify no regressions
npm run lint
```
### Run-after-update checklist
<!-- Steps to re-verify after any model or dependency update. -->
- [ ] Run all test cases (TC-001 through TC-00N)
- [ ] Compare outputs to expected baselines
- [ ] Check for regression in failure modes (see anti-patterns)
- [ ] Verify ESCALATE triggers still work correctly
### Regression baselines
<!-- Save representative outputs here so you can compare after updates. -->
<!-- Include at least one happy path, one edge case, and one failure case. -->
```json
{
"tc_001_baseline": "...",
"tc_002_baseline": "..."
}
```