agent-harness/SPEC-CREATION-GUIDE.md

# Spec Creation Guide — The Interview Protocol

> The spec is the most important document in the entire harness.
> A bad spec produces bad code, no matter how good the agent is.
> This guide teaches you how to create a great spec through structured conversation.

---

## Why Interview-Based Spec Creation?

Most agent harness guides say "write a good spec" and move on. That's like telling someone "just write good code." The spec requires **two kinds of knowledge**:

1. **Domain knowledge** — what the human knows (goals, constraints, edge cases, things they've tried)
2. **Technical knowledge** — what the agent/engineer knows (architecture patterns, tooling, testing strategies)

Neither side has the full picture. The interview process brings them together.

### The Anti-Pattern: Agent-Written Specs

If you ask an agent to "analyze this codebase and write a spec," you get a description of **what exists**, not a plan for **what should exist**. The agent can't know:
- Why you're building this
- What you've tried that didn't work
- What tradeoffs you're willing to make
- What "done" looks like to you

### The Anti-Pattern: Human-Only Specs

If the human writes the spec alone, you get:
- Vague acceptance criteria ("it should be fast")
- Missing technical details (no build commands, no test strategy)
- Implied knowledge that the agent can't access
- Gaps where the human assumed things were obvious

---

## The Interview Protocol

### Phase 1: Vision & Context (5-10 minutes)

Start broad. Understand the "why" before the "what."

**Questions to ask:**

1. **"What are we building, in one sentence?"**
   - Forces clarity. If they can't say it in one sentence, the scope isn't clear yet.
   - Good: "A CLI toolkit for interacting with DocuSign APIs without a proxy server."
   - Bad: "Something to help with DocuSign stuff."

2. **"Who is this for?"**
   - The user? Other developers? An automated system?
   - This shapes API design, error messages, documentation needs.

3. **"Why now? What's the trigger?"**
   - Understanding urgency and motivation reveals hidden requirements.
   - "I'm tired of copying tokens manually" → auto-refresh is a core requirement, not a nice-to-have.

4. **"What does 'done' look like? How will you know it's working?"**
   - Push for measurable criteria, not feelings.
   - "It works" → "I can run `cli auth` and get a valid token without opening a browser."

5. **"What have you tried before? What didn't work?"**
   - This is GOLD. Anti-patterns save agents hours of wasted effort.
   - "Node.js fetch sends headers that break SpringCM" → use curl instead.

**What you're listening for:**
- Unstated assumptions ("obviously it needs to...")
- Emotional language (frustration = high-priority requirement)
- Scope creep indicators ("and eventually it could also...")

---

### Phase 2: Requirements Extraction (10-15 minutes)

Now go feature by feature. For each feature:

**The requirement loop:**

1. **"Walk me through how you'd use this feature."**
   - Get the happy path first. Concrete scenario, not abstract description.
   - "I'd run `cli templates list` and see my 20 most recent templates with names and IDs."

2. **"What could go wrong?"**
   - Error cases, edge cases, permissions issues.
   - "The token could be expired." → auto-refresh requirement.
   - "The account might not have that API enabled." → graceful error message.

3. **"What's the input? What's the output?"**
   - Be specific about formats, fields, defaults.
   - "Input: template ID. Output: JSON with name, ID, folder, page count, created date."

4. **"How would you test this?"**
   - If they can describe a test, you have an acceptance criterion.
   - "I'd run it and check that I get at least one template back with a valid ID."

5. **"Is this a must-have or nice-to-have?"**
   - Prioritization prevents scope explosion.
   - Phase 1 = must-haves. Phase 2+ = nice-to-haves.

**Pro tip:** Number requirements as you go (FR-001, FR-002...). It creates shared language for the rest of the project.

---

### Phase 3: Technical Discovery (10-15 minutes)

This is where the engineer/agent fills in what the human might not think to specify.

**Questions to explore together:**

1. **Tech stack confirmation**
   - "You're using TypeScript with npm workspaces — should we keep that pattern?"
   - Don't assume. The human might want to change direction.

2. **Existing code patterns**
   - Read the codebase. Identify patterns already in use.
   - "I see you're using Commander.js for CLI parsing — should all packages follow that?"
   - "Your auth module uses JWT with RSA keys — should new packages share that?"

3. **Build and test infrastructure**
   - "What are the build commands? What test framework? What's the CI/CD setup?"
   - If there's no test framework, that's a Phase 0 task.

4. **Data model and persistence**
   - "Where does data live? Files? Database? Environment variables?"
   - "How do packages share configuration?" (e.g., monorepo root `.env`)

5. **Deployment and environment**
   - "Is this demo-only or does it need production support?"
   - "What environments exist?" (demo, staging, production)

6. **Dependencies and external services**
   - "What APIs are involved? What are their quirks?"
   - "Any rate limits, authentication requirements, or known issues?"

**What the engineer contributes:**
- Suggest architecture patterns the human hasn't considered
- Identify missing infrastructure (test framework, linting, CI)
- Spot potential issues early (circular dependencies, shared state)
- Propose phasing based on technical dependencies

---

### Phase 4: Constraint Mapping (5 minutes)

Explicitly capture the guardrails.

**Three categories:**

1. **MUST** — Non-negotiable requirements
   - "MUST use curl for HTTP calls (fetch breaks SpringCM)"
   - "MUST store tokens in .env file"

2. **MUST NOT** — Explicit prohibitions
   - "MUST NOT commit secrets to git"
   - "MUST NOT use React for the frontend"

3. **PREFER** — Soft preferences
   - "PREFER ES modules over CommonJS"
   - "PREFER shared utilities over code duplication"

**Why this matters:** Agents follow explicit constraints better than implied ones. A MUST NOT prevents entire categories of mistakes.

---

### Phase 5: Spec Assembly (Agent's job)

After the interview, the agent assembles the spec:

1. **Fill the PROJECT-SPEC.md template** with interview answers
2. **Add technical details** discovered from code review
3. **Write acceptance criteria** from the requirement conversations
4. **Propose phasing** based on dependencies
5. **Include anti-patterns** from "what didn't work" answers
6. **Present to human for review**

**The review conversation:**
- Read through each section together
- Human corrects misunderstandings
- Agent asks clarifying questions on gaps
- Iterate until the human says "yes, that's what I want"

---

### Phase 6: Self-Containment Test (5 minutes)

> **The critical test:** Can the spec be solved without the agent needing
> to fetch information not included in it?
>
> This is Toby Lütke's insight: *Can you state a problem with enough
> context that the task is plausibly solvable without the agent going
> out and getting more information?*

**The test — rewrite the spec as if:**
1. The reader has never seen your project before
2. The reader doesn't know your coding conventions or style
3. The reader has no access to information you don't include
4. The reader will stop and do nothing if anything is ambiguous

**The checklist:**
- [ ] Every acronym is defined on first use
- [ ] File paths referenced actually exist and are correct
- [ ] External dependencies have versions pinned or install instructions included
- [ ] Domain-specific terms are explained (not everyone knows what "JWT" or "FTS" means)
- [ ] The agent can find all referenced files without searching
- [ ] If removing any sentence would cause the agent to make mistakes, the spec isn't self-contained yet

**The failure mode this catches:**
Agents fill gaps with statistical plausibility — they guess in ways that are
often subtly wrong. A spec that relies on shared context (even 5 minutes of
prior conversation) will produce outputs that look right but aren't.

**If the spec fails the test:** Add the missing context. If you can't add it
(too much to document), add an ESCALATE constraint: "If you encounter
information not covered by this spec, do not assume — ask the human."

---

## Spec Quality Checklist

Before handing a spec to agents, verify:

### Completeness
- [ ] Every feature has numbered acceptance criteria (FR-NNN)
- [ ] Data model is defined with types and constraints
- [ ] Build and test commands are specified and work
- [ ] Anti-patterns section exists with real examples
- [ ] Phasing is defined with dependencies noted
- [ ] All four constraint categories are filled (MUST / MUST NOT / PREFER / ESCALATE)
- [ ] Evaluation design section exists with test cases and verification steps

### Clarity
- [ ] A stranger could read this and understand what to build
- [ ] No ambiguous words ("fast", "nice", "good") — use numbers
- [ ] Input/output examples for key operations
- [ ] Error cases are explicitly described

### Testability
- [ ] Every acceptance criterion can be verified by running code
- [ ] Sample data or fixtures are provided
- [ ] Performance criteria have specific thresholds
- [ ] "Done" is objectively measurable

### Feasibility
- [ ] Tech stack is proven for this type of project
- [ ] External dependencies are accessible (API keys, permissions)
- [ ] Scope fits the timeline (phasing handles overflow)
- [ ] Known challenges are documented with mitigation strategies

### Self-Containment
- [ ] A stranger could solve this without asking follow-up questions
- [ ] No domain-specific terms used without definition
- [ ] All file paths, commands, and references are correct
- [ ] ESCALATE constraints cover situations where spec is ambiguous

---

## Common Interview Mistakes

### 1. Leading the witness
**Bad:** "You probably want auto-refresh, right?"
**Good:** "What happens when the token expires mid-session?"

### 2. Accepting vague answers
**Bad:** Human: "It should handle errors well." Agent: "Got it."
**Good:** "Can you give me an example of an error? What should the user see?"

### 3. Skipping the 'why'
**Bad:** Jumping straight to features.
**Good:** Understanding context first — it changes how you interpret every requirement.

### 4. Over-engineering the spec
**Bad:** 50-page spec with UML diagrams for a CLI tool.
**Good:** Enough detail for an agent to work autonomously, no more.

### 5. Forgetting anti-patterns
**Bad:** Only describing what TO do.
**Good:** Explicitly listing what NOT to do — saves agents from repeating your mistakes.

---

## Template: Interview Notes

Use this to capture notes during the interview before assembling the spec:

```markdown
# Interview Notes — [Project Name]
**Date:** YYYY-MM-DD
**Participants:** [Human], [Agent]

## Vision
- One-liner:
- Target user:
- Trigger/motivation:
- Success criteria:

## Features (raw notes)
1. Feature name — description — happy path — error cases — priority
2. ...

## Technical Context
- Existing stack:
- Patterns to follow:
- Patterns to avoid:
- Build/test commands:

## Constraints
- MUST:
- MUST NOT:
- PREFER:

## Anti-patterns (things that didn't work)
1.
2.

## Open Questions
1.
2.
```

---

_This guide is a living document. Update it as you learn what works._