donkeycar-rl-autoresearch/IMPLEMENTATION_PLAN.md

# Implementation Plan — DonkeyCar RL Autoresearch

> Agent: read this at the start of every iteration.
> Pick the first unchecked task in the current active wave.
> Mark done immediately after commit.

---

## Wave 1: Real Training Foundation
**Goal:** Make the inner loop actually train and save models. Produce a real champion model.
**Gate:** champion model achieves mean_reward > 100 on training track.
**Status:** 🟠 In progress

### Stream 1A: Core Runner Rebuild

- [ ] **1A-01** — Rebuild `donkeycar_sb3_runner.py` with real PPO training (`model.learn()`), model save, and proper evaluation (`evaluate_policy()`)
- [ ] **1A-02** — Add `SpeedRewardWrapper` — reward = `speed * (1 - abs(cte)/max_cte)`; add `--reward-shaping` flag
- [ ] **1A-03** — Add champion model tracking — write `champion_manifest.json` when new best is found
- [ ] **1A-04** — Fix autoresearch controller to pass `learning_rate`, `save_dir`, `reward_shaping` args to runner

### Stream 1B: Tests

- [ ] **1B-01** — Write `tests/test_discretize_action.py` — action encoding, decoding, round-trip
- [ ] **1B-02** — Write `tests/test_autoresearch_controller.py` — GP fit, UCB computation, param round-trip, champion tracking
- [ ] **1B-03** — Write `tests/test_runner_integration.py` — mocked sim, training + save + eval cycle

### Stream 1C: First Real Autoresearch Run

- [ ] **1C-01** — Run 50-trial autoresearch with real PPO training; verify models saved
- [ ] **1C-02** — Save regression baseline: `champion_reward_phase1.txt`
- [ ] **1C-03** — Push all results and models to Gitea
- [ ] **1C-04** — Write Wave 1 process eval

---

## Wave 2: Multi-Track Generalization
**Goal:** Champion model drives any track with mean_reward > 50.
**Gate:** Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete.
**Status:** ⏸️ Not started — blocked on Wave 1

- [ ] **2-01** — Write `evaluate_champion.py` — load champion model, evaluate on specified track
- [ ] **2-02** — Implement multi-track training curriculum (train on 2 tracks alternately)
- [ ] **2-03** — Add domain randomization wrapper (randomize road width, lighting)
- [ ] **2-04** — Implement convergence detection in autoresearch (stop when GP sigma collapses)
- [ ] **2-05** — Add automatic Gitea push every N trials
- [ ] **2-06** — Evaluate champion on unseen track; record generalization gap

---

## Wave 3: Racing / Speed Optimization
**Goal:** Fastest possible lap times on any track.
**Gate:** Wave 2 champion generalizes to ≥1 unseen track (mean_reward > 50).
**Status:** ⏸️ Not started — blocked on Wave 2

- [ ] **3-01** — Implement lap time measurement and logging
- [ ] **3-02** — Tune reward function for pure speed (aggressive speed weight)
- [ ] **3-03** — Fine-tuning from champion checkpoint on new tracks
- [ ] **3-04** — Head-to-head: autoresearch champion vs human-tuned baseline
- [ ] **3-05** — Research writeup / report

---

## Completion Signals

The agent outputs one of these at the end of each iteration:
- `<promise>PLANNED</promise>` — just created/updated the plan, ready to implement
- `<promise>DONE</promise>` — all tasks in current wave complete
- `<promise>STUCK</promise>` — needs human input (see ESCALATION REQUIRED block if present)
- `<promise>ERROR</promise>` — unrecoverable error

---

## Notes

- **Random policy data (300 trials):** The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file: `autoresearch_results_phase1.jsonl`.
- **Model storage:** Large CNN models (>100MB) should be excluded from git or use git LFS. Add `agent/models/**/*.zip` to .gitignore if needed, and document download location.
- **Simulator requirement:** All live training tasks (1C-*) require DonkeyCar sim running on port 9091. Tests (1B-*) do NOT require the simulator.