donkeycar-rl-autoresearch/IMPLEMENTATION_PLAN.md

# Implementation Plan — DonkeyCar RL Autoresearch

> Agent: read this at the start of every iteration.
> Pick the first unchecked task in the current active wave.
> Mark done immediately after commit.

---

## ✅ Wave 1: Real Training Foundation — COMPLETE
All tasks done. Phase 1 champion achieved genuine forward driving.

## ✅ Wave 2: Track Completion — COMPLETE
All top 3 Phase 2 models complete the full track.
Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps.
Driving style: Right lane, very stable. Completes full track in ~2874 steps.
Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation).

---

## Wave 3: Behavioral Control & Speed Optimization
**Goal:** Control driving style (lane, oscillation), measure lap time, optimize for speed.
**Gate:** Phase 2 champion completes full track (DONE ✅).
**Status:** 🟠 In progress

### Stream 3A: Enhanced Evaluator + Metrics

- [x] **3A-01** — Update champion to Phase 2 Trial 20
- [ ] **3A-02** — Add lap time measurement to evaluate_champion.py
- [ ] **3A-03** — Add steering oscillation metric (std of steering actions per episode)
- [ ] **3A-04** — Add lane position histogram (distribution of CTE values)
- [ ] **3A-05** — Save eval summary to `outerloop-results/eval_summary.jsonl`

### Stream 3B: Behavioral Reward Variants

- [ ] **3B-01** — `LanePositionWrapper`: reward = `1 - abs(cte - target)/max_cte` with configurable target CTE offset
- [ ] **3B-02** — `AntiOscillationWrapper`: adds penalty for rapid steering changes (smoothness reward)
- [ ] **3B-03** — `AsymmetricCTEWrapper`: penalizes left-of-center more (enforces right-lane rule)
- [ ] **3B-04** — Tests for all three wrappers (no simulator required)
- [ ] **3B-05** — Integrate wrapper selection into autoresearch_controller.py via `--behavior` flag

### Stream 3C: Speed Optimization

- [ ] **3C-01** — Measure actual lap time using `last_lap_time` from sim info dict
- [ ] **3C-02** — Update reward to incorporate lap time: `reward += lap_bonus if lap_completed`
- [ ] **3C-03** — Run targeted autoresearch starting from Phase 2 champion checkpoint
- [ ] **3C-04** — Fine-tuning: load Phase 2 champion weights, continue training with speed reward

### Stream 3D: Multi-Track Generalization

- [ ] **3D-01** — Evaluate champion on 2nd track (e.g., `donkey-mountain-track-v0`)
- [ ] **3D-02** — Track-agnostic training: alternate episodes between 2 tracks
- [ ] **3D-03** — Measure generalization gap (train_track vs unseen_track reward)

---

## Wave 4: Racing (future)
**Goal:** Fastest possible lap on any track.
**Gate:** Wave 3 complete. Multi-track generalization proven.
**Status:** ⏸️ Not started

- [ ] **4-01** — Pure lap time reward (replace CTE-based reward with time-based)
- [ ] **4-02** — Head-to-head: autoresearch champion vs human-tuned config
- [ ] **4-03** — Research paper / writeup structure

---

## Notes

- **Phase 2 key finding:** n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better)
- **CTE symmetry:** reward is symmetric → car picks left or right based on random NN init
- **Track ends!** The track has a physical finish — runs end on track completion, not timeout
- **Reward v4 (base × efficiency × speed):** Successfully eliminated all circular driving exploits
- **Champion model path:** `agent/models/champion/model.zip` (Trial 20, Phase 2)