# Implementation Plan — DonkeyCar RL Autoresearch > Agent: read this at the start of every iteration. > Pick the first unchecked task in the current active wave. > Mark done immediately after commit. --- ## ✅ Wave 1: Real Training Foundation — COMPLETE All tasks done. Phase 1 champion achieved genuine forward driving. ## ✅ Wave 2: Track Completion — COMPLETE All top 3 Phase 2 models complete the full track. Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps. Driving style: Right lane, very stable. Completes full track in ~2874 steps. Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation). --- ## Wave 3: Behavioral Control & Speed Optimization **Goal:** Control driving style (lane, oscillation), measure lap time, optimize for speed. **Gate:** Phase 2 champion completes full track (DONE ✅). **Status:** 🟠 In progress ### Stream 3A: Enhanced Evaluator + Metrics - [x] **3A-01** — Update champion to Phase 2 Trial 20 - [ ] **3A-02** — Add lap time measurement to evaluate_champion.py - [ ] **3A-03** — Add steering oscillation metric (std of steering actions per episode) - [ ] **3A-04** — Add lane position histogram (distribution of CTE values) - [ ] **3A-05** — Save eval summary to `outerloop-results/eval_summary.jsonl` ### Stream 3B: Behavioral Reward Variants - [ ] **3B-01** — `LanePositionWrapper`: reward = `1 - abs(cte - target)/max_cte` with configurable target CTE offset - [ ] **3B-02** — `AntiOscillationWrapper`: adds penalty for rapid steering changes (smoothness reward) - [ ] **3B-03** — `AsymmetricCTEWrapper`: penalizes left-of-center more (enforces right-lane rule) - [ ] **3B-04** — Tests for all three wrappers (no simulator required) - [ ] **3B-05** — Integrate wrapper selection into autoresearch_controller.py via `--behavior` flag ### Stream 3C: Speed Optimization - [ ] **3C-01** — Measure actual lap time using `last_lap_time` from sim info dict - [ ] **3C-02** — Update reward to incorporate lap time: `reward += lap_bonus if lap_completed` - [ ] **3C-03** — Run targeted autoresearch starting from Phase 2 champion checkpoint - [ ] **3C-04** — Fine-tuning: load Phase 2 champion weights, continue training with speed reward ### Stream 3D: Multi-Track Generalization - [ ] **3D-01** — Evaluate champion on 2nd track (e.g., `donkey-mountain-track-v0`) - [ ] **3D-02** — Track-agnostic training: alternate episodes between 2 tracks - [ ] **3D-03** — Measure generalization gap (train_track vs unseen_track reward) --- ## Wave 4: Racing (future) **Goal:** Fastest possible lap on any track. **Gate:** Wave 3 complete. Multi-track generalization proven. **Status:** ⏸️ Not started - [ ] **4-01** — Pure lap time reward (replace CTE-based reward with time-based) - [ ] **4-02** — Head-to-head: autoresearch champion vs human-tuned config - [ ] **4-03** — Research paper / writeup structure --- ## Notes - **Phase 2 key finding:** n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better) - **CTE symmetry:** reward is symmetric → car picks left or right based on random NN init - **Track ends!** The track has a physical finish — runs end on track completion, not timeout - **Reward v4 (base × efficiency × speed):** Successfully eliminated all circular driving exploits - **Champion model path:** `agent/models/champion/model.zip` (Trial 20, Phase 2)