74 lines
3.4 KiB
Markdown
74 lines
3.4 KiB
Markdown
# Implementation Plan — DonkeyCar RL Autoresearch
|
||
|
||
> Agent: read this at the start of every iteration.
|
||
> Pick the first unchecked task in the current active wave.
|
||
> Mark done immediately after commit.
|
||
|
||
---
|
||
|
||
## ✅ Wave 1: Real Training Foundation — COMPLETE
|
||
All tasks done. Phase 1 champion achieved genuine forward driving.
|
||
|
||
## ✅ Wave 2: Track Completion — COMPLETE
|
||
All top 3 Phase 2 models complete the full track.
|
||
Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps.
|
||
Driving style: Right lane, very stable. Completes full track in ~2874 steps.
|
||
Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation).
|
||
|
||
---
|
||
|
||
## Wave 3: Behavioral Control & Speed Optimization
|
||
**Goal:** Control driving style (lane, oscillation), measure lap time, optimize for speed.
|
||
**Gate:** Phase 2 champion completes full track (DONE ✅).
|
||
**Status:** 🟠 In progress
|
||
|
||
### Stream 3A: Enhanced Evaluator + Metrics
|
||
|
||
- [x] **3A-01** — Update champion to Phase 2 Trial 20
|
||
- [ ] **3A-02** — Add lap time measurement to evaluate_champion.py
|
||
- [ ] **3A-03** — Add steering oscillation metric (std of steering actions per episode)
|
||
- [ ] **3A-04** — Add lane position histogram (distribution of CTE values)
|
||
- [ ] **3A-05** — Save eval summary to `outerloop-results/eval_summary.jsonl`
|
||
|
||
### Stream 3B: Behavioral Reward Variants
|
||
|
||
- [ ] **3B-01** — `LanePositionWrapper`: reward = `1 - abs(cte - target)/max_cte` with configurable target CTE offset
|
||
- [ ] **3B-02** — `AntiOscillationWrapper`: adds penalty for rapid steering changes (smoothness reward)
|
||
- [ ] **3B-03** — `AsymmetricCTEWrapper`: penalizes left-of-center more (enforces right-lane rule)
|
||
- [ ] **3B-04** — Tests for all three wrappers (no simulator required)
|
||
- [ ] **3B-05** — Integrate wrapper selection into autoresearch_controller.py via `--behavior` flag
|
||
|
||
### Stream 3C: Speed Optimization
|
||
|
||
- [ ] **3C-01** — Measure actual lap time using `last_lap_time` from sim info dict
|
||
- [ ] **3C-02** — Update reward to incorporate lap time: `reward += lap_bonus if lap_completed`
|
||
- [ ] **3C-03** — Run targeted autoresearch starting from Phase 2 champion checkpoint
|
||
- [ ] **3C-04** — Fine-tuning: load Phase 2 champion weights, continue training with speed reward
|
||
|
||
### Stream 3D: Multi-Track Generalization
|
||
|
||
- [ ] **3D-01** — Evaluate champion on 2nd track (e.g., `donkey-mountain-track-v0`)
|
||
- [ ] **3D-02** — Track-agnostic training: alternate episodes between 2 tracks
|
||
- [ ] **3D-03** — Measure generalization gap (train_track vs unseen_track reward)
|
||
|
||
---
|
||
|
||
## Wave 4: Racing (future)
|
||
**Goal:** Fastest possible lap on any track.
|
||
**Gate:** Wave 3 complete. Multi-track generalization proven.
|
||
**Status:** ⏸️ Not started
|
||
|
||
- [ ] **4-01** — Pure lap time reward (replace CTE-based reward with time-based)
|
||
- [ ] **4-02** — Head-to-head: autoresearch champion vs human-tuned config
|
||
- [ ] **4-03** — Research paper / writeup structure
|
||
|
||
---
|
||
|
||
## Notes
|
||
|
||
- **Phase 2 key finding:** n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better)
|
||
- **CTE symmetry:** reward is symmetric → car picks left or right based on random NN init
|
||
- **Track ends!** The track has a physical finish — runs end on track completion, not timeout
|
||
- **Reward v4 (base × efficiency × speed):** Successfully eliminated all circular driving exploits
|
||
- **Champion model path:** `agent/models/champion/model.zip` (Trial 20, Phase 2)
|