3.4 KiB
Implementation Plan — DonkeyCar RL Autoresearch
Agent: read this at the start of every iteration. Pick the first unchecked task in the current active wave. Mark done immediately after commit.
✅ Wave 1: Real Training Foundation — COMPLETE
All tasks done. Phase 1 champion achieved genuine forward driving.
✅ Wave 2: Track Completion — COMPLETE
All top 3 Phase 2 models complete the full track. Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps. Driving style: Right lane, very stable. Completes full track in ~2874 steps. Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation).
Wave 3: Behavioral Control & Speed Optimization
Goal: Control driving style (lane, oscillation), measure lap time, optimize for speed. Gate: Phase 2 champion completes full track (DONE ✅). Status: 🟠 In progress
Stream 3A: Enhanced Evaluator + Metrics
- 3A-01 — Update champion to Phase 2 Trial 20
- 3A-02 — Add lap time measurement to evaluate_champion.py
- 3A-03 — Add steering oscillation metric (std of steering actions per episode)
- 3A-04 — Add lane position histogram (distribution of CTE values)
- 3A-05 — Save eval summary to
outerloop-results/eval_summary.jsonl
Stream 3B: Behavioral Reward Variants
- 3B-01 —
LanePositionWrapper: reward =1 - abs(cte - target)/max_ctewith configurable target CTE offset - 3B-02 —
AntiOscillationWrapper: adds penalty for rapid steering changes (smoothness reward) - 3B-03 —
AsymmetricCTEWrapper: penalizes left-of-center more (enforces right-lane rule) - 3B-04 — Tests for all three wrappers (no simulator required)
- 3B-05 — Integrate wrapper selection into autoresearch_controller.py via
--behaviorflag
Stream 3C: Speed Optimization
- 3C-01 — Measure actual lap time using
last_lap_timefrom sim info dict - 3C-02 — Update reward to incorporate lap time:
reward += lap_bonus if lap_completed - 3C-03 — Run targeted autoresearch starting from Phase 2 champion checkpoint
- 3C-04 — Fine-tuning: load Phase 2 champion weights, continue training with speed reward
Stream 3D: Multi-Track Generalization
- 3D-01 — Evaluate champion on 2nd track (e.g.,
donkey-mountain-track-v0) - 3D-02 — Track-agnostic training: alternate episodes between 2 tracks
- 3D-03 — Measure generalization gap (train_track vs unseen_track reward)
Wave 4: Racing (future)
Goal: Fastest possible lap on any track. Gate: Wave 3 complete. Multi-track generalization proven. Status: ⏸️ Not started
- 4-01 — Pure lap time reward (replace CTE-based reward with time-based)
- 4-02 — Head-to-head: autoresearch champion vs human-tuned config
- 4-03 — Research paper / writeup structure
Notes
- Phase 2 key finding: n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better)
- CTE symmetry: reward is symmetric → car picks left or right based on random NN init
- Track ends! The track has a physical finish — runs end on track completion, not timeout
- Reward v4 (base × efficiency × speed): Successfully eliminated all circular driving exploits
- Champion model path:
agent/models/champion/model.zip(Trial 20, Phase 2)