donkeycar-rl-autoresearch/IMPLEMENTATION_PLAN.md

3.4 KiB
Raw Permalink Blame History

Implementation Plan — DonkeyCar RL Autoresearch

Agent: read this at the start of every iteration. Pick the first unchecked task in the current active wave. Mark done immediately after commit.


Wave 1: Real Training Foundation — COMPLETE

All tasks done. Phase 1 champion achieved genuine forward driving.

Wave 2: Track Completion — COMPLETE

All top 3 Phase 2 models complete the full track. Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps. Driving style: Right lane, very stable. Completes full track in ~2874 steps. Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation).


Wave 3: Behavioral Control & Speed Optimization

Goal: Control driving style (lane, oscillation), measure lap time, optimize for speed. Gate: Phase 2 champion completes full track (DONE ). Status: 🟠 In progress

Stream 3A: Enhanced Evaluator + Metrics

  • 3A-01 — Update champion to Phase 2 Trial 20
  • 3A-02 — Add lap time measurement to evaluate_champion.py
  • 3A-03 — Add steering oscillation metric (std of steering actions per episode)
  • 3A-04 — Add lane position histogram (distribution of CTE values)
  • 3A-05 — Save eval summary to outerloop-results/eval_summary.jsonl

Stream 3B: Behavioral Reward Variants

  • 3B-01LanePositionWrapper: reward = 1 - abs(cte - target)/max_cte with configurable target CTE offset
  • 3B-02AntiOscillationWrapper: adds penalty for rapid steering changes (smoothness reward)
  • 3B-03AsymmetricCTEWrapper: penalizes left-of-center more (enforces right-lane rule)
  • 3B-04 — Tests for all three wrappers (no simulator required)
  • 3B-05 — Integrate wrapper selection into autoresearch_controller.py via --behavior flag

Stream 3C: Speed Optimization

  • 3C-01 — Measure actual lap time using last_lap_time from sim info dict
  • 3C-02 — Update reward to incorporate lap time: reward += lap_bonus if lap_completed
  • 3C-03 — Run targeted autoresearch starting from Phase 2 champion checkpoint
  • 3C-04 — Fine-tuning: load Phase 2 champion weights, continue training with speed reward

Stream 3D: Multi-Track Generalization

  • 3D-01 — Evaluate champion on 2nd track (e.g., donkey-mountain-track-v0)
  • 3D-02 — Track-agnostic training: alternate episodes between 2 tracks
  • 3D-03 — Measure generalization gap (train_track vs unseen_track reward)

Wave 4: Racing (future)

Goal: Fastest possible lap on any track. Gate: Wave 3 complete. Multi-track generalization proven. Status: ⏸️ Not started

  • 4-01 — Pure lap time reward (replace CTE-based reward with time-based)
  • 4-02 — Head-to-head: autoresearch champion vs human-tuned config
  • 4-03 — Research paper / writeup structure

Notes

  • Phase 2 key finding: n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better)
  • CTE symmetry: reward is symmetric → car picks left or right based on random NN init
  • Track ends! The track has a physical finish — runs end on track completion, not timeout
  • Reward v4 (base × efficiency × speed): Successfully eliminated all circular driving exploits
  • Champion model path: agent/models/champion/model.zip (Trial 20, Phase 2)