donkeycar-rl-autoresearch/IMPLEMENTATION_PLAN.md

74 lines
3.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Implementation Plan — DonkeyCar RL Autoresearch
> Agent: read this at the start of every iteration.
> Pick the first unchecked task in the current active wave.
> Mark done immediately after commit.
---
## ✅ Wave 1: Real Training Foundation — COMPLETE
All tasks done. Phase 1 champion achieved genuine forward driving.
## ✅ Wave 2: Track Completion — COMPLETE
All top 3 Phase 2 models complete the full track.
Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps.
Driving style: Right lane, very stable. Completes full track in ~2874 steps.
Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation).
---
## Wave 3: Behavioral Control & Speed Optimization
**Goal:** Control driving style (lane, oscillation), measure lap time, optimize for speed.
**Gate:** Phase 2 champion completes full track (DONE ✅).
**Status:** 🟠 In progress
### Stream 3A: Enhanced Evaluator + Metrics
- [x] **3A-01** — Update champion to Phase 2 Trial 20
- [ ] **3A-02** — Add lap time measurement to evaluate_champion.py
- [ ] **3A-03** — Add steering oscillation metric (std of steering actions per episode)
- [ ] **3A-04** — Add lane position histogram (distribution of CTE values)
- [ ] **3A-05** — Save eval summary to `outerloop-results/eval_summary.jsonl`
### Stream 3B: Behavioral Reward Variants
- [ ] **3B-01**`LanePositionWrapper`: reward = `1 - abs(cte - target)/max_cte` with configurable target CTE offset
- [ ] **3B-02**`AntiOscillationWrapper`: adds penalty for rapid steering changes (smoothness reward)
- [ ] **3B-03**`AsymmetricCTEWrapper`: penalizes left-of-center more (enforces right-lane rule)
- [ ] **3B-04** — Tests for all three wrappers (no simulator required)
- [ ] **3B-05** — Integrate wrapper selection into autoresearch_controller.py via `--behavior` flag
### Stream 3C: Speed Optimization
- [ ] **3C-01** — Measure actual lap time using `last_lap_time` from sim info dict
- [ ] **3C-02** — Update reward to incorporate lap time: `reward += lap_bonus if lap_completed`
- [ ] **3C-03** — Run targeted autoresearch starting from Phase 2 champion checkpoint
- [ ] **3C-04** — Fine-tuning: load Phase 2 champion weights, continue training with speed reward
### Stream 3D: Multi-Track Generalization
- [ ] **3D-01** — Evaluate champion on 2nd track (e.g., `donkey-mountain-track-v0`)
- [ ] **3D-02** — Track-agnostic training: alternate episodes between 2 tracks
- [ ] **3D-03** — Measure generalization gap (train_track vs unseen_track reward)
---
## Wave 4: Racing (future)
**Goal:** Fastest possible lap on any track.
**Gate:** Wave 3 complete. Multi-track generalization proven.
**Status:** ⏸️ Not started
- [ ] **4-01** — Pure lap time reward (replace CTE-based reward with time-based)
- [ ] **4-02** — Head-to-head: autoresearch champion vs human-tuned config
- [ ] **4-03** — Research paper / writeup structure
---
## Notes
- **Phase 2 key finding:** n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better)
- **CTE symmetry:** reward is symmetric → car picks left or right based on random NN init
- **Track ends!** The track has a physical finish — runs end on track completion, not timeout
- **Reward v4 (base × efficiency × speed):** Successfully eliminated all circular driving exploits
- **Champion model path:** `agent/models/champion/model.zip` (Trial 20, Phase 2)