4.0 KiB
4.0 KiB
Implementation Plan — DonkeyCar RL Autoresearch
Agent: read this at the start of every iteration. Pick the first unchecked task in the current active wave. Mark done immediately after commit.
Wave 1: Real Training Foundation
Goal: Make the inner loop actually train and save models. Produce a real champion model.
Gate: champion model achieves mean_reward > 100 on training track.
Status: 🟠 In progress
Stream 1A: Core Runner Rebuild
- 1A-01 — Rebuild
donkeycar_sb3_runner.pywith real PPO training (model.learn()), model save, and proper evaluation (evaluate_policy()) - 1A-02 — Add
SpeedRewardWrapper— reward =speed * (1 - abs(cte)/max_cte); add--reward-shapingflag - 1A-03 — Add champion model tracking — write
champion_manifest.jsonwhen new best is found - 1A-04 — Fix autoresearch controller to pass
learning_rate,save_dir,reward_shapingargs to runner
Stream 1B: Tests
- 1B-01 — Write
tests/test_discretize_action.py— action encoding, decoding, round-trip - 1B-02 — Write
tests/test_autoresearch_controller.py— GP fit, UCB computation, param round-trip, champion tracking - 1B-03 — Write
tests/test_runner_integration.py— mocked sim, training + save + eval cycle
Stream 1C: First Real Autoresearch Run
- 1C-01 — Run 50-trial autoresearch with real PPO training; verify models saved
- 1C-02 — Save regression baseline:
champion_reward_phase1.txt - 1C-03 — Push all results and models to Gitea
- 1C-04 — Write Wave 1 process eval
Wave 2: Multi-Track Generalization
Goal: Champion model drives any track with mean_reward > 50.
Gate: Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete.
Status: ⏸️ Not started — blocked on Wave 1
- 2-01 — Write
evaluate_champion.py— load champion model, evaluate on specified track - 2-02 — Implement multi-track training curriculum (train on 2 tracks alternately)
- 2-03 — Add domain randomization wrapper (randomize road width, lighting)
- 2-04 — Implement convergence detection in autoresearch (stop when GP sigma collapses)
- 2-05 — Add automatic Gitea push every N trials
- 2-06 — Evaluate champion on unseen track; record generalization gap
Wave 3: Racing / Speed Optimization
Goal: Fastest possible lap times on any track.
Gate: Wave 2 champion generalizes to ≥1 unseen track (mean_reward > 50).
Status: ⏸️ Not started — blocked on Wave 2
- 3-01 — Implement lap time measurement and logging
- 3-02 — Tune reward function for pure speed (aggressive speed weight)
- 3-03 — Fine-tuning from champion checkpoint on new tracks
- 3-04 — Head-to-head: autoresearch champion vs human-tuned baseline
- 3-05 — Research writeup / report
Completion Signals
The agent outputs one of these at the end of each iteration:
<promise>PLANNED</promise>— just created/updated the plan, ready to implement<promise>DONE</promise>— all tasks in current wave complete<promise>STUCK</promise>— needs human input (see ESCALATION REQUIRED block if present)<promise>ERROR</promise>— unrecoverable error
Notes
- Random policy data (300 trials): The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file:
autoresearch_results_phase1.jsonl. - Model storage: Large CNN models (>100MB) should be excluded from git or use git LFS. Add
agent/models/**/*.zipto .gitignore if needed, and document download location. - Simulator requirement: All live training tasks (1C-) require DonkeyCar sim running on port 9091. Tests (1B-) do NOT require the simulator.