donkeycar-rl-autoresearch/IMPLEMENTATION_PLAN.md

4.0 KiB

Implementation Plan — DonkeyCar RL Autoresearch

Agent: read this at the start of every iteration. Pick the first unchecked task in the current active wave. Mark done immediately after commit.


Wave 1: Real Training Foundation

Goal: Make the inner loop actually train and save models. Produce a real champion model.
Gate: champion model achieves mean_reward > 100 on training track.
Status: 🟠 In progress

Stream 1A: Core Runner Rebuild

  • 1A-01 — Rebuild donkeycar_sb3_runner.py with real PPO training (model.learn()), model save, and proper evaluation (evaluate_policy())
  • 1A-02 — Add SpeedRewardWrapper — reward = speed * (1 - abs(cte)/max_cte); add --reward-shaping flag
  • 1A-03 — Add champion model tracking — write champion_manifest.json when new best is found
  • 1A-04 — Fix autoresearch controller to pass learning_rate, save_dir, reward_shaping args to runner

Stream 1B: Tests

  • 1B-01 — Write tests/test_discretize_action.py — action encoding, decoding, round-trip
  • 1B-02 — Write tests/test_autoresearch_controller.py — GP fit, UCB computation, param round-trip, champion tracking
  • 1B-03 — Write tests/test_runner_integration.py — mocked sim, training + save + eval cycle

Stream 1C: First Real Autoresearch Run

  • 1C-01 — Run 50-trial autoresearch with real PPO training; verify models saved
  • 1C-02 — Save regression baseline: champion_reward_phase1.txt
  • 1C-03 — Push all results and models to Gitea
  • 1C-04 — Write Wave 1 process eval

Wave 2: Multi-Track Generalization

Goal: Champion model drives any track with mean_reward > 50.
Gate: Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete.
Status: ⏸️ Not started — blocked on Wave 1

  • 2-01 — Write evaluate_champion.py — load champion model, evaluate on specified track
  • 2-02 — Implement multi-track training curriculum (train on 2 tracks alternately)
  • 2-03 — Add domain randomization wrapper (randomize road width, lighting)
  • 2-04 — Implement convergence detection in autoresearch (stop when GP sigma collapses)
  • 2-05 — Add automatic Gitea push every N trials
  • 2-06 — Evaluate champion on unseen track; record generalization gap

Wave 3: Racing / Speed Optimization

Goal: Fastest possible lap times on any track.
Gate: Wave 2 champion generalizes to ≥1 unseen track (mean_reward > 50).
Status: ⏸️ Not started — blocked on Wave 2

  • 3-01 — Implement lap time measurement and logging
  • 3-02 — Tune reward function for pure speed (aggressive speed weight)
  • 3-03 — Fine-tuning from champion checkpoint on new tracks
  • 3-04 — Head-to-head: autoresearch champion vs human-tuned baseline
  • 3-05 — Research writeup / report

Completion Signals

The agent outputs one of these at the end of each iteration:

  • <promise>PLANNED</promise> — just created/updated the plan, ready to implement
  • <promise>DONE</promise> — all tasks in current wave complete
  • <promise>STUCK</promise> — needs human input (see ESCALATION REQUIRED block if present)
  • <promise>ERROR</promise> — unrecoverable error

Notes

  • Random policy data (300 trials): The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file: autoresearch_results_phase1.jsonl.
  • Model storage: Large CNN models (>100MB) should be excluded from git or use git LFS. Add agent/models/**/*.zip to .gitignore if needed, and document download location.
  • Simulator requirement: All live training tasks (1C-) require DonkeyCar sim running on port 9091. Tests (1B-) do NOT require the simulator.