# Implementation Plan β€” DonkeyCar RL Autoresearch > Agent: read this at the start of every iteration. > Pick the first unchecked task in the current active wave. > Mark done immediately after commit. --- ## Wave 1: Real Training Foundation **Goal:** Make the inner loop actually train and save models. Produce a real champion model. **Gate:** champion model achieves mean_reward > 100 on training track. **Status:** 🟠 In progress ### Stream 1A: Core Runner Rebuild - [ ] **1A-01** β€” Rebuild `donkeycar_sb3_runner.py` with real PPO training (`model.learn()`), model save, and proper evaluation (`evaluate_policy()`) - [ ] **1A-02** β€” Add `SpeedRewardWrapper` β€” reward = `speed * (1 - abs(cte)/max_cte)`; add `--reward-shaping` flag - [ ] **1A-03** β€” Add champion model tracking β€” write `champion_manifest.json` when new best is found - [ ] **1A-04** β€” Fix autoresearch controller to pass `learning_rate`, `save_dir`, `reward_shaping` args to runner ### Stream 1B: Tests - [ ] **1B-01** β€” Write `tests/test_discretize_action.py` β€” action encoding, decoding, round-trip - [ ] **1B-02** β€” Write `tests/test_autoresearch_controller.py` β€” GP fit, UCB computation, param round-trip, champion tracking - [ ] **1B-03** β€” Write `tests/test_runner_integration.py` β€” mocked sim, training + save + eval cycle ### Stream 1C: First Real Autoresearch Run - [ ] **1C-01** β€” Run 50-trial autoresearch with real PPO training; verify models saved - [ ] **1C-02** β€” Save regression baseline: `champion_reward_phase1.txt` - [ ] **1C-03** β€” Push all results and models to Gitea - [ ] **1C-04** β€” Write Wave 1 process eval --- ## Wave 2: Multi-Track Generalization **Goal:** Champion model drives any track with mean_reward > 50. **Gate:** Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete. **Status:** ⏸️ Not started β€” blocked on Wave 1 - [ ] **2-01** β€” Write `evaluate_champion.py` β€” load champion model, evaluate on specified track - [ ] **2-02** β€” Implement multi-track training curriculum (train on 2 tracks alternately) - [ ] **2-03** β€” Add domain randomization wrapper (randomize road width, lighting) - [ ] **2-04** β€” Implement convergence detection in autoresearch (stop when GP sigma collapses) - [ ] **2-05** β€” Add automatic Gitea push every N trials - [ ] **2-06** β€” Evaluate champion on unseen track; record generalization gap --- ## Wave 3: Racing / Speed Optimization **Goal:** Fastest possible lap times on any track. **Gate:** Wave 2 champion generalizes to β‰₯1 unseen track (mean_reward > 50). **Status:** ⏸️ Not started β€” blocked on Wave 2 - [ ] **3-01** β€” Implement lap time measurement and logging - [ ] **3-02** β€” Tune reward function for pure speed (aggressive speed weight) - [ ] **3-03** β€” Fine-tuning from champion checkpoint on new tracks - [ ] **3-04** β€” Head-to-head: autoresearch champion vs human-tuned baseline - [ ] **3-05** β€” Research writeup / report --- ## Completion Signals The agent outputs one of these at the end of each iteration: - `PLANNED` β€” just created/updated the plan, ready to implement - `DONE` β€” all tasks in current wave complete - `STUCK` β€” needs human input (see ESCALATION REQUIRED block if present) - `ERROR` β€” unrecoverable error --- ## Notes - **Random policy data (300 trials):** The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file: `autoresearch_results_phase1.jsonl`. - **Model storage:** Large CNN models (>100MB) should be excluded from git or use git LFS. Add `agent/models/**/*.zip` to .gitignore if needed, and document download location. - **Simulator requirement:** All live training tasks (1C-*) require DonkeyCar sim running on port 9091. Tests (1B-*) do NOT require the simulator.