# Implementation Plan β DonkeyCar RL Autoresearch
> Agent: read this at the start of every iteration.
> Pick the first unchecked task in the current active wave.
> Mark done immediately after commit.
---
## Wave 1: Real Training Foundation
**Goal:** Make the inner loop actually train and save models. Produce a real champion model.
**Gate:** champion model achieves mean_reward > 100 on training track.
**Status:** π In progress
### Stream 1A: Core Runner Rebuild
- [ ] **1A-01** β Rebuild `donkeycar_sb3_runner.py` with real PPO training (`model.learn()`), model save, and proper evaluation (`evaluate_policy()`)
- [ ] **1A-02** β Add `SpeedRewardWrapper` β reward = `speed * (1 - abs(cte)/max_cte)`; add `--reward-shaping` flag
- [ ] **1A-03** β Add champion model tracking β write `champion_manifest.json` when new best is found
- [ ] **1A-04** β Fix autoresearch controller to pass `learning_rate`, `save_dir`, `reward_shaping` args to runner
### Stream 1B: Tests
- [ ] **1B-01** β Write `tests/test_discretize_action.py` β action encoding, decoding, round-trip
- [ ] **1B-02** β Write `tests/test_autoresearch_controller.py` β GP fit, UCB computation, param round-trip, champion tracking
- [ ] **1B-03** β Write `tests/test_runner_integration.py` β mocked sim, training + save + eval cycle
### Stream 1C: First Real Autoresearch Run
- [ ] **1C-01** β Run 50-trial autoresearch with real PPO training; verify models saved
- [ ] **1C-02** β Save regression baseline: `champion_reward_phase1.txt`
- [ ] **1C-03** β Push all results and models to Gitea
- [ ] **1C-04** β Write Wave 1 process eval
---
## Wave 2: Multi-Track Generalization
**Goal:** Champion model drives any track with mean_reward > 50.
**Gate:** Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete.
**Status:** βΈοΈ Not started β blocked on Wave 1
- [ ] **2-01** β Write `evaluate_champion.py` β load champion model, evaluate on specified track
- [ ] **2-02** β Implement multi-track training curriculum (train on 2 tracks alternately)
- [ ] **2-03** β Add domain randomization wrapper (randomize road width, lighting)
- [ ] **2-04** β Implement convergence detection in autoresearch (stop when GP sigma collapses)
- [ ] **2-05** β Add automatic Gitea push every N trials
- [ ] **2-06** β Evaluate champion on unseen track; record generalization gap
---
## Wave 3: Racing / Speed Optimization
**Goal:** Fastest possible lap times on any track.
**Gate:** Wave 2 champion generalizes to β₯1 unseen track (mean_reward > 50).
**Status:** βΈοΈ Not started β blocked on Wave 2
- [ ] **3-01** β Implement lap time measurement and logging
- [ ] **3-02** β Tune reward function for pure speed (aggressive speed weight)
- [ ] **3-03** β Fine-tuning from champion checkpoint on new tracks
- [ ] **3-04** β Head-to-head: autoresearch champion vs human-tuned baseline
- [ ] **3-05** β Research writeup / report
---
## Completion Signals
The agent outputs one of these at the end of each iteration:
- `PLANNED` β just created/updated the plan, ready to implement
- `DONE` β all tasks in current wave complete
- `STUCK` β needs human input (see ESCALATION REQUIRED block if present)
- `ERROR` β unrecoverable error
---
## Notes
- **Random policy data (300 trials):** The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file: `autoresearch_results_phase1.jsonl`.
- **Model storage:** Large CNN models (>100MB) should be excluded from git or use git LFS. Add `agent/models/**/*.zip` to .gitignore if needed, and document download location.
- **Simulator requirement:** All live training tasks (1C-*) require DonkeyCar sim running on port 9091. Tests (1B-*) do NOT require the simulator.