# Execution Board — Stream 1A: Core Runner Rebuild **Feature:** Replace random-action inner loop with real PPO/DQN training, model save, and evaluated policy rewards **Created:** 2026-04-13 **Branch:** main **IMPLEMENTATION_PLAN tasks:** 1A-01, 1A-02, 1A-03, 1A-04 **Status:** 🟠 In progress --- ## 🎯 Goal Rebuild `donkeycar_sb3_runner.py` so that every trial: 1. Trains a real PPO (or DQN) model using `model.learn(total_timesteps=N)` 2. Evaluates the trained model with `evaluate_policy()` (learned policy, NOT random) 3. Saves the model to disk 4. Tracks the champion model across all trials 5. Supports speed-aware reward shaping --- ## ⚠️ Dependencies None — can start immediately. --- ## 📦 Packets ### Packet 1A-01 — Rebuild Runner with Real Training **Status:** ⬜ Not started **Est. effort:** 1 session **Depends on:** none **Goal:** Replace random `env.action_space.sample()` loop with real `PPO.learn()` + `evaluate_policy()`. **Steps:** 1. Remove all legacy random-action loop code 2. Add `model = PPO('CnnPolicy', env, learning_rate=lr, verbose=1)` initialization 3. Add `model.learn(total_timesteps=timesteps)` training call 4. Add `mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=eval_episodes)` 5. Add `model.save(save_dir)` — save after every successful training run 6. Print per-trial summary: timesteps, mean_reward, std_reward, save path 7. Keep `env.close()` + `time.sleep(2)` teardown (non-negotiable per ADR-006) 8. Add `--learning-rate` and `--save-dir` CLI args 9. Add DQN path: if `--agent dqn`, use DQN with DiscretizedActionWrapper **Files created/modified:** - `agent/donkeycar_sb3_runner.py` — complete rebuild **Known-answer tests:** - PPO with 100 timesteps on mocked env should produce a non-None model object - Model saved to `save_dir/model.zip` should be loadable with `PPO.load()` **Acceptance criteria:** - [ ] Running `python3 donkeycar_sb3_runner.py --agent ppo --timesteps 100 --save-dir /tmp/test-model` with a live sim produces `/tmp/test-model/model.zip` - [ ] `mean_reward` in output comes from `evaluate_policy()`, not random episodes - [ ] Script exits with code 0 and calls `env.close()` - [ ] `--learning-rate` flag is respected (check SB3 verbose output) - [ ] No `NameError: name 'model' is not defined` possible (model always defined before save) **Validation evidence:** `.harness/wave1-runner/validation/1A-01-validation.md` --- ### Packet 1A-02 — Speed-Aware Reward Wrapper **Status:** ⬜ Not started **Est. effort:** 1 session **Depends on:** 1A-01 **Goal:** Add `SpeedRewardWrapper` that replaces default CTE-only reward with `speed * (1 - abs(cte)/max_cte)`. **Steps:** 1. Create `agent/reward_wrapper.py` with `SpeedRewardWrapper(gym.Wrapper)` 2. In `step()`, extract `speed` and `cte` from `info` dict (DonkeyCar provides these) 3. Compute shaped reward: `speed * (1.0 - min(abs(cte)/max_cte, 1.0))` minus penalty on crash 4. Add `--reward-shaping` boolean flag to runner CLI 5. Apply wrapper in runner if flag set: `env = SpeedRewardWrapper(env, max_cte=8.0)` 6. Log which reward mode is active at startup **Files created/modified:** - `agent/reward_wrapper.py` — new file - `agent/donkeycar_sb3_runner.py` — add `--reward-shaping` flag and wrapper application **Acceptance criteria:** - [ ] `SpeedRewardWrapper` replaces reward when `--reward-shaping` is set - [ ] Default behavior unchanged when flag not set - [ ] Wrapper handles missing `speed` or `cte` in info gracefully (falls back to original reward) - [ ] Unit test passes without simulator (mocked info dict) **Validation evidence:** `.harness/wave1-runner/validation/1A-02-validation.md` --- ### Packet 1A-03 — Champion Model Tracking **Status:** ⬜ Not started **Est. effort:** 0.5 sessions **Depends on:** 1A-01 **Goal:** Track the best model across all trials; maintain `agent/models/champion/` with the current best. **Steps:** 1. After each trial, read `agent/models/champion/manifest.json` (if exists) to get current best reward 2. If new `mean_reward > current_best_reward`, copy model to `agent/models/champion/model.zip` 3. Write updated `manifest.json`: `{trial, timestamp, params, mean_reward, model_path}` 4. Log `[CHAMPION] New best: mean_reward=X params=Y` to console and autoresearch log 5. Add `champion` boolean field to JSONL result record **Files created/modified:** - `agent/autoresearch_controller.py` — add champion tracking logic - `agent/models/champion/` — directory for champion model + manifest **Known-answer tests:** ```python # Rewards [50, 80, 60, 90, 70] → champion updates at trials 1, 2, 4 rewards = [50, 80, 60, 90, 70] tracker = ChampionTracker('/tmp/test-champion') champions = [] for i, r in enumerate(rewards): if tracker.update_if_better(r, params={}, model_path=f'trial-{i}'): champions.append(i) assert champions == [0, 1, 3] # 0-indexed ``` **Acceptance criteria:** - [ ] Champion manifest updated whenever new best reward is found - [ ] `agent/models/champion/model.zip` always contains the best model seen - [ ] `champion` field in JSONL is `true` for the best trial, `false` otherwise - [ ] Known-answer champion tracking test passes **Validation evidence:** `.harness/wave1-runner/validation/1A-03-validation.md` --- ### Packet 1A-04 — Autoresearch Controller Wiring **Status:** ⬜ Not started **Est. effort:** 0.5 sessions **Depends on:** 1A-01, 1A-03 **Goal:** Update `autoresearch_controller.py` to pass all required args to the rebuilt runner, use a separate Phase 1 results file, and add timesteps to the search space. **Steps:** 1. Add `timesteps` to GP search space: `{'type': 'int', 'min': 5000, 'max': 30000}` 2. Pass `--learning-rate`, `--save-dir`, `--reward-shaping` to runner subprocess 3. Save new results to `autoresearch_results_phase1.jsonl` (do NOT mix with random-policy data) 4. Parse `mean_reward` from `[SB3 Runner] mean_reward=X` output line 5. Parse `std_reward` from `[SB3 Runner] std_reward=X` output line (add to runner output) 6. Add `--push-every N` flag: git add + commit + push every N trials 7. Add `--min-trials-before-gp 3` (default): use random sampling for first N trials **Files created/modified:** - `agent/autoresearch_controller.py` — wire up new args, new results file, push support **Acceptance criteria:** - [ ] Phase 1 results go to `autoresearch_results_phase1.jsonl` only - [ ] `learning_rate` arg is passed to and used by the runner - [ ] `save_dir` is a trial-specific path: `agent/models/trial-{trial_number:04d}` - [ ] Git push happens every N trials if `--push-every N` is set - [ ] Random proposal used for first `min_trials_before_gp` trials **Validation evidence:** `.harness/wave1-runner/validation/1A-04-validation.md` --- ## 🔢 Dependency Order ``` 1A-01 → 1A-02 (reward wrapper) 1A-01 → 1A-03 (champion tracking) 1A-01 + 1A-03 → 1A-04 (controller wiring) ``` 1A-02 and 1A-03 can run in parallel after 1A-01. --- ## 🏁 Stream Completion Criteria - [ ] All 4 packets complete with validation evidence written - [ ] `python3 donkeycar_sb3_runner.py --agent ppo --timesteps 5000 --save-dir /tmp/t` produces a saved model - [ ] `pytest tests/ -v` — stream 1A tests pass (once 1B is done) - [ ] No `NameError: name 'model' is not defined` possible in any code path - [ ] Champion tracking works: manifest.json updated correctly - [ ] IMPLEMENTATION_PLAN tasks 1A-01 through 1A-04 marked `[x]` - [ ] EXECUTION_MASTER updated --- ## 📋 Mandatory Commit Trailer Format ``` feat(runner): 1A-NN — Agent: pi/claude-sonnet Tests: N/A (sim required) | N/N passing Tests-Added: +N TypeScript: N/A ```