donkeycar-rl-autoresearch/.harness/wave1-runner/execution-board.md

# Execution Board — Stream 1A: Core Runner Rebuild

**Feature:** Replace random-action inner loop with real PPO/DQN training, model save, and evaluated policy rewards
**Created:** 2026-04-13
**Branch:** main
**IMPLEMENTATION_PLAN tasks:** 1A-01, 1A-02, 1A-03, 1A-04
**Status:** 🟠 In progress

---

## 🎯 Goal

Rebuild `donkeycar_sb3_runner.py` so that every trial:
1. Trains a real PPO (or DQN) model using `model.learn(total_timesteps=N)`
2. Evaluates the trained model with `evaluate_policy()` (learned policy, NOT random)
3. Saves the model to disk
4. Tracks the champion model across all trials
5. Supports speed-aware reward shaping

---

## ⚠️ Dependencies

None — can start immediately.

---

## 📦 Packets

### Packet 1A-01 — Rebuild Runner with Real Training

**Status:** ⬜ Not started
**Est. effort:** 1 session
**Depends on:** none

**Goal:** Replace random `env.action_space.sample()` loop with real `PPO.learn()` + `evaluate_policy()`.

**Steps:**
1. Remove all legacy random-action loop code
2. Add `model = PPO('CnnPolicy', env, learning_rate=lr, verbose=1)` initialization
3. Add `model.learn(total_timesteps=timesteps)` training call
4. Add `mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=eval_episodes)`
5. Add `model.save(save_dir)` — save after every successful training run
6. Print per-trial summary: timesteps, mean_reward, std_reward, save path
7. Keep `env.close()` + `time.sleep(2)` teardown (non-negotiable per ADR-006)
8. Add `--learning-rate` and `--save-dir` CLI args
9. Add DQN path: if `--agent dqn`, use DQN with DiscretizedActionWrapper

**Files created/modified:**
- `agent/donkeycar_sb3_runner.py` — complete rebuild

**Known-answer tests:**
- PPO with 100 timesteps on mocked env should produce a non-None model object
- Model saved to `save_dir/model.zip` should be loadable with `PPO.load()`

**Acceptance criteria:**
- [ ] Running `python3 donkeycar_sb3_runner.py --agent ppo --timesteps 100 --save-dir /tmp/test-model` with a live sim produces `/tmp/test-model/model.zip`
- [ ] `mean_reward` in output comes from `evaluate_policy()`, not random episodes
- [ ] Script exits with code 0 and calls `env.close()`
- [ ] `--learning-rate` flag is respected (check SB3 verbose output)
- [ ] No `NameError: name 'model' is not defined` possible (model always defined before save)

**Validation evidence:** `.harness/wave1-runner/validation/1A-01-validation.md`

---

### Packet 1A-02 — Speed-Aware Reward Wrapper

**Status:** ⬜ Not started
**Est. effort:** 1 session
**Depends on:** 1A-01

**Goal:** Add `SpeedRewardWrapper` that replaces default CTE-only reward with `speed * (1 - abs(cte)/max_cte)`.

**Steps:**
1. Create `agent/reward_wrapper.py` with `SpeedRewardWrapper(gym.Wrapper)`
2. In `step()`, extract `speed` and `cte` from `info` dict (DonkeyCar provides these)
3. Compute shaped reward: `speed * (1.0 - min(abs(cte)/max_cte, 1.0))` minus penalty on crash
4. Add `--reward-shaping` boolean flag to runner CLI
5. Apply wrapper in runner if flag set: `env = SpeedRewardWrapper(env, max_cte=8.0)`
6. Log which reward mode is active at startup

**Files created/modified:**
- `agent/reward_wrapper.py` — new file
- `agent/donkeycar_sb3_runner.py` — add `--reward-shaping` flag and wrapper application

**Acceptance criteria:**
- [ ] `SpeedRewardWrapper` replaces reward when `--reward-shaping` is set
- [ ] Default behavior unchanged when flag not set
- [ ] Wrapper handles missing `speed` or `cte` in info gracefully (falls back to original reward)
- [ ] Unit test passes without simulator (mocked info dict)

**Validation evidence:** `.harness/wave1-runner/validation/1A-02-validation.md`

---

### Packet 1A-03 — Champion Model Tracking

**Status:** ⬜ Not started
**Est. effort:** 0.5 sessions
**Depends on:** 1A-01

**Goal:** Track the best model across all trials; maintain `agent/models/champion/` with the current best.

**Steps:**
1. After each trial, read `agent/models/champion/manifest.json` (if exists) to get current best reward
2. If new `mean_reward > current_best_reward`, copy model to `agent/models/champion/model.zip`
3. Write updated `manifest.json`: `{trial, timestamp, params, mean_reward, model_path}`
4. Log `[CHAMPION] New best: mean_reward=X params=Y` to console and autoresearch log
5. Add `champion` boolean field to JSONL result record

**Files created/modified:**
- `agent/autoresearch_controller.py` — add champion tracking logic
- `agent/models/champion/` — directory for champion model + manifest

**Known-answer tests:**
```python
# Rewards [50, 80, 60, 90, 70] → champion updates at trials 1, 2, 4
rewards = [50, 80, 60, 90, 70]
tracker = ChampionTracker('/tmp/test-champion')
champions = []
for i, r in enumerate(rewards):
    if tracker.update_if_better(r, params={}, model_path=f'trial-{i}'):
        champions.append(i)
assert champions == [0, 1, 3]  # 0-indexed
```

**Acceptance criteria:**
- [ ] Champion manifest updated whenever new best reward is found
- [ ] `agent/models/champion/model.zip` always contains the best model seen
- [ ] `champion` field in JSONL is `true` for the best trial, `false` otherwise
- [ ] Known-answer champion tracking test passes

**Validation evidence:** `.harness/wave1-runner/validation/1A-03-validation.md`

---

### Packet 1A-04 — Autoresearch Controller Wiring

**Status:** ⬜ Not started
**Est. effort:** 0.5 sessions
**Depends on:** 1A-01, 1A-03

**Goal:** Update `autoresearch_controller.py` to pass all required args to the rebuilt runner, use a separate Phase 1 results file, and add timesteps to the search space.

**Steps:**
1. Add `timesteps` to GP search space: `{'type': 'int', 'min': 5000, 'max': 30000}`
2. Pass `--learning-rate`, `--save-dir`, `--reward-shaping` to runner subprocess
3. Save new results to `autoresearch_results_phase1.jsonl` (do NOT mix with random-policy data)
4. Parse `mean_reward` from `[SB3 Runner] mean_reward=X` output line
5. Parse `std_reward` from `[SB3 Runner] std_reward=X` output line (add to runner output)
6. Add `--push-every N` flag: git add + commit + push every N trials
7. Add `--min-trials-before-gp 3` (default): use random sampling for first N trials

**Files created/modified:**
- `agent/autoresearch_controller.py` — wire up new args, new results file, push support

**Acceptance criteria:**
- [ ] Phase 1 results go to `autoresearch_results_phase1.jsonl` only
- [ ] `learning_rate` arg is passed to and used by the runner
- [ ] `save_dir` is a trial-specific path: `agent/models/trial-{trial_number:04d}`
- [ ] Git push happens every N trials if `--push-every N` is set
- [ ] Random proposal used for first `min_trials_before_gp` trials

**Validation evidence:** `.harness/wave1-runner/validation/1A-04-validation.md`

---

## 🔢 Dependency Order

```
1A-01 → 1A-02 (reward wrapper)
1A-01 → 1A-03 (champion tracking)
1A-01 + 1A-03 → 1A-04 (controller wiring)
```

1A-02 and 1A-03 can run in parallel after 1A-01.

---

## 🏁 Stream Completion Criteria

- [ ] All 4 packets complete with validation evidence written
- [ ] `python3 donkeycar_sb3_runner.py --agent ppo --timesteps 5000 --save-dir /tmp/t` produces a saved model
- [ ] `pytest tests/ -v` — stream 1A tests pass (once 1B is done)
- [ ] No `NameError: name 'model' is not defined` possible in any code path
- [ ] Champion tracking works: manifest.json updated correctly
- [ ] IMPLEMENTATION_PLAN tasks 1A-01 through 1A-04 marked `[x]`
- [ ] EXECUTION_MASTER updated

---

## 📋 Mandatory Commit Trailer Format

```
feat(runner): 1A-NN — <description>

Agent: pi/claude-sonnet
Tests: N/A (sim required) | N/N passing
Tests-Added: +N
TypeScript: N/A
```