203 lines
7.5 KiB
Markdown
203 lines
7.5 KiB
Markdown
# Execution Board — Stream 1A: Core Runner Rebuild
|
|
|
|
**Feature:** Replace random-action inner loop with real PPO/DQN training, model save, and evaluated policy rewards
|
|
**Created:** 2026-04-13
|
|
**Branch:** main
|
|
**IMPLEMENTATION_PLAN tasks:** 1A-01, 1A-02, 1A-03, 1A-04
|
|
**Status:** 🟠 In progress
|
|
|
|
---
|
|
|
|
## 🎯 Goal
|
|
|
|
Rebuild `donkeycar_sb3_runner.py` so that every trial:
|
|
1. Trains a real PPO (or DQN) model using `model.learn(total_timesteps=N)`
|
|
2. Evaluates the trained model with `evaluate_policy()` (learned policy, NOT random)
|
|
3. Saves the model to disk
|
|
4. Tracks the champion model across all trials
|
|
5. Supports speed-aware reward shaping
|
|
|
|
---
|
|
|
|
## ⚠️ Dependencies
|
|
|
|
None — can start immediately.
|
|
|
|
---
|
|
|
|
## 📦 Packets
|
|
|
|
### Packet 1A-01 — Rebuild Runner with Real Training
|
|
|
|
**Status:** ⬜ Not started
|
|
**Est. effort:** 1 session
|
|
**Depends on:** none
|
|
|
|
**Goal:** Replace random `env.action_space.sample()` loop with real `PPO.learn()` + `evaluate_policy()`.
|
|
|
|
**Steps:**
|
|
1. Remove all legacy random-action loop code
|
|
2. Add `model = PPO('CnnPolicy', env, learning_rate=lr, verbose=1)` initialization
|
|
3. Add `model.learn(total_timesteps=timesteps)` training call
|
|
4. Add `mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=eval_episodes)`
|
|
5. Add `model.save(save_dir)` — save after every successful training run
|
|
6. Print per-trial summary: timesteps, mean_reward, std_reward, save path
|
|
7. Keep `env.close()` + `time.sleep(2)` teardown (non-negotiable per ADR-006)
|
|
8. Add `--learning-rate` and `--save-dir` CLI args
|
|
9. Add DQN path: if `--agent dqn`, use DQN with DiscretizedActionWrapper
|
|
|
|
**Files created/modified:**
|
|
- `agent/donkeycar_sb3_runner.py` — complete rebuild
|
|
|
|
**Known-answer tests:**
|
|
- PPO with 100 timesteps on mocked env should produce a non-None model object
|
|
- Model saved to `save_dir/model.zip` should be loadable with `PPO.load()`
|
|
|
|
**Acceptance criteria:**
|
|
- [ ] Running `python3 donkeycar_sb3_runner.py --agent ppo --timesteps 100 --save-dir /tmp/test-model` with a live sim produces `/tmp/test-model/model.zip`
|
|
- [ ] `mean_reward` in output comes from `evaluate_policy()`, not random episodes
|
|
- [ ] Script exits with code 0 and calls `env.close()`
|
|
- [ ] `--learning-rate` flag is respected (check SB3 verbose output)
|
|
- [ ] No `NameError: name 'model' is not defined` possible (model always defined before save)
|
|
|
|
**Validation evidence:** `.harness/wave1-runner/validation/1A-01-validation.md`
|
|
|
|
---
|
|
|
|
### Packet 1A-02 — Speed-Aware Reward Wrapper
|
|
|
|
**Status:** ⬜ Not started
|
|
**Est. effort:** 1 session
|
|
**Depends on:** 1A-01
|
|
|
|
**Goal:** Add `SpeedRewardWrapper` that replaces default CTE-only reward with `speed * (1 - abs(cte)/max_cte)`.
|
|
|
|
**Steps:**
|
|
1. Create `agent/reward_wrapper.py` with `SpeedRewardWrapper(gym.Wrapper)`
|
|
2. In `step()`, extract `speed` and `cte` from `info` dict (DonkeyCar provides these)
|
|
3. Compute shaped reward: `speed * (1.0 - min(abs(cte)/max_cte, 1.0))` minus penalty on crash
|
|
4. Add `--reward-shaping` boolean flag to runner CLI
|
|
5. Apply wrapper in runner if flag set: `env = SpeedRewardWrapper(env, max_cte=8.0)`
|
|
6. Log which reward mode is active at startup
|
|
|
|
**Files created/modified:**
|
|
- `agent/reward_wrapper.py` — new file
|
|
- `agent/donkeycar_sb3_runner.py` — add `--reward-shaping` flag and wrapper application
|
|
|
|
**Acceptance criteria:**
|
|
- [ ] `SpeedRewardWrapper` replaces reward when `--reward-shaping` is set
|
|
- [ ] Default behavior unchanged when flag not set
|
|
- [ ] Wrapper handles missing `speed` or `cte` in info gracefully (falls back to original reward)
|
|
- [ ] Unit test passes without simulator (mocked info dict)
|
|
|
|
**Validation evidence:** `.harness/wave1-runner/validation/1A-02-validation.md`
|
|
|
|
---
|
|
|
|
### Packet 1A-03 — Champion Model Tracking
|
|
|
|
**Status:** ⬜ Not started
|
|
**Est. effort:** 0.5 sessions
|
|
**Depends on:** 1A-01
|
|
|
|
**Goal:** Track the best model across all trials; maintain `agent/models/champion/` with the current best.
|
|
|
|
**Steps:**
|
|
1. After each trial, read `agent/models/champion/manifest.json` (if exists) to get current best reward
|
|
2. If new `mean_reward > current_best_reward`, copy model to `agent/models/champion/model.zip`
|
|
3. Write updated `manifest.json`: `{trial, timestamp, params, mean_reward, model_path}`
|
|
4. Log `[CHAMPION] New best: mean_reward=X params=Y` to console and autoresearch log
|
|
5. Add `champion` boolean field to JSONL result record
|
|
|
|
**Files created/modified:**
|
|
- `agent/autoresearch_controller.py` — add champion tracking logic
|
|
- `agent/models/champion/` — directory for champion model + manifest
|
|
|
|
**Known-answer tests:**
|
|
```python
|
|
# Rewards [50, 80, 60, 90, 70] → champion updates at trials 1, 2, 4
|
|
rewards = [50, 80, 60, 90, 70]
|
|
tracker = ChampionTracker('/tmp/test-champion')
|
|
champions = []
|
|
for i, r in enumerate(rewards):
|
|
if tracker.update_if_better(r, params={}, model_path=f'trial-{i}'):
|
|
champions.append(i)
|
|
assert champions == [0, 1, 3] # 0-indexed
|
|
```
|
|
|
|
**Acceptance criteria:**
|
|
- [ ] Champion manifest updated whenever new best reward is found
|
|
- [ ] `agent/models/champion/model.zip` always contains the best model seen
|
|
- [ ] `champion` field in JSONL is `true` for the best trial, `false` otherwise
|
|
- [ ] Known-answer champion tracking test passes
|
|
|
|
**Validation evidence:** `.harness/wave1-runner/validation/1A-03-validation.md`
|
|
|
|
---
|
|
|
|
### Packet 1A-04 — Autoresearch Controller Wiring
|
|
|
|
**Status:** ⬜ Not started
|
|
**Est. effort:** 0.5 sessions
|
|
**Depends on:** 1A-01, 1A-03
|
|
|
|
**Goal:** Update `autoresearch_controller.py` to pass all required args to the rebuilt runner, use a separate Phase 1 results file, and add timesteps to the search space.
|
|
|
|
**Steps:**
|
|
1. Add `timesteps` to GP search space: `{'type': 'int', 'min': 5000, 'max': 30000}`
|
|
2. Pass `--learning-rate`, `--save-dir`, `--reward-shaping` to runner subprocess
|
|
3. Save new results to `autoresearch_results_phase1.jsonl` (do NOT mix with random-policy data)
|
|
4. Parse `mean_reward` from `[SB3 Runner] mean_reward=X` output line
|
|
5. Parse `std_reward` from `[SB3 Runner] std_reward=X` output line (add to runner output)
|
|
6. Add `--push-every N` flag: git add + commit + push every N trials
|
|
7. Add `--min-trials-before-gp 3` (default): use random sampling for first N trials
|
|
|
|
**Files created/modified:**
|
|
- `agent/autoresearch_controller.py` — wire up new args, new results file, push support
|
|
|
|
**Acceptance criteria:**
|
|
- [ ] Phase 1 results go to `autoresearch_results_phase1.jsonl` only
|
|
- [ ] `learning_rate` arg is passed to and used by the runner
|
|
- [ ] `save_dir` is a trial-specific path: `agent/models/trial-{trial_number:04d}`
|
|
- [ ] Git push happens every N trials if `--push-every N` is set
|
|
- [ ] Random proposal used for first `min_trials_before_gp` trials
|
|
|
|
**Validation evidence:** `.harness/wave1-runner/validation/1A-04-validation.md`
|
|
|
|
---
|
|
|
|
## 🔢 Dependency Order
|
|
|
|
```
|
|
1A-01 → 1A-02 (reward wrapper)
|
|
1A-01 → 1A-03 (champion tracking)
|
|
1A-01 + 1A-03 → 1A-04 (controller wiring)
|
|
```
|
|
|
|
1A-02 and 1A-03 can run in parallel after 1A-01.
|
|
|
|
---
|
|
|
|
## 🏁 Stream Completion Criteria
|
|
|
|
- [ ] All 4 packets complete with validation evidence written
|
|
- [ ] `python3 donkeycar_sb3_runner.py --agent ppo --timesteps 5000 --save-dir /tmp/t` produces a saved model
|
|
- [ ] `pytest tests/ -v` — stream 1A tests pass (once 1B is done)
|
|
- [ ] No `NameError: name 'model' is not defined` possible in any code path
|
|
- [ ] Champion tracking works: manifest.json updated correctly
|
|
- [ ] IMPLEMENTATION_PLAN tasks 1A-01 through 1A-04 marked `[x]`
|
|
- [ ] EXECUTION_MASTER updated
|
|
|
|
---
|
|
|
|
## 📋 Mandatory Commit Trailer Format
|
|
|
|
```
|
|
feat(runner): 1A-NN — <description>
|
|
|
|
Agent: pi/claude-sonnet
|
|
Tests: N/A (sim required) | N/N passing
|
|
Tests-Added: +N
|
|
TypeScript: N/A
|
|
```
|