7.5 KiB
Execution Board — Stream 1A: Core Runner Rebuild
Feature: Replace random-action inner loop with real PPO/DQN training, model save, and evaluated policy rewards
Created: 2026-04-13
Branch: main
IMPLEMENTATION_PLAN tasks: 1A-01, 1A-02, 1A-03, 1A-04
Status: 🟠 In progress
🎯 Goal
Rebuild donkeycar_sb3_runner.py so that every trial:
- Trains a real PPO (or DQN) model using
model.learn(total_timesteps=N) - Evaluates the trained model with
evaluate_policy()(learned policy, NOT random) - Saves the model to disk
- Tracks the champion model across all trials
- Supports speed-aware reward shaping
⚠️ Dependencies
None — can start immediately.
📦 Packets
Packet 1A-01 — Rebuild Runner with Real Training
Status: ⬜ Not started
Est. effort: 1 session
Depends on: none
Goal: Replace random env.action_space.sample() loop with real PPO.learn() + evaluate_policy().
Steps:
- Remove all legacy random-action loop code
- Add
model = PPO('CnnPolicy', env, learning_rate=lr, verbose=1)initialization - Add
model.learn(total_timesteps=timesteps)training call - Add
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=eval_episodes) - Add
model.save(save_dir)— save after every successful training run - Print per-trial summary: timesteps, mean_reward, std_reward, save path
- Keep
env.close()+time.sleep(2)teardown (non-negotiable per ADR-006) - Add
--learning-rateand--save-dirCLI args - Add DQN path: if
--agent dqn, use DQN with DiscretizedActionWrapper
Files created/modified:
agent/donkeycar_sb3_runner.py— complete rebuild
Known-answer tests:
- PPO with 100 timesteps on mocked env should produce a non-None model object
- Model saved to
save_dir/model.zipshould be loadable withPPO.load()
Acceptance criteria:
- Running
python3 donkeycar_sb3_runner.py --agent ppo --timesteps 100 --save-dir /tmp/test-modelwith a live sim produces/tmp/test-model/model.zip mean_rewardin output comes fromevaluate_policy(), not random episodes- Script exits with code 0 and calls
env.close() --learning-rateflag is respected (check SB3 verbose output)- No
NameError: name 'model' is not definedpossible (model always defined before save)
Validation evidence: .harness/wave1-runner/validation/1A-01-validation.md
Packet 1A-02 — Speed-Aware Reward Wrapper
Status: ⬜ Not started
Est. effort: 1 session
Depends on: 1A-01
Goal: Add SpeedRewardWrapper that replaces default CTE-only reward with speed * (1 - abs(cte)/max_cte).
Steps:
- Create
agent/reward_wrapper.pywithSpeedRewardWrapper(gym.Wrapper) - In
step(), extractspeedandctefrominfodict (DonkeyCar provides these) - Compute shaped reward:
speed * (1.0 - min(abs(cte)/max_cte, 1.0))minus penalty on crash - Add
--reward-shapingboolean flag to runner CLI - Apply wrapper in runner if flag set:
env = SpeedRewardWrapper(env, max_cte=8.0) - Log which reward mode is active at startup
Files created/modified:
agent/reward_wrapper.py— new fileagent/donkeycar_sb3_runner.py— add--reward-shapingflag and wrapper application
Acceptance criteria:
SpeedRewardWrapperreplaces reward when--reward-shapingis set- Default behavior unchanged when flag not set
- Wrapper handles missing
speedorctein info gracefully (falls back to original reward) - Unit test passes without simulator (mocked info dict)
Validation evidence: .harness/wave1-runner/validation/1A-02-validation.md
Packet 1A-03 — Champion Model Tracking
Status: ⬜ Not started
Est. effort: 0.5 sessions
Depends on: 1A-01
Goal: Track the best model across all trials; maintain agent/models/champion/ with the current best.
Steps:
- After each trial, read
agent/models/champion/manifest.json(if exists) to get current best reward - If new
mean_reward > current_best_reward, copy model toagent/models/champion/model.zip - Write updated
manifest.json:{trial, timestamp, params, mean_reward, model_path} - Log
[CHAMPION] New best: mean_reward=X params=Yto console and autoresearch log - Add
championboolean field to JSONL result record
Files created/modified:
agent/autoresearch_controller.py— add champion tracking logicagent/models/champion/— directory for champion model + manifest
Known-answer tests:
# Rewards [50, 80, 60, 90, 70] → champion updates at trials 1, 2, 4
rewards = [50, 80, 60, 90, 70]
tracker = ChampionTracker('/tmp/test-champion')
champions = []
for i, r in enumerate(rewards):
if tracker.update_if_better(r, params={}, model_path=f'trial-{i}'):
champions.append(i)
assert champions == [0, 1, 3] # 0-indexed
Acceptance criteria:
- Champion manifest updated whenever new best reward is found
agent/models/champion/model.zipalways contains the best model seenchampionfield in JSONL istruefor the best trial,falseotherwise- Known-answer champion tracking test passes
Validation evidence: .harness/wave1-runner/validation/1A-03-validation.md
Packet 1A-04 — Autoresearch Controller Wiring
Status: ⬜ Not started
Est. effort: 0.5 sessions
Depends on: 1A-01, 1A-03
Goal: Update autoresearch_controller.py to pass all required args to the rebuilt runner, use a separate Phase 1 results file, and add timesteps to the search space.
Steps:
- Add
timestepsto GP search space:{'type': 'int', 'min': 5000, 'max': 30000} - Pass
--learning-rate,--save-dir,--reward-shapingto runner subprocess - Save new results to
autoresearch_results_phase1.jsonl(do NOT mix with random-policy data) - Parse
mean_rewardfrom[SB3 Runner] mean_reward=Xoutput line - Parse
std_rewardfrom[SB3 Runner] std_reward=Xoutput line (add to runner output) - Add
--push-every Nflag: git add + commit + push every N trials - Add
--min-trials-before-gp 3(default): use random sampling for first N trials
Files created/modified:
agent/autoresearch_controller.py— wire up new args, new results file, push support
Acceptance criteria:
- Phase 1 results go to
autoresearch_results_phase1.jsonlonly learning_ratearg is passed to and used by the runnersave_diris a trial-specific path:agent/models/trial-{trial_number:04d}- Git push happens every N trials if
--push-every Nis set - Random proposal used for first
min_trials_before_gptrials
Validation evidence: .harness/wave1-runner/validation/1A-04-validation.md
🔢 Dependency Order
1A-01 → 1A-02 (reward wrapper)
1A-01 → 1A-03 (champion tracking)
1A-01 + 1A-03 → 1A-04 (controller wiring)
1A-02 and 1A-03 can run in parallel after 1A-01.
🏁 Stream Completion Criteria
- All 4 packets complete with validation evidence written
python3 donkeycar_sb3_runner.py --agent ppo --timesteps 5000 --save-dir /tmp/tproduces a saved modelpytest tests/ -v— stream 1A tests pass (once 1B is done)- No
NameError: name 'model' is not definedpossible in any code path - Champion tracking works: manifest.json updated correctly
- IMPLEMENTATION_PLAN tasks 1A-01 through 1A-04 marked
[x] - EXECUTION_MASTER updated
📋 Mandatory Commit Trailer Format
feat(runner): 1A-NN — <description>
Agent: pi/claude-sonnet
Tests: N/A (sim required) | N/N passing
Tests-Added: +N
TypeScript: N/A