donkeycar-rl-autoresearch/.harness/wave1-runner/execution-board.md

7.5 KiB

Execution Board — Stream 1A: Core Runner Rebuild

Feature: Replace random-action inner loop with real PPO/DQN training, model save, and evaluated policy rewards
Created: 2026-04-13
Branch: main
IMPLEMENTATION_PLAN tasks: 1A-01, 1A-02, 1A-03, 1A-04
Status: 🟠 In progress


🎯 Goal

Rebuild donkeycar_sb3_runner.py so that every trial:

  1. Trains a real PPO (or DQN) model using model.learn(total_timesteps=N)
  2. Evaluates the trained model with evaluate_policy() (learned policy, NOT random)
  3. Saves the model to disk
  4. Tracks the champion model across all trials
  5. Supports speed-aware reward shaping

⚠️ Dependencies

None — can start immediately.


📦 Packets

Packet 1A-01 — Rebuild Runner with Real Training

Status: Not started
Est. effort: 1 session
Depends on: none

Goal: Replace random env.action_space.sample() loop with real PPO.learn() + evaluate_policy().

Steps:

  1. Remove all legacy random-action loop code
  2. Add model = PPO('CnnPolicy', env, learning_rate=lr, verbose=1) initialization
  3. Add model.learn(total_timesteps=timesteps) training call
  4. Add mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=eval_episodes)
  5. Add model.save(save_dir) — save after every successful training run
  6. Print per-trial summary: timesteps, mean_reward, std_reward, save path
  7. Keep env.close() + time.sleep(2) teardown (non-negotiable per ADR-006)
  8. Add --learning-rate and --save-dir CLI args
  9. Add DQN path: if --agent dqn, use DQN with DiscretizedActionWrapper

Files created/modified:

  • agent/donkeycar_sb3_runner.py — complete rebuild

Known-answer tests:

  • PPO with 100 timesteps on mocked env should produce a non-None model object
  • Model saved to save_dir/model.zip should be loadable with PPO.load()

Acceptance criteria:

  • Running python3 donkeycar_sb3_runner.py --agent ppo --timesteps 100 --save-dir /tmp/test-model with a live sim produces /tmp/test-model/model.zip
  • mean_reward in output comes from evaluate_policy(), not random episodes
  • Script exits with code 0 and calls env.close()
  • --learning-rate flag is respected (check SB3 verbose output)
  • No NameError: name 'model' is not defined possible (model always defined before save)

Validation evidence: .harness/wave1-runner/validation/1A-01-validation.md


Packet 1A-02 — Speed-Aware Reward Wrapper

Status: Not started
Est. effort: 1 session
Depends on: 1A-01

Goal: Add SpeedRewardWrapper that replaces default CTE-only reward with speed * (1 - abs(cte)/max_cte).

Steps:

  1. Create agent/reward_wrapper.py with SpeedRewardWrapper(gym.Wrapper)
  2. In step(), extract speed and cte from info dict (DonkeyCar provides these)
  3. Compute shaped reward: speed * (1.0 - min(abs(cte)/max_cte, 1.0)) minus penalty on crash
  4. Add --reward-shaping boolean flag to runner CLI
  5. Apply wrapper in runner if flag set: env = SpeedRewardWrapper(env, max_cte=8.0)
  6. Log which reward mode is active at startup

Files created/modified:

  • agent/reward_wrapper.py — new file
  • agent/donkeycar_sb3_runner.py — add --reward-shaping flag and wrapper application

Acceptance criteria:

  • SpeedRewardWrapper replaces reward when --reward-shaping is set
  • Default behavior unchanged when flag not set
  • Wrapper handles missing speed or cte in info gracefully (falls back to original reward)
  • Unit test passes without simulator (mocked info dict)

Validation evidence: .harness/wave1-runner/validation/1A-02-validation.md


Packet 1A-03 — Champion Model Tracking

Status: Not started
Est. effort: 0.5 sessions
Depends on: 1A-01

Goal: Track the best model across all trials; maintain agent/models/champion/ with the current best.

Steps:

  1. After each trial, read agent/models/champion/manifest.json (if exists) to get current best reward
  2. If new mean_reward > current_best_reward, copy model to agent/models/champion/model.zip
  3. Write updated manifest.json: {trial, timestamp, params, mean_reward, model_path}
  4. Log [CHAMPION] New best: mean_reward=X params=Y to console and autoresearch log
  5. Add champion boolean field to JSONL result record

Files created/modified:

  • agent/autoresearch_controller.py — add champion tracking logic
  • agent/models/champion/ — directory for champion model + manifest

Known-answer tests:

# Rewards [50, 80, 60, 90, 70] → champion updates at trials 1, 2, 4
rewards = [50, 80, 60, 90, 70]
tracker = ChampionTracker('/tmp/test-champion')
champions = []
for i, r in enumerate(rewards):
    if tracker.update_if_better(r, params={}, model_path=f'trial-{i}'):
        champions.append(i)
assert champions == [0, 1, 3]  # 0-indexed

Acceptance criteria:

  • Champion manifest updated whenever new best reward is found
  • agent/models/champion/model.zip always contains the best model seen
  • champion field in JSONL is true for the best trial, false otherwise
  • Known-answer champion tracking test passes

Validation evidence: .harness/wave1-runner/validation/1A-03-validation.md


Packet 1A-04 — Autoresearch Controller Wiring

Status: Not started
Est. effort: 0.5 sessions
Depends on: 1A-01, 1A-03

Goal: Update autoresearch_controller.py to pass all required args to the rebuilt runner, use a separate Phase 1 results file, and add timesteps to the search space.

Steps:

  1. Add timesteps to GP search space: {'type': 'int', 'min': 5000, 'max': 30000}
  2. Pass --learning-rate, --save-dir, --reward-shaping to runner subprocess
  3. Save new results to autoresearch_results_phase1.jsonl (do NOT mix with random-policy data)
  4. Parse mean_reward from [SB3 Runner] mean_reward=X output line
  5. Parse std_reward from [SB3 Runner] std_reward=X output line (add to runner output)
  6. Add --push-every N flag: git add + commit + push every N trials
  7. Add --min-trials-before-gp 3 (default): use random sampling for first N trials

Files created/modified:

  • agent/autoresearch_controller.py — wire up new args, new results file, push support

Acceptance criteria:

  • Phase 1 results go to autoresearch_results_phase1.jsonl only
  • learning_rate arg is passed to and used by the runner
  • save_dir is a trial-specific path: agent/models/trial-{trial_number:04d}
  • Git push happens every N trials if --push-every N is set
  • Random proposal used for first min_trials_before_gp trials

Validation evidence: .harness/wave1-runner/validation/1A-04-validation.md


🔢 Dependency Order

1A-01 → 1A-02 (reward wrapper)
1A-01 → 1A-03 (champion tracking)
1A-01 + 1A-03 → 1A-04 (controller wiring)

1A-02 and 1A-03 can run in parallel after 1A-01.


🏁 Stream Completion Criteria

  • All 4 packets complete with validation evidence written
  • python3 donkeycar_sb3_runner.py --agent ppo --timesteps 5000 --save-dir /tmp/t produces a saved model
  • pytest tests/ -v — stream 1A tests pass (once 1B is done)
  • No NameError: name 'model' is not defined possible in any code path
  • Champion tracking works: manifest.json updated correctly
  • IMPLEMENTATION_PLAN tasks 1A-01 through 1A-04 marked [x]
  • EXECUTION_MASTER updated

📋 Mandatory Commit Trailer Format

feat(runner): 1A-NN — <description>

Agent: pi/claude-sonnet
Tests: N/A (sim required) | N/N passing
Tests-Added: +N
TypeScript: N/A