7.5 KiB

Raw Permalink Blame History

Execution Board — Stream 1A: Core Runner Rebuild

Feature: Replace random-action inner loop with real PPO/DQN training, model save, and evaluated policy rewards
Created: 2026-04-13
Branch: main
IMPLEMENTATION_PLAN tasks: 1A-01, 1A-02, 1A-03, 1A-04
Status: 🟠 In progress

🎯 Goal

Rebuild donkeycar_sb3_runner.py so that every trial:

Trains a real PPO (or DQN) model using model.learn(total_timesteps=N)
Evaluates the trained model with evaluate_policy() (learned policy, NOT random)
Saves the model to disk
Tracks the champion model across all trials
Supports speed-aware reward shaping

⚠️ Dependencies

None — can start immediately.

📦 Packets

Packet 1A-01 — Rebuild Runner with Real Training

Status: ⬜ Not started
Est. effort: 1 session
Depends on: none

Goal: Replace random env.action_space.sample() loop with real PPO.learn() + evaluate_policy().

Steps:

Remove all legacy random-action loop code
Add model = PPO('CnnPolicy', env, learning_rate=lr, verbose=1) initialization
Add model.learn(total_timesteps=timesteps) training call
Add mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=eval_episodes)
Add model.save(save_dir) — save after every successful training run
Print per-trial summary: timesteps, mean_reward, std_reward, save path
Keep env.close() + time.sleep(2) teardown (non-negotiable per ADR-006)
Add --learning-rate and --save-dir CLI args
Add DQN path: if --agent dqn, use DQN with DiscretizedActionWrapper

Files created/modified:

agent/donkeycar_sb3_runner.py — complete rebuild

Known-answer tests:

PPO with 100 timesteps on mocked env should produce a non-None model object
Model saved to save_dir/model.zip should be loadable with PPO.load()

Acceptance criteria:

Running python3 donkeycar_sb3_runner.py --agent ppo --timesteps 100 --save-dir /tmp/test-model with a live sim produces /tmp/test-model/model.zip
mean_reward in output comes from evaluate_policy(), not random episodes
Script exits with code 0 and calls env.close()
--learning-rate flag is respected (check SB3 verbose output)
No NameError: name 'model' is not defined possible (model always defined before save)

Validation evidence: .harness/wave1-runner/validation/1A-01-validation.md

Packet 1A-02 — Speed-Aware Reward Wrapper

Status: ⬜ Not started
Est. effort: 1 session
Depends on: 1A-01

Goal: Add SpeedRewardWrapper that replaces default CTE-only reward with speed * (1 - abs(cte)/max_cte).

Steps:

Create agent/reward_wrapper.py with SpeedRewardWrapper(gym.Wrapper)
In step(), extract speed and cte from info dict (DonkeyCar provides these)
Compute shaped reward: speed * (1.0 - min(abs(cte)/max_cte, 1.0)) minus penalty on crash
Add --reward-shaping boolean flag to runner CLI
Apply wrapper in runner if flag set: env = SpeedRewardWrapper(env, max_cte=8.0)
Log which reward mode is active at startup

Files created/modified:

agent/reward_wrapper.py — new file
agent/donkeycar_sb3_runner.py — add --reward-shaping flag and wrapper application

Acceptance criteria:

SpeedRewardWrapper replaces reward when --reward-shaping is set
Default behavior unchanged when flag not set
Wrapper handles missing speed or cte in info gracefully (falls back to original reward)
Unit test passes without simulator (mocked info dict)

Validation evidence: .harness/wave1-runner/validation/1A-02-validation.md

Packet 1A-03 — Champion Model Tracking

Status: ⬜ Not started
Est. effort: 0.5 sessions
Depends on: 1A-01

Goal: Track the best model across all trials; maintain agent/models/champion/ with the current best.

Steps:

After each trial, read agent/models/champion/manifest.json (if exists) to get current best reward
If new mean_reward > current_best_reward, copy model to agent/models/champion/model.zip
Write updated manifest.json: {trial, timestamp, params, mean_reward, model_path}
Log [CHAMPION] New best: mean_reward=X params=Y to console and autoresearch log
Add champion boolean field to JSONL result record

Files created/modified:

agent/autoresearch_controller.py — add champion tracking logic
agent/models/champion/ — directory for champion model + manifest

Known-answer tests:

# Rewards [50, 80, 60, 90, 70] → champion updates at trials 1, 2, 4
rewards = [50, 80, 60, 90, 70]
tracker = ChampionTracker('/tmp/test-champion')
champions = []
for i, r in enumerate(rewards):
    if tracker.update_if_better(r, params={}, model_path=f'trial-{i}'):
        champions.append(i)
assert champions == [0, 1, 3]  # 0-indexed

Acceptance criteria:

Champion manifest updated whenever new best reward is found
agent/models/champion/model.zip always contains the best model seen
champion field in JSONL is true for the best trial, false otherwise
Known-answer champion tracking test passes

Validation evidence: .harness/wave1-runner/validation/1A-03-validation.md

Packet 1A-04 — Autoresearch Controller Wiring

Status: ⬜ Not started
Est. effort: 0.5 sessions
Depends on: 1A-01, 1A-03

Goal: Update autoresearch_controller.py to pass all required args to the rebuilt runner, use a separate Phase 1 results file, and add timesteps to the search space.

Steps:

Add timesteps to GP search space: {'type': 'int', 'min': 5000, 'max': 30000}
Pass --learning-rate, --save-dir, --reward-shaping to runner subprocess
Save new results to autoresearch_results_phase1.jsonl (do NOT mix with random-policy data)
Parse mean_reward from [SB3 Runner] mean_reward=X output line
Parse std_reward from [SB3 Runner] std_reward=X output line (add to runner output)
Add --push-every N flag: git add + commit + push every N trials
Add --min-trials-before-gp 3 (default): use random sampling for first N trials

Files created/modified:

agent/autoresearch_controller.py — wire up new args, new results file, push support

Acceptance criteria:

Phase 1 results go to autoresearch_results_phase1.jsonl only
learning_rate arg is passed to and used by the runner
save_dir is a trial-specific path: agent/models/trial-{trial_number:04d}
Git push happens every N trials if --push-every N is set
Random proposal used for first min_trials_before_gp trials

Validation evidence: .harness/wave1-runner/validation/1A-04-validation.md

🔢 Dependency Order

1A-01 → 1A-02 (reward wrapper)
1A-01 → 1A-03 (champion tracking)
1A-01 + 1A-03 → 1A-04 (controller wiring)

1A-02 and 1A-03 can run in parallel after 1A-01.

🏁 Stream Completion Criteria

All 4 packets complete with validation evidence written
python3 donkeycar_sb3_runner.py --agent ppo --timesteps 5000 --save-dir /tmp/t produces a saved model
pytest tests/ -v — stream 1A tests pass (once 1B is done)
No NameError: name 'model' is not defined possible in any code path
Champion tracking works: manifest.json updated correctly
IMPLEMENTATION_PLAN tasks 1A-01 through 1A-04 marked [x]
EXECUTION_MASTER updated

📋 Mandatory Commit Trailer Format

feat(runner): 1A-NN — <description>

Agent: pi/claude-sonnet
Tests: N/A (sim required) | N/N passing
Tests-Added: +N
TypeScript: N/A

7.5 KiB Raw Permalink Blame History

Execution Board — Stream 1A: Core Runner Rebuild

🎯 Goal

⚠️ Dependencies

📦 Packets

Packet 1A-01 — Rebuild Runner with Real Training

Packet 1A-02 — Speed-Aware Reward Wrapper

Packet 1A-03 — Champion Model Tracking

Packet 1A-04 — Autoresearch Controller Wiring

🔢 Dependency Order

🏁 Stream Completion Criteria

📋 Mandatory Commit Trailer Format

7.5 KiB

Raw Permalink Blame History