Strategy change driven by Trial 1 data analysis:
- generated_road removed: too similar to generated_track, and Phase-2
warm-start caused catastrophic forgetting (reward 2388→37 in one rotation)
- mountain_track mean reward was only 17 — model never converged there
- mini_monaco score 24.9 (37 steps) — model was outputting degenerate actions
Wave 4 approach:
- NO warm-start: fresh random weights every trial
- Train: generated_track + mountain_track (visually distinct backgrounds,
both have road markings — forces model to learn general mark-following)
- Test (zero-shot): mini_monaco only (never seen during training)
- Wider LR search: [1e-4, 2e-3] (scratch model needs different range)
- Larger step budgets: 60k-250k total (fresh model needs more time)
- Seed params: lr=0.0003 and lr=0.001 (diverse from the start)
Files:
- multitrack_runner.py: 2 training tracks, no warm-start auto-detection
- wave4_controller.py: new Wave 4 GP+UCB controller
- tests updated: TRAINING_TRACKS assertion, seed param tests → wave4
- 96 tests passing
ADR-013 to follow.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
PPO.load() bakes lr_schedule=FloatSchedule(saved_lr) into the model.
train() calls _update_learning_rate() which reads lr_schedule, not
model.learning_rate. So even with param_groups patched, the first
gradient step reverts the optimizer to the saved LR.
Complete 3-part fix in create_or_load_model():
model.learning_rate = lr # attribute
model.lr_schedule = get_schedule_fn(lr) # prevents train() reverting
for pg in optimizer.param_groups: pg['lr'] = lr # immediate effect
Also:
- SEED_PARAMS: second seed now uses LR=0.001 (was 0.000225) so GP
starts with real LR diversity instead of two identical seeds
- tests/test_end_to_end.py: 13 new tests covering the full LR override
path including a live learn() call; would have caught both bugs
- Phase 3 results re-cleared (seed trial 1 ran with half-fix)
- 96 tests total, all passing
Agent: pi
Tests: 96 passed
Tests-Added: 13
TypeScript: N/A