Two reward hacking behaviours observed during Wave 4 training:
1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
Model circles at start/finish line completing laps in 1-2 sim-seconds,
accumulating lap_count indefinitely with no genuine track progress.
Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
< min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
Window size also increased from 30 → 60 to catch slower circles.
2. Non-terminating segment eval episodes:
evaluate_policy on wide tracks (no barriers) could run indefinitely,
inflating segment_reward to 200k+. Replaced with manual eval loop
capped at MAX_EVAL_STEPS=3000 steps.
Phase 4 results cleared (trials 4-6 ran with exploitable reward).
Tests: 4 new reward wrapper tests, 100 total passing.
Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
Without this, Wave 4 scratch-trained models produce no rollout stats in
the log, making it impossible to monitor training progress or spot
degenerate policies early.
Warm-start models in Wave 3 showed stats because verbose=1 was baked
into the Phase-2 saved model state; fresh models default to verbose=0.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
Strategy change driven by Trial 1 data analysis:
- generated_road removed: too similar to generated_track, and Phase-2
warm-start caused catastrophic forgetting (reward 2388→37 in one rotation)
- mountain_track mean reward was only 17 — model never converged there
- mini_monaco score 24.9 (37 steps) — model was outputting degenerate actions
Wave 4 approach:
- NO warm-start: fresh random weights every trial
- Train: generated_track + mountain_track (visually distinct backgrounds,
both have road markings — forces model to learn general mark-following)
- Test (zero-shot): mini_monaco only (never seen during training)
- Wider LR search: [1e-4, 2e-3] (scratch model needs different range)
- Larger step budgets: 60k-250k total (fresh model needs more time)
- Seed params: lr=0.0003 and lr=0.001 (diverse from the start)
Files:
- multitrack_runner.py: 2 training tracks, no warm-start auto-detection
- wave4_controller.py: new Wave 4 GP+UCB controller
- tests updated: TRAINING_TRACKS assertion, seed param tests → wave4
- 96 tests passing
ADR-013 to follow.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
PPO.load() bakes lr_schedule=FloatSchedule(saved_lr) into the model.
train() calls _update_learning_rate() which reads lr_schedule, not
model.learning_rate. So even with param_groups patched, the first
gradient step reverts the optimizer to the saved LR.
Complete 3-part fix in create_or_load_model():
model.learning_rate = lr # attribute
model.lr_schedule = get_schedule_fn(lr) # prevents train() reverting
for pg in optimizer.param_groups: pg['lr'] = lr # immediate effect
Also:
- SEED_PARAMS: second seed now uses LR=0.001 (was 0.000225) so GP
starts with real LR diversity instead of two identical seeds
- tests/test_end_to_end.py: 13 new tests covering the full LR override
path including a live learn() call; would have caught both bugs
- Phase 3 results re-cleared (seed trial 1 ran with half-fix)
- 96 tests total, all passing
Agent: pi
Tests: 96 passed
Tests-Added: 13
TypeScript: N/A
PPO.load() restores the saved optimizer state (lr=0.000225 from Phase 2
champion). Setting model.learning_rate alone is insufficient because
_update_learning_rate() may not fire before the first gradient step, and
the optimizer's param_groups still hold the old value.
Fix: after PPO.load(), explicitly set lr on every optimizer param_group:
model.learning_rate = lr
for pg in model.policy.optimizer.param_groups:
pg['lr'] = lr
Impact: all 8 previous Wave 3 trials actually trained at LR=0.000225
regardless of GP proposal. Results archived as:
autoresearch_results_phase3_CONTAMINATED_wrong_lr.jsonl
Phase 3 results cleared; autoresearch restarting from scratch.
Agent: pi
Tests: 83 passed
Tests-Added: 0
TypeScript: N/A
Warren track surface is green carpet (not outdoor road), and the
episode-done condition (|CTE| > max_cte) does not fire when the car
crosses the INSIDE boundary. Car can drive off-track and bump into
chairs indefinitely, making scores meaningless as a test metric.
Changes:
- multitrack_runner.py: TEST_TRACKS now mini_monaco only
- wave3_controller.py: drop warren_reward from parse/save/champion paths
- tests/test_wave3.py: update assertions to match single test track
- All 83 tests pass
Track classification (final):
TRAIN : generated_road, generated_track, mountain_track
TEST : mini_monaco (outdoor, proper road, correct done condition)
SKIP : warren, warehouse, robo_racing_league, waveshare, circuit_launch
SKIP : avc_sparkfun (orange markings)
ADR-010 to be updated.
Agent: pi
Tests: 83 passed
Tests-Added: 0
TypeScript: N/A
Bug: send_exit_scene_raw() opened a NEW TCP connection, creating a second
phantom vehicle. The sim sent exit_scene to the phantom, leaving the real
training connection stuck on generated_road for the entire run.
Fix: _send_exit_scene() now calls env.unwrapped.viewer.exit_scene() on the
EXISTING TCP connection that the training env already holds. This is the
only reliable way to switch scenes mid-session (matches track_switcher.py).
Also:
- Removed send_exit_scene_raw() import from multitrack_runner.py
- Simplified initial connection (no spurious exit_scene at startup)
- Reduced search space: total_timesteps 80k-400k -> 30k-150k
- Reduced seed params: 150k/300k -> 45k/90k (~35-45 min per trial)
- Added test: test_close_and_switch_uses_viewer_not_raw_socket
83 tests passing
Agent: pi
Tests: 83 passed
Tests-Added: 1
TypeScript: N/A