Samples car position every 100 steps during eval. Computes macro
efficiency = net_displacement / total_sampled_path. If < 0.3 with
>= 500 steps, logs WARNING: SHUTTLE EXPLOIT? with the efficiency value.
Also logs reward/step per episode so anomalously high-scoring long
episodes can be diagnosed immediately.
This will tell us definitively whether Trials 9 and 14 (1435/1573
scores, 2000 steps each) were genuine driving or back-and-forth
shuttling on a mini_monaco straight.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Checkpoint code added save_dir inside train_multitrack() but save_dir
is defined in main(). Every trial since the checkpoint fix was added
crashed with 'name save_dir is not defined' after the first segment,
producing rc=101 and no GP data.
Fix: add save_dir=None parameter to train_multitrack() and pass it
from the main() call site.
This explains why Trials 6-10 in the current run all produced None
results despite appearing to train normally for the first segment.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
ADR-013: Wave 4 train-from-scratch rationale (why no warm-start, why
generated_track+mountain_track, proven by 1943 overnight result)
ADR-014: Measure throughput before long runs (10+ hours lost to timeouts)
ADR-015: Per-segment checkpointing is non-negotiable
ADR-016: Verify fixes are running before walking away
These decisions existed in conversation but were never written down,
causing them to be forgotten after context compaction and re-learned
the hard way multiple times.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
All previous issues:
- Controller was never restarted after cap/checkpoint fixes -> they never ran
- Timeout trials (score=0) were polluting GP data -> removed
- Overnight Trial 3 result (1943 mini_monaco) was unknown to GP -> added
GP now has 5 valid data points including the 1943 score at
lr=0.000685, switch=17499. GP should converge toward longer
switching intervals which produced the only great result.
Verified before relaunch:
- PARAM_SPACE max total_timesteps = 90000 ✓
- Checkpoint saves after every segment ✓
- Rescue eval on timeout ✓
- 102 tests passing ✓
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Two changes:
1. Lower total_timesteps cap: 120k → 90k
Actual throughput is 16 steps/sec (not 20 as estimated).
120k steps = 126 min training + 9 min overhead = 135 min > 2hr limit.
90k steps = 94 min + 8 min overhead = 102 min, safely within limit.
2. Per-segment checkpoint saves in multitrack_runner
model.save() called after every segment so the latest weights are
always on disk. If the runner is killed (timeout/crash/Ctrl+C),
training data is never completely lost.
3. Timeout rescue eval in wave4_controller
If JOB_TIMEOUT fires and a checkpoint exists, immediately runs a
quick mini_monaco eval on the checkpoint so the trial still produces
a GP data point despite the timeout.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Trials 3+4 both proposed ~140k steps and hit the 2hr JOB_TIMEOUT,
wasting time and producing no GP data. At ~20 steps/sec, 120k steps
takes ~100 min, safely within the 2hr limit.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
StuckTerminationWrapper added to wrap_env stack (between ThrottleClamp
and SpeedReward):
- Terminates episode after stuck_steps=80 steps with <0.5m displacement
- Handles slow barrier contact that Unity hit detection misses
- Handles off-lap-line circles (efficiency→0 gave zero reward but no
termination; now gives -1.0 after 80 steps = ~4s of non-progress)
- Wrapper stack: ThrottleClamp → StuckTermination → SpeedReward
Also: missing deque import in multitrack_runner.py caused NameError.
Phase 4 results cleared again (Trial 1 ran without StuckTermination).
Tests: 2 new stuck-termination tests, 102 total.
Agent: pi
Tests: 102 passed
Tests-Added: 2
TypeScript: N/A
Two reward hacking behaviours observed during Wave 4 training:
1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
Model circles at start/finish line completing laps in 1-2 sim-seconds,
accumulating lap_count indefinitely with no genuine track progress.
Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
< min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
Window size also increased from 30 → 60 to catch slower circles.
2. Non-terminating segment eval episodes:
evaluate_policy on wide tracks (no barriers) could run indefinitely,
inflating segment_reward to 200k+. Replaced with manual eval loop
capped at MAX_EVAL_STEPS=3000 steps.
Phase 4 results cleared (trials 4-6 ran with exploitable reward).
Tests: 4 new reward wrapper tests, 100 total passing.
Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
Without this, Wave 4 scratch-trained models produce no rollout stats in
the log, making it impossible to monitor training progress or spot
degenerate policies early.
Warm-start models in Wave 3 showed stats because verbose=1 was baked
into the Phase-2 saved model state; fresh models default to verbose=0.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A