v5 required for mountain hills (v4 gives zero gradient on hills - documented Exp 1).
Same simple approach as Exp 13 which worked: single track, minimal wrappers,
lap-based stopping. ThrottleClamp + V5Reward only.
Return to Wave 4 setup that produced Trial 9 (2000/2000 on generated_track).
v4 reward: base x efficiency x speed. Circles give ~0 reward naturally.
No StuckTerminationWrapper, no CTE patience, no progress terminator.
Just ThrottleClamp + V4Reward. Lap-based stopping criterion.
Previously circles ran 20+ seconds because the efficiency gate only returned
0 reward without terminating. After 20 consecutive steps of efficiency < 0.15
(~0.7 seconds at 27 steps/sec), episode now terminates with -1.0.
Also confirmed from telemetry diagnostic: CTE does report correctly when
car goes off-track (rises steadily to 6.2m before tree collision).
The grass exploit runs long only when the open grass area has no obstacles.
Efficiency gate termination is the most reliable catch for both circles
and open-grass driving (straight-line grass = high efficiency, but
active_node progress terminator catches that case).
User's insight: a circling car stays near the same track waypoints, so
active_node (sim's track progress indicator) never advances. Track the
maximum active_node reached this episode. If it hasn't increased in
progress_patience=60 steps (~3.3s), terminate.
This catches:
- Circular driving (active_node oscillates, max never increases)
- Stuck on cone/barrier (active_node frozen)
- NOT triggered by: legitimate cornering, slow forward progress, lap resets
On lap completion, active_node wraps to 0 — reset max_node_seen and counter.
Also: Exp 12 — single track mountain training with lap-based stopping criterion.
Train until 3 consecutive laps in eval, not fixed step count.
When both DummyVecEnv cars get stuck against walls simultaneously, Unity
physics slows to 1-2 FPS (heavy collision computation). At that speed,
stuck_steps=40 takes 1+ minute of wall-clock time — observed twice by user.
Fix: add max_stuck_seconds=12.0 wall-clock timeout. Timer resets whenever
car moves >= min_displacement. Fires regardless of step count if car hasn't
moved in 12 real-world seconds. Both triggers preserved (step count OR time).
Removed the progress_patience (active_node) terminator that was added
without sufficient evidence. Per ADR-020, mountain rollback is a learning
issue not a termination issue. Removed code should not be re-added without
specific evidence it is needed.
Only confirmed fix: CTE patience terminator catches grass exploit BEFORE
CTE exceeds 16m (the sim's determine_episode_over pass threshold).
- max_cte_terminate=4.0m
- cte_patience=20 steps
Critical facts documented permanently:
- throttle_min=0.5 bakes into action space (too fast for corners)
- throttle_min=0.2 + v5 reward CAN learn hill (proved Exp 9, mountain only 90k)
- Mountain failure in parallel is contamination from grass exploit, not throttle
- Grass exploit root cause: sim determine_episode_over() passes when CTE>16m
- DO NOT confuse mountain rollback with stuck issue
- DO NOT change throttle_min as first response to mountain failure
v5 dropped the efficiency term to get gradient signal on hills, but this
re-enabled circular driving (observed in Exp 11). v6 adds efficiency back
as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise
reward = speed × CTE_quality (same as v5).
Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient
on hills (all terms → 0 simultaneously). v6's gate passes when efficiency
is above threshold (car moving forward, even slowly on hill) and only
blocks when car is truly circling.
Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported
car stuck against barriers for ~10s which is too long with DummyVecEnv.
Scripts in /tmp are lost on reboot and not reproducible.
All experiment scripts now committed to git with README.
Exp5 script was already gone (lost before this fix).
All others (Exp6-Exp10, overnight, wave5, etc.) now preserved.
Rule going forward: scripts saved to agent/experiments/ and committed
BEFORE running, not after.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
The short-lap episode termination fix in SpeedRewardWrapper was not
working when multitrack_runner.py ran via command line because the env
was created as a plain gym.Wrapper chain, not VecTransposeImage(DummyVecEnv).
In custom scripts (Exp8, Exp9), env was explicitly:
VecTransposeImage(DummyVecEnv([make_env]))
This made episode termination work correctly.
In multitrack_runner.py, env was just wrap_env(raw) — a plain gym.Wrapper.
SB3 auto-wraps this internally but the terminated signal from
SpeedRewardWrapper.force_terminate did not propagate correctly,
so circle-exploit episodes were never terminated during training.
Fix: use VecTransposeImage(DummyVecEnv([...])) explicitly in main().
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Every test run now saves to agent/test-results/YYYY-MM-DD_HH-MM_<model>.log
so results are never lost. Also added 3-set Exp9 eval results to TEST_HISTORY.
Usage:
python3 agent/run_eval.py --model models/exp9-.../best_model.zip --sets 3
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
The circle exploit persisted because the penalty alone (-100 per short
lap) was insufficient. The model stayed alive between laps accumulating
small positive rewards, making circling a viable strategy despite the
penalty.
Fix: _compute_reward_and_done() returns (reward, force_terminate).
When a short lap is detected, force_terminate=True is returned and
step() sets terminated=True immediately. The episode ends on the spot —
no more rewards possible. This makes the circle exploit strictly worse
than any forward driving behaviour.
Tests updated: _compute_reward → _compute_reward_and_done, short-lap
test now asserts force_terminate=True.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Every training segment now saves checkpoint_NNNNNNN.zip so the
full training history is preserved on disk. No checkpoint is ever
overwritten. model.zip still updated for crash recovery.
After a 90k-step run with 13 segments you now have:
checkpoint_0006851.zip <- step 6,851
checkpoint_0013702.zip <- step 13,702
...
checkpoint_0090000.zip <- step 90,000
best_model.zip <- highest scoring segment (reloaded at end)
model.zip <- latest weights (crash recovery)
This means we can NEVER again lose a good mid-training model.
If the model was driving at step 30k, checkpoint_0030000.zip exists.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
This was the root cause of losing good models during training.
The model could learn to lap at step 30k then drift to a worse
policy by step 90k, and we only ever saved the final weights.
Changes to train_multitrack():
- Tracks best_segment_reward across all segments
- Saves best_model.zip whenever a new high score is achieved
- At end of training, RELOADS best_model.zip before returning
so the caller always gets the best policy found, not the drift
Both files saved per trial:
model.zip <- latest checkpoint (crash recovery)
best_model.zip <- best policy seen during training (used for eval)
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Problem with v4 on mountain_track: CTE × efficiency × speed all collapse
to zero simultaneously when the car slows on the hill, giving no gradient
signal for 'apply more throttle'.
v5: reward = (speed / 10) × (1 - |CTE| / max_cte)
- Directly rewards going fast while staying centred
- Hill: car slows → reward drops → clear gradient toward more throttle
- Circling protection now entirely handled by lap-time penalty +
StuckTerminationWrapper (not by the reward formula)
Tests updated to reflect v5 semantics (102 passing).
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
The goal is a model that generalises to ANY road-surface track, not
specifically mini_monaco. mini_monaco (tight barriers, hairpins) was
a bad proxy for this. Generated_road is a much better zero-shot test:
same visual category, never seen during Wave 4 training.
eval_on_track.py lets us run the Wave 4 champion on any track with
the same wrappers used during training, plus shuttle-exploit detection.
Run after Trial 25 finishes:
python3 agent/eval_on_track.py --model agent/models/wave4-champion/model.zip --track donkey-generated-roads-v0 --episodes 3 --max-steps 3000
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Samples car position every 100 steps during eval. Computes macro
efficiency = net_displacement / total_sampled_path. If < 0.3 with
>= 500 steps, logs WARNING: SHUTTLE EXPLOIT? with the efficiency value.
Also logs reward/step per episode so anomalously high-scoring long
episodes can be diagnosed immediately.
This will tell us definitively whether Trials 9 and 14 (1435/1573
scores, 2000 steps each) were genuine driving or back-and-forth
shuttling on a mini_monaco straight.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Checkpoint code added save_dir inside train_multitrack() but save_dir
is defined in main(). Every trial since the checkpoint fix was added
crashed with 'name save_dir is not defined' after the first segment,
producing rc=101 and no GP data.
Fix: add save_dir=None parameter to train_multitrack() and pass it
from the main() call site.
This explains why Trials 6-10 in the current run all produced None
results despite appearing to train normally for the first segment.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
ADR-013: Wave 4 train-from-scratch rationale (why no warm-start, why
generated_track+mountain_track, proven by 1943 overnight result)
ADR-014: Measure throughput before long runs (10+ hours lost to timeouts)
ADR-015: Per-segment checkpointing is non-negotiable
ADR-016: Verify fixes are running before walking away
These decisions existed in conversation but were never written down,
causing them to be forgotten after context compaction and re-learned
the hard way multiple times.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
All previous issues:
- Controller was never restarted after cap/checkpoint fixes -> they never ran
- Timeout trials (score=0) were polluting GP data -> removed
- Overnight Trial 3 result (1943 mini_monaco) was unknown to GP -> added
GP now has 5 valid data points including the 1943 score at
lr=0.000685, switch=17499. GP should converge toward longer
switching intervals which produced the only great result.
Verified before relaunch:
- PARAM_SPACE max total_timesteps = 90000 ✓
- Checkpoint saves after every segment ✓
- Rescue eval on timeout ✓
- 102 tests passing ✓
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Two changes:
1. Lower total_timesteps cap: 120k → 90k
Actual throughput is 16 steps/sec (not 20 as estimated).
120k steps = 126 min training + 9 min overhead = 135 min > 2hr limit.
90k steps = 94 min + 8 min overhead = 102 min, safely within limit.
2. Per-segment checkpoint saves in multitrack_runner
model.save() called after every segment so the latest weights are
always on disk. If the runner is killed (timeout/crash/Ctrl+C),
training data is never completely lost.
3. Timeout rescue eval in wave4_controller
If JOB_TIMEOUT fires and a checkpoint exists, immediately runs a
quick mini_monaco eval on the checkpoint so the trial still produces
a GP data point despite the timeout.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Trials 3+4 both proposed ~140k steps and hit the 2hr JOB_TIMEOUT,
wasting time and producing no GP data. At ~20 steps/sec, 120k steps
takes ~100 min, safely within the 2hr limit.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
StuckTerminationWrapper added to wrap_env stack (between ThrottleClamp
and SpeedReward):
- Terminates episode after stuck_steps=80 steps with <0.5m displacement
- Handles slow barrier contact that Unity hit detection misses
- Handles off-lap-line circles (efficiency→0 gave zero reward but no
termination; now gives -1.0 after 80 steps = ~4s of non-progress)
- Wrapper stack: ThrottleClamp → StuckTermination → SpeedReward
Also: missing deque import in multitrack_runner.py caused NameError.
Phase 4 results cleared again (Trial 1 ran without StuckTermination).
Tests: 2 new stuck-termination tests, 102 total.
Agent: pi
Tests: 102 passed
Tests-Added: 2
TypeScript: N/A
Two reward hacking behaviours observed during Wave 4 training:
1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
Model circles at start/finish line completing laps in 1-2 sim-seconds,
accumulating lap_count indefinitely with no genuine track progress.
Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
< min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
Window size also increased from 30 → 60 to catch slower circles.
2. Non-terminating segment eval episodes:
evaluate_policy on wide tracks (no barriers) could run indefinitely,
inflating segment_reward to 200k+. Replaced with manual eval loop
capped at MAX_EVAL_STEPS=3000 steps.
Phase 4 results cleared (trials 4-6 ran with exploitable reward).
Tests: 4 new reward wrapper tests, 100 total passing.
Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
Without this, Wave 4 scratch-trained models produce no rollout stats in
the log, making it impossible to monitor training progress or spot
degenerate policies early.
Warm-start models in Wave 3 showed stats because verbose=1 was baked
into the Phase-2 saved model state; fresh models default to verbose=0.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
Strategy change driven by Trial 1 data analysis:
- generated_road removed: too similar to generated_track, and Phase-2
warm-start caused catastrophic forgetting (reward 2388→37 in one rotation)
- mountain_track mean reward was only 17 — model never converged there
- mini_monaco score 24.9 (37 steps) — model was outputting degenerate actions
Wave 4 approach:
- NO warm-start: fresh random weights every trial
- Train: generated_track + mountain_track (visually distinct backgrounds,
both have road markings — forces model to learn general mark-following)
- Test (zero-shot): mini_monaco only (never seen during training)
- Wider LR search: [1e-4, 2e-3] (scratch model needs different range)
- Larger step budgets: 60k-250k total (fresh model needs more time)
- Seed params: lr=0.0003 and lr=0.001 (diverse from the start)
Files:
- multitrack_runner.py: 2 training tracks, no warm-start auto-detection
- wave4_controller.py: new Wave 4 GP+UCB controller
- tests updated: TRAINING_TRACKS assertion, seed param tests → wave4
- 96 tests passing
ADR-013 to follow.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
PPO.load() bakes lr_schedule=FloatSchedule(saved_lr) into the model.
train() calls _update_learning_rate() which reads lr_schedule, not
model.learning_rate. So even with param_groups patched, the first
gradient step reverts the optimizer to the saved LR.
Complete 3-part fix in create_or_load_model():
model.learning_rate = lr # attribute
model.lr_schedule = get_schedule_fn(lr) # prevents train() reverting
for pg in optimizer.param_groups: pg['lr'] = lr # immediate effect
Also:
- SEED_PARAMS: second seed now uses LR=0.001 (was 0.000225) so GP
starts with real LR diversity instead of two identical seeds
- tests/test_end_to_end.py: 13 new tests covering the full LR override
path including a live learn() call; would have caught both bugs
- Phase 3 results re-cleared (seed trial 1 ran with half-fix)
- 96 tests total, all passing
Agent: pi
Tests: 96 passed
Tests-Added: 13
TypeScript: N/A