Fixes three root-cause bugs discovered before/during this experiment:
1. regen_road was silently doing nothing — TcpCarHandler.RegenRoad() bailed on
null TrainingManager; added direct RoadBuilder+PathManager fallback.
2. MapOverlay minimap not refreshing — fixed to check node[10] position change.
3. BrakeOnUpdateCallback: sends zero control before PPO gradient updates to
prevent car drifting during 3-8s CPU pause.
4. PathManager self-intersection fix: retry loop with XZ segment-segment math
(up to 20 retries) — verifiably different roads per seed.
Exp27 trains fresh weights with N_THROTTLE=3 (bins 0.2/0.5/1.0), ent_coef=0.05,
500k steps, regen_road TCP message per checkpoint. Peak: 462.7r/1580 steps @110k.
Also adds verify_minimap_refresh.py and verify_road_regen.py diagnostic scripts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_best_models.py: evaluates exp24/25/26 best models across 10 fixed random
roads (regen_road with fixed seeds) for fair head-to-head comparison.
eval_gentrack_on_minimonaco.py: zero-shot evaluation of gentrack specialists
(exp13, wave5-gentrack-only, wave4-trial-0009) on mini-monaco.
Results: exp26 > exp25 > exp24 on random roads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Loads exp25 best_model (381r @ 80k) to skip early exploration. Runs 300k
steps on generated_road with road regen every 10k steps. Python-side hit
check is now active (added late in exp25, not loaded then). Final cross-model
eval: exp26 best (9/10 full eps, 381.2r mean) — top performer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
multitrack_runner.py: adds Python-side hit check as a zero-latency backstop
— gym_donkeycar can delay hit!=none termination by one frame; this fires
on the same step and records stuck_reason for diagnostics.
eval_on_track.py: logs hit value and stuck_reason at episode end; calls
exit_scene after eval so the sim returns to main menu (next gym.make() can
switch scenes); removes unused SPEED_SCALE constant.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
exp24: reconnect to sim after each 10k-step checkpoint. Reconnecting reloads
the scene → sdsandbox generates a new random road. Each training segment and
each checkpoint eval now runs on a different road layout, preventing overfitting
to a single road and giving meaningful generalization metrics in the eval logs.
Car.cs: add a short forward raycast in FixedUpdate to detect barriers the front
wheels are pressing against. WheelColliders do not fire OnCollisionEnter/Stay on
the car's MonoBehaviour, so nose-first barrier contact was invisible to Car.cs
collision callbacks. The raycast fires when throttle > 0.05 and a collider is
within 0.8m forward — registers the collision the same way OnCollisionStay does.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StuckTerminationWrapper: add low_speed_threshold + max_low_speed_seconds params.
Car pinned against a barrier has speed≈0 even while sliding laterally — lateral
drift was resetting the position-based displacement timer, leaving the car stuck
for up to max_episode_seconds. Speed-based check terminates after 2s at speed<0.5.
Exp24: 7-bin discrete steering (DiscretizedActionWrapper) eliminates Gaussian policy
noise that caused rapid oscillation in exp23. max_episode_seconds reduced to 30s
since speed-based stuck detection now handles the barrier-contact cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- generated_road.unity + generated_track.unity: showBarrierMeshes 1→0.
Visible barrier meshes would appear in the camera observation and let the
policy learn from an artificial visual cue that won't exist at eval time.
- exp23: add PID-file guard — aborts immediately if another instance is
already running, preventing multiple cars from spawning in the sim.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: barriers were zero-thickness MeshCollider planes with no CCD on the
car. The car tunnelled through between frames. Every Python patch was trying to
catch in code what physics should enforce.
Unity (source only — build in progress):
- RoadBuilder.cs: CreateBarrier() now makes BoxCollider-per-segment with real 3D
volume (barrierThickness=1.0m default) + half-thickness overlap at corners to
seal gaps. CreateEndCap() seals open ends of non-looping tracks (generated_road).
- Car.cs: rb.collisionDetectionMode = Continuous in Awake() — prevents tunneling.
Python:
- reward_wrapper.py v7: removed CTE-patience termination, high-CTE negative
reward, solid_hit monitoring, low-speed/wedge detection. Kept: efficiency gate,
no-progress (active_node) termination, lap exploit guard. Reward = speed×CTE_quality.
- exp23_generated_road_clean.py: single track, no warm-start, 200k steps, clean
reward, MAX_EPISODE_SECONDS=120 as safety net only.
- tests: 17 tests covering clean reward properties.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Run stopped at ~34k steps. ep_len_mean frozen at 118 due to MAX_EPISODE_SECONDS=18
cap. Barriers identified as zero-thickness MeshColliders (physics tunneling root cause).
Clean-slate rebuild planned: BoxCollider barriers + CCD on car + simplified reward.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact
or 4 sustained solid-hit frames; prevents car wedging against invisible barriers
- reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned
motionless (below threshold, no displacement) after grace period
- reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE >
max_cte_terminate (not after patience), so PPO cannot collect positive speed
rewards while driving the large outside-road circle
- tests: 23 passing unit tests covering all new termination paths
- exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track
with warm-start from champion model; exp22 is current active run
- SESSION_HANDOFF.md: live handoff doc for next session continuity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StuckTerminationWrapper wall-clock timer was resettable by barrier-sliding:
car drifting 0.5m along a wall repeatedly resets the 12s timer. At low sim
fps (1-2fps when both cars stuck), 40-step check also takes minutes.
Fix: added max_episode_seconds=30 — hard wall-clock limit per episode,
independent of position or sim fps. No episode can run longer than 30s.
Also adds monitor_training.sh: independent shell process that checks every
5 minutes and appends status to /tmp/training_monitor.log — works without
Claude being active.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Exp 17 post-mortem: efficiency gate window=30 steps only covers ~40% of a
3.5s exploit circle at 22fps, giving partial-arc efficiency ~0.77 (gate fires
at 0.15). Car earned positive reward while circling, outweighing the -10
lap penalty. Performance peaked at 80k then collapsed.
Exp 18 fixes:
- window_size 30→200: covers 2+ full exploit circles, driving efficiency→0
- min_lap_time 5s→12s: genuine laps are 13-16s (gentrack) and 27-29s (mountain);
anything under 12s is an exploit and terminates immediately
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- exp17_parallel_450k.py: parallel two-track training (generated_track:9091,
mountain_track:9093), 450k steps, v6 reward, HOST=localhost
- DECISIONS.md: ADR-025 (parallel strategy) and ADR-026 (mountain friction fix)
- docs/STATE.md: updated to April 2026 state with current champions and strategy
- docs/TEST_HISTORY.md: mountain friction fix notes + Exp 17 full design
- outerloop-results: exp14 finetune logs and robust mountain eval results
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v5 required for mountain hills (v4 gives zero gradient on hills - documented Exp 1).
Same simple approach as Exp 13 which worked: single track, minimal wrappers,
lap-based stopping. ThrottleClamp + V5Reward only.
Return to Wave 4 setup that produced Trial 9 (2000/2000 on generated_track).
v4 reward: base x efficiency x speed. Circles give ~0 reward naturally.
No StuckTerminationWrapper, no CTE patience, no progress terminator.
Just ThrottleClamp + V4Reward. Lap-based stopping criterion.
Previously circles ran 20+ seconds because the efficiency gate only returned
0 reward without terminating. After 20 consecutive steps of efficiency < 0.15
(~0.7 seconds at 27 steps/sec), episode now terminates with -1.0.
Also confirmed from telemetry diagnostic: CTE does report correctly when
car goes off-track (rises steadily to 6.2m before tree collision).
The grass exploit runs long only when the open grass area has no obstacles.
Efficiency gate termination is the most reliable catch for both circles
and open-grass driving (straight-line grass = high efficiency, but
active_node progress terminator catches that case).
User's insight: a circling car stays near the same track waypoints, so
active_node (sim's track progress indicator) never advances. Track the
maximum active_node reached this episode. If it hasn't increased in
progress_patience=60 steps (~3.3s), terminate.
This catches:
- Circular driving (active_node oscillates, max never increases)
- Stuck on cone/barrier (active_node frozen)
- NOT triggered by: legitimate cornering, slow forward progress, lap resets
On lap completion, active_node wraps to 0 — reset max_node_seen and counter.
Also: Exp 12 — single track mountain training with lap-based stopping criterion.
Train until 3 consecutive laps in eval, not fixed step count.
When both DummyVecEnv cars get stuck against walls simultaneously, Unity
physics slows to 1-2 FPS (heavy collision computation). At that speed,
stuck_steps=40 takes 1+ minute of wall-clock time — observed twice by user.
Fix: add max_stuck_seconds=12.0 wall-clock timeout. Timer resets whenever
car moves >= min_displacement. Fires regardless of step count if car hasn't
moved in 12 real-world seconds. Both triggers preserved (step count OR time).
Removed the progress_patience (active_node) terminator that was added
without sufficient evidence. Per ADR-020, mountain rollback is a learning
issue not a termination issue. Removed code should not be re-added without
specific evidence it is needed.
Only confirmed fix: CTE patience terminator catches grass exploit BEFORE
CTE exceeds 16m (the sim's determine_episode_over pass threshold).
- max_cte_terminate=4.0m
- cte_patience=20 steps
Critical facts documented permanently:
- throttle_min=0.5 bakes into action space (too fast for corners)
- throttle_min=0.2 + v5 reward CAN learn hill (proved Exp 9, mountain only 90k)
- Mountain failure in parallel is contamination from grass exploit, not throttle
- Grass exploit root cause: sim determine_episode_over() passes when CTE>16m
- DO NOT confuse mountain rollback with stuck issue
- DO NOT change throttle_min as first response to mountain failure
v5 dropped the efficiency term to get gradient signal on hills, but this
re-enabled circular driving (observed in Exp 11). v6 adds efficiency back
as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise
reward = speed × CTE_quality (same as v5).
Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient
on hills (all terms → 0 simultaneously). v6's gate passes when efficiency
is above threshold (car moving forward, even slowly on hill) and only
blocks when car is truly circling.
Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported
car stuck against barriers for ~10s which is too long with DummyVecEnv.
Scripts in /tmp are lost on reboot and not reproducible.
All experiment scripts now committed to git with README.
Exp5 script was already gone (lost before this fix).
All others (Exp6-Exp10, overnight, wave5, etc.) now preserved.
Rule going forward: scripts saved to agent/experiments/ and committed
BEFORE running, not after.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
The short-lap episode termination fix in SpeedRewardWrapper was not
working when multitrack_runner.py ran via command line because the env
was created as a plain gym.Wrapper chain, not VecTransposeImage(DummyVecEnv).
In custom scripts (Exp8, Exp9), env was explicitly:
VecTransposeImage(DummyVecEnv([make_env]))
This made episode termination work correctly.
In multitrack_runner.py, env was just wrap_env(raw) — a plain gym.Wrapper.
SB3 auto-wraps this internally but the terminated signal from
SpeedRewardWrapper.force_terminate did not propagate correctly,
so circle-exploit episodes were never terminated during training.
Fix: use VecTransposeImage(DummyVecEnv([...])) explicitly in main().
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Every test run now saves to agent/test-results/YYYY-MM-DD_HH-MM_<model>.log
so results are never lost. Also added 3-set Exp9 eval results to TEST_HISTORY.
Usage:
python3 agent/run_eval.py --model models/exp9-.../best_model.zip --sets 3
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
The circle exploit persisted because the penalty alone (-100 per short
lap) was insufficient. The model stayed alive between laps accumulating
small positive rewards, making circling a viable strategy despite the
penalty.
Fix: _compute_reward_and_done() returns (reward, force_terminate).
When a short lap is detected, force_terminate=True is returned and
step() sets terminated=True immediately. The episode ends on the spot —
no more rewards possible. This makes the circle exploit strictly worse
than any forward driving behaviour.
Tests updated: _compute_reward → _compute_reward_and_done, short-lap
test now asserts force_terminate=True.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Every training segment now saves checkpoint_NNNNNNN.zip so the
full training history is preserved on disk. No checkpoint is ever
overwritten. model.zip still updated for crash recovery.
After a 90k-step run with 13 segments you now have:
checkpoint_0006851.zip <- step 6,851
checkpoint_0013702.zip <- step 13,702
...
checkpoint_0090000.zip <- step 90,000
best_model.zip <- highest scoring segment (reloaded at end)
model.zip <- latest weights (crash recovery)
This means we can NEVER again lose a good mid-training model.
If the model was driving at step 30k, checkpoint_0030000.zip exists.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
This was the root cause of losing good models during training.
The model could learn to lap at step 30k then drift to a worse
policy by step 90k, and we only ever saved the final weights.
Changes to train_multitrack():
- Tracks best_segment_reward across all segments
- Saves best_model.zip whenever a new high score is achieved
- At end of training, RELOADS best_model.zip before returning
so the caller always gets the best policy found, not the drift
Both files saved per trial:
model.zip <- latest checkpoint (crash recovery)
best_model.zip <- best policy seen during training (used for eval)
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A