Root cause: barriers were zero-thickness MeshCollider planes with no CCD on the
car. The car tunnelled through between frames. Every Python patch was trying to
catch in code what physics should enforce.
Unity (source only — build in progress):
- RoadBuilder.cs: CreateBarrier() now makes BoxCollider-per-segment with real 3D
volume (barrierThickness=1.0m default) + half-thickness overlap at corners to
seal gaps. CreateEndCap() seals open ends of non-looping tracks (generated_road).
- Car.cs: rb.collisionDetectionMode = Continuous in Awake() — prevents tunneling.
Python:
- reward_wrapper.py v7: removed CTE-patience termination, high-CTE negative
reward, solid_hit monitoring, low-speed/wedge detection. Kept: efficiency gate,
no-progress (active_node) termination, lap exploit guard. Reward = speed×CTE_quality.
- exp23_generated_road_clean.py: single track, no warm-start, 200k steps, clean
reward, MAX_EPISODE_SECONDS=120 as safety net only.
- tests: 17 tests covering clean reward properties.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact
or 4 sustained solid-hit frames; prevents car wedging against invisible barriers
- reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned
motionless (below threshold, no displacement) after grace period
- reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE >
max_cte_terminate (not after patience), so PPO cannot collect positive speed
rewards while driving the large outside-road circle
- tests: 23 passing unit tests covering all new termination paths
- exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track
with warm-start from champion model; exp22 is current active run
- SESSION_HANDOFF.md: live handoff doc for next session continuity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
User's insight: a circling car stays near the same track waypoints, so
active_node (sim's track progress indicator) never advances. Track the
maximum active_node reached this episode. If it hasn't increased in
progress_patience=60 steps (~3.3s), terminate.
This catches:
- Circular driving (active_node oscillates, max never increases)
- Stuck on cone/barrier (active_node frozen)
- NOT triggered by: legitimate cornering, slow forward progress, lap resets
On lap completion, active_node wraps to 0 — reset max_node_seen and counter.
Also: Exp 12 — single track mountain training with lap-based stopping criterion.
Train until 3 consecutive laps in eval, not fixed step count.
When both DummyVecEnv cars get stuck against walls simultaneously, Unity
physics slows to 1-2 FPS (heavy collision computation). At that speed,
stuck_steps=40 takes 1+ minute of wall-clock time — observed twice by user.
Fix: add max_stuck_seconds=12.0 wall-clock timeout. Timer resets whenever
car moves >= min_displacement. Fires regardless of step count if car hasn't
moved in 12 real-world seconds. Both triggers preserved (step count OR time).
Removed the progress_patience (active_node) terminator that was added
without sufficient evidence. Per ADR-020, mountain rollback is a learning
issue not a termination issue. Removed code should not be re-added without
specific evidence it is needed.
Only confirmed fix: CTE patience terminator catches grass exploit BEFORE
CTE exceeds 16m (the sim's determine_episode_over pass threshold).
- max_cte_terminate=4.0m
- cte_patience=20 steps
Critical facts documented permanently:
- throttle_min=0.5 bakes into action space (too fast for corners)
- throttle_min=0.2 + v5 reward CAN learn hill (proved Exp 9, mountain only 90k)
- Mountain failure in parallel is contamination from grass exploit, not throttle
- Grass exploit root cause: sim determine_episode_over() passes when CTE>16m
- DO NOT confuse mountain rollback with stuck issue
- DO NOT change throttle_min as first response to mountain failure
v5 dropped the efficiency term to get gradient signal on hills, but this
re-enabled circular driving (observed in Exp 11). v6 adds efficiency back
as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise
reward = speed × CTE_quality (same as v5).
Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient
on hills (all terms → 0 simultaneously). v6's gate passes when efficiency
is above threshold (car moving forward, even slowly on hill) and only
blocks when car is truly circling.
Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported
car stuck against barriers for ~10s which is too long with DummyVecEnv.
The circle exploit persisted because the penalty alone (-100 per short
lap) was insufficient. The model stayed alive between laps accumulating
small positive rewards, making circling a viable strategy despite the
penalty.
Fix: _compute_reward_and_done() returns (reward, force_terminate).
When a short lap is detected, force_terminate=True is returned and
step() sets terminated=True immediately. The episode ends on the spot —
no more rewards possible. This makes the circle exploit strictly worse
than any forward driving behaviour.
Tests updated: _compute_reward → _compute_reward_and_done, short-lap
test now asserts force_terminate=True.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Problem with v4 on mountain_track: CTE × efficiency × speed all collapse
to zero simultaneously when the car slows on the hill, giving no gradient
signal for 'apply more throttle'.
v5: reward = (speed / 10) × (1 - |CTE| / max_cte)
- Directly rewards going fast while staying centred
- Hill: car slows → reward drops → clear gradient toward more throttle
- Circling protection now entirely handled by lap-time penalty +
StuckTerminationWrapper (not by the reward formula)
Tests updated to reflect v5 semantics (102 passing).
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
StuckTerminationWrapper added to wrap_env stack (between ThrottleClamp
and SpeedReward):
- Terminates episode after stuck_steps=80 steps with <0.5m displacement
- Handles slow barrier contact that Unity hit detection misses
- Handles off-lap-line circles (efficiency→0 gave zero reward but no
termination; now gives -1.0 after 80 steps = ~4s of non-progress)
- Wrapper stack: ThrottleClamp → StuckTermination → SpeedReward
Also: missing deque import in multitrack_runner.py caused NameError.
Phase 4 results cleared again (Trial 1 ran without StuckTermination).
Tests: 2 new stuck-termination tests, 102 total.
Agent: pi
Tests: 102 passed
Tests-Added: 2
TypeScript: N/A
Two reward hacking behaviours observed during Wave 4 training:
1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
Model circles at start/finish line completing laps in 1-2 sim-seconds,
accumulating lap_count indefinitely with no genuine track progress.
Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
< min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
Window size also increased from 30 → 60 to catch slower circles.
2. Non-terminating segment eval episodes:
evaluate_policy on wide tracks (no barriers) could run indefinitely,
inflating segment_reward to 200k+. Replaced with manual eval loop
capped at MAX_EVAL_STEPS=3000 steps.
Phase 4 results cleared (trials 4-6 ran with exploitable reward).
Tests: 4 new reward wrapper tests, 100 total passing.
Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
Strategy change driven by Trial 1 data analysis:
- generated_road removed: too similar to generated_track, and Phase-2
warm-start caused catastrophic forgetting (reward 2388→37 in one rotation)
- mountain_track mean reward was only 17 — model never converged there
- mini_monaco score 24.9 (37 steps) — model was outputting degenerate actions
Wave 4 approach:
- NO warm-start: fresh random weights every trial
- Train: generated_track + mountain_track (visually distinct backgrounds,
both have road markings — forces model to learn general mark-following)
- Test (zero-shot): mini_monaco only (never seen during training)
- Wider LR search: [1e-4, 2e-3] (scratch model needs different range)
- Larger step budgets: 60k-250k total (fresh model needs more time)
- Seed params: lr=0.0003 and lr=0.001 (diverse from the start)
Files:
- multitrack_runner.py: 2 training tracks, no warm-start auto-detection
- wave4_controller.py: new Wave 4 GP+UCB controller
- tests updated: TRAINING_TRACKS assertion, seed param tests → wave4
- 96 tests passing
ADR-013 to follow.
Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
PPO.load() bakes lr_schedule=FloatSchedule(saved_lr) into the model.
train() calls _update_learning_rate() which reads lr_schedule, not
model.learning_rate. So even with param_groups patched, the first
gradient step reverts the optimizer to the saved LR.
Complete 3-part fix in create_or_load_model():
model.learning_rate = lr # attribute
model.lr_schedule = get_schedule_fn(lr) # prevents train() reverting
for pg in optimizer.param_groups: pg['lr'] = lr # immediate effect
Also:
- SEED_PARAMS: second seed now uses LR=0.001 (was 0.000225) so GP
starts with real LR diversity instead of two identical seeds
- tests/test_end_to_end.py: 13 new tests covering the full LR override
path including a live learn() call; would have caught both bugs
- Phase 3 results re-cleared (seed trial 1 ran with half-fix)
- 96 tests total, all passing
Agent: pi
Tests: 96 passed
Tests-Added: 13
TypeScript: N/A
Warren track surface is green carpet (not outdoor road), and the
episode-done condition (|CTE| > max_cte) does not fire when the car
crosses the INSIDE boundary. Car can drive off-track and bump into
chairs indefinitely, making scores meaningless as a test metric.
Changes:
- multitrack_runner.py: TEST_TRACKS now mini_monaco only
- wave3_controller.py: drop warren_reward from parse/save/champion paths
- tests/test_wave3.py: update assertions to match single test track
- All 83 tests pass
Track classification (final):
TRAIN : generated_road, generated_track, mountain_track
TEST : mini_monaco (outdoor, proper road, correct done condition)
SKIP : warren, warehouse, robo_racing_league, waveshare, circuit_launch
SKIP : avc_sparkfun (orange markings)
ADR-010 to be updated.
Agent: pi
Tests: 83 passed
Tests-Added: 0
TypeScript: N/A
Bug: send_exit_scene_raw() opened a NEW TCP connection, creating a second
phantom vehicle. The sim sent exit_scene to the phantom, leaving the real
training connection stuck on generated_road for the entire run.
Fix: _send_exit_scene() now calls env.unwrapped.viewer.exit_scene() on the
EXISTING TCP connection that the training env already holds. This is the
only reliable way to switch scenes mid-session (matches track_switcher.py).
Also:
- Removed send_exit_scene_raw() import from multitrack_runner.py
- Simplified initial connection (no spurious exit_scene at startup)
- Reduced search space: total_timesteps 80k-400k -> 30k-150k
- Reduced seed params: 150k/300k -> 45k/90k (~35-45 min per trial)
- Added test: test_close_and_switch_uses_viewer_not_raw_socket
83 tests passing
Agent: pi
Tests: 83 passed
Tests-Added: 1
TypeScript: N/A
PHASE 2 MILESTONE DOCUMENTED:
All 3 top models complete the full track with distinct driving styles:
- Trial 20 (n_steer=3): Right lane, stable steering — CHAMPION ✅
- Trial 8 (n_steer=4): Left/center lane, oscillating (still completes!)
- Trial 18 (n_steer=3): Right shoulder, very accurate line following
Key finding: fewer steering bins (n_steer=3) = better driving (counterintuitive)
CTE symmetry explains left/right preference: random NN init determines which side
BEHAVIORAL REWARD WRAPPERS (agent/behavioral_wrappers.py):
- LanePositionWrapper: target a specific CTE offset (control left/right preference)
- AntiOscillationWrapper: penalise rapid steering changes (fix Model 2 oscillation)
- AsymmetricCTEWrapper: enforce right-lane rule (penalise left-of-centre more)
- CombinedBehavioralWrapper: all three combined in one wrapper
ENHANCED EVALUATOR (agent/evaluate_champion.py):
- Full metrics: reward, lap time, oscillation score, CTE distribution, lane position
- --compare flag: runs all top Phase 2 models side by side with comparison table
- Saves eval summary to outerloop-results/eval_summary.jsonl
- Detects lap completion events from sim info dict
IMPLEMENTATION PLAN updated: Wave 3 streams defined
RESEARCH LOG updated: Phase 2 milestone, behavioral analysis, next steps
Champion updated to Trial 20 (Phase 2)
Agent: pi/claude-sonnet
Tests: 53/53 passing (+13 behavioral wrapper tests)
Tests-Added: +13
TypeScript: N/A
ROOT CAUSE:
donkey_sim.py calc_reward() uses forward_vel = dot(heading, velocity).
A spinning car ALWAYS has forward_vel > 0 (always moving 'forward' relative
to its own heading), so it earned positive reward indefinitely while circling.
v3 WAS INSUFFICIENT:
v3 applied efficiency only to the speed BONUS: original × (1 + speed×eff×scale)
But 'original' from sim was still exploitable: CTE≈0 while spinning → original=1.0/step
Efficiency killed the speed bonus but not the base reward.
47k-step run: spinning = 1.0/step × 47k = 47k reward (never crashes in circle)
v4 FIX — base × efficiency × speed:
reward = (1 - abs(cte)/max_cte) × efficiency × (1 + speed_scale × speed)
Completely ignores sim's bogus forward_vel reward.
Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed.
ALL three terms must be high to earn reward — cannot be gamed.
Key new test: test_circling_at_zero_cte_gives_near_zero_reward
Worst-case exploit (CTE=0 spinning) → avg reward < 0.15 (was 1.0 in v3)
forward_beats_circling_by_3x confirmed.
Also: update Phase 2 autoresearch timesteps test, research log updated.
Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +1 (core v4 circling guarantee)
TypeScript: N/A
Problems fixed:
- Timesteps 5k-30k caused all trials to timeout (PPO+CNN+CPU needs ~0.1s/step)
- New range: 1000-5000 steps fits well within 480s timeout
- PPO random init policy outputs throttle~0 -> car sits still -> fix with ThrottleClampWrapper (min 0.2)
- Sim stuck detection: if speed<0.02 for 100 consecutive steps, stop training and report error
- Sim frozen detection: if observation unchanged for 30 steps, stop training (connection lost)
- eval_episodes reduced to 3 to speed up evaluation phase
Agent: pi/claude-sonnet
Tests: 37/37 passing
Tests-Added: 0 (behaviour change only)
TypeScript: N/A
- Rebuilt donkeycar_sb3_runner.py: real PPO/DQN model.learn() + evaluate_policy() + model.save()
- Added SpeedRewardWrapper: reward = speed * (1 - |cte|/max_cte)
- Added ChampionTracker: tracks best model across all trials, writes manifest.json
- Rebuilt autoresearch_controller.py: Phase 1 results separated from random-policy data
- Added timesteps to GP search space
- Added --push-every N for automatic git push
- Added 37 passing tests: discretize_action, reward_wrapper, autoresearch_controller, runner_integration
- Scaffolded project with agent harness (large mode): PROJECT-SPEC, DECISIONS, IMPLEMENTATION_PLAN, EXECUTION_MASTER
- Fixed: model.save() never called before model is defined (was root cause of all prior NameError crashes)
- Fixed: random policy replaced with real trained policy evaluation
Agent: pi/claude-sonnet
Tests: 37/37 passing
Tests-Added: +37
TypeScript: N/A