- reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact
or 4 sustained solid-hit frames; prevents car wedging against invisible barriers
- reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned
motionless (below threshold, no displacement) after grace period
- reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE >
max_cte_terminate (not after patience), so PPO cannot collect positive speed
rewards while driving the large outside-road circle
- tests: 23 passing unit tests covering all new termination paths
- exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track
with warm-start from champion model; exp22 is current active run
- SESSION_HANDOFF.md: live handoff doc for next session continuity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
User's insight: a circling car stays near the same track waypoints, so
active_node (sim's track progress indicator) never advances. Track the
maximum active_node reached this episode. If it hasn't increased in
progress_patience=60 steps (~3.3s), terminate.
This catches:
- Circular driving (active_node oscillates, max never increases)
- Stuck on cone/barrier (active_node frozen)
- NOT triggered by: legitimate cornering, slow forward progress, lap resets
On lap completion, active_node wraps to 0 — reset max_node_seen and counter.
Also: Exp 12 — single track mountain training with lap-based stopping criterion.
Train until 3 consecutive laps in eval, not fixed step count.
Removed the progress_patience (active_node) terminator that was added
without sufficient evidence. Per ADR-020, mountain rollback is a learning
issue not a termination issue. Removed code should not be re-added without
specific evidence it is needed.
Only confirmed fix: CTE patience terminator catches grass exploit BEFORE
CTE exceeds 16m (the sim's determine_episode_over pass threshold).
- max_cte_terminate=4.0m
- cte_patience=20 steps
Critical facts documented permanently:
- throttle_min=0.5 bakes into action space (too fast for corners)
- throttle_min=0.2 + v5 reward CAN learn hill (proved Exp 9, mountain only 90k)
- Mountain failure in parallel is contamination from grass exploit, not throttle
- Grass exploit root cause: sim determine_episode_over() passes when CTE>16m
- DO NOT confuse mountain rollback with stuck issue
- DO NOT change throttle_min as first response to mountain failure
v5 dropped the efficiency term to get gradient signal on hills, but this
re-enabled circular driving (observed in Exp 11). v6 adds efficiency back
as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise
reward = speed × CTE_quality (same as v5).
Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient
on hills (all terms → 0 simultaneously). v6's gate passes when efficiency
is above threshold (car moving forward, even slowly on hill) and only
blocks when car is truly circling.
Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported
car stuck against barriers for ~10s which is too long with DummyVecEnv.
The circle exploit persisted because the penalty alone (-100 per short
lap) was insufficient. The model stayed alive between laps accumulating
small positive rewards, making circling a viable strategy despite the
penalty.
Fix: _compute_reward_and_done() returns (reward, force_terminate).
When a short lap is detected, force_terminate=True is returned and
step() sets terminated=True immediately. The episode ends on the spot —
no more rewards possible. This makes the circle exploit strictly worse
than any forward driving behaviour.
Tests updated: _compute_reward → _compute_reward_and_done, short-lap
test now asserts force_terminate=True.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Problem with v4 on mountain_track: CTE × efficiency × speed all collapse
to zero simultaneously when the car slows on the hill, giving no gradient
signal for 'apply more throttle'.
v5: reward = (speed / 10) × (1 - |CTE| / max_cte)
- Directly rewards going fast while staying centred
- Hill: car slows → reward drops → clear gradient toward more throttle
- Circling protection now entirely handled by lap-time penalty +
StuckTerminationWrapper (not by the reward formula)
Tests updated to reflect v5 semantics (102 passing).
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Two reward hacking behaviours observed during Wave 4 training:
1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
Model circles at start/finish line completing laps in 1-2 sim-seconds,
accumulating lap_count indefinitely with no genuine track progress.
Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
< min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
Window size also increased from 30 → 60 to catch slower circles.
2. Non-terminating segment eval episodes:
evaluate_policy on wide tracks (no barriers) could run indefinitely,
inflating segment_reward to 200k+. Replaced with manual eval loop
capped at MAX_EVAL_STEPS=3000 steps.
Phase 4 results cleared (trials 4-6 ran with exploitable reward).
Tests: 4 new reward wrapper tests, 100 total passing.
Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
ROOT CAUSE:
donkey_sim.py calc_reward() uses forward_vel = dot(heading, velocity).
A spinning car ALWAYS has forward_vel > 0 (always moving 'forward' relative
to its own heading), so it earned positive reward indefinitely while circling.
v3 WAS INSUFFICIENT:
v3 applied efficiency only to the speed BONUS: original × (1 + speed×eff×scale)
But 'original' from sim was still exploitable: CTE≈0 while spinning → original=1.0/step
Efficiency killed the speed bonus but not the base reward.
47k-step run: spinning = 1.0/step × 47k = 47k reward (never crashes in circle)
v4 FIX — base × efficiency × speed:
reward = (1 - abs(cte)/max_cte) × efficiency × (1 + speed_scale × speed)
Completely ignores sim's bogus forward_vel reward.
Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed.
ALL three terms must be high to earn reward — cannot be gamed.
Key new test: test_circling_at_zero_cte_gives_near_zero_reward
Worst-case exploit (CTE=0 spinning) → avg reward < 0.15 (was 1.0 in v3)
forward_beats_circling_by_3x confirmed.
Also: update Phase 2 autoresearch timesteps test, research log updated.
Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +1 (core v4 circling guarantee)
TypeScript: N/A
- Rebuilt donkeycar_sb3_runner.py: real PPO/DQN model.learn() + evaluate_policy() + model.save()
- Added SpeedRewardWrapper: reward = speed * (1 - |cte|/max_cte)
- Added ChampionTracker: tracks best model across all trials, writes manifest.json
- Rebuilt autoresearch_controller.py: Phase 1 results separated from random-policy data
- Added timesteps to GP search space
- Added --push-every N for automatic git push
- Added 37 passing tests: discretize_action, reward_wrapper, autoresearch_controller, runner_integration
- Scaffolded project with agent harness (large mode): PROJECT-SPEC, DECISIONS, IMPLEMENTATION_PLAN, EXECUTION_MASTER
- Fixed: model.save() never called before model is defined (was root cause of all prior NameError crashes)
- Fixed: random policy replaced with real trained policy evaluation
Agent: pi/claude-sonnet
Tests: 37/37 passing
Tests-Added: +37
TypeScript: N/A