The circle exploit persisted because the penalty alone (-100 per short
lap) was insufficient. The model stayed alive between laps accumulating
small positive rewards, making circling a viable strategy despite the
penalty.
Fix: _compute_reward_and_done() returns (reward, force_terminate).
When a short lap is detected, force_terminate=True is returned and
step() sets terminated=True immediately. The episode ends on the spot —
no more rewards possible. This makes the circle exploit strictly worse
than any forward driving behaviour.
Tests updated: _compute_reward → _compute_reward_and_done, short-lap
test now asserts force_terminate=True.
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Problem with v4 on mountain_track: CTE × efficiency × speed all collapse
to zero simultaneously when the car slows on the hill, giving no gradient
signal for 'apply more throttle'.
v5: reward = (speed / 10) × (1 - |CTE| / max_cte)
- Directly rewards going fast while staying centred
- Hill: car slows → reward drops → clear gradient toward more throttle
- Circling protection now entirely handled by lap-time penalty +
StuckTerminationWrapper (not by the reward formula)
Tests updated to reflect v5 semantics (102 passing).
Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
Two reward hacking behaviours observed during Wave 4 training:
1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
Model circles at start/finish line completing laps in 1-2 sim-seconds,
accumulating lap_count indefinitely with no genuine track progress.
Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
< min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
Window size also increased from 30 → 60 to catch slower circles.
2. Non-terminating segment eval episodes:
evaluate_policy on wide tracks (no barriers) could run indefinitely,
inflating segment_reward to 200k+. Replaced with manual eval loop
capped at MAX_EVAL_STEPS=3000 steps.
Phase 4 results cleared (trials 4-6 ran with exploitable reward).
Tests: 4 new reward wrapper tests, 100 total passing.
Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
ROOT CAUSE:
donkey_sim.py calc_reward() uses forward_vel = dot(heading, velocity).
A spinning car ALWAYS has forward_vel > 0 (always moving 'forward' relative
to its own heading), so it earned positive reward indefinitely while circling.
v3 WAS INSUFFICIENT:
v3 applied efficiency only to the speed BONUS: original × (1 + speed×eff×scale)
But 'original' from sim was still exploitable: CTE≈0 while spinning → original=1.0/step
Efficiency killed the speed bonus but not the base reward.
47k-step run: spinning = 1.0/step × 47k = 47k reward (never crashes in circle)
v4 FIX — base × efficiency × speed:
reward = (1 - abs(cte)/max_cte) × efficiency × (1 + speed_scale × speed)
Completely ignores sim's bogus forward_vel reward.
Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed.
ALL three terms must be high to earn reward — cannot be gamed.
Key new test: test_circling_at_zero_cte_gives_near_zero_reward
Worst-case exploit (CTE=0 spinning) → avg reward < 0.15 (was 1.0 in v3)
forward_beats_circling_by_3x confirmed.
Also: update Phase 2 autoresearch timesteps test, research log updated.
Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +1 (core v4 circling guarantee)
TypeScript: N/A
- Rebuilt donkeycar_sb3_runner.py: real PPO/DQN model.learn() + evaluate_policy() + model.save()
- Added SpeedRewardWrapper: reward = speed * (1 - |cte|/max_cte)
- Added ChampionTracker: tracks best model across all trials, writes manifest.json
- Rebuilt autoresearch_controller.py: Phase 1 results separated from random-policy data
- Added timesteps to GP search space
- Added --push-every N for automatic git push
- Added 37 passing tests: discretize_action, reward_wrapper, autoresearch_controller, runner_integration
- Scaffolded project with agent harness (large mode): PROJECT-SPEC, DECISIONS, IMPLEMENTATION_PLAN, EXECUTION_MASTER
- Fixed: model.save() never called before model is defined (was root cause of all prior NameError crashes)
- Fixed: random policy replaced with real trained policy evaluation
Agent: pi/claude-sonnet
Tests: 37/37 passing
Tests-Added: +37
TypeScript: N/A