fix: close short-lap circle exploit and cap segment eval episode length

Two reward hacking behaviours observed during Wave 4 training:

1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
   Model circles at start/finish line completing laps in 1-2 sim-seconds,
   accumulating lap_count indefinitely with no genuine track progress.
   Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
   < min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
   A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
   Window size also increased from 30 → 60 to catch slower circles.

2. Non-terminating segment eval episodes:
   evaluate_policy on wide tracks (no barriers) could run indefinitely,
   inflating segment_reward to 200k+. Replaced with manual eval loop
   capped at MAX_EVAL_STEPS=3000 steps.

Phase 4 results cleared (trials 4-6 ran with exploitable reward).

Tests: 4 new reward wrapper tests, 100 total passing.

Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
This commit is contained in:
Paul Huliganga 2026-04-15 09:06:25 -04:00
parent 1be95b7c82
commit 5d1227833d
7 changed files with 193 additions and 18 deletions

View File

@ -279,15 +279,21 @@ def train_multitrack(model, first_env, total_timesteps, steps_per_switch):
)
steps_done += segment_steps
# Quick segment reward estimate (run one short episode deterministically)
# Quick segment reward estimate — one deterministic episode,
# capped at MAX_EVAL_STEPS to prevent non-terminating episodes
# (e.g. car driving forever on wide generated_track) inflating the metric.
MAX_EVAL_STEPS = 3000
try:
seg_reward, _ = evaluate_policy(
model, env,
n_eval_episodes=1,
deterministic=True,
return_episode_rewards=False,
warn=False,
)
obs = env.reset()
ep_reward = 0.0
for _ in range(MAX_EVAL_STEPS):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
ep_reward += float(reward[0] if hasattr(reward, '__len__') else reward)
done_flag = done[0] if hasattr(done, '__len__') else done
if done_flag:
break
seg_reward = ep_reward
log(f'[W3 Runner][TRAIN] track={track_name} segment_reward={seg_reward:.2f}')
segment_rewards.append((track_name, float(seg_reward)))
except Exception as e:

View File

@ -606,3 +606,42 @@
[2026-04-14 22:43:59] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90}
[2026-04-14 22:43:59] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8}
[2026-04-14 22:43:59] [AutoResearch] Only 1 results — using random proposal.
[2026-04-15 09:03:29] [AutoResearch] GP UCB top-5 candidates:
[2026-04-15 09:03:29] UCB=2.3107 mu=0.3981 sigma=0.9563 params={'n_steer': 9, 'n_throttle': 2, 'learning_rate': 0.001405531880392808, 'timesteps': 26173}
[2026-04-15 09:03:29] UCB=2.3049 mu=0.8602 sigma=0.7224 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.001793493447174312, 'timesteps': 19198}
[2026-04-15 09:03:29] UCB=2.2813 mu=0.4904 sigma=0.8954 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011616192816742616, 'timesteps': 13887}
[2026-04-15 09:03:29] UCB=2.2767 mu=0.5194 sigma=0.8787 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011646447444663046, 'timesteps': 21199}
[2026-04-15 09:03:29] UCB=2.2525 mu=0.6254 sigma=0.8136 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0010196345864901517, 'timesteps': 22035}
[2026-04-15 09:03:29] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=50.0000 params={'n_steer': 5}
[2026-04-15 09:03:29] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'n_steer': 7}
[2026-04-15 09:03:29] [Champion] 🏆 NEW BEST! Trial 0: mean_reward=50.0000 params={'r': 50}
[2026-04-15 09:03:29] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'r': 80}
[2026-04-15 09:03:29] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90}
[2026-04-15 09:03:29] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8}
[2026-04-15 09:03:29] [AutoResearch] Only 1 results — using random proposal.
[2026-04-15 09:04:15] [AutoResearch] GP UCB top-5 candidates:
[2026-04-15 09:04:15] UCB=2.3107 mu=0.3981 sigma=0.9563 params={'n_steer': 9, 'n_throttle': 2, 'learning_rate': 0.001405531880392808, 'timesteps': 26173}
[2026-04-15 09:04:15] UCB=2.3049 mu=0.8602 sigma=0.7224 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.001793493447174312, 'timesteps': 19198}
[2026-04-15 09:04:15] UCB=2.2813 mu=0.4904 sigma=0.8954 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011616192816742616, 'timesteps': 13887}
[2026-04-15 09:04:15] UCB=2.2767 mu=0.5194 sigma=0.8787 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011646447444663046, 'timesteps': 21199}
[2026-04-15 09:04:15] UCB=2.2525 mu=0.6254 sigma=0.8136 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0010196345864901517, 'timesteps': 22035}
[2026-04-15 09:04:15] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=50.0000 params={'n_steer': 5}
[2026-04-15 09:04:15] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'n_steer': 7}
[2026-04-15 09:04:15] [Champion] 🏆 NEW BEST! Trial 0: mean_reward=50.0000 params={'r': 50}
[2026-04-15 09:04:15] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'r': 80}
[2026-04-15 09:04:15] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90}
[2026-04-15 09:04:15] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8}
[2026-04-15 09:04:15] [AutoResearch] Only 1 results — using random proposal.
[2026-04-15 09:05:43] [AutoResearch] GP UCB top-5 candidates:
[2026-04-15 09:05:43] UCB=2.3107 mu=0.3981 sigma=0.9563 params={'n_steer': 9, 'n_throttle': 2, 'learning_rate': 0.001405531880392808, 'timesteps': 26173}
[2026-04-15 09:05:43] UCB=2.3049 mu=0.8602 sigma=0.7224 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.001793493447174312, 'timesteps': 19198}
[2026-04-15 09:05:43] UCB=2.2813 mu=0.4904 sigma=0.8954 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011616192816742616, 'timesteps': 13887}
[2026-04-15 09:05:43] UCB=2.2767 mu=0.5194 sigma=0.8787 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011646447444663046, 'timesteps': 21199}
[2026-04-15 09:05:43] UCB=2.2525 mu=0.6254 sigma=0.8136 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0010196345864901517, 'timesteps': 22035}
[2026-04-15 09:05:43] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=50.0000 params={'n_steer': 5}
[2026-04-15 09:05:43] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'n_steer': 7}
[2026-04-15 09:05:43] [Champion] 🏆 NEW BEST! Trial 0: mean_reward=50.0000 params={'r': 50}
[2026-04-15 09:05:43] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'r': 80}
[2026-04-15 09:05:43] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90}
[2026-04-15 09:05:43] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8}
[2026-04-15 09:05:43] [AutoResearch] Only 1 results — using random proposal.

View File

@ -345,3 +345,18 @@
[2026-04-14 22:44:13] [Wave3] Only 0 results — using random proposal.
[2026-04-14 22:44:13] [Champion] 🏆 NEW BEST! Trial 3: score=1500.00 (mini_monaco=1500.0) params={'learning_rate': 0.0002, 'steps_per_switch': 8000, 'total_timesteps': 150000}
[2026-04-14 22:44:13] [Champion] 🏆 NEW BEST! Trial 1: score=2000.00 (mini_monaco=2000.0) params={}
[2026-04-15 09:03:51] [Wave3] Seed trial 1/2: using hardcoded params.
[2026-04-15 09:03:51] [Wave3] Seed trial 2/2: using hardcoded params.
[2026-04-15 09:03:51] [Wave3] Only 0 results — using random proposal.
[2026-04-15 09:03:51] [Champion] 🏆 NEW BEST! Trial 3: score=1500.00 (mini_monaco=1500.0) params={'learning_rate': 0.0002, 'steps_per_switch': 8000, 'total_timesteps': 150000}
[2026-04-15 09:03:51] [Champion] 🏆 NEW BEST! Trial 1: score=2000.00 (mini_monaco=2000.0) params={}
[2026-04-15 09:04:44] [Wave3] Seed trial 1/2: using hardcoded params.
[2026-04-15 09:04:44] [Wave3] Seed trial 2/2: using hardcoded params.
[2026-04-15 09:04:44] [Wave3] Only 0 results — using random proposal.
[2026-04-15 09:04:44] [Champion] 🏆 NEW BEST! Trial 3: score=1500.00 (mini_monaco=1500.0) params={'learning_rate': 0.0002, 'steps_per_switch': 8000, 'total_timesteps': 150000}
[2026-04-15 09:04:44] [Champion] 🏆 NEW BEST! Trial 1: score=2000.00 (mini_monaco=2000.0) params={}
[2026-04-15 09:06:00] [Wave3] Seed trial 1/2: using hardcoded params.
[2026-04-15 09:06:00] [Wave3] Seed trial 2/2: using hardcoded params.
[2026-04-15 09:06:00] [Wave3] Only 0 results — using random proposal.
[2026-04-15 09:06:00] [Champion] 🏆 NEW BEST! Trial 3: score=1500.00 (mini_monaco=1500.0) params={'learning_rate': 0.0002, 'steps_per_switch': 8000, 'total_timesteps': 150000}
[2026-04-15 09:06:00] [Champion] 🏆 NEW BEST! Trial 1: score=2000.00 (mini_monaco=2000.0) params={}

View File

@ -119,3 +119,15 @@
[2026-04-15 07:15:57] score=1943.10 params={'learning_rate': 0.0006852550685205609, 'steps_per_switch': 17499, 'total_timesteps': 157743}
[2026-04-15 07:15:57] score=222.07 params={'learning_rate': 0.001, 'steps_per_switch': 6000, 'total_timesteps': 80000}
[2026-04-15 07:15:57] score=45.67 params={'learning_rate': 0.0003, 'steps_per_switch': 6000, 'total_timesteps': 80000}
[2026-04-15 07:15:59] [Wave4] ✅ Git push complete after trial 5
[2026-04-15 07:16:01]
[Wave4] ========== Trial 6/25 ==========
[2026-04-15 07:16:01] [Wave4] GP UCB top-5 proposals:
[2026-04-15 07:16:01] UCB=2.4565 mu=0.8712 σ=0.7926 params={'learning_rate': 0.0011062087200910864, 'steps_per_switch': 18318, 'total_timesteps': 194470}
[2026-04-15 07:16:01] UCB=2.4485 mu=0.9338 σ=0.7573 params={'learning_rate': 0.0004307107164246544, 'steps_per_switch': 19141, 'total_timesteps': 199878}
[2026-04-15 07:16:01] UCB=2.4478 mu=0.8840 σ=0.7819 params={'learning_rate': 0.00041215765557335777, 'steps_per_switch': 16229, 'total_timesteps': 203707}
[2026-04-15 07:16:01] UCB=2.4468 mu=0.8283 σ=0.8092 params={'learning_rate': 0.0009928039664024839, 'steps_per_switch': 19629, 'total_timesteps': 113788}
[2026-04-15 07:16:01] UCB=2.4456 mu=0.9298 σ=0.7579 params={'learning_rate': 0.0002412156295150517, 'steps_per_switch': 19116, 'total_timesteps': 179367}
[2026-04-15 07:16:01] [Wave4] Proposed params: {'learning_rate': 0.0011062087200910864, 'steps_per_switch': 18318, 'total_timesteps': 194470}
[2026-04-15 07:16:03] [Wave4] Launching trial 6: {'learning_rate': 0.0011062087200910864, 'steps_per_switch': 18318, 'total_timesteps': 194470}
[2026-04-15 07:16:03] [Wave4] Command: python3 /home/paulh/projects/donkeycar-rl-autoresearch/agent/multitrack_runner.py --total-timesteps 194470 --steps-per-switch 18318 --learning-rate 0.0011062087200910864 --eval-episodes 3 --save-dir /home/paulh/projects/donkeycar-rl-autoresearch/agent/models/wave4-trial-0006

View File

@ -1,5 +0,0 @@
{"trial": 1, "timestamp": "2026-04-15T00:02:45.732560", "params": {"learning_rate": 0.0003, "steps_per_switch": 6000, "total_timesteps": 80000}, "combined_test_score": 45.6693, "mini_monaco_reward": 45.6693, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/wave4-trial-0001/model.zip", "champion": true, "run_status": "ok", "elapsed_sec": 4699.276456594467}
{"trial": 2, "timestamp": "2026-04-15T01:21:38.620202", "params": {"learning_rate": 0.001, "steps_per_switch": 6000, "total_timesteps": 80000}, "combined_test_score": 222.0731, "mini_monaco_reward": 222.0731, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/wave4-trial-0002/model.zip", "champion": true, "run_status": "ok", "elapsed_sec": 4728.351642370224}
{"trial": 3, "timestamp": "2026-04-15T03:15:46.643415", "params": {"learning_rate": 0.0006852550685205609, "steps_per_switch": 17499, "total_timesteps": 157743}, "combined_test_score": 1943.1038, "mini_monaco_reward": 1943.1038, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/wave4-trial-0003/model.zip", "champion": true, "run_status": "ok", "elapsed_sec": 6843.732668876648}
{"trial": 4, "timestamp": "2026-04-15T05:15:51.127688", "params": {"learning_rate": 0.0003250095463348546, "steps_per_switch": 19054, "total_timesteps": 197116}, "combined_test_score": 0.0, "mini_monaco_reward": 0.0, "model_path": null, "champion": false, "run_status": "error_rc-9", "elapsed_sec": 7200.456610918045}
{"trial": 5, "timestamp": "2026-04-15T07:15:57.431753", "params": {"learning_rate": 0.0003927960467617446, "steps_per_switch": 19892, "total_timesteps": 201785}, "combined_test_score": 0.0, "mini_monaco_reward": 0.0, "model_path": null, "champion": false, "run_status": "error_rc-9", "elapsed_sec": 7202.279730081558}

View File

@ -62,20 +62,24 @@ class SpeedRewardWrapper(gym.Wrapper):
self,
env,
speed_scale: float = 0.1,
window_size: int = 30,
window_size: int = 60, # increased from 30 — catches slower circles
min_efficiency: float = 0.05,
max_cte: float = 8.0,
min_lap_time: float = 5.0, # laps faster than this are penalised as exploits
):
super().__init__(env)
self.speed_scale = speed_scale
self.window_size = window_size
self.min_efficiency = min_efficiency
self.max_cte = max_cte
self.min_lap_time = min_lap_time
self._pos_history = deque(maxlen=window_size + 1)
self._last_lap_count = 0 # track lap completions to detect short-lap exploit
def reset(self, **kwargs):
result = self.env.reset(**kwargs)
self._pos_history.clear()
self._last_lap_count = 0
return result
def step(self, action):
@ -104,11 +108,36 @@ class SpeedRewardWrapper(gym.Wrapper):
"""
Compute reward from scratch using CTE × efficiency × speed.
Bypasses sim's exploitable forward_vel-based reward.
Exploit patches
---------------
Short-lap circle: model circles at start/finish line triggering
lap completions every 1-2 sim-seconds. Detected via lap_count
increment + last_lap_time < min_lap_time large penalty.
"""
# Crash / episode over
if done:
return -1.0
# --- Short-lap exploit detection ---
# Fires exactly once per lap completion, only when the lap was too fast.
try:
current_lap_count = int(info.get('lap_count', 0) or 0)
except (TypeError, ValueError):
current_lap_count = self._last_lap_count
if current_lap_count > self._last_lap_count:
# A new lap just completed
self._last_lap_count = current_lap_count
try:
lap_time = float(info.get('last_lap_time', 999.0) or 999.0)
except (TypeError, ValueError):
lap_time = 999.0
if lap_time < self.min_lap_time:
# Tiny-circle exploit — heavy penalty proportional to how short the lap was
return -10.0 * (self.min_lap_time / max(lap_time, 0.1))
# Legitimate lap — no penalty, fall through to normal reward
# Update position history
pos = info.get('pos', None)
if pos is not None:

View File

@ -231,3 +231,82 @@ def test_reward_resets_on_episode_reset():
# Should get reasonable reward after fresh start
assert rewards[-1] > 0, "Should get positive reward after reset and straight driving"
# ---------------------------------------------------------------------------
# Short-lap exploit patch tests
# ---------------------------------------------------------------------------
def test_short_lap_triggers_penalty():
"""
A lap completed faster than min_lap_time must return a large penalty,
not a positive reward. This closes the start/finish circle exploit.
"""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# Simulate step where a new lap completes in 1 second (exploit)
info = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.0}
reward = wrapper._compute_reward(done=False, info=info)
assert reward < 0, f'Short lap (1s) should penalise, got reward={reward}'
assert reward <= -10.0, f'Short lap penalty should be large (<= -10), got {reward}'
def test_legitimate_lap_not_penalised():
"""
A lap completed above min_lap_time must NOT trigger the penalty.
"""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# First step — no lap yet
info_no_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 0, 'last_lap_time': 0.0}
wrapper._compute_reward(done=False, info=info_no_lap)
# Legitimate lap at 12 seconds
info = {'cte': 0.2, 'speed': 3.0, 'pos': (1.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 12.0}
reward = wrapper._compute_reward(done=False, info=info)
assert reward >= 0, f'Legitimate lap (12s) should not be penalised, got {reward}'
def test_lap_count_not_double_penalised():
"""
Penalty fires exactly once per short lap, not on every subsequent step.
"""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# Short lap fires on step where lap_count increments
info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.5}
r1 = wrapper._compute_reward(done=False, info=info_lap)
assert r1 < 0
# Next step same lap_count — should get normal reward, not another penalty
info_next = {'cte': 0.0, 'speed': 3.0, 'pos': (0.1, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.5}
r2 = wrapper._compute_reward(done=False, info=info_next)
assert r2 >= 0, f'Penalty should not repeat on same lap_count, got r2={r2}'
def test_lap_count_resets_on_episode_reset():
"""lap_count tracker must reset when the episode resets."""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# Complete a short lap
info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.0}
wrapper._compute_reward(done=False, info=info_lap)
assert wrapper._last_lap_count == 1
# Reset episode — counter must go back to 0
wrapper.reset()
assert wrapper._last_lap_count == 0