fix: reward v6 — efficiency gate prevents circular driving, stuck_steps 80→40

v5 dropped the efficiency term to get gradient signal on hills, but this
re-enabled circular driving (observed in Exp 11). v6 adds efficiency back
as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise
reward = speed × CTE_quality (same as v5).

Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient
on hills (all terms → 0 simultaneously). v6's gate passes when efficiency
is above threshold (car moving forward, even slowly on hill) and only
blocks when car is truly circling.

Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported
car stuck against barriers for ~10s which is too long with DummyVecEnv.
This commit is contained in:
Paul Huliganga 2026-04-19 12:02:55 -04:00
parent 21addf268e
commit beb04f3ebe
5 changed files with 110 additions and 69 deletions

View File

@ -36,7 +36,7 @@ def make_env(track_id, port):
def _init(): def _init():
raw = gym.make(track_id, conf={'host': HOST, 'port': port}) raw = gym.make(track_id, conf={'host': HOST, 'port': port})
env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN) env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN)
env = StuckTerminationWrapper(env, stuck_steps=80, min_displacement=0.5) env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
env = SpeedRewardWrapper(env) env = SpeedRewardWrapper(env)
return env return env
return _init return _init

View File

@ -177,7 +177,7 @@ class StuckTerminationWrapper(gym.Wrapper):
def wrap_env(raw_env): def wrap_env(raw_env):
"""Apply standard wrappers: throttle clamp + stuck detection + speed reward.""" """Apply standard wrappers: throttle clamp + stuck detection + speed reward."""
env = ThrottleClampWrapper(raw_env, throttle_min=THROTTLE_MIN) env = ThrottleClampWrapper(raw_env, throttle_min=THROTTLE_MIN)
env = StuckTerminationWrapper(env, stuck_steps=80, min_displacement=0.5) env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
env = SpeedRewardWrapper(env, speed_scale=SPEED_SCALE) env = SpeedRewardWrapper(env, speed_scale=SPEED_SCALE)
return env return env

View File

@ -1,6 +1,6 @@
""" """
Speed + Progress Reward Wrapper for DonkeyCar RL v4 (Full Bypass) Speed + Progress Reward Wrapper for DonkeyCar RL v6 (Speed×CTE + Efficiency Gate)
==================================================================== =====================================================================================
REWARD HACKING HISTORY: REWARD HACKING HISTORY:
v1 additive: speed × (1-cte/max_cte) boundary oscillation v1 additive: speed × (1-cte/max_cte) boundary oscillation
@ -8,9 +8,15 @@ REWARD HACKING HISTORY:
v3 path efficiency: original × (1+speed×eff×scale) still circling! v3 path efficiency: original × (1+speed×eff×scale) still circling!
WHY v3 failed: efficiency killed the SPEED BONUS but not the BASE reward. WHY v3 failed: efficiency killed the SPEED BONUS but not the BASE reward.
A spinning car at CTE0 still earns 1.0/step × thousands of steps. A spinning car at CTE0 still earns 1.0/step × thousands of steps.
v4: base × eff × (1 + speed_scale × speed) zero gradient on hills!
v4 (THIS VERSION): Completely bypass sim's reward. Multiply base reward by WHY v4 failed on hills: speed0 AND eff0 AND cte_quality varies all
efficiency so circling yields ZERO reward regardless of CTE. three terms near zero simultaneously no gradient to push ANY term up.
v5: speed × CTE_quality (no efficiency) circular driving returns!
WHY v5 failed: dropped efficiency entirely. Circular driving at CTE0
with speed>0 earns positive reward indefinitely. Observed in Exp 11.
v6 (THIS VERSION): v5 reward + efficiency GATE.
Keeps v5's gradient properties (non-zero gradient on hills) but adds
a binary efficiency check that zeros reward when car is circling.
ROOT CAUSE OF CIRCLING: ROOT CAUSE OF CIRCLING:
The sim's own calc_reward() uses `forward_vel` = dot(car_heading, velocity). The sim's own calc_reward() uses `forward_vel` = dot(car_heading, velocity).
@ -18,24 +24,35 @@ ROOT CAUSE OF CIRCLING:
so forward_vel > 0 always, giving positive reward while circling indefinitely. so forward_vel > 0 always, giving positive reward while circling indefinitely.
We bypass this entirely. We bypass this entirely.
FORMULA (v4): FORMULA (v6):
base = 1.0 - min(abs(cte) / max_cte, 1.0) # CTE quality [0,1] cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1] centred=1
eff = net_displacement / total_path_length # Forward progress [0,1] speed_norm = min(speed / 10.0, 1.0) # [0,1] normalised
shaped = base × eff × (1 + speed_scale × speed) # All three must be high efficiency = net_displacement / total_path # [0,1] straight=1, circle=0
On done/crash: shaped = -1.0 if efficiency < min_efficiency:
reward = 0.0 # GATE: circling → zero reward (but not negative)
else:
reward = cte_quality × speed_norm # v5 formula (gradient on hills)
On done/crash: reward = -1.0
WHY GATE NOT MULTIPLIER:
v4 used efficiency as a multiplier: reward = base × eff × speed_bonus.
On a hill: speed0, eff0, base0.5 reward0 and reward/speed0.
No gradient to push speed up car stays stuck.
v6 gate: efficiency is either PASS or FAIL. When efficiency > threshold
(car moving forward at all), reward = speed × CTE_quality. On a hill:
car is stuck but still has eff > 0 (not literally circling), so the gate
passes and the reward = speed × CTE_quality. reward/speed > 0 gradient
pushes toward more throttle. Circle has eff 0 gate fails reward = 0.
PROPERTIES: PROPERTIES:
- Spinning (eff0): shaped 0 (no reward) - Circling (eff<threshold): reward = 0 (no incentive to circle)
- On track, slow (eff1): shaped base (CTE reward only) - On track, stuck (eff>0): reward = speed × CTE (gradient toward unstuck)
- On track, fast (eff1): shaped > base (CTE + speed bonus) - On track, fast: reward = high (speed + centred)
- Off track (base0): shaped 0 (penalty via done) - Off track: reward 0 (CTE_quality 0)
- Cannot be gamed: ALL THREE terms must be high simultaneously - Crash: reward = -1.0
RESEARCH NOTE (2026-04-13):
v3 was insufficient circling at start gave 1.0/step × 47k steps = 47k reward.
v4 makes efficiency a multiplier on the entire reward, not just the speed bonus.
See docs/RESEARCH_LOG.md for full hacking history.
""" """
import gymnasium as gym import gymnasium as gym
@ -62,8 +79,8 @@ class SpeedRewardWrapper(gym.Wrapper):
self, self,
env, env,
speed_scale: float = 0.1, speed_scale: float = 0.1,
window_size: int = 60, # increased from 30 — catches slower circles window_size: int = 30, # captures 2+ full circles at typical circling speed
min_efficiency: float = 0.05, min_efficiency: float = 0.15, # gate threshold: circles ≈ 0.13, wobbly straight ≈ 0.98
max_cte: float = 8.0, max_cte: float = 8.0,
min_lap_time: float = 5.0, # laps faster than this are penalised as exploits min_lap_time: float = 5.0, # laps faster than this are penalised as exploits
): ):
@ -109,26 +126,36 @@ class SpeedRewardWrapper(gym.Wrapper):
def _compute_reward_and_done(self, done: bool, info: dict): def _compute_reward_and_done(self, done: bool, info: dict):
""" """
v5: speed × CTE-quality reward. v6: speed × CTE-quality + efficiency gate.
reward = speed × (1 - |cte| / max_cte) reward = speed_norm × cte_quality (when efficiency >= threshold)
reward = 0.0 (when efficiency < threshold circling)
reward = -1.0 (on crash/done)
Simpler than v4. Directly incentivises going FAST while staying The efficiency gate prevents circular driving (eff0 for circles)
centred. On a hill: car slows reward drops clear gradient without killing gradient on hills (eff>0 for a stuck-but-not-circling
signal to apply more throttle. v4's efficiency term gave zero car, so the gate passes and speed×CTE gradient pushes toward unstuck).
gradient when the car was stuck (all three terms collapsed to zero
simultaneously, so no direction to improve).
Exploit protection (unchanged): Exploit protection:
- Short-lap penalty: laps < min_lap_time large negative reward - Efficiency gate: circles reward = 0
- StuckTerminationWrapper: done=True after 80 steps of <0.5m movement - Short-lap penalty: laps < min_lap_time large negative + terminate
- StuckTerminationWrapper: done=True after stuck_steps of no movement
- Crash: done=True -1.0 - Crash: done=True -1.0
""" """
# Track position for efficiency calculation
try:
pos = info.get('pos', (0.0, 0.0, 0.0))
pos_x = float(pos[0])
pos_z = float(pos[2]) # z is forward in Unity coordinate system
self._pos_history.append(np.array([pos_x, pos_z]))
except (TypeError, ValueError, IndexError):
pass
# Crash / episode over # Crash / episode over
if done: if done:
return -1.0, False return -1.0, False
# --- Short-lap exploit detection (unchanged) --- # --- Short-lap exploit detection ---
try: try:
current_lap_count = int(info.get('lap_count', 0) or 0) current_lap_count = int(info.get('lap_count', 0) or 0)
except (TypeError, ValueError): except (TypeError, ValueError):
@ -141,13 +168,16 @@ class SpeedRewardWrapper(gym.Wrapper):
except (TypeError, ValueError): except (TypeError, ValueError):
lap_time = 999.0 lap_time = 999.0
if lap_time < self.min_lap_time: if lap_time < self.min_lap_time:
# Short-lap exploit: penalty AND terminate episode immediately.
# Penalty alone is insufficient — the model stays alive and
# keeps accumulating small rewards between laps.
# Termination removes that loophole completely.
penalty = -10.0 * (self.min_lap_time / max(lap_time, 0.1)) penalty = -10.0 * (self.min_lap_time / max(lap_time, 0.1))
return penalty, True # (reward, force_terminate) return penalty, True # (reward, force_terminate)
# Legitimate lap — fall through to normal reward
# --- Efficiency gate: detect circular driving ---
efficiency = self._compute_efficiency()
if efficiency < self.min_efficiency:
# Car is circling — zero reward but don't terminate.
# Zero (not negative) so there's no perverse incentive to crash
# early to avoid accumulating penalties.
return 0.0, False
# --- CTE quality: how centred is the car? --- # --- CTE quality: how centred is the car? ---
try: try:
@ -162,10 +192,7 @@ class SpeedRewardWrapper(gym.Wrapper):
except (TypeError, ValueError): except (TypeError, ValueError):
speed = 0.0 speed = 0.0
# --- v5 reward: speed × CTE quality --- # --- v6 reward: speed × CTE quality (same as v5, but gated) ---
# Fast + centred = high reward. Slow (hill) = low reward → gradient
# pushes policy toward higher throttle. Off-track = near-zero.
# Normalise speed so max reward ≈ 1.0 at reasonable speed (10 m/s).
speed_norm = min(speed / 10.0, 1.0) speed_norm = min(speed / 10.0, 1.0)
return cte_quality * speed_norm, False return cte_quality * speed_norm, False

View File

@ -56,7 +56,7 @@ log(f'Log file: {log_path}')
def make_env(track_id, throttle_min): def make_env(track_id, throttle_min):
raw = gym.make(track_id) raw = gym.make(track_id)
env = ThrottleClampWrapper(raw, throttle_min=throttle_min) env = ThrottleClampWrapper(raw, throttle_min=throttle_min)
env = StuckTerminationWrapper(env, stuck_steps=80, min_displacement=0.5) env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
env = SpeedRewardWrapper(env) env = SpeedRewardWrapper(env)
return env return env

View File

@ -69,20 +69,28 @@ def test_sim_reward_is_completely_ignored():
def test_circling_at_zero_cte_gives_near_zero_reward(): def test_circling_at_zero_cte_gives_near_zero_reward():
""" """
v5: circling protection is handled by lap-time penalty + StuckTermination, v6: circling (low efficiency) should yield zero reward via the efficiency gate.
NOT by the reward formula. A circling car at CTE=0 with speed CAN earn After enough steps of circular motion, the efficiency drops below threshold
reward per step. This test verifies the formula works as designed: and the gate zeros the reward.
reward = speed_norm * cte_quality. Circling is stopped by other mechanisms.
""" """
env = MockEnv(speed=3.0, cte=0.0) env = MockEnv(speed=3.0, cte=0.0)
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=20) wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=30, min_efficiency=0.15)
wrapped.reset() wrapped.reset()
# At CTE=0 and speed=3, expected reward = (3/10) * 1.0 = 0.3 # Drive in a circle for enough steps to fill the position window
rewards = []
for i in range(40):
angle = 2 * math.pi * i / 12 # completes circle every 12 steps
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
_, r, _, _, _ = wrapped.step(0) _, r, _, _, _ = wrapped.step(0)
expected = (3.0 / 10.0) * 1.0 rewards.append(r)
assert abs(r - expected) < 0.05, (
f"v5: reward at CTE=0, speed=3 should be ~{expected:.2f}, got {r:.4f}") # After 20+ steps of circular motion, efficiency gate should kick in
# Last few rewards should be 0.0
assert rewards[-1] == 0.0, (
f"v6: circular driving should yield 0.0 reward via efficiency gate, got {rewards[-1]:.4f}")
assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3, (
f"v6: most of last 5 rewards during circle should be 0.0, got {rewards[-5:]}")
def test_forward_driving_earns_positive_reward(): def test_forward_driving_earns_positive_reward():
@ -97,23 +105,29 @@ def test_forward_driving_earns_positive_reward():
def test_forward_beats_circling_by_large_margin(): def test_forward_beats_circling_by_large_margin():
""" """
v5: forward driving at moderate CTE should beat driving with high CTE. v6: forward driving earns positive reward; circular driving earns zero.
The reward directly penalises being off-centre. The efficiency gate ensures this gap.
""" """
# On track (CTE=1m) at speed=5 # Forward driving at CTE=1m, speed=5
env_on = MockEnv(speed=5.0, cte=1.0) env_fwd = MockEnv(speed=5.0, cte=1.0)
wrapped_on = SpeedRewardWrapper(env_on, speed_scale=0.1) wrapped_fwd = SpeedRewardWrapper(env_fwd, speed_scale=0.1, window_size=30)
wrapped_on.reset() wrapped_fwd.reset()
_, r_on, _, _, _ = wrapped_on.step(0) for i in range(35):
env_fwd.set_pos([i * 0.5, 0., 0.]) # straight line
_, r_fwd, _, _, _ = wrapped_fwd.step(0)
# Off track (CTE=7m) at same speed # Circular driving at CTE=0, speed=5
env_off = MockEnv(speed=5.0, cte=7.0) env_circ = MockEnv(speed=5.0, cte=0.0)
wrapped_off = SpeedRewardWrapper(env_off, speed_scale=0.1) wrapped_circ = SpeedRewardWrapper(env_circ, speed_scale=0.1, window_size=30)
wrapped_off.reset() wrapped_circ.reset()
_, r_off, _, _, _ = wrapped_off.step(0) for i in range(35):
angle = 2 * math.pi * i / 12
env_circ.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
_, r_circ, _, _, _ = wrapped_circ.step(0)
assert r_on > r_off * 3, ( assert r_fwd > 0, f"Forward driving should earn positive reward, got {r_fwd}"
f"On-track ({r_on:.2f}) should beat off-track ({r_off:.2f}) by 3x") assert r_circ == 0.0, f"Circular driving should earn 0 reward, got {r_circ}"
assert r_fwd > r_circ, f"Forward ({r_fwd:.3f}) must beat circling ({r_circ:.3f})"
def test_crash_gives_negative_reward(): def test_crash_gives_negative_reward():