fix: reward v6 — efficiency gate prevents circular driving, stuck_steps 80→40

v5 dropped the efficiency term to get gradient signal on hills, but this re-enabled circular driving (observed in Exp 11). v6 adds efficiency back as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise reward = speed × CTE_quality (same as v5). Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient on hills (all terms → 0 simultaneously). v6's gate passes when efficiency is above threshold (car moving forward, even slowly on hill) and only blocks when car is truly circling. Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported car stuck against barriers for ~10s which is too long with DummyVecEnv.
2026-04-19 12:02:55 -04:00 · 2026-04-19 12:02:55 -04:00 · beb04f3ebe
parent 21addf268e
commit beb04f3ebe
5 changed files with 110 additions and 69 deletions
--- a/agent/experiments/exp11_parallel_envs.py
+++ b/agent/experiments/exp11_parallel_envs.py
@ -36,7 +36,7 @@ def make_env(track_id, port):
    def _init():
        raw = gym.make(track_id, conf={'host': HOST, 'port': port})
        env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN)
-        env = StuckTerminationWrapper(env, stuck_steps=80, min_displacement=0.5)
+        env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
        env = SpeedRewardWrapper(env)
        return env
    return _init
--- a/agent/multitrack_runner.py
+++ b/agent/multitrack_runner.py
@ -177,7 +177,7 @@ class StuckTerminationWrapper(gym.Wrapper):
 def wrap_env(raw_env):
    """Apply standard wrappers: throttle clamp + stuck detection + speed reward."""
    env = ThrottleClampWrapper(raw_env, throttle_min=THROTTLE_MIN)
-    env = StuckTerminationWrapper(env, stuck_steps=80, min_displacement=0.5)
+    env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
    env = SpeedRewardWrapper(env, speed_scale=SPEED_SCALE)
    return env
--- a/agent/reward_wrapper.py
+++ b/agent/reward_wrapper.py
@ -1,6 +1,6 @@
 """
-Speed + Progress Reward Wrapper for DonkeyCar RL — v4 (Full Bypass)
+Speed + Progress Reward Wrapper for DonkeyCar RL — v6 (Speed×CTE + Efficiency Gate)
-====================================================================
+=====================================================================================
 REWARD HACKING HISTORY:
  v1 additive:      speed × (1-cte/max_cte)         → boundary oscillation
@ -8,9 +8,15 @@ REWARD HACKING HISTORY:
  v3 path efficiency: original × (1+speed×eff×scale) → still circling!
     WHY v3 failed: efficiency killed the SPEED BONUS but not the BASE reward.
     A spinning car at CTE≈0 still earns 1.0/step × thousands of steps.
-
+  v4: base × eff × (1 + speed_scale × speed)        → zero gradient on hills!
-  v4 (THIS VERSION): Completely bypass sim's reward. Multiply base reward by
+     WHY v4 failed on hills: speed≈0 AND eff≈0 AND cte_quality varies → all
-     efficiency so circling yields ZERO reward regardless of CTE.
+     three terms near zero simultaneously → no gradient to push ANY term up.
  v5: speed × CTE_quality (no efficiency)            → circular driving returns!
     WHY v5 failed: dropped efficiency entirely. Circular driving at CTE≈0
     with speed>0 earns positive reward indefinitely. Observed in Exp 11.
  v6 (THIS VERSION): v5 reward + efficiency GATE.
     Keeps v5's gradient properties (non-zero gradient on hills) but adds
     a binary efficiency check that zeros reward when car is circling.
 ROOT CAUSE OF CIRCLING:
  The sim's own calc_reward() uses `forward_vel` = dot(car_heading, velocity).
@ -18,24 +24,35 @@ ROOT CAUSE OF CIRCLING:
  so forward_vel > 0 always, giving positive reward while circling indefinitely.
  We bypass this entirely.
-FORMULA (v4):
+FORMULA (v6):
-    base     = 1.0 - min(abs(cte) / max_cte, 1.0)    # CTE quality [0,1]
+    cte_quality = 1.0 - min(|cte| / max_cte, 1.0)   # [0,1] centred=1
-    eff      = net_displacement / total_path_length    # Forward progress [0,1]
+    speed_norm  = min(speed / 10.0, 1.0)              # [0,1] normalised
-    shaped   = base × eff × (1 + speed_scale × speed) # All three must be high
+    efficiency  = net_displacement / total_path        # [0,1] straight=1, circle=0
-    On done/crash: shaped = -1.0
+    if efficiency < min_efficiency:
        reward = 0.0     # GATE: circling → zero reward (but not negative)
    else:
        reward = cte_quality × speed_norm    # v5 formula (gradient on hills)
    On done/crash: reward = -1.0
 WHY GATE NOT MULTIPLIER:
    v4 used efficiency as a multiplier: reward = base × eff × speed_bonus.
    On a hill: speed≈0, eff≈0, base≈0.5 → reward≈0 and ∂reward/∂speed≈0.
    No gradient to push speed up — car stays stuck.
    v6 gate: efficiency is either PASS or FAIL. When efficiency > threshold
    (car moving forward at all), reward = speed × CTE_quality. On a hill:
    car is stuck but still has eff > 0 (not literally circling), so the gate
    passes and the reward = speed × CTE_quality. ∂reward/∂speed > 0 → gradient
    pushes toward more throttle. Circle has eff ≈ 0 → gate fails → reward = 0.
 PROPERTIES:
-    - Spinning (eff≈0):           shaped ≈ 0          (no reward)
+    - Circling (eff<threshold): reward = 0  (no incentive to circle)
-    - On track, slow (eff≈1):     shaped ≈ base       (CTE reward only)
+    - On track, stuck (eff>0):  reward = speed × CTE (gradient toward unstuck)
-    - On track, fast (eff≈1):     shaped > base       (CTE + speed bonus)
+    - On track, fast:           reward = high       (speed + centred)
-    - Off track (base≈0):         shaped ≈ 0          (penalty via done)
+    - Off track:                reward ≈ 0          (CTE_quality → 0)
-    - Cannot be gamed:            ALL THREE terms must be high simultaneously
+    - Crash:                    reward = -1.0
 RESEARCH NOTE (2026-04-13):
    v3 was insufficient — circling at start gave 1.0/step × 47k steps = 47k reward.
    v4 makes efficiency a multiplier on the entire reward, not just the speed bonus.
    See docs/RESEARCH_LOG.md for full hacking history.
 """
 import gymnasium as gym
@ -62,8 +79,8 @@ class SpeedRewardWrapper(gym.Wrapper):
        self,
        env,
        speed_scale: float = 0.1,
-        window_size: int = 60,        # increased from 30 — catches slower circles
+        window_size: int = 30,         # captures 2+ full circles at typical circling speed
-        min_efficiency: float = 0.05,
+        min_efficiency: float = 0.15,  # gate threshold: circles ≈ 0.13, wobbly straight ≈ 0.98
        max_cte: float = 8.0,
        min_lap_time: float = 5.0,    # laps faster than this are penalised as exploits
    ):
@ -109,26 +126,36 @@ class SpeedRewardWrapper(gym.Wrapper):
    def _compute_reward_and_done(self, done: bool, info: dict):
        """
-        v5: speed × CTE-quality reward.
+        v6: speed × CTE-quality + efficiency gate.
-        reward = speed × (1 - |cte| / max_cte)
+        reward = speed_norm × cte_quality   (when efficiency >= threshold)
        reward = 0.0                        (when efficiency < threshold — circling)
        reward = -1.0                       (on crash/done)
-        Simpler than v4.  Directly incentivises going FAST while staying
+        The efficiency gate prevents circular driving (eff≈0 for circles)
-        centred.  On a hill: car slows → reward drops → clear gradient
+        without killing gradient on hills (eff>0 for a stuck-but-not-circling
-        signal to apply more throttle.  v4's efficiency term gave zero
+        car, so the gate passes and speed×CTE gradient pushes toward unstuck).
        gradient when the car was stuck (all three terms collapsed to zero
        simultaneously, so no direction to improve).
-        Exploit protection (unchanged):
+        Exploit protection:
-        - Short-lap penalty: laps < min_lap_time → large negative reward
+        - Efficiency gate: circles → reward = 0
-        - StuckTerminationWrapper: done=True after 80 steps of <0.5m movement
+        - Short-lap penalty: laps < min_lap_time → large negative + terminate
        - StuckTerminationWrapper: done=True after stuck_steps of no movement
        - Crash: done=True → -1.0
        """
        # Track position for efficiency calculation
        try:
            pos = info.get('pos', (0.0, 0.0, 0.0))
            pos_x = float(pos[0])
            pos_z = float(pos[2])  # z is forward in Unity coordinate system
            self._pos_history.append(np.array([pos_x, pos_z]))
        except (TypeError, ValueError, IndexError):
            pass
        # Crash / episode over
        if done:
            return -1.0, False
-        # --- Short-lap exploit detection (unchanged) ---
+        # --- Short-lap exploit detection ---
        try:
            current_lap_count = int(info.get('lap_count', 0) or 0)
        except (TypeError, ValueError):
@ -141,13 +168,16 @@ class SpeedRewardWrapper(gym.Wrapper):
            except (TypeError, ValueError):
                lap_time = 999.0
            if lap_time < self.min_lap_time:
                # Short-lap exploit: penalty AND terminate episode immediately.
                # Penalty alone is insufficient — the model stays alive and
                # keeps accumulating small rewards between laps.
                # Termination removes that loophole completely.
                penalty = -10.0 * (self.min_lap_time / max(lap_time, 0.1))
                return penalty, True   # (reward, force_terminate)
-            # Legitimate lap — fall through to normal reward
+
        # --- Efficiency gate: detect circular driving ---
        efficiency = self._compute_efficiency()
        if efficiency < self.min_efficiency:
            # Car is circling — zero reward but don't terminate.
            # Zero (not negative) so there's no perverse incentive to crash
            # early to avoid accumulating penalties.
            return 0.0, False
        # --- CTE quality: how centred is the car? ---
        try:
@ -162,10 +192,7 @@ class SpeedRewardWrapper(gym.Wrapper):
        except (TypeError, ValueError):
            speed = 0.0
-        # --- v5 reward: speed × CTE quality ---
+        # --- v6 reward: speed × CTE quality (same as v5, but gated) ---
        # Fast + centred = high reward.  Slow (hill) = low reward → gradient
        # pushes policy toward higher throttle.  Off-track = near-zero.
        # Normalise speed so max reward ≈ 1.0 at reasonable speed (10 m/s).
        speed_norm = min(speed / 10.0, 1.0)
        return cte_quality * speed_norm, False
--- a/agent/run_eval.py
+++ b/agent/run_eval.py
@ -56,7 +56,7 @@ log(f'Log file: {log_path}')
 def make_env(track_id, throttle_min):
    raw = gym.make(track_id)
    env = ThrottleClampWrapper(raw, throttle_min=throttle_min)
-    env = StuckTerminationWrapper(env, stuck_steps=80, min_displacement=0.5)
+    env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
    env = SpeedRewardWrapper(env)
    return env
--- a/tests/test_reward_wrapper.py
+++ b/tests/test_reward_wrapper.py
@ -69,20 +69,28 @@ def test_sim_reward_is_completely_ignored():
 def test_circling_at_zero_cte_gives_near_zero_reward():
    """
-    v5: circling protection is handled by lap-time penalty + StuckTermination,
+    v6: circling (low efficiency) should yield zero reward via the efficiency gate.
-    NOT by the reward formula.  A circling car at CTE=0 with speed CAN earn
+    After enough steps of circular motion, the efficiency drops below threshold
-    reward per step.  This test verifies the formula works as designed:
+    and the gate zeros the reward.
    reward = speed_norm * cte_quality.  Circling is stopped by other mechanisms.
    """
    env = MockEnv(speed=3.0, cte=0.0)
-    wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=20)
+    wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=30, min_efficiency=0.15)
    wrapped.reset()
-    # At CTE=0 and speed=3, expected reward = (3/10) * 1.0 = 0.3
+    # Drive in a circle for enough steps to fill the position window
    rewards = []
    for i in range(40):
        angle = 2 * math.pi * i / 12  # completes circle every 12 steps
        env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
        _, r, _, _, _ = wrapped.step(0)
-    expected = (3.0 / 10.0) * 1.0
+        rewards.append(r)
-    assert abs(r - expected) < 0.05, (
+
-        f"v5: reward at CTE=0, speed=3 should be ~{expected:.2f}, got {r:.4f}")
+    # After 20+ steps of circular motion, efficiency gate should kick in
    # Last few rewards should be 0.0
    assert rewards[-1] == 0.0, (
        f"v6: circular driving should yield 0.0 reward via efficiency gate, got {rewards[-1]:.4f}")
    assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3, (
        f"v6: most of last 5 rewards during circle should be 0.0, got {rewards[-5:]}")
 def test_forward_driving_earns_positive_reward():
@ -97,23 +105,29 @@ def test_forward_driving_earns_positive_reward():
 def test_forward_beats_circling_by_large_margin():
    """
-    v5: forward driving at moderate CTE should beat driving with high CTE.
+    v6: forward driving earns positive reward; circular driving earns zero.
-    The reward directly penalises being off-centre.
+    The efficiency gate ensures this gap.
    """
-    # On track (CTE=1m) at speed=5
+    # Forward driving at CTE=1m, speed=5
-    env_on = MockEnv(speed=5.0, cte=1.0)
+    env_fwd = MockEnv(speed=5.0, cte=1.0)
-    wrapped_on = SpeedRewardWrapper(env_on, speed_scale=0.1)
+    wrapped_fwd = SpeedRewardWrapper(env_fwd, speed_scale=0.1, window_size=30)
-    wrapped_on.reset()
+    wrapped_fwd.reset()
-    _, r_on, _, _, _ = wrapped_on.step(0)
+    for i in range(35):
        env_fwd.set_pos([i * 0.5, 0., 0.])  # straight line
        _, r_fwd, _, _, _ = wrapped_fwd.step(0)
-    # Off track (CTE=7m) at same speed
+    # Circular driving at CTE=0, speed=5
-    env_off = MockEnv(speed=5.0, cte=7.0)
+    env_circ = MockEnv(speed=5.0, cte=0.0)
-    wrapped_off = SpeedRewardWrapper(env_off, speed_scale=0.1)
+    wrapped_circ = SpeedRewardWrapper(env_circ, speed_scale=0.1, window_size=30)
-    wrapped_off.reset()
+    wrapped_circ.reset()
-    _, r_off, _, _, _ = wrapped_off.step(0)
+    for i in range(35):
        angle = 2 * math.pi * i / 12
        env_circ.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
        _, r_circ, _, _, _ = wrapped_circ.step(0)
-    assert r_on > r_off * 3, (
+    assert r_fwd > 0, f"Forward driving should earn positive reward, got {r_fwd}"
-        f"On-track ({r_on:.2f}) should beat off-track ({r_off:.2f}) by 3x")
+    assert r_circ == 0.0, f"Circular driving should earn 0 reward, got {r_circ}"
    assert r_fwd > r_circ, f"Forward ({r_fwd:.3f}) must beat circling ({r_circ:.3f})"
 def test_crash_gives_negative_reward():