fix(core): replace exploit bandaids with solid physics barriers + clean reward

Root cause: barriers were zero-thickness MeshCollider planes with no CCD on the
car. The car tunnelled through between frames. Every Python patch was trying to
catch in code what physics should enforce.

Unity (source only — build in progress):
- RoadBuilder.cs: CreateBarrier() now makes BoxCollider-per-segment with real 3D
  volume (barrierThickness=1.0m default) + half-thickness overlap at corners to
  seal gaps. CreateEndCap() seals open ends of non-looping tracks (generated_road).
- Car.cs: rb.collisionDetectionMode = Continuous in Awake() — prevents tunneling.

Python:
- reward_wrapper.py v7: removed CTE-patience termination, high-CTE negative
  reward, solid_hit monitoring, low-speed/wedge detection. Kept: efficiency gate,
  no-progress (active_node) termination, lap exploit guard. Reward = speed×CTE_quality.
- exp23_generated_road_clean.py: single track, no warm-start, 200k steps, clean
  reward, MAX_EPISODE_SECONDS=120 as safety net only.
- tests: 17 tests covering clean reward properties.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Paul Huliganga 2026-05-05 15:56:00 -04:00
parent c5c4ca658e
commit 2d52bb4ffc
4 changed files with 560 additions and 884 deletions

View File

@ -12,238 +12,171 @@ If the user says only `continue`, interpret it using the instruction above.
## Current Goal
Stabilize the Unity simulator geometry and collision behavior enough that:
Run a clean, trustworthy exp23 on `generated_road` with:
- Solid BoxCollider barriers (car physically cannot escape)
- Clean reward: speed × CTE_quality + efficiency gate
- No artificial episode caps or Python-side exploit patches
- `generated_road` and `generated_track` both run without bad invisible barrier placement
- barrier contacts terminate episodes appropriately
- RL can restart from a trustworthy simulator build
Get RL training producing genuine improvement again.
## Important Paths
Project:
- `/home/paulh/projects/donkeycar-rl-autoresearch`
Unity source project:
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim`
Unity build output:
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin`
Current runtime simulator folders in use:
- `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin`
- `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy`
## Current RL Experiment Files
Unity build log:
- `C:\Users\Paul\AppData\Local\Temp\unity_rebuild.log`
- `agent/experiments/exp21_generated_pair_warm_v4.py`
- `agent/experiments/exp22_generated_pair_warm_v6.py`
## What Was Fixed This Session
Latest model/output folder:
### Root cause identified and fixed
- `agent/models/exp22-generated-pair-warm-v6`
**The car was escaping the track because:**
1. Barriers were zero-thickness `MeshCollider` planes — no physical volume
2. Car Rigidbody had no CCD — default `Discrete` mode allows tunneling
Current training run:
Both problems created a simulator where the car could literally teleport through
barrier walls between physics frames. Every Python-side "fix" (CTE termination,
time caps, hit detection) was attempting in Python what the physics engine was
failing to enforce.
- launched `agent/experiments/exp22_generated_pair_warm_v6.py`
- PID file: `agent/models/exp22-generated-pair-warm-v6/current.pid`
- current PID at launch time: `609054`
- log: `agent/models/exp22-generated-pair-warm-v6/run_2026-05-05_141929_strictcte.log`
- startup verified: connected to `localhost:9091` and `localhost:9093`, loaded `generated_road` and `generated_track`, attached warm-start model, reached `Starting training...`
Latest urgent exploit fix:
- User observed generated_road still doing the large outside circle exploit.
- Stopped the previous run immediately.
- Patched `agent/reward_wrapper.py` so high CTE receives negative reward immediately during the patience window instead of falling through to positive speed reward.
- Patched `agent/experiments/exp22_generated_pair_warm_v6.py`:
- `MAX_CTE_TERMINATE = 2.5`
- `CTE_PATIENCE = 3`
- Added regression test `test_high_cte_never_gets_positive_speed_reward_before_termination`.
- Verified `python3 -m pytest -q tests/test_reward_wrapper.py`: `21 passed`.
## What Was Learned
### Training status
The latest meaningful `exp22` run was poor and should not be resumed as-is.
From `agent/models/exp22-generated-pair-warm-v6/run_2026-04-28_2132_openfix.log`:
- best `generated_track` eval reached only about `92` steps
- run was not trustworthy due to ongoing barrier-placement concerns
### Simulator behavior
- Invisible barriers are collider-only by default, so the user cannot see them in the standalone player
- Diagnostic probe showed both tracks could advance from the start before hitting `left_barrier`, so there was no obvious full-width blocker across the road start
- User screenshot suggested the car was getting trapped near the shoulder/edge, consistent with barrier corridor too close to the drivable edge
- User also reported that barrier contact sometimes blocks the car without promptly ending the episode
### Collision semantics
The user does **not** want every barrier brush to terminate the episode.
Desired behavior:
- light brush: can continue
- sustained contact: terminate
- head-on / abrupt stop: terminate quickly
## Code Changes Already Made
### Unity / simulator side
### Unity changes (source updated, build in progress)
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs`
- Rewrote `CreateBarrier()`: now creates one `BoxCollider` per segment with real
3D volume (`barrierThickness` wide — default 1.0m)
- Segment boxes overlap by `barrierThickness * 0.5` to close corner gaps
- Added `CreateEndCap()`: seals the two open ends of non-looping tracks
(`generated_road` is `closeLoop=0` — without end caps the car can drive off
the ends of the track)
- Added `public float barrierThickness = 1.0f` field (inspector-editable)
- `showBarrierMeshes=true` now shows proper translucent 3D boxes, not flat planes
Implemented structural refactor:
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/Car.cs`
- Added `rb.collisionDetectionMode = CollisionDetectionMode.Continuous;` in
`Awake()` — prevents tunneling even against any remaining thin geometry
- explicit `closeLoop` support
- explicit road-edge generation
- barrier edges derived from left/right road edges instead of guessed centerline offset
- open tracks do not force wraparound
- debug polyline support via gizmos
### Python changes (committed)
Added runtime-visible debug barrier support:
`agent/reward_wrapper.py` → v7 (clean)
- REMOVED: CTE-patience termination, high-CTE negative reward, solid_hit
monitoring, low-speed/wedge detection, all exploit-closing bandaids
- KEPT: efficiency gate (zero reward when circling), no-progress termination
(active_node), lap exploit guard
- Reward: `speed_norm × CTE_quality` when efficiency passes gate
- `showBarrierMeshes`
- `barrierDebugColor`
- barrier objects now include `MeshFilter`
- optional `MeshRenderer` added for visible translucent barriers
`agent/experiments/exp23_generated_road_clean.py`
- Single track: `generated_road` on port 9091
- No warm-start (fresh PPO weights)
- `MAX_EPISODE_SECONDS=120` (generous safety net, not a training constraint)
- LR=0.0003, 200k total steps, checkpoints every 10k
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scenes/generated_road.unity`
`tests/test_reward_wrapper.py` — 17 tests, all pass
- `closeLoop = 0`
- `doAddBarriers = 1`
- `showBarrierMeshes = 1`
- pinned road variation arrays to one entry
- `roadOffsets.Array.data[0] = 2.2`
## Current State
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scenes/generated_track.unity`
### Unity build
- Build launched with PID 37896 on 2026-05-05
- Log: `C:\Users\Paul\AppData\Local\Temp\unity_rebuild.log`
- Check: `grep -q "Exiting batchmode successfully" /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log && echo OK`
- `showBarrierMeshes = 1`
- `roadOffsetW = 2.2`
- barriers still enabled
### After build completes
1. Sync to both runtime folders:
```bash
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' \
'/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin/'
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' \
'/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy/'
```
### Python / RL side
2. Launch sims (only port 9091 needed for exp23 — single env):
```powershell
$key = 'HKCU:\Software\DonkeyCar\donkey_sim'
Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9091 -Type DWord
Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9092 -Type DWord
Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin\donkey_sim.exe' `
-ArgumentList '--port','9091' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin'
```
`/home/paulh/projects/donkeycar-rl-autoresearch/agent/reward_wrapper.py`
3. Verify port:
```bash
python3 -c "import socket; s=socket.socket(); s.settimeout(3); s.connect(('127.0.0.1',9091)); print('PORT 9091: OK'); s.close()"
```
Latest intent:
4. Visually verify barriers in the sim window:
- `showBarrierMeshes=1` is already set in both scene files
- Translucent box barriers should be visible on BOTH sides of the road
- Verify no gaps at corners
- Verify end-cap walls at start and finish of generated_road
- **Do not start exp23 until Paul confirms barriers look correct**
- do **not** terminate instantly on every barrier hit
- terminate on sustained obstacle contact
- terminate on head-on style stop
5. Launch exp23:
```bash
cd /home/paulh/projects/donkeycar-rl-autoresearch
SAVE_DIR=agent/models/exp23-generated-road-clean
mkdir -p $SAVE_DIR
nohup python3 agent/experiments/exp23_generated_road_clean.py \
> $SAVE_DIR/run_$(date +%Y-%m-%d_%H%M%S)_clean.log 2>&1 &
echo $! > $SAVE_DIR/current.pid
```
Current patch in file:
## Key Parameters (exp23)
- tracks `_solid_hit_steps`
- tracks `_prev_speed`
- classifies solid hits via `hit` containing `barrier`, `wall`, or `tree`
- immediate terminate on abrupt speed collapse while colliding
- terminate after several consecutive solid-hit frames
| Setting | Value | Why |
|---|---|---|
| Track | generated_road | Single track — diagnose before adding second |
| LR | 0.0003 | Standard PPO starting LR |
| Total steps | 200k | More room to learn with clean signal |
| max_episode_seconds | 120s | Safety net only — physics does the work |
| MAX_CTE_TERMINATE | none | Removed — barriers are physical now |
| Warm-start | none | Previous warm-starts trained on broken reward |
| showBarrierMeshes | ON | Verify visually before committing to long run |
This was meant to replace the too-aggressive “any barrier hit = immediate death” logic.
## Success Criteria
## Most Recent Verified Build Status
Unity batch build for the debug-visible barrier version completed successfully.
Evidence:
- build log ended with `Exiting batchmode successfully now!`
- return code `0`
The successful build has now been synced into both `Downloads` runtime folders and both simulators have been relaunched.
Current verified runtime state:
- main folder process owns port `9091`
- main folder also owns private API port `9092`
- copy folder process owns port `9093`
- copy folder also owns private API port `9094`
- Linux socket probe reported `PORT 9091: OK`, `PORT 9092: OK`, `PORT 9093: OK`, and `PORT 9094: OK`
- latest runtime build includes double-sided barrier mesh triangles for visual/debug barrier rendering
Note: the Windows profile uses shared Unity PlayerPrefs/registry values under `HKCU:\Software\DonkeyCar\donkey_sim`. Explicit `--port` args bind the servers correctly, but the in-sim UI can still show the saved PlayerPrefs value. Before launch, set `port_h2088097884`/`portPrivateAPI_h1325370089` to `9091`/`9092`, start the main sim, then set them to `9093`/`9094` and start the copy. Also keep passing explicit `--port 9091` and `--port 9093`.
Latest user visual inspection before double-sided patch:
- `generated_road`: barriers visible on both sides except missing on left side at the very start before the first curve
- `generated_track`: barrier visible only on the right/inside side when driving clockwise; no visible left/outside barrier
Likely diagnosis: barrier mesh was generated as a single-sided vertical plane and the Standard shader culled backfaces, so some debug barrier surfaces existed but were invisible from the road/camera side.
Latest simulator-side patch:
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs`
- `CreateBarrier(...)` now emits reverse-facing triangles for every barrier quad, making debug barrier meshes visible from both sides
- failed attempt: `Unlit/Transparent` made both tracks' barriers black in the standalone player
- failed attempt: duplicating reverse-facing triangles made `generated_track` barriers black, likely due coplanar transparent overdraw/z-fighting on the closed/scaled track
- current debug barrier mesh is back to one triangle set per quad; material uses `Standard` transparent mode with forced pale fallback color, alpha blend, culling off, and emission enabled so barriers should stay light/translucent while remaining visible from both sides
- Unity Windows batch build succeeded after this patch
- rebuilt output synced to both runtime folders and relaunched with explicit ports
## Immediate Next Steps
1. Monitor current exp22 training log/checkpoints.
2. Determine:
- are barriers too close to the road edge globally?
- or only wrong at specific bends / first-corner geometry?
3. Fix geometry if needed before restarting RL.
4. Only after geometry is visually verified, restart `exp22` or a successor experiment.
- Car cannot drive past the barrier walls (verify visually)
- ep_len_mean should INCREASE over checkpoints (not frozen at 118)
- eval steps should improve at 20k, 30k, 40k checkpoints
- No evidence of outside-road circling in the reward curve
## Useful Commands
### Sync latest build into runtime folders
### Check build log
```bash
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin/'
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy/'
tail -20 /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log
grep "Exiting batchmode\|Build failed\|error\|Error" /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log | tail -5
```
### Launch sims from Windows side
```powershell
$key = 'HKCU:\Software\DonkeyCar\donkey_sim'
Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9091 -Type DWord
Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9092 -Type DWord
Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin\donkey_sim.exe' -ArgumentList '--port','9091' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin'
Start-Sleep -Seconds 4
Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9093 -Type DWord
Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9094 -Type DWord
Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin - Copy\donkey_sim.exe' -ArgumentList '--port','9093' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin - Copy'
### Monitor exp23
```bash
tail -f agent/models/exp23-generated-road-clean/run_*_clean.log
```
### Verify ports
```bash
python3 - <<'PY'
import socket
for p in (9091, 9093):
s = socket.socket()
s.settimeout(3)
try:
s.connect(('127.0.0.1', p))
print(f'PORT {p}: OK')
except Exception as e:
print(f'PORT {p}: FAIL {e}')
finally:
s.close()
for p in (9091,):
s = socket.socket(); s.settimeout(3)
try: s.connect(('127.0.0.1', p)); print(f'PORT {p}: OK')
except Exception as e: print(f'PORT {p}: FAIL {e}')
finally: s.close()
PY
```
## Notes for Next Session
- If the user says `continue`, do not ask broad questions. Start with the immediate next steps above.
- Prefer direct verification over more RL training.
- Do not restart long training until the user has visually confirmed the debug-visible barriers look correct.
- If the user says `continue`, do not ask broad questions. Check build log → sync → launch → verify barriers → start exp23.
- **Barrier visual confirmation is required before starting exp23.** Paul must see the translucent 3D boxes on both sides of the road with no gaps before committing to a 200k training run.
- The second sim (port 9093) is not needed for exp23 — only launch one sim.
- Do not add generated_track back until generated_road training is verified working.

View File

@ -0,0 +1,236 @@
"""
Exp 23: Clean slate generated_road, solid barriers, simple reward.
What changed from exp22:
- Single track: generated_road on port 9091 only (diagnose one track first)
- Simulator now uses BoxCollider barriers + CCD on the car Rigidbody.
The car physically cannot escape. No Python-side exploit patches needed.
- Reward wrapper v7: speed × CTE_quality + efficiency gate + no-progress kill.
Removed: CTE-patience termination, solid_hit detection, wedge detection,
MAX_EPISODE_SECONDS hard cap.
- StuckTerminationWrapper: max_episode_seconds raised to 120s (genuine safety
net only physics handles the actual containment).
- No warm-start: fresh PPO weights. Previous warm-starts were trained under
broken reward/barrier conditions and add more noise than signal.
- Total steps: 200k (more room to learn with clean signal).
"""
import os
import sys
import time
from datetime import datetime
sys.path.insert(0, '/home/paulh/projects/donkeycar-rl-autoresearch/agent')
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage
from donkeycar_sb3_runner import ThrottleClampWrapper
from multitrack_runner import StuckTerminationWrapper
from reward_wrapper import SpeedRewardWrapper
HOST = 'localhost'
THROTTLE_MIN = 0.2
LR = 0.0003
TOTAL_STEPS = 200_000
CHECKPOINT_EVERY = 10_000
SAVE_DIR = '/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/exp23-generated-road-clean'
os.makedirs(SAVE_DIR, exist_ok=True)
# Reward wrapper v7 params — clean and minimal
EFFICIENCY_WINDOW = 30
MIN_EFFICIENCY = 0.15
MAX_CTE = 8.0
MIN_LAP_TIME = 12.0
PROGRESS_PATIENCE = 100 # steps without new waypoint → terminate
# StuckTerminationWrapper — generous limit, physics does the real work now
MAX_STUCK_SECONDS = 5.0
MAX_EPISODE_SECONDS = 120.0 # safety net only
def log(msg):
print(f'[{datetime.now().strftime("%H:%M:%S")}] {msg}', flush=True)
def make_env(track_id, port):
def _init():
raw = gym.make(track_id, conf={'host': HOST, 'port': port})
env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN)
env = StuckTerminationWrapper(
env,
stuck_steps=40,
min_displacement=0.5,
max_stuck_seconds=MAX_STUCK_SECONDS,
max_episode_seconds=MAX_EPISODE_SECONDS,
)
env = SpeedRewardWrapper(
env,
window_size=EFFICIENCY_WINDOW,
min_efficiency=MIN_EFFICIENCY,
max_cte=MAX_CTE,
min_lap_time=MIN_LAP_TIME,
progress_patience=PROGRESS_PATIENCE,
)
return env
return _init
def make_eval_env(track_id, port):
inner = make_env(track_id, port)()
return VecTransposeImage(DummyVecEnv([lambda e=inner: e]))
log('=' * 60)
log('Exp 23: generated_road — clean barriers, clean reward')
log(f' Sim: {HOST}:9091 -> generated_road')
log(f' throttle_min={THROTTLE_MIN}, lr={LR}, total={TOTAL_STEPS:,}')
log(f' Reward: v7 (speed×CTE, efficiency gate, no-progress kill)')
log(f' Max stuck: {MAX_STUCK_SECONDS}s, episode cap: {MAX_EPISODE_SECONDS}s (safety net)')
log(f' Progress patience: {PROGRESS_PATIENCE} steps')
log(f' Checkpoints every {CHECKPOINT_EVERY:,} steps')
log('=' * 60)
log('Creating DummyVecEnv on generated_road...')
env = DummyVecEnv([make_env('donkey-generated-roads-v0', 9091)])
env = VecTransposeImage(env)
log(f' VecEnv num_envs={env.num_envs}, obs={env.observation_space.shape}')
model = PPO(
'CnnPolicy',
env,
learning_rate=LR,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
verbose=1,
device='cpu',
)
# Write PID for external monitoring
pid_path = os.path.join(SAVE_DIR, 'current.pid')
with open(pid_path, 'w') as f:
f.write(str(os.getpid()))
log(f'Fresh PPO model created. Starting training...')
best_total_steps = float('-inf')
best_total_reward = float('-inf')
steps_done = 0
run_tag = datetime.now().strftime('%Y-%m-%d_%H%M%S') + '_clean'
log_path = os.path.join(SAVE_DIR, f'run_{run_tag}.log')
best_model_path = os.path.join(SAVE_DIR, 'best_model.zip')
import logging
logging.basicConfig(
level=logging.INFO,
format='%(message)s',
handlers=[logging.FileHandler(log_path), logging.StreamHandler(sys.stdout)],
)
file_log = logging.getLogger('exp23')
def flog(msg):
ts = datetime.now().strftime('%H:%M:%S')
file_log.info(f'[{ts}] {msg}')
flog('=' * 60)
flog(f'Exp 23 started — PID {os.getpid()}')
flog(f'Log: {log_path}')
flog('=' * 60)
while steps_done < TOTAL_STEPS:
seg_steps = min(CHECKPOINT_EVERY, TOTAL_STEPS - steps_done)
model.learn(total_timesteps=seg_steps, reset_num_timesteps=False)
steps_done += seg_steps
ckpt = os.path.join(SAVE_DIR, f'checkpoint_{steps_done:07d}')
model.save(ckpt)
model.save(os.path.join(SAVE_DIR, 'model'))
flog(f'[{steps_done:,}/{TOTAL_STEPS:,}] Checkpoint saved: {ckpt}.zip')
# Mid-training eval on generated_road
try:
obs = env.reset()
ep_rewards = np.zeros(env.num_envs)
ep_steps = np.zeros(env.num_envs)
done_mask = np.zeros(env.num_envs, dtype=bool)
for _ in range(2000):
action, _ = model.predict(obs, deterministic=True)
obs, rewards, dones, infos = env.step(action)
for i in range(env.num_envs):
if not done_mask[i]:
ep_rewards[i] += rewards[i]
ep_steps[i] += 1
if dones[i]:
done_mask[i] = True
if done_mask.all():
break
total_steps_eval = int(ep_steps.sum())
total_reward_eval = float(ep_rewards.sum())
status = '' if ep_steps[0] >= 2000 else f'❌@{int(ep_steps[0])}'
flog(f' Eval: gen_road={total_reward_eval:.1f}r/{int(ep_steps[0])}s {status}')
if (total_steps_eval > best_total_steps
or (total_steps_eval == best_total_steps
and total_reward_eval > best_total_reward)):
best_total_steps = total_steps_eval
best_total_reward = total_reward_eval
model.save(best_model_path)
flog(f' NEW BEST: steps={best_total_steps} reward={best_total_reward:.1f}')
except Exception as e:
flog(f' Eval error: {e}')
env.close()
# ── Final evaluation ──────────────────────────────────────────────────────────
flog('=' * 60)
flog('FINAL EVALUATION: best_model on generated_road')
flog('=' * 60)
EVAL_SETS = 3
EVAL_MAX_STEPS = 2000
steps_list = []
reward_list = []
for s in range(1, EVAL_SETS + 1):
try:
eval_env = make_eval_env('donkey-generated-roads-v0', 9091)
eval_model = PPO.load(best_model_path, env=eval_env, device='cpu')
obs = eval_env.reset()
done = False
total_s = 0
total_r = 0.0
while not done and total_s < EVAL_MAX_STEPS:
action, _ = eval_model.predict(obs, deterministic=True)
result = eval_env.step(action)
obs, r, done = result[0], result[1], result[2]
if hasattr(done, '__len__'):
done = bool(done[0])
total_r += float(r) if not hasattr(r, '__len__') else float(r[0])
total_s += 1
status = '' if total_s >= EVAL_MAX_STEPS else f'❌@{total_s}'
flog(f' Set {s}: {total_r:.1f}r / {total_s}s {status}')
steps_list.append(total_s)
reward_list.append(total_r)
eval_env.close()
except Exception as e:
flog(f' Set {s} error: {e}')
if steps_list:
flog(f' Mean: {np.mean(steps_list):.0f} steps / {np.mean(reward_list):.1f} reward')
flog('Exp 23 complete.')

View File

@ -1,58 +1,36 @@
"""
Speed + Progress Reward Wrapper for DonkeyCar RL v6 (Speed×CTE + Efficiency Gate)
=====================================================================================
Speed × CTE Reward Wrapper for DonkeyCar RL v7 (Clean)
=========================================================
REWARD HACKING HISTORY:
v1 additive: speed × (1-cte/max_cte) boundary oscillation
v2 multiplicative: original × (1+speed×scale) circular driving (on-track)
v3 path efficiency: original × (1+speed×eff×scale) still circling!
WHY v3 failed: efficiency killed the SPEED BONUS but not the BASE reward.
A spinning car at CTE0 still earns 1.0/step × thousands of steps.
v4: base × eff × (1 + speed_scale × speed) zero gradient on hills!
WHY v4 failed on hills: speed0 AND eff0 AND cte_quality varies all
three terms near zero simultaneously no gradient to push ANY term up.
v5: speed × CTE_quality (no efficiency) circular driving returns!
WHY v5 failed: dropped efficiency entirely. Circular driving at CTE0
with speed>0 earns positive reward indefinitely. Observed in Exp 11.
v6 (THIS VERSION): v5 reward + efficiency GATE.
Keeps v5's gradient properties (non-zero gradient on hills) but adds
a binary efficiency check that zeros reward when car is circling.
The simulator now uses solid BoxCollider barriers with Continuous Collision
Detection on the car Rigidbody. The car physically cannot escape the track.
This removes the need for every Python-side exploit patch that lived here:
ROOT CAUSE OF CIRCLING:
The sim's own calc_reward() uses `forward_vel` = dot(car_heading, velocity).
A spinning car is ALWAYS moving "forward" relative to its own heading,
so forward_vel > 0 always, giving positive reward while circling indefinitely.
We bypass this entirely.
REMOVED (simulator now enforces these physically):
- CTE-patience termination (car can't get far off track anyway)
- High-CTE negative reward patch
- solid_hit / barrier-contact monitoring
- low-speed / wedge detection
FORMULA (v6):
cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1] centred=1
speed_norm = min(speed / 10.0, 1.0) # [0,1] normalised
efficiency = net_displacement / total_path # [0,1] straight=1, circle=0
KEPT (still needed physics can't detect these):
- Efficiency gate: zero reward when circling
(car on-track but spinning in circles, not advancing)
- No-progress termination: active_node not advancing
(car stuck at waypoint, not completing the course)
- Lap exploit check: super-fast laps are physically impossible but kept
as a sanity guard
FORMULA:
cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1]: centred=1
speed_norm = min(speed / 10.0, 1.0) # [0,1]: normalised
efficiency = net_displacement / total_path # [0,1]: straight=1, circle=0
if efficiency < min_efficiency:
reward = 0.0 # GATE: circling → zero reward (but not negative)
reward = 0.0 # circling — no incentive
else:
reward = cte_quality × speed_norm # v5 formula (gradient on hills)
reward = cte_quality × speed_norm
On done/crash: reward = -1.0
WHY GATE NOT MULTIPLIER:
v4 used efficiency as a multiplier: reward = base × eff × speed_bonus.
On a hill: speed0, eff0, base0.5 reward0 and reward/speed0.
No gradient to push speed up car stays stuck.
v6 gate: efficiency is either PASS or FAIL. When efficiency > threshold
(car moving forward at all), reward = speed × CTE_quality. On a hill:
car is stuck but still has eff > 0 (not literally circling), so the gate
passes and the reward = speed × CTE_quality. reward/speed > 0 gradient
pushes toward more throttle. Circle has eff 0 gate fails reward = 0.
PROPERTIES:
- Circling (eff<threshold): reward = 0 (no incentive to circle)
- On track, stuck (eff>0): reward = speed × CTE (gradient toward unstuck)
- On track, fast: reward = high (speed + centred)
- Off track: reward 0 (CTE_quality 0)
- Crash: reward = -1.0
"""
import gymnasium as gym
@ -62,92 +40,49 @@ from collections import deque
class SpeedRewardWrapper(gym.Wrapper):
"""
Full reward bypass: speed × CTE_quality, gated by efficiency.
Completely ignores the sim's own reward (which uses forward_vel and is
exploitable by circular/spinning motion).
Exploit termination:
- Sustained high CTE (> max_cte_terminate for cte_patience steps): grass exploit
- No track progress (active_node max not advancing for progress_patience steps):
catches circular driving, stuck-on-cone, stuck-on-barrier.
A circling car stays near the same waypoints active_node never advances.
A stuck car never advances either. Forward driving always advances.
Reward = speed × CTE_quality, gated by path efficiency.
Args:
env: gymnasium environment
speed_scale: speed bonus multiplier (default 0.1)
window_size: steps for efficiency gate (default 30)
min_efficiency: efficiency gate threshold (default 0.15)
max_cte: track half-width for reward normalization (default 8.0)
min_lap_time: laps faster than this are penalised as exploits
max_cte_terminate: terminate if CTE > this for cte_patience steps
cte_patience: steps of sustained high CTE before termination
progress_patience: steps without new max active_node before termination
env: gymnasium environment
window_size: steps for efficiency gate history (default 30)
min_efficiency: efficiency threshold below this, reward = 0 (default 0.15)
max_cte: CTE at which reward reaches 0 (default 8.0)
min_lap_time: laps faster than this are penalised (exploit guard)
progress_patience: steps without new max active_node before termination
"""
def __init__(
self,
env,
speed_scale: float = 0.1,
window_size: int = 30,
min_efficiency: float = 0.15,
max_cte: float = 8.0,
min_lap_time: float = 5.0,
max_cte_terminate: float = 4.0,
cte_patience: int = 20,
progress_patience: int = 60,
efficiency_patience: int = 20, # steps of low efficiency before termination
low_speed_patience: int = 20,
low_speed_threshold: float = 0.2,
low_speed_min_displacement: float = 0.25,
low_speed_grace_steps: int = 20,
):
super().__init__(env)
self.speed_scale = speed_scale
self.window_size = window_size
self.min_efficiency = min_efficiency
self.max_cte = max_cte
self.min_lap_time = min_lap_time
self.max_cte_terminate = max_cte_terminate
self.cte_patience = cte_patience
self.progress_patience = progress_patience
self.efficiency_patience = efficiency_patience
self.low_speed_patience = low_speed_patience
self.low_speed_threshold = low_speed_threshold
self.low_speed_min_displacement = low_speed_min_displacement
self.low_speed_grace_steps = low_speed_grace_steps
self._pos_history = deque(maxlen=window_size + 1)
self._last_lap_count = 0
self._high_cte_steps = 0
self._max_node_seen = -1
self._no_progress_steps = 0
self._low_eff_steps = 0
self._solid_hit_steps = 0
self._prev_speed = 0.0
self._episode_steps = 0
self._low_speed_steps = 0
self._low_speed_anchor = None
def reset(self, **kwargs):
result = self.env.reset(**kwargs)
self._pos_history.clear()
self._last_lap_count = 0
self._high_cte_steps = 0
self._max_node_seen = -1
self._no_progress_steps = 0
self._low_eff_steps = 0
self._solid_hit_steps = 0
self._prev_speed = 0.0
self._episode_steps = 0
self._low_speed_steps = 0
self._low_speed_anchor = None
return result
def step(self, action):
result = self.env.step(action)
# Handle both 4-tuple (old gym) and 5-tuple (gymnasium) APIs
if len(result) == 5:
obs, _sim_reward, terminated, truncated, info = result
done = terminated or truncated
@ -158,159 +93,54 @@ class SpeedRewardWrapper(gym.Wrapper):
else:
raise ValueError(f'Unexpected step() result length: {len(result)}')
# Completely ignore _sim_reward — compute our own
shaped, force_terminate = self._compute_reward_and_done(done, info)
shaped, force_terminate = self._compute_reward(done, info)
if force_terminate:
terminated = True
done = True
if len(result) == 5:
return obs, shaped, terminated, truncated, info
else:
return obs, shaped, done, info
return obs, shaped, done, info
def _compute_reward_and_done(self, done: bool, info: dict):
"""
v6.1: speed × CTE-quality + efficiency gate + grass/rollback terminators.
New termination conditions:
- Sustained high CTE: CTE > max_cte_terminate for cte_patience steps
terminate. Stops the grass exploit (car exits track gap and
drives indefinitely on grass with CTE just under max_cte=8.0).
- No track progress: active_node doesn't advance for progress_patience
steps terminate. Stops mountain rollback (car goes up, rolls
back, IS moving so StuckWrapper doesn't fire, but never advances).
reward = speed_norm × cte_quality (when efficiency >= threshold)
reward = 0.0 (when circling)
reward = -1.0 (on crash/termination)
"""
# Track position for efficiency calculation
current_pos = None
def _compute_reward(self, done: bool, info: dict):
# Record position for efficiency calculation
try:
pos = info.get('pos', (0.0, 0.0, 0.0))
pos_x = float(pos[0])
pos_z = float(pos[2])
current_pos = np.array([pos_x, pos_z])
self._pos_history.append(current_pos)
self._pos_history.append(np.array([float(pos[0]), float(pos[2])]))
except (TypeError, ValueError, IndexError):
pass
self._episode_steps += 1
# Crash / episode over
if done:
return -1.0, False
# --- CTE value for all checks ---
try:
cte = float(info.get('cte', 0.0) or 0.0)
except (TypeError, ValueError):
cte = 0.0
# --- Speed / collision classification ---
try:
speed = max(0.0, float(info.get('speed', 0.0) or 0.0))
except (TypeError, ValueError):
speed = 0.0
try:
hit = str(info.get('hit', 'none') or 'none').lower()
except Exception:
hit = 'none'
solid_hit = (
hit != 'none' and (
'barrier' in hit or
'wall' in hit or
'tree' in hit
)
)
# Allow brief brushes, but terminate on:
# 1. a head-on style stop: car was moving, then collision arrives with
# a large speed drop; or
# 2. sustained obstacle contact over several telemetry frames.
if solid_hit:
head_on_impact = self._prev_speed >= 1.5 and speed <= 0.35
if head_on_impact:
self._prev_speed = speed
return -1.0, True
self._solid_hit_steps += 1
if self._solid_hit_steps >= 4:
self._prev_speed = speed
return -1.0, True
else:
self._solid_hit_steps = 0
# --- Wheels-spinning / barrier wedge termination ---
# CTE can remain deceptively acceptable when the car is pressed against
# a generated-road barrier or invisible collider. If speed stays near
# zero and position does not meaningfully change after the launch grace
# period, kill the episode quickly with a negative reward.
if (
current_pos is not None
and self._episode_steps > self.low_speed_grace_steps
and speed <= self.low_speed_threshold
):
if self._low_speed_anchor is None:
self._low_speed_anchor = current_pos
self._low_speed_steps = 1
else:
moved = float(np.linalg.norm(current_pos - self._low_speed_anchor))
if moved >= self.low_speed_min_displacement:
self._low_speed_anchor = current_pos
self._low_speed_steps = 0
else:
self._low_speed_steps += 1
if self._low_speed_steps >= self.low_speed_patience:
self._prev_speed = speed
return -1.0, True
else:
self._low_speed_steps = 0
self._low_speed_anchor = current_pos
# --- Grass / outside-road exploit: high CTE is bad immediately ---
# Do not let the policy collect positive speed reward while it is
# outside the useful road corridor. Earlier versions only terminated
# after patience frames, but still paid positive reward during those
# frames; PPO learned large fast circles outside generated_road.
if abs(cte) > self.max_cte_terminate:
self._high_cte_steps += 1
if self._high_cte_steps >= self.cte_patience:
self._prev_speed = speed
return -1.0, True # too long off-track — terminate
self._prev_speed = speed
return -0.25, False
else:
self._high_cte_steps = 0
# --- Circle / stuck exploit: no track progress termination ---
# Track the highest active_node (track waypoint) reached this episode.
# A circling car stays near the same waypoints — max_node never advances.
# A stuck car never advances either. Only genuine forward driving advances.
# On lap completion, active_node resets to 0 — we reset our tracker too.
# --- No-progress termination ---
# Terminates episodes where the car isn't advancing along the track
# (circling near the start, stuck against a barrier, etc.).
try:
active_node = int(info.get('active_node', -1) or 0)
total_nodes = int(info.get('total_nodes', 1) or 1)
except (TypeError, ValueError):
active_node = -1
total_nodes = 1
if active_node >= 0:
if active_node > self._max_node_seen:
# New furthest point reached — genuine forward progress
self._max_node_seen = active_node
self._no_progress_steps = 0
else:
self._no_progress_steps += 1
if self._no_progress_steps >= self.progress_patience:
self._prev_speed = speed
return -1.0, True # no forward progress — terminate
return -1.0, True
# --- Lap detection: reset progress tracker + exploit guard ---
try:
current_lap_count = int(info.get('lap_count', 0) or 0)
except (TypeError, ValueError):
@ -318,7 +148,6 @@ class SpeedRewardWrapper(gym.Wrapper):
if current_lap_count > self._last_lap_count:
self._last_lap_count = current_lap_count
# Reset progress tracker — active_node wraps to 0 on new lap
self._max_node_seen = -1
self._no_progress_steps = 0
try:
@ -326,47 +155,22 @@ class SpeedRewardWrapper(gym.Wrapper):
except (TypeError, ValueError):
lap_time = 999.0
if lap_time < self.min_lap_time:
penalty = -10.0 * (self.min_lap_time / max(lap_time, 0.1))
self._prev_speed = speed
return penalty, True
return -10.0 * (self.min_lap_time / max(lap_time, 0.1)), True
# --- Efficiency gate: detect circular driving ---
# Count consecutive steps of low efficiency. After patience steps, terminate.
# Previously this just returned 0 reward (no termination) which let circles
# run for 20+ seconds. Now we terminate after ~20 steps (~0.7s).
efficiency = self._compute_efficiency()
if efficiency < self.min_efficiency:
self._low_eff_steps += 1
if self._low_eff_steps >= self.efficiency_patience:
self._prev_speed = speed
return -1.0, True # circle too long — terminate
self._prev_speed = speed
return 0.0, False # still accumulating — zero reward
else:
self._low_eff_steps = 0
# --- Efficiency gate: zero reward when circling ---
if self._compute_efficiency() < self.min_efficiency:
return 0.0, False
# --- CTE quality ---
# --- Core reward: speed × CTE quality ---
cte_quality = 1.0 - min(abs(cte) / self.max_cte, 1.0)
# --- Speed ---
# --- v6 reward: speed × CTE quality ---
speed_norm = min(speed / 10.0, 1.0)
self._prev_speed = speed
speed_norm = min(speed / 10.0, 1.0)
return cte_quality * speed_norm, False
def _compute_efficiency(self) -> float:
"""Path efficiency = net_displacement / total_path_length."""
if len(self._pos_history) < 3:
return 1.0 # Insufficient history — give benefit of doubt
return 1.0
positions = list(self._pos_history)
net = np.linalg.norm(positions[-1] - positions[0])
total = sum(
np.linalg.norm(positions[i + 1] - positions[i])
for i in range(len(positions) - 1)
)
return float(net / total) if total > 1e-6 else 1.0
def theoretical_max_per_step(self, max_speed: float = 10.0) -> float:
"""Upper bound on reward/step (efficiency=1, CTE=0, max speed)."""
return 1.0 * 1.0 * (1.0 + self.speed_scale * max_speed)
net = float(np.linalg.norm(positions[-1] - positions[0]))
total = float(sum(np.linalg.norm(positions[i+1] - positions[i])
for i in range(len(positions) - 1)))
return net / total if total > 1e-6 else 1.0

View File

@ -1,30 +1,25 @@
"""
Tests for reward_wrapper.py v4 (full sim bypass base × efficiency × speed).
"""
"""Tests for reward_wrapper.py v7 (clean: speed×CTE + efficiency gate)."""
import sys, os, math, pytest
import numpy as np
import gymnasium as gym
from collections import deque
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent'))
from reward_wrapper import SpeedRewardWrapper
import gymnasium as gym
# ---- Mock Environments ----
class MockEnv(gym.Env):
"""Configurable mock gymnasium.Env."""
metadata = {'render_modes': []}
def __init__(self, speed=2.0, cte=0.0, pos=(0., 0., 0.), done=False, use_5tuple=True):
super().__init__()
self.action_space = gym.spaces.Discrete(5)
self.action_space = gym.spaces.Discrete(5)
self.observation_space = gym.spaces.Box(0, 255, (120, 160, 3), dtype=np.uint8)
self._speed = speed
self._cte = cte
self._pos = list(pos)
self._done = done
self._speed = speed
self._cte = cte
self._pos = list(pos)
self._done = done
self._use_5tuple = use_5tuple
def set_pos(self, p): self._pos = list(p)
@ -34,523 +29,231 @@ class MockEnv(gym.Env):
return np.zeros((120, 160, 3), dtype=np.uint8), {}
def step(self, action):
obs = np.zeros((120, 160, 3), dtype=np.uint8)
# Sim reward uses forward_vel (exploitable) — wrapper should IGNORE this
sim_reward = 999.0 # Deliberately bogus — wrapper must not use this
info = {'speed': self._speed, 'cte': self._cte, 'pos': self._pos}
obs = np.zeros((120, 160, 3), dtype=np.uint8)
sim_reward = 999.0 # deliberately bogus — wrapper must ignore this
info = {'speed': self._speed, 'cte': self._cte, 'pos': self._pos}
if self._use_5tuple:
return obs, sim_reward, self._done, False, info
return obs, sim_reward, self._done, info
def close(self): pass
# ── Helpers ──────────────────────────────────────────────────────────────────
def make_info(cte=0.5, speed=2.0, pos=None, active_node=1, lap_count=0, lap_time=0.0):
return {
'cte': cte, 'speed': speed,
'pos': pos or (0., 0., 0.),
'active_node': active_node, 'total_nodes': 100,
'lap_count': lap_count, 'last_lap_time': lap_time,
}
def step_wrapped(wrapped_env, env, pos, cte=0.5, speed=2.0):
env.set_pos(pos)
env.set_cte(cte)
env._speed = speed
return wrapped_env.step(0)
# ---- Core v4 Properties ----
# ── Core reward properties ────────────────────────────────────────────────────
def test_sim_reward_is_completely_ignored():
"""
The wrapper must NOT use the sim's reward (999.0).
v4 computes reward from scratch using CTE/pos/speed only.
"""
env = MockEnv(speed=2.0, cte=0.5, pos=(0., 0., 0.))
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
wrapped = SpeedRewardWrapper(env)
wrapped.reset()
_, reward, _, _, _ = wrapped.step(0)
assert reward != 999.0, "Wrapper must not pass through sim's bogus reward"
assert reward < 10.0, f"Reward should be small, got {reward}"
assert reward != 999.0
assert reward < 10.0
def test_circling_at_zero_cte_gives_near_zero_reward():
"""
v6: circling (low efficiency) should yield zero reward via the efficiency gate.
After enough steps of circular motion, the efficiency drops below threshold
and the gate zeros the reward.
"""
env = MockEnv(speed=3.0, cte=0.0)
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=30, min_efficiency=0.15)
def test_crash_gives_negative_one():
env = MockEnv(speed=5.0, cte=0.0, done=True)
wrapped = SpeedRewardWrapper(env)
wrapped.reset()
# Drive in a circle for enough steps to fill the position window
rewards = []
for i in range(40):
angle = 2 * math.pi * i / 12 # completes circle every 12 steps
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
_, r, _, _, _ = wrapped.step(0)
rewards.append(r)
# After 20+ steps of circular motion, efficiency gate should kick in
# Last few rewards should be 0.0
assert rewards[-1] == 0.0, (
f"v6: circular driving should yield 0.0 reward via efficiency gate, got {rewards[-1]:.4f}")
assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3, (
f"v6: most of last 5 rewards during circle should be 0.0, got {rewards[-5:]}")
_, reward, _, _, _ = wrapped.step(0)
assert reward == -1.0
def test_forward_driving_earns_positive_reward():
"""Straight-line driving at low CTE and reasonable speed earns positive reward."""
env = MockEnv(speed=5.0, cte=0.5)
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10)
wrapped = SpeedRewardWrapper(env, window_size=10)
wrapped.reset()
_, r, _, _, _ = wrapped.step(0)
# reward = (5/10) * (1 - 0.5/8) = 0.5 * 0.9375 = 0.469
assert r > 0.3, f"Forward driving should earn >0.3 reward, got {r:.4f}"
# reward = (5/10) * (1 - 0.5/8.0) = 0.5 * 0.9375 = 0.469
assert r > 0.3, f"Forward driving should earn >0.3, got {r:.4f}"
def test_forward_beats_circling_by_large_margin():
"""
v6: forward driving earns positive reward; circular driving earns zero.
The efficiency gate ensures this gap.
"""
# Forward driving at CTE=1m, speed=5
def test_higher_cte_reduces_reward():
env_low = MockEnv(speed=2.0, cte=0.5)
env_high = MockEnv(speed=2.0, cte=4.0)
w_low = SpeedRewardWrapper(env_low, window_size=5)
w_high = SpeedRewardWrapper(env_high, window_size=5)
w_low.reset(); w_high.reset()
for i in range(10):
env_low.set_pos( [i * 0.3, 0., 0.])
env_high.set_pos([i * 0.3, 0., 0.])
_, r_low, _, _, _ = w_low.step(0)
_, r_high, _, _, _ = w_high.step(0)
assert r_low > r_high
def test_higher_speed_increases_reward():
env_slow = MockEnv(speed=0.5, cte=1.0)
env_fast = MockEnv(speed=3.0, cte=1.0)
w_slow = SpeedRewardWrapper(env_slow, window_size=10)
w_fast = SpeedRewardWrapper(env_fast, window_size=10)
w_slow.reset(); w_fast.reset()
for i in range(15):
env_slow.set_pos([i * 0.1, 0., 0.])
env_fast.set_pos([i * 0.3, 0., 0.])
_, r_slow, _, _, _ = w_slow.step(0)
_, r_fast, _, _, _ = w_fast.step(0)
assert r_fast > r_slow
def test_4tuple_compatibility():
env = MockEnv(speed=2.0, cte=0.5, use_5tuple=False)
env.set_pos([0., 0., 0.])
wrapped = SpeedRewardWrapper(env)
wrapped.reset()
result = wrapped.step(0)
assert len(result) == 4
_, reward, done, info = result
assert isinstance(reward, float)
assert reward != 999.0
# ── Efficiency gate ───────────────────────────────────────────────────────────
def test_circling_earns_zero_reward():
env = MockEnv(speed=3.0, cte=0.0)
wrapped = SpeedRewardWrapper(env, window_size=30, min_efficiency=0.15)
wrapped.reset()
rewards = []
for i in range(40):
angle = 2 * math.pi * i / 12
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
_, r, _, _, _ = wrapped.step(0)
rewards.append(r)
assert rewards[-1] == 0.0
assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3
def test_forward_beats_circling():
env_fwd = MockEnv(speed=5.0, cte=1.0)
wrapped_fwd = SpeedRewardWrapper(env_fwd, speed_scale=0.1, window_size=30)
wrapped_fwd.reset()
w_fwd = SpeedRewardWrapper(env_fwd, window_size=30)
w_fwd.reset()
for i in range(35):
env_fwd.set_pos([i * 0.5, 0., 0.]) # straight line
_, r_fwd, _, _, _ = wrapped_fwd.step(0)
env_fwd.set_pos([i * 0.5, 0., 0.])
_, r_fwd, _, _, _ = w_fwd.step(0)
# Circular driving at CTE=0, speed=5
env_circ = MockEnv(speed=5.0, cte=0.0)
wrapped_circ = SpeedRewardWrapper(env_circ, speed_scale=0.1, window_size=30)
wrapped_circ.reset()
w_circ = SpeedRewardWrapper(env_circ, window_size=30)
w_circ.reset()
for i in range(35):
angle = 2 * math.pi * i / 12
env_circ.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
_, r_circ, _, _, _ = wrapped_circ.step(0)
_, r_circ, _, _, _ = w_circ.step(0)
assert r_fwd > 0, f"Forward driving should earn positive reward, got {r_fwd}"
assert r_circ == 0.0, f"Circular driving should earn 0 reward, got {r_circ}"
assert r_fwd > r_circ, f"Forward ({r_fwd:.3f}) must beat circling ({r_circ:.3f})"
assert r_fwd > 0
assert r_circ == 0.0
def test_crash_gives_negative_reward():
"""Episode termination (done=True) must always give -1.0."""
env = MockEnv(speed=5.0, cte=0.0, done=True)
wrapped = SpeedRewardWrapper(env, speed_scale=0.2)
wrapped.reset()
_, reward, _, _, _ = wrapped.step(0)
assert reward == -1.0, f"Crash reward must be -1.0, got {reward}"
def test_high_cte_reduces_reward():
"""Higher CTE should reduce reward (closer to track edge = lower base)."""
env_low = MockEnv(speed=2.0, cte=0.5)
env_high = MockEnv(speed=2.0, cte=4.0)
wrapped_low = SpeedRewardWrapper(env_low, speed_scale=0.1, window_size=5)
wrapped_high = SpeedRewardWrapper(env_high, speed_scale=0.1, window_size=5)
wrapped_low.reset()
wrapped_high.reset()
# Drive straight so efficiency fills up
for i in range(10):
env_low.set_pos([i * 0.3, 0., 0.])
env_high.set_pos([i * 0.3, 0., 0.])
_, r_low, _, _, _ = wrapped_low.step(0)
_, r_high, _, _, _ = wrapped_high.step(0)
assert r_low > r_high, f"Low CTE ({r_low:.3f}) should reward more than high CTE ({r_high:.3f})"
def test_speed_bonus_increases_reward_when_on_track():
"""Faster forward driving earns more reward than slower forward driving."""
env_slow = MockEnv(speed=0.5, cte=1.0)
env_fast = MockEnv(speed=3.0, cte=1.0)
wrapped_slow = SpeedRewardWrapper(env_slow, speed_scale=0.1, window_size=10)
wrapped_fast = SpeedRewardWrapper(env_fast, speed_scale=0.1, window_size=10)
wrapped_slow.reset()
wrapped_fast.reset()
for i in range(15):
env_slow.set_pos([i * 0.1, 0., 0.])
env_fast.set_pos([i * 0.3, 0., 0.]) # Fast car covers more ground
_, r_slow, _, _, _ = wrapped_slow.step(0)
_, r_fast, _, _, _ = wrapped_fast.step(0)
assert r_fast > r_slow, f"Fast ({r_fast:.3f}) should earn more than slow ({r_slow:.3f})"
def test_theoretical_max_per_step():
"""Max reward/step = 1.0 × 1.0 × (1 + scale × max_speed) = 2.0 at scale=0.1, max=10."""
env = MockEnv()
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
assert wrapped.theoretical_max_per_step(max_speed=10.0) == pytest.approx(2.0, abs=1e-6)
def test_4tuple_step_compatibility():
"""Wrapper must handle 4-tuple step() return (old gym API)."""
env = MockEnv(speed=2.0, cte=0.5, use_5tuple=False)
env.set_pos([0., 0., 0.])
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
wrapped.reset()
result = wrapped.step(0)
assert len(result) == 4, f"Expected 4-tuple, got {len(result)}"
_, reward, done, info = result
assert isinstance(reward, float)
assert reward != 999.0, "Should not use sim reward"
def test_reward_resets_on_episode_reset():
"""After reset, position history clears so efficiency recalculates cleanly."""
def test_history_clears_on_reset():
env = MockEnv(speed=2.0, cte=0.5)
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10)
wrapped = SpeedRewardWrapper(env, window_size=10)
wrapped.reset()
# Fill with circular data
for i in range(15):
angle = 2 * math.pi * i / 12
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
wrapped.step(0)
# After reset, start fresh straight
wrapped.reset()
rewards = []
for i in range(5):
env.set_pos([i * 0.3, 0., 0.])
_, r, _, _, _ = wrapped.step(0)
rewards.append(r)
# Should get reasonable reward after fresh start
assert rewards[-1] > 0, "Should get positive reward after reset and straight driving"
assert rewards[-1] > 0
# ---------------------------------------------------------------------------
# Short-lap exploit patch tests
# ---------------------------------------------------------------------------
# ── No-progress termination ───────────────────────────────────────────────────
def test_short_lap_triggers_penalty():
"""
A lap completed faster than min_lap_time must return a large penalty,
not a positive reward. This closes the start/finish circle exploit.
"""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# Simulate step where a new lap completes in 1 second (exploit)
info = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.0}
reward, _ = wrapper._compute_reward_and_done(done=False, info=info)
assert reward < 0, f'Short lap (1s) should penalise, got reward={reward}'
assert reward <= -10.0, f'Short lap penalty should be large (<= -10), got {reward}'
def test_legitimate_lap_not_penalised():
"""
A lap completed above min_lap_time must NOT trigger the penalty.
"""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# First step — no lap yet
info_no_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 0, 'last_lap_time': 0.0}
wrapper._compute_reward_and_done(done=False, info=info_no_lap)
# Legitimate lap at 12 seconds
info = {'cte': 0.2, 'speed': 3.0, 'pos': (1.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 12.0}
reward, _ = wrapper._compute_reward_and_done(done=False, info=info)
assert reward >= 0, f'Legitimate lap (12s) should not be penalised, got {reward}'
def test_lap_count_not_double_penalised():
"""
Penalty fires exactly once per short lap, not on every subsequent step.
"""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# Short lap fires on step where lap_count increments
info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.5}
r1, _ = wrapper._compute_reward_and_done(done=False, info=info_lap)
assert r1 < 0
# Next step same lap_count — should get normal reward, not another penalty
info_next = {'cte': 0.0, 'speed': 3.0, 'pos': (0.1, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.5}
r2, _ = wrapper._compute_reward_and_done(done=False, info=info_next)
assert r2 >= 0, f'Penalty should not repeat on same lap_count, got r2={r2}'
def test_lap_count_resets_on_episode_reset():
"""lap_count tracker must reset when the episode resets."""
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
# Complete a short lap
info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
'lap_count': 1, 'last_lap_time': 1.0}
wrapper._compute_reward_and_done(done=False, info=info_lap)
assert wrapper._last_lap_count == 1
# Reset episode — counter must go back to 0
wrapper.reset()
assert wrapper._last_lap_count == 0
# ---------------------------------------------------------------------------
# v6.1 exploit terminator tests
# ---------------------------------------------------------------------------
def test_sustained_high_cte_terminates_episode():
"""
Grass exploit fix: if CTE exceeds max_cte_terminate for cte_patience
consecutive steps, the episode must be force-terminated with -1.0 reward.
This catches the generated_track gap where car drives indefinitely on grass.
"""
env = MockEnv(speed=3.0, cte=5.0) # CTE=5.0 > max_cte_terminate=4.0
wrapper = SpeedRewardWrapper(env, max_cte_terminate=4.0, cte_patience=5)
wrapper.reset()
rewards = []
terminated = []
for _ in range(10):
info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.),
'active_node': 0, 'lap_count': 0, 'last_lap_time': 0.0}
r, force_term = wrapper._compute_reward_and_done(done=False, info=info)
rewards.append(r)
terminated.append(force_term)
# High CTE should be punished immediately, then terminate at step 5
assert rewards[0] < 0, f'High CTE should be negative immediately, got {rewards[0]}'
assert terminated[4] == True, f'Should force-terminate at step 5, got {terminated}'
assert rewards[4] == -1.0, f'Termination reward should be -1.0, got {rewards[4]}'
assert terminated[0] == False, 'Should not terminate at step 1'
def test_high_cte_never_gets_positive_speed_reward_before_termination():
"""
Regression for generated_road outside-circle exploit: while CTE is outside
the allowed corridor, the wrapper must not pay positive speed reward during
the patience window. The policy should receive negative feedback
immediately, then termination.
"""
env = MockEnv(speed=5.0, cte=3.0)
wrapper = SpeedRewardWrapper(env, max_cte_terminate=2.5, cte_patience=3)
wrapper.reset()
rewards = []
terminated = []
for i in range(3):
info = {
'cte': 3.0,
'speed': 5.0,
'pos': (float(i), 0.0, 0.0),
'active_node': i,
'total_nodes': 100,
'lap_count': 0,
'last_lap_time': 0.0,
}
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
rewards.append(r)
terminated.append(ft)
assert rewards[:2] == [-0.25, -0.25]
assert rewards[2] == -1.0
assert terminated == [False, False, True]
def test_high_cte_resets_when_back_on_track():
"""
High CTE counter must reset when car returns to track.
Prevents false termination after a brief excursion.
"""
env = MockEnv(speed=3.0, cte=0.5)
wrapper = SpeedRewardWrapper(env, max_cte_terminate=4.0, cte_patience=5)
wrapper.reset()
# 3 steps high CTE
for _ in range(3):
info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.),
'active_node': 0, 'lap_count': 0, 'last_lap_time': 0.0}
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
assert ft == False, 'Should not terminate after only 3 steps'
# 1 step back on track resets counter
info = {'cte': 1.0, 'speed': 3.0, 'pos': (0., 0., 0.),
'active_node': 1, 'lap_count': 0, 'last_lap_time': 0.0}
wrapper._compute_reward_and_done(done=False, info=info)
assert wrapper._high_cte_steps == 0, 'CTE counter should reset when back on track'
# 5 more steps high CTE — should now terminate (counter starts fresh)
for i in range(5):
info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.),
'active_node': 1, 'lap_count': 0, 'last_lap_time': 0.0}
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
assert ft == True, 'Should terminate after 5 new consecutive high-CTE steps'
def test_no_track_progress_terminates_episode():
"""
Circle/stuck exploit fix: if max active_node doesn't advance for
progress_patience steps, the episode must be force-terminated.
A circling car stays near the same waypoints max_node never increases.
"""
def test_no_progress_terminates():
env = MockEnv(speed=3.0, cte=0.5)
wrapper = SpeedRewardWrapper(env, progress_patience=10)
wrapper.reset()
# First step initialises max_node to 5, then 10 more steps stuck at 5 → terminate
for i in range(12):
info = {'cte': 0.5, 'speed': 3.0, 'pos': (float(i)*0.1, 0., 0.),
'active_node': 5, 'total_nodes': 100,
'lap_count': 0, 'last_lap_time': 0.0}
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
r, ft = wrapper._compute_reward(False, make_info(active_node=5, pos=(i*0.1, 0., 0.)))
if ft:
break
assert ft == True, 'Should terminate when max active_node not advancing'
assert ft is True
assert r == -1.0
def test_low_speed_no_displacement_terminates_barrier_wedge():
"""
Regression for invisible-barrier wedge: wheels can be commanded but the car
remains nearly motionless with acceptable CTE. This must terminate quickly
instead of returning zero/positive reward indefinitely.
"""
env = MockEnv(speed=0.05, cte=0.5)
wrapper = SpeedRewardWrapper(
env,
low_speed_grace_steps=2,
low_speed_patience=3,
low_speed_threshold=0.2,
low_speed_min_displacement=0.25,
progress_patience=100,
)
wrapper.reset()
terminated = False
reward = None
for _ in range(8):
info = {
'cte': 0.5,
'speed': 0.05,
'pos': (1.0, 0.0, 1.0),
'active_node': 5,
'total_nodes': 100,
'lap_count': 0,
'last_lap_time': 0.0,
}
reward, terminated = wrapper._compute_reward_and_done(done=False, info=info)
if terminated:
break
assert terminated is True
assert reward == -1.0
def test_low_speed_counter_resets_after_meaningful_displacement():
"""Slow starts should not terminate if the car is still changing position."""
env = MockEnv(speed=0.05, cte=0.5)
wrapper = SpeedRewardWrapper(
env,
low_speed_grace_steps=0,
low_speed_patience=3,
low_speed_threshold=0.2,
low_speed_min_displacement=0.25,
progress_patience=100,
)
wrapper.reset()
for i in range(6):
info = {
'cte': 0.5,
'speed': 0.05,
'pos': (float(i) * 0.3, 0.0, 0.0),
'active_node': i,
'total_nodes': 100,
'lap_count': 0,
'last_lap_time': 0.0,
}
reward, terminated = wrapper._compute_reward_and_done(done=False, info=info)
assert terminated is False
def test_track_progress_resets_counter():
"""
Advancing to a new max active_node must reset the no-progress counter.
"""
env = MockEnv(speed=3.0, cte=0.5)
def test_progress_resets_counter():
env = MockEnv()
wrapper = SpeedRewardWrapper(env, progress_patience=5)
wrapper.reset()
# Step forward: nodes 0, 1, 2, 3 — each new node resets counter
for node in range(4):
info = {'cte': 0.5, 'speed': 3.0, 'pos': (float(node)*0.5, 0., 0.),
'active_node': node, 'total_nodes': 100,
'lap_count': 0, 'last_lap_time': 0.0}
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
assert ft == False, f'Should not terminate when advancing (node {node})'
assert wrapper._no_progress_steps == 0, 'Counter should reset on new max node'
r, ft = wrapper._compute_reward(False, make_info(active_node=node, pos=(node*0.5, 0., 0.)))
assert ft is False
assert wrapper._no_progress_steps == 0
def test_circle_exploit_terminates():
"""
A car circling near the same spot should be terminated.
active_node oscillates but never exceeds the initial max.
"""
env = MockEnv(speed=3.0, cte=0.5)
def test_circling_active_node_terminates():
env = MockEnv()
wrapper = SpeedRewardWrapper(env, progress_patience=10)
wrapper.reset()
# Set max_node to 10
info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.),
'active_node': 10, 'total_nodes': 100,
'lap_count': 0, 'last_lap_time': 0.0}
wrapper._compute_reward_and_done(done=False, info=info)
# Now oscillate between nodes 8-10 (circling near node 10)
wrapper._compute_reward(False, make_info(active_node=10))
terminated = False
for i in range(20):
node = 8 + (i % 3) # oscillates 8, 9, 10, 8, 9, 10...
info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.),
'active_node': node, 'total_nodes': 100,
'lap_count': 0, 'last_lap_time': 0.0}
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
r, ft = wrapper._compute_reward(False, make_info(active_node=8 + (i % 3)))
if ft:
terminated = True
break
assert terminated, 'Circling (oscillating active_node, no new max) should terminate'
assert terminated
def test_lap_completion_resets_progress_tracker():
"""
On lap completion, active_node resets to 0. Progress tracker must also
reset so the car isn't immediately terminated for 'no progress'.
"""
env = MockEnv(speed=3.0, cte=0.5)
env = MockEnv()
wrapper = SpeedRewardWrapper(env, progress_patience=5, min_lap_time=5.0)
wrapper.reset()
# Drive to near end of track
info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.),
'active_node': 99, 'total_nodes': 100,
'lap_count': 0, 'last_lap_time': 0.0}
wrapper._compute_reward_and_done(done=False, info=info)
wrapper._compute_reward(False, make_info(active_node=99))
assert wrapper._max_node_seen == 99
# Complete a valid lap
info = {'cte': 0.5, 'speed': 3.0, 'pos': (0., 0., 0.),
'active_node': 0, 'total_nodes': 100,
'lap_count': 1, 'last_lap_time': 12.0} # 12s lap = valid
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
# Progress tracker should be reset
assert wrapper._max_node_seen == -1, 'max_node_seen should reset on lap completion'
r, ft = wrapper._compute_reward(False, make_info(active_node=0, lap_count=1, lap_time=12.0))
assert wrapper._max_node_seen == -1
assert wrapper._no_progress_steps == 0
assert ft == False, 'Valid lap should not terminate'
assert ft is False
# ── Lap exploit guard ─────────────────────────────────────────────────────────
def test_short_lap_penalised():
env = MockEnv()
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
r, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.0))
assert r < 0
assert r <= -10.0
def test_legitimate_lap_not_penalised():
env = MockEnv()
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
wrapper._compute_reward(False, make_info(lap_count=0))
r, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=12.0, pos=(1., 0., 0.)))
assert r >= 0
def test_lap_penalty_fires_once():
env = MockEnv()
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
r1, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.5))
assert r1 < 0
r2, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.5, pos=(0.1, 0., 0.)))
assert r2 >= 0
def test_lap_count_resets_on_episode_reset():
env = MockEnv()
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
wrapper.reset()
wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.0))
assert wrapper._last_lap_count == 1
wrapper.reset()
assert wrapper._last_lap_count == 0