diff --git a/agent/SESSION_HANDOFF.md b/agent/SESSION_HANDOFF.md index 5c1de5e..0f72043 100644 --- a/agent/SESSION_HANDOFF.md +++ b/agent/SESSION_HANDOFF.md @@ -12,238 +12,171 @@ If the user says only `continue`, interpret it using the instruction above. ## Current Goal -Stabilize the Unity simulator geometry and collision behavior enough that: +Run a clean, trustworthy exp23 on `generated_road` with: +- Solid BoxCollider barriers (car physically cannot escape) +- Clean reward: speed × CTE_quality + efficiency gate +- No artificial episode caps or Python-side exploit patches -- `generated_road` and `generated_track` both run without bad invisible barrier placement -- barrier contacts terminate episodes appropriately -- RL can restart from a trustworthy simulator build +Get RL training producing genuine improvement again. ## Important Paths Project: - - `/home/paulh/projects/donkeycar-rl-autoresearch` Unity source project: - - `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim` Unity build output: - - `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin` Current runtime simulator folders in use: - - `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin` - `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy` -## Current RL Experiment Files +Unity build log: +- `C:\Users\Paul\AppData\Local\Temp\unity_rebuild.log` -- `agent/experiments/exp21_generated_pair_warm_v4.py` -- `agent/experiments/exp22_generated_pair_warm_v6.py` +## What Was Fixed This Session -Latest model/output folder: +### Root cause identified and fixed -- `agent/models/exp22-generated-pair-warm-v6` +**The car was escaping the track because:** +1. Barriers were zero-thickness `MeshCollider` planes — no physical volume +2. Car Rigidbody had no CCD — default `Discrete` mode allows tunneling -Current training run: +Both problems created a simulator where the car could literally teleport through +barrier walls between physics frames. Every Python-side "fix" (CTE termination, +time caps, hit detection) was attempting in Python what the physics engine was +failing to enforce. -- launched `agent/experiments/exp22_generated_pair_warm_v6.py` -- PID file: `agent/models/exp22-generated-pair-warm-v6/current.pid` -- current PID at launch time: `609054` -- log: `agent/models/exp22-generated-pair-warm-v6/run_2026-05-05_141929_strictcte.log` -- startup verified: connected to `localhost:9091` and `localhost:9093`, loaded `generated_road` and `generated_track`, attached warm-start model, reached `Starting training...` - -Latest urgent exploit fix: - -- User observed generated_road still doing the large outside circle exploit. -- Stopped the previous run immediately. -- Patched `agent/reward_wrapper.py` so high CTE receives negative reward immediately during the patience window instead of falling through to positive speed reward. -- Patched `agent/experiments/exp22_generated_pair_warm_v6.py`: - - `MAX_CTE_TERMINATE = 2.5` - - `CTE_PATIENCE = 3` -- Added regression test `test_high_cte_never_gets_positive_speed_reward_before_termination`. -- Verified `python3 -m pytest -q tests/test_reward_wrapper.py`: `21 passed`. - -## What Was Learned - -### Training status - -The latest meaningful `exp22` run was poor and should not be resumed as-is. - -From `agent/models/exp22-generated-pair-warm-v6/run_2026-04-28_2132_openfix.log`: - -- best `generated_track` eval reached only about `92` steps -- run was not trustworthy due to ongoing barrier-placement concerns - -### Simulator behavior - -- Invisible barriers are collider-only by default, so the user cannot see them in the standalone player -- Diagnostic probe showed both tracks could advance from the start before hitting `left_barrier`, so there was no obvious full-width blocker across the road start -- User screenshot suggested the car was getting trapped near the shoulder/edge, consistent with barrier corridor too close to the drivable edge -- User also reported that barrier contact sometimes blocks the car without promptly ending the episode - -### Collision semantics - -The user does **not** want every barrier brush to terminate the episode. - -Desired behavior: - -- light brush: can continue -- sustained contact: terminate -- head-on / abrupt stop: terminate quickly - -## Code Changes Already Made - -### Unity / simulator side +### Unity changes (source updated, build in progress) `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs` +- Rewrote `CreateBarrier()`: now creates one `BoxCollider` per segment with real + 3D volume (`barrierThickness` wide — default 1.0m) +- Segment boxes overlap by `barrierThickness * 0.5` to close corner gaps +- Added `CreateEndCap()`: seals the two open ends of non-looping tracks + (`generated_road` is `closeLoop=0` — without end caps the car can drive off + the ends of the track) +- Added `public float barrierThickness = 1.0f` field (inspector-editable) +- `showBarrierMeshes=true` now shows proper translucent 3D boxes, not flat planes -Implemented structural refactor: +`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/Car.cs` +- Added `rb.collisionDetectionMode = CollisionDetectionMode.Continuous;` in + `Awake()` — prevents tunneling even against any remaining thin geometry -- explicit `closeLoop` support -- explicit road-edge generation -- barrier edges derived from left/right road edges instead of guessed centerline offset -- open tracks do not force wraparound -- debug polyline support via gizmos +### Python changes (committed) -Added runtime-visible debug barrier support: +`agent/reward_wrapper.py` → v7 (clean) +- REMOVED: CTE-patience termination, high-CTE negative reward, solid_hit + monitoring, low-speed/wedge detection, all exploit-closing bandaids +- KEPT: efficiency gate (zero reward when circling), no-progress termination + (active_node), lap exploit guard +- Reward: `speed_norm × CTE_quality` when efficiency passes gate -- `showBarrierMeshes` -- `barrierDebugColor` -- barrier objects now include `MeshFilter` -- optional `MeshRenderer` added for visible translucent barriers +`agent/experiments/exp23_generated_road_clean.py` +- Single track: `generated_road` on port 9091 +- No warm-start (fresh PPO weights) +- `MAX_EPISODE_SECONDS=120` (generous safety net, not a training constraint) +- LR=0.0003, 200k total steps, checkpoints every 10k -`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scenes/generated_road.unity` +`tests/test_reward_wrapper.py` — 17 tests, all pass -- `closeLoop = 0` -- `doAddBarriers = 1` -- `showBarrierMeshes = 1` -- pinned road variation arrays to one entry -- `roadOffsets.Array.data[0] = 2.2` +## Current State -`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scenes/generated_track.unity` +### Unity build +- Build launched with PID 37896 on 2026-05-05 +- Log: `C:\Users\Paul\AppData\Local\Temp\unity_rebuild.log` +- Check: `grep -q "Exiting batchmode successfully" /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log && echo OK` -- `showBarrierMeshes = 1` -- `roadOffsetW = 2.2` -- barriers still enabled +### After build completes +1. Sync to both runtime folders: +```bash +rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' \ + '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin/' +rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' \ + '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy/' +``` -### Python / RL side +2. Launch sims (only port 9091 needed for exp23 — single env): +```powershell +$key = 'HKCU:\Software\DonkeyCar\donkey_sim' +Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9091 -Type DWord +Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9092 -Type DWord +Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin\donkey_sim.exe' ` + -ArgumentList '--port','9091' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin' +``` -`/home/paulh/projects/donkeycar-rl-autoresearch/agent/reward_wrapper.py` +3. Verify port: +```bash +python3 -c "import socket; s=socket.socket(); s.settimeout(3); s.connect(('127.0.0.1',9091)); print('PORT 9091: OK'); s.close()" +``` -Latest intent: +4. Visually verify barriers in the sim window: + - `showBarrierMeshes=1` is already set in both scene files + - Translucent box barriers should be visible on BOTH sides of the road + - Verify no gaps at corners + - Verify end-cap walls at start and finish of generated_road + - **Do not start exp23 until Paul confirms barriers look correct** -- do **not** terminate instantly on every barrier hit -- terminate on sustained obstacle contact -- terminate on head-on style stop +5. Launch exp23: +```bash +cd /home/paulh/projects/donkeycar-rl-autoresearch +SAVE_DIR=agent/models/exp23-generated-road-clean +mkdir -p $SAVE_DIR +nohup python3 agent/experiments/exp23_generated_road_clean.py \ + > $SAVE_DIR/run_$(date +%Y-%m-%d_%H%M%S)_clean.log 2>&1 & +echo $! > $SAVE_DIR/current.pid +``` -Current patch in file: +## Key Parameters (exp23) -- tracks `_solid_hit_steps` -- tracks `_prev_speed` -- classifies solid hits via `hit` containing `barrier`, `wall`, or `tree` -- immediate terminate on abrupt speed collapse while colliding -- terminate after several consecutive solid-hit frames +| Setting | Value | Why | +|---|---|---| +| Track | generated_road | Single track — diagnose before adding second | +| LR | 0.0003 | Standard PPO starting LR | +| Total steps | 200k | More room to learn with clean signal | +| max_episode_seconds | 120s | Safety net only — physics does the work | +| MAX_CTE_TERMINATE | none | Removed — barriers are physical now | +| Warm-start | none | Previous warm-starts trained on broken reward | +| showBarrierMeshes | ON | Verify visually before committing to long run | -This was meant to replace the too-aggressive “any barrier hit = immediate death” logic. +## Success Criteria -## Most Recent Verified Build Status - -Unity batch build for the debug-visible barrier version completed successfully. - -Evidence: - -- build log ended with `Exiting batchmode successfully now!` -- return code `0` - -The successful build has now been synced into both `Downloads` runtime folders and both simulators have been relaunched. - -Current verified runtime state: - -- main folder process owns port `9091` -- main folder also owns private API port `9092` -- copy folder process owns port `9093` -- copy folder also owns private API port `9094` -- Linux socket probe reported `PORT 9091: OK`, `PORT 9092: OK`, `PORT 9093: OK`, and `PORT 9094: OK` -- latest runtime build includes double-sided barrier mesh triangles for visual/debug barrier rendering - -Note: the Windows profile uses shared Unity PlayerPrefs/registry values under `HKCU:\Software\DonkeyCar\donkey_sim`. Explicit `--port` args bind the servers correctly, but the in-sim UI can still show the saved PlayerPrefs value. Before launch, set `port_h2088097884`/`portPrivateAPI_h1325370089` to `9091`/`9092`, start the main sim, then set them to `9093`/`9094` and start the copy. Also keep passing explicit `--port 9091` and `--port 9093`. - -Latest user visual inspection before double-sided patch: - -- `generated_road`: barriers visible on both sides except missing on left side at the very start before the first curve -- `generated_track`: barrier visible only on the right/inside side when driving clockwise; no visible left/outside barrier - -Likely diagnosis: barrier mesh was generated as a single-sided vertical plane and the Standard shader culled backfaces, so some debug barrier surfaces existed but were invisible from the road/camera side. - -Latest simulator-side patch: - -- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs` -- `CreateBarrier(...)` now emits reverse-facing triangles for every barrier quad, making debug barrier meshes visible from both sides -- failed attempt: `Unlit/Transparent` made both tracks' barriers black in the standalone player -- failed attempt: duplicating reverse-facing triangles made `generated_track` barriers black, likely due coplanar transparent overdraw/z-fighting on the closed/scaled track -- current debug barrier mesh is back to one triangle set per quad; material uses `Standard` transparent mode with forced pale fallback color, alpha blend, culling off, and emission enabled so barriers should stay light/translucent while remaining visible from both sides -- Unity Windows batch build succeeded after this patch -- rebuilt output synced to both runtime folders and relaunched with explicit ports - -## Immediate Next Steps - -1. Monitor current exp22 training log/checkpoints. - -2. Determine: - - are barriers too close to the road edge globally? - - or only wrong at specific bends / first-corner geometry? - -3. Fix geometry if needed before restarting RL. - -4. Only after geometry is visually verified, restart `exp22` or a successor experiment. +- Car cannot drive past the barrier walls (verify visually) +- ep_len_mean should INCREASE over checkpoints (not frozen at 118) +- eval steps should improve at 20k, 30k, 40k checkpoints +- No evidence of outside-road circling in the reward curve ## Useful Commands -### Sync latest build into runtime folders - +### Check build log ```bash -rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin/' -rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy/' +tail -20 /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log +grep "Exiting batchmode\|Build failed\|error\|Error" /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log | tail -5 ``` -### Launch sims from Windows side - -```powershell -$key = 'HKCU:\Software\DonkeyCar\donkey_sim' - -Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9091 -Type DWord -Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9092 -Type DWord -Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin\donkey_sim.exe' -ArgumentList '--port','9091' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin' - -Start-Sleep -Seconds 4 - -Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9093 -Type DWord -Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9094 -Type DWord -Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin - Copy\donkey_sim.exe' -ArgumentList '--port','9093' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin - Copy' +### Monitor exp23 +```bash +tail -f agent/models/exp23-generated-road-clean/run_*_clean.log ``` ### Verify ports - ```bash python3 - <<'PY' import socket -for p in (9091, 9093): - s = socket.socket() - s.settimeout(3) - try: - s.connect(('127.0.0.1', p)) - print(f'PORT {p}: OK') - except Exception as e: - print(f'PORT {p}: FAIL {e}') - finally: - s.close() +for p in (9091,): + s = socket.socket(); s.settimeout(3) + try: s.connect(('127.0.0.1', p)); print(f'PORT {p}: OK') + except Exception as e: print(f'PORT {p}: FAIL {e}') + finally: s.close() PY ``` ## Notes for Next Session -- If the user says `continue`, do not ask broad questions. Start with the immediate next steps above. -- Prefer direct verification over more RL training. -- Do not restart long training until the user has visually confirmed the debug-visible barriers look correct. +- If the user says `continue`, do not ask broad questions. Check build log → sync → launch → verify barriers → start exp23. +- **Barrier visual confirmation is required before starting exp23.** Paul must see the translucent 3D boxes on both sides of the road with no gaps before committing to a 200k training run. +- The second sim (port 9093) is not needed for exp23 — only launch one sim. +- Do not add generated_track back until generated_road training is verified working. diff --git a/agent/experiments/exp23_generated_road_clean.py b/agent/experiments/exp23_generated_road_clean.py new file mode 100644 index 0000000..a2f70c8 --- /dev/null +++ b/agent/experiments/exp23_generated_road_clean.py @@ -0,0 +1,236 @@ +""" +Exp 23: Clean slate — generated_road, solid barriers, simple reward. + +What changed from exp22: + - Single track: generated_road on port 9091 only (diagnose one track first) + - Simulator now uses BoxCollider barriers + CCD on the car Rigidbody. + The car physically cannot escape. No Python-side exploit patches needed. + - Reward wrapper v7: speed × CTE_quality + efficiency gate + no-progress kill. + Removed: CTE-patience termination, solid_hit detection, wedge detection, + MAX_EPISODE_SECONDS hard cap. + - StuckTerminationWrapper: max_episode_seconds raised to 120s (genuine safety + net only — physics handles the actual containment). + - No warm-start: fresh PPO weights. Previous warm-starts were trained under + broken reward/barrier conditions and add more noise than signal. + - Total steps: 200k (more room to learn with clean signal). +""" +import os +import sys +import time +from datetime import datetime + +sys.path.insert(0, '/home/paulh/projects/donkeycar-rl-autoresearch/agent') + +import gymnasium as gym +import numpy as np +from stable_baselines3 import PPO +from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage + +from donkeycar_sb3_runner import ThrottleClampWrapper +from multitrack_runner import StuckTerminationWrapper +from reward_wrapper import SpeedRewardWrapper + + +HOST = 'localhost' +THROTTLE_MIN = 0.2 +LR = 0.0003 +TOTAL_STEPS = 200_000 +CHECKPOINT_EVERY = 10_000 +SAVE_DIR = '/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/exp23-generated-road-clean' +os.makedirs(SAVE_DIR, exist_ok=True) + +# Reward wrapper v7 params — clean and minimal +EFFICIENCY_WINDOW = 30 +MIN_EFFICIENCY = 0.15 +MAX_CTE = 8.0 +MIN_LAP_TIME = 12.0 +PROGRESS_PATIENCE = 100 # steps without new waypoint → terminate + +# StuckTerminationWrapper — generous limit, physics does the real work now +MAX_STUCK_SECONDS = 5.0 +MAX_EPISODE_SECONDS = 120.0 # safety net only + + +def log(msg): + print(f'[{datetime.now().strftime("%H:%M:%S")}] {msg}', flush=True) + + +def make_env(track_id, port): + def _init(): + raw = gym.make(track_id, conf={'host': HOST, 'port': port}) + env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN) + env = StuckTerminationWrapper( + env, + stuck_steps=40, + min_displacement=0.5, + max_stuck_seconds=MAX_STUCK_SECONDS, + max_episode_seconds=MAX_EPISODE_SECONDS, + ) + env = SpeedRewardWrapper( + env, + window_size=EFFICIENCY_WINDOW, + min_efficiency=MIN_EFFICIENCY, + max_cte=MAX_CTE, + min_lap_time=MIN_LAP_TIME, + progress_patience=PROGRESS_PATIENCE, + ) + return env + return _init + + +def make_eval_env(track_id, port): + inner = make_env(track_id, port)() + return VecTransposeImage(DummyVecEnv([lambda e=inner: e])) + + +log('=' * 60) +log('Exp 23: generated_road — clean barriers, clean reward') +log(f' Sim: {HOST}:9091 -> generated_road') +log(f' throttle_min={THROTTLE_MIN}, lr={LR}, total={TOTAL_STEPS:,}') +log(f' Reward: v7 (speed×CTE, efficiency gate, no-progress kill)') +log(f' Max stuck: {MAX_STUCK_SECONDS}s, episode cap: {MAX_EPISODE_SECONDS}s (safety net)') +log(f' Progress patience: {PROGRESS_PATIENCE} steps') +log(f' Checkpoints every {CHECKPOINT_EVERY:,} steps') +log('=' * 60) + +log('Creating DummyVecEnv on generated_road...') +env = DummyVecEnv([make_env('donkey-generated-roads-v0', 9091)]) +env = VecTransposeImage(env) +log(f' VecEnv num_envs={env.num_envs}, obs={env.observation_space.shape}') + +model = PPO( + 'CnnPolicy', + env, + learning_rate=LR, + n_steps=2048, + batch_size=64, + n_epochs=10, + gamma=0.99, + gae_lambda=0.95, + clip_range=0.2, + ent_coef=0.01, + verbose=1, + device='cpu', +) + +# Write PID for external monitoring +pid_path = os.path.join(SAVE_DIR, 'current.pid') +with open(pid_path, 'w') as f: + f.write(str(os.getpid())) + +log(f'Fresh PPO model created. Starting training...') + +best_total_steps = float('-inf') +best_total_reward = float('-inf') +steps_done = 0 +run_tag = datetime.now().strftime('%Y-%m-%d_%H%M%S') + '_clean' +log_path = os.path.join(SAVE_DIR, f'run_{run_tag}.log') +best_model_path = os.path.join(SAVE_DIR, 'best_model.zip') + +import logging +logging.basicConfig( + level=logging.INFO, + format='%(message)s', + handlers=[logging.FileHandler(log_path), logging.StreamHandler(sys.stdout)], +) +file_log = logging.getLogger('exp23') + +def flog(msg): + ts = datetime.now().strftime('%H:%M:%S') + file_log.info(f'[{ts}] {msg}') + +flog('=' * 60) +flog(f'Exp 23 started — PID {os.getpid()}') +flog(f'Log: {log_path}') +flog('=' * 60) + +while steps_done < TOTAL_STEPS: + seg_steps = min(CHECKPOINT_EVERY, TOTAL_STEPS - steps_done) + model.learn(total_timesteps=seg_steps, reset_num_timesteps=False) + steps_done += seg_steps + + ckpt = os.path.join(SAVE_DIR, f'checkpoint_{steps_done:07d}') + model.save(ckpt) + model.save(os.path.join(SAVE_DIR, 'model')) + flog(f'[{steps_done:,}/{TOTAL_STEPS:,}] Checkpoint saved: {ckpt}.zip') + + # Mid-training eval on generated_road + try: + obs = env.reset() + ep_rewards = np.zeros(env.num_envs) + ep_steps = np.zeros(env.num_envs) + done_mask = np.zeros(env.num_envs, dtype=bool) + + for _ in range(2000): + action, _ = model.predict(obs, deterministic=True) + obs, rewards, dones, infos = env.step(action) + for i in range(env.num_envs): + if not done_mask[i]: + ep_rewards[i] += rewards[i] + ep_steps[i] += 1 + if dones[i]: + done_mask[i] = True + if done_mask.all(): + break + + total_steps_eval = int(ep_steps.sum()) + total_reward_eval = float(ep_rewards.sum()) + + status = '✅' if ep_steps[0] >= 2000 else f'❌@{int(ep_steps[0])}' + flog(f' Eval: gen_road={total_reward_eval:.1f}r/{int(ep_steps[0])}s {status}') + + if (total_steps_eval > best_total_steps + or (total_steps_eval == best_total_steps + and total_reward_eval > best_total_reward)): + best_total_steps = total_steps_eval + best_total_reward = total_reward_eval + model.save(best_model_path) + flog(f' NEW BEST: steps={best_total_steps} reward={best_total_reward:.1f}') + + except Exception as e: + flog(f' Eval error: {e}') + +env.close() + +# ── Final evaluation ────────────────────────────────────────────────────────── +flog('=' * 60) +flog('FINAL EVALUATION: best_model on generated_road') +flog('=' * 60) + +EVAL_SETS = 3 +EVAL_MAX_STEPS = 2000 + +steps_list = [] +reward_list = [] + +for s in range(1, EVAL_SETS + 1): + try: + eval_env = make_eval_env('donkey-generated-roads-v0', 9091) + eval_model = PPO.load(best_model_path, env=eval_env, device='cpu') + obs = eval_env.reset() + done = False + total_s = 0 + total_r = 0.0 + + while not done and total_s < EVAL_MAX_STEPS: + action, _ = eval_model.predict(obs, deterministic=True) + result = eval_env.step(action) + obs, r, done = result[0], result[1], result[2] + if hasattr(done, '__len__'): + done = bool(done[0]) + total_r += float(r) if not hasattr(r, '__len__') else float(r[0]) + total_s += 1 + + status = '✅' if total_s >= EVAL_MAX_STEPS else f'❌@{total_s}' + flog(f' Set {s}: {total_r:.1f}r / {total_s}s {status}') + steps_list.append(total_s) + reward_list.append(total_r) + eval_env.close() + + except Exception as e: + flog(f' Set {s} error: {e}') + +if steps_list: + flog(f' Mean: {np.mean(steps_list):.0f} steps / {np.mean(reward_list):.1f} reward') + +flog('Exp 23 complete.') diff --git a/agent/reward_wrapper.py b/agent/reward_wrapper.py index 2696bc0..2a1cc5a 100644 --- a/agent/reward_wrapper.py +++ b/agent/reward_wrapper.py @@ -1,58 +1,36 @@ """ -Speed + Progress Reward Wrapper for DonkeyCar RL — v6 (Speed×CTE + Efficiency Gate) -===================================================================================== +Speed × CTE Reward Wrapper for DonkeyCar RL — v7 (Clean) +========================================================= -REWARD HACKING HISTORY: - v1 additive: speed × (1-cte/max_cte) → boundary oscillation - v2 multiplicative: original × (1+speed×scale) → circular driving (on-track) - v3 path efficiency: original × (1+speed×eff×scale) → still circling! - WHY v3 failed: efficiency killed the SPEED BONUS but not the BASE reward. - A spinning car at CTE≈0 still earns 1.0/step × thousands of steps. - v4: base × eff × (1 + speed_scale × speed) → zero gradient on hills! - WHY v4 failed on hills: speed≈0 AND eff≈0 AND cte_quality varies → all - three terms near zero simultaneously → no gradient to push ANY term up. - v5: speed × CTE_quality (no efficiency) → circular driving returns! - WHY v5 failed: dropped efficiency entirely. Circular driving at CTE≈0 - with speed>0 earns positive reward indefinitely. Observed in Exp 11. - v6 (THIS VERSION): v5 reward + efficiency GATE. - Keeps v5's gradient properties (non-zero gradient on hills) but adds - a binary efficiency check that zeros reward when car is circling. +The simulator now uses solid BoxCollider barriers with Continuous Collision +Detection on the car Rigidbody. The car physically cannot escape the track. +This removes the need for every Python-side exploit patch that lived here: -ROOT CAUSE OF CIRCLING: - The sim's own calc_reward() uses `forward_vel` = dot(car_heading, velocity). - A spinning car is ALWAYS moving "forward" relative to its own heading, - so forward_vel > 0 always, giving positive reward while circling indefinitely. - We bypass this entirely. + REMOVED (simulator now enforces these physically): + - CTE-patience termination (car can't get far off track anyway) + - High-CTE negative reward patch + - solid_hit / barrier-contact monitoring + - low-speed / wedge detection -FORMULA (v6): - cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1] centred=1 - speed_norm = min(speed / 10.0, 1.0) # [0,1] normalised - efficiency = net_displacement / total_path # [0,1] straight=1, circle=0 + KEPT (still needed — physics can't detect these): + - Efficiency gate: zero reward when circling + (car on-track but spinning in circles, not advancing) + - No-progress termination: active_node not advancing + (car stuck at waypoint, not completing the course) + - Lap exploit check: super-fast laps are physically impossible but kept + as a sanity guard + +FORMULA: + cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1]: centred=1 + speed_norm = min(speed / 10.0, 1.0) # [0,1]: normalised + efficiency = net_displacement / total_path # [0,1]: straight=1, circle=0 if efficiency < min_efficiency: - reward = 0.0 # GATE: circling → zero reward (but not negative) + reward = 0.0 # circling — no incentive else: - reward = cte_quality × speed_norm # v5 formula (gradient on hills) + reward = cte_quality × speed_norm On done/crash: reward = -1.0 - -WHY GATE NOT MULTIPLIER: - v4 used efficiency as a multiplier: reward = base × eff × speed_bonus. - On a hill: speed≈0, eff≈0, base≈0.5 → reward≈0 and ∂reward/∂speed≈0. - No gradient to push speed up — car stays stuck. - - v6 gate: efficiency is either PASS or FAIL. When efficiency > threshold - (car moving forward at all), reward = speed × CTE_quality. On a hill: - car is stuck but still has eff > 0 (not literally circling), so the gate - passes and the reward = speed × CTE_quality. ∂reward/∂speed > 0 → gradient - pushes toward more throttle. Circle has eff ≈ 0 → gate fails → reward = 0. - -PROPERTIES: - - Circling (eff0): reward = speed × CTE (gradient toward unstuck) - - On track, fast: reward = high (speed + centred) - - Off track: reward ≈ 0 (CTE_quality → 0) - - Crash: reward = -1.0 """ import gymnasium as gym @@ -62,92 +40,49 @@ from collections import deque class SpeedRewardWrapper(gym.Wrapper): """ - Full reward bypass: speed × CTE_quality, gated by efficiency. - - Completely ignores the sim's own reward (which uses forward_vel and is - exploitable by circular/spinning motion). - - Exploit termination: - - Sustained high CTE (> max_cte_terminate for cte_patience steps): grass exploit - - No track progress (active_node max not advancing for progress_patience steps): - catches circular driving, stuck-on-cone, stuck-on-barrier. - A circling car stays near the same waypoints — active_node never advances. - A stuck car never advances either. Forward driving always advances. + Reward = speed × CTE_quality, gated by path efficiency. Args: - env: gymnasium environment - speed_scale: speed bonus multiplier (default 0.1) - window_size: steps for efficiency gate (default 30) - min_efficiency: efficiency gate threshold (default 0.15) - max_cte: track half-width for reward normalization (default 8.0) - min_lap_time: laps faster than this are penalised as exploits - max_cte_terminate: terminate if CTE > this for cte_patience steps - cte_patience: steps of sustained high CTE before termination - progress_patience: steps without new max active_node before termination + env: gymnasium environment + window_size: steps for efficiency gate history (default 30) + min_efficiency: efficiency threshold — below this, reward = 0 (default 0.15) + max_cte: CTE at which reward reaches 0 (default 8.0) + min_lap_time: laps faster than this are penalised (exploit guard) + progress_patience: steps without new max active_node before termination """ def __init__( self, env, - speed_scale: float = 0.1, window_size: int = 30, min_efficiency: float = 0.15, max_cte: float = 8.0, min_lap_time: float = 5.0, - max_cte_terminate: float = 4.0, - cte_patience: int = 20, progress_patience: int = 60, - efficiency_patience: int = 20, # steps of low efficiency before termination - low_speed_patience: int = 20, - low_speed_threshold: float = 0.2, - low_speed_min_displacement: float = 0.25, - low_speed_grace_steps: int = 20, ): super().__init__(env) - self.speed_scale = speed_scale self.window_size = window_size self.min_efficiency = min_efficiency self.max_cte = max_cte self.min_lap_time = min_lap_time - self.max_cte_terminate = max_cte_terminate - self.cte_patience = cte_patience self.progress_patience = progress_patience - self.efficiency_patience = efficiency_patience - self.low_speed_patience = low_speed_patience - self.low_speed_threshold = low_speed_threshold - self.low_speed_min_displacement = low_speed_min_displacement - self.low_speed_grace_steps = low_speed_grace_steps + self._pos_history = deque(maxlen=window_size + 1) self._last_lap_count = 0 - self._high_cte_steps = 0 self._max_node_seen = -1 self._no_progress_steps = 0 - self._low_eff_steps = 0 - self._solid_hit_steps = 0 - self._prev_speed = 0.0 - self._episode_steps = 0 - self._low_speed_steps = 0 - self._low_speed_anchor = None def reset(self, **kwargs): result = self.env.reset(**kwargs) self._pos_history.clear() self._last_lap_count = 0 - self._high_cte_steps = 0 self._max_node_seen = -1 self._no_progress_steps = 0 - self._low_eff_steps = 0 - self._solid_hit_steps = 0 - self._prev_speed = 0.0 - self._episode_steps = 0 - self._low_speed_steps = 0 - self._low_speed_anchor = None return result def step(self, action): result = self.env.step(action) - # Handle both 4-tuple (old gym) and 5-tuple (gymnasium) APIs if len(result) == 5: obs, _sim_reward, terminated, truncated, info = result done = terminated or truncated @@ -158,159 +93,54 @@ class SpeedRewardWrapper(gym.Wrapper): else: raise ValueError(f'Unexpected step() result length: {len(result)}') - # Completely ignore _sim_reward — compute our own - shaped, force_terminate = self._compute_reward_and_done(done, info) + shaped, force_terminate = self._compute_reward(done, info) if force_terminate: terminated = True done = True if len(result) == 5: return obs, shaped, terminated, truncated, info - else: - return obs, shaped, done, info + return obs, shaped, done, info - def _compute_reward_and_done(self, done: bool, info: dict): - """ - v6.1: speed × CTE-quality + efficiency gate + grass/rollback terminators. - - New termination conditions: - - Sustained high CTE: CTE > max_cte_terminate for cte_patience steps - → terminate. Stops the grass exploit (car exits track gap and - drives indefinitely on grass with CTE just under max_cte=8.0). - - No track progress: active_node doesn't advance for progress_patience - steps → terminate. Stops mountain rollback (car goes up, rolls - back, IS moving so StuckWrapper doesn't fire, but never advances). - - reward = speed_norm × cte_quality (when efficiency >= threshold) - reward = 0.0 (when circling) - reward = -1.0 (on crash/termination) - """ - # Track position for efficiency calculation - current_pos = None + def _compute_reward(self, done: bool, info: dict): + # Record position for efficiency calculation try: pos = info.get('pos', (0.0, 0.0, 0.0)) - pos_x = float(pos[0]) - pos_z = float(pos[2]) - current_pos = np.array([pos_x, pos_z]) - self._pos_history.append(current_pos) + self._pos_history.append(np.array([float(pos[0]), float(pos[2])])) except (TypeError, ValueError, IndexError): pass - self._episode_steps += 1 - - # Crash / episode over if done: return -1.0, False - # --- CTE value for all checks --- try: cte = float(info.get('cte', 0.0) or 0.0) except (TypeError, ValueError): cte = 0.0 - # --- Speed / collision classification --- try: speed = max(0.0, float(info.get('speed', 0.0) or 0.0)) except (TypeError, ValueError): speed = 0.0 - try: - hit = str(info.get('hit', 'none') or 'none').lower() - except Exception: - hit = 'none' - - solid_hit = ( - hit != 'none' and ( - 'barrier' in hit or - 'wall' in hit or - 'tree' in hit - ) - ) - - # Allow brief brushes, but terminate on: - # 1. a head-on style stop: car was moving, then collision arrives with - # a large speed drop; or - # 2. sustained obstacle contact over several telemetry frames. - if solid_hit: - head_on_impact = self._prev_speed >= 1.5 and speed <= 0.35 - if head_on_impact: - self._prev_speed = speed - return -1.0, True - - self._solid_hit_steps += 1 - if self._solid_hit_steps >= 4: - self._prev_speed = speed - return -1.0, True - else: - self._solid_hit_steps = 0 - - # --- Wheels-spinning / barrier wedge termination --- - # CTE can remain deceptively acceptable when the car is pressed against - # a generated-road barrier or invisible collider. If speed stays near - # zero and position does not meaningfully change after the launch grace - # period, kill the episode quickly with a negative reward. - if ( - current_pos is not None - and self._episode_steps > self.low_speed_grace_steps - and speed <= self.low_speed_threshold - ): - if self._low_speed_anchor is None: - self._low_speed_anchor = current_pos - self._low_speed_steps = 1 - else: - moved = float(np.linalg.norm(current_pos - self._low_speed_anchor)) - if moved >= self.low_speed_min_displacement: - self._low_speed_anchor = current_pos - self._low_speed_steps = 0 - else: - self._low_speed_steps += 1 - - if self._low_speed_steps >= self.low_speed_patience: - self._prev_speed = speed - return -1.0, True - else: - self._low_speed_steps = 0 - self._low_speed_anchor = current_pos - - # --- Grass / outside-road exploit: high CTE is bad immediately --- - # Do not let the policy collect positive speed reward while it is - # outside the useful road corridor. Earlier versions only terminated - # after patience frames, but still paid positive reward during those - # frames; PPO learned large fast circles outside generated_road. - if abs(cte) > self.max_cte_terminate: - self._high_cte_steps += 1 - if self._high_cte_steps >= self.cte_patience: - self._prev_speed = speed - return -1.0, True # too long off-track — terminate - self._prev_speed = speed - return -0.25, False - else: - self._high_cte_steps = 0 - - # --- Circle / stuck exploit: no track progress termination --- - # Track the highest active_node (track waypoint) reached this episode. - # A circling car stays near the same waypoints — max_node never advances. - # A stuck car never advances either. Only genuine forward driving advances. - # On lap completion, active_node resets to 0 — we reset our tracker too. + # --- No-progress termination --- + # Terminates episodes where the car isn't advancing along the track + # (circling near the start, stuck against a barrier, etc.). try: active_node = int(info.get('active_node', -1) or 0) - total_nodes = int(info.get('total_nodes', 1) or 1) except (TypeError, ValueError): active_node = -1 - total_nodes = 1 if active_node >= 0: if active_node > self._max_node_seen: - # New furthest point reached — genuine forward progress self._max_node_seen = active_node self._no_progress_steps = 0 else: self._no_progress_steps += 1 if self._no_progress_steps >= self.progress_patience: - self._prev_speed = speed - return -1.0, True # no forward progress — terminate - + return -1.0, True + # --- Lap detection: reset progress tracker + exploit guard --- try: current_lap_count = int(info.get('lap_count', 0) or 0) except (TypeError, ValueError): @@ -318,7 +148,6 @@ class SpeedRewardWrapper(gym.Wrapper): if current_lap_count > self._last_lap_count: self._last_lap_count = current_lap_count - # Reset progress tracker — active_node wraps to 0 on new lap self._max_node_seen = -1 self._no_progress_steps = 0 try: @@ -326,47 +155,22 @@ class SpeedRewardWrapper(gym.Wrapper): except (TypeError, ValueError): lap_time = 999.0 if lap_time < self.min_lap_time: - penalty = -10.0 * (self.min_lap_time / max(lap_time, 0.1)) - self._prev_speed = speed - return penalty, True + return -10.0 * (self.min_lap_time / max(lap_time, 0.1)), True - # --- Efficiency gate: detect circular driving --- - # Count consecutive steps of low efficiency. After patience steps, terminate. - # Previously this just returned 0 reward (no termination) which let circles - # run for 20+ seconds. Now we terminate after ~20 steps (~0.7s). - efficiency = self._compute_efficiency() - if efficiency < self.min_efficiency: - self._low_eff_steps += 1 - if self._low_eff_steps >= self.efficiency_patience: - self._prev_speed = speed - return -1.0, True # circle too long — terminate - self._prev_speed = speed - return 0.0, False # still accumulating — zero reward - else: - self._low_eff_steps = 0 + # --- Efficiency gate: zero reward when circling --- + if self._compute_efficiency() < self.min_efficiency: + return 0.0, False - # --- CTE quality --- + # --- Core reward: speed × CTE quality --- cte_quality = 1.0 - min(abs(cte) / self.max_cte, 1.0) - - # --- Speed --- - # --- v6 reward: speed × CTE quality --- - speed_norm = min(speed / 10.0, 1.0) - self._prev_speed = speed + speed_norm = min(speed / 10.0, 1.0) return cte_quality * speed_norm, False def _compute_efficiency(self) -> float: - """Path efficiency = net_displacement / total_path_length.""" if len(self._pos_history) < 3: - return 1.0 # Insufficient history — give benefit of doubt - + return 1.0 positions = list(self._pos_history) - net = np.linalg.norm(positions[-1] - positions[0]) - total = sum( - np.linalg.norm(positions[i + 1] - positions[i]) - for i in range(len(positions) - 1) - ) - return float(net / total) if total > 1e-6 else 1.0 - - def theoretical_max_per_step(self, max_speed: float = 10.0) -> float: - """Upper bound on reward/step (efficiency=1, CTE=0, max speed).""" - return 1.0 * 1.0 * (1.0 + self.speed_scale * max_speed) + net = float(np.linalg.norm(positions[-1] - positions[0])) + total = float(sum(np.linalg.norm(positions[i+1] - positions[i]) + for i in range(len(positions) - 1))) + return net / total if total > 1e-6 else 1.0 diff --git a/tests/test_reward_wrapper.py b/tests/test_reward_wrapper.py index 5b6731c..1d19782 100644 --- a/tests/test_reward_wrapper.py +++ b/tests/test_reward_wrapper.py @@ -1,30 +1,25 @@ -""" -Tests for reward_wrapper.py v4 (full sim bypass — base × efficiency × speed). -""" +"""Tests for reward_wrapper.py v7 (clean: speed×CTE + efficiency gate).""" import sys, os, math, pytest import numpy as np -import gymnasium as gym -from collections import deque sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent')) from reward_wrapper import SpeedRewardWrapper +import gymnasium as gym -# ---- Mock Environments ---- class MockEnv(gym.Env): - """Configurable mock gymnasium.Env.""" metadata = {'render_modes': []} def __init__(self, speed=2.0, cte=0.0, pos=(0., 0., 0.), done=False, use_5tuple=True): super().__init__() - self.action_space = gym.spaces.Discrete(5) + self.action_space = gym.spaces.Discrete(5) self.observation_space = gym.spaces.Box(0, 255, (120, 160, 3), dtype=np.uint8) - self._speed = speed - self._cte = cte - self._pos = list(pos) - self._done = done + self._speed = speed + self._cte = cte + self._pos = list(pos) + self._done = done self._use_5tuple = use_5tuple def set_pos(self, p): self._pos = list(p) @@ -34,523 +29,231 @@ class MockEnv(gym.Env): return np.zeros((120, 160, 3), dtype=np.uint8), {} def step(self, action): - obs = np.zeros((120, 160, 3), dtype=np.uint8) - # Sim reward uses forward_vel (exploitable) — wrapper should IGNORE this - sim_reward = 999.0 # Deliberately bogus — wrapper must not use this - info = {'speed': self._speed, 'cte': self._cte, 'pos': self._pos} + obs = np.zeros((120, 160, 3), dtype=np.uint8) + sim_reward = 999.0 # deliberately bogus — wrapper must ignore this + info = {'speed': self._speed, 'cte': self._cte, 'pos': self._pos} if self._use_5tuple: return obs, sim_reward, self._done, False, info return obs, sim_reward, self._done, info - def close(self): pass + +# ── Helpers ────────────────────────────────────────────────────────────────── + +def make_info(cte=0.5, speed=2.0, pos=None, active_node=1, lap_count=0, lap_time=0.0): + return { + 'cte': cte, 'speed': speed, + 'pos': pos or (0., 0., 0.), + 'active_node': active_node, 'total_nodes': 100, + 'lap_count': lap_count, 'last_lap_time': lap_time, + } -def step_wrapped(wrapped_env, env, pos, cte=0.5, speed=2.0): - env.set_pos(pos) - env.set_cte(cte) - env._speed = speed - return wrapped_env.step(0) - - -# ---- Core v4 Properties ---- +# ── Core reward properties ──────────────────────────────────────────────────── def test_sim_reward_is_completely_ignored(): - """ - The wrapper must NOT use the sim's reward (999.0). - v4 computes reward from scratch using CTE/pos/speed only. - """ env = MockEnv(speed=2.0, cte=0.5, pos=(0., 0., 0.)) - wrapped = SpeedRewardWrapper(env, speed_scale=0.1) + wrapped = SpeedRewardWrapper(env) wrapped.reset() _, reward, _, _, _ = wrapped.step(0) - assert reward != 999.0, "Wrapper must not pass through sim's bogus reward" - assert reward < 10.0, f"Reward should be small, got {reward}" + assert reward != 999.0 + assert reward < 10.0 -def test_circling_at_zero_cte_gives_near_zero_reward(): - """ - v6: circling (low efficiency) should yield zero reward via the efficiency gate. - After enough steps of circular motion, the efficiency drops below threshold - and the gate zeros the reward. - """ - env = MockEnv(speed=3.0, cte=0.0) - wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=30, min_efficiency=0.15) +def test_crash_gives_negative_one(): + env = MockEnv(speed=5.0, cte=0.0, done=True) + wrapped = SpeedRewardWrapper(env) wrapped.reset() - - # Drive in a circle for enough steps to fill the position window - rewards = [] - for i in range(40): - angle = 2 * math.pi * i / 12 # completes circle every 12 steps - env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)]) - _, r, _, _, _ = wrapped.step(0) - rewards.append(r) - - # After 20+ steps of circular motion, efficiency gate should kick in - # Last few rewards should be 0.0 - assert rewards[-1] == 0.0, ( - f"v6: circular driving should yield 0.0 reward via efficiency gate, got {rewards[-1]:.4f}") - assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3, ( - f"v6: most of last 5 rewards during circle should be 0.0, got {rewards[-5:]}") + _, reward, _, _, _ = wrapped.step(0) + assert reward == -1.0 def test_forward_driving_earns_positive_reward(): - """Straight-line driving at low CTE and reasonable speed earns positive reward.""" env = MockEnv(speed=5.0, cte=0.5) - wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10) + wrapped = SpeedRewardWrapper(env, window_size=10) wrapped.reset() _, r, _, _, _ = wrapped.step(0) - # reward = (5/10) * (1 - 0.5/8) = 0.5 * 0.9375 = 0.469 - assert r > 0.3, f"Forward driving should earn >0.3 reward, got {r:.4f}" + # reward = (5/10) * (1 - 0.5/8.0) = 0.5 * 0.9375 = 0.469 + assert r > 0.3, f"Forward driving should earn >0.3, got {r:.4f}" -def test_forward_beats_circling_by_large_margin(): - """ - v6: forward driving earns positive reward; circular driving earns zero. - The efficiency gate ensures this gap. - """ - # Forward driving at CTE=1m, speed=5 +def test_higher_cte_reduces_reward(): + env_low = MockEnv(speed=2.0, cte=0.5) + env_high = MockEnv(speed=2.0, cte=4.0) + w_low = SpeedRewardWrapper(env_low, window_size=5) + w_high = SpeedRewardWrapper(env_high, window_size=5) + w_low.reset(); w_high.reset() + for i in range(10): + env_low.set_pos( [i * 0.3, 0., 0.]) + env_high.set_pos([i * 0.3, 0., 0.]) + _, r_low, _, _, _ = w_low.step(0) + _, r_high, _, _, _ = w_high.step(0) + assert r_low > r_high + + +def test_higher_speed_increases_reward(): + env_slow = MockEnv(speed=0.5, cte=1.0) + env_fast = MockEnv(speed=3.0, cte=1.0) + w_slow = SpeedRewardWrapper(env_slow, window_size=10) + w_fast = SpeedRewardWrapper(env_fast, window_size=10) + w_slow.reset(); w_fast.reset() + for i in range(15): + env_slow.set_pos([i * 0.1, 0., 0.]) + env_fast.set_pos([i * 0.3, 0., 0.]) + _, r_slow, _, _, _ = w_slow.step(0) + _, r_fast, _, _, _ = w_fast.step(0) + assert r_fast > r_slow + + +def test_4tuple_compatibility(): + env = MockEnv(speed=2.0, cte=0.5, use_5tuple=False) + env.set_pos([0., 0., 0.]) + wrapped = SpeedRewardWrapper(env) + wrapped.reset() + result = wrapped.step(0) + assert len(result) == 4 + _, reward, done, info = result + assert isinstance(reward, float) + assert reward != 999.0 + + +# ── Efficiency gate ─────────────────────────────────────────────────────────── + +def test_circling_earns_zero_reward(): + env = MockEnv(speed=3.0, cte=0.0) + wrapped = SpeedRewardWrapper(env, window_size=30, min_efficiency=0.15) + wrapped.reset() + rewards = [] + for i in range(40): + angle = 2 * math.pi * i / 12 + env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)]) + _, r, _, _, _ = wrapped.step(0) + rewards.append(r) + assert rewards[-1] == 0.0 + assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3 + + +def test_forward_beats_circling(): env_fwd = MockEnv(speed=5.0, cte=1.0) - wrapped_fwd = SpeedRewardWrapper(env_fwd, speed_scale=0.1, window_size=30) - wrapped_fwd.reset() + w_fwd = SpeedRewardWrapper(env_fwd, window_size=30) + w_fwd.reset() for i in range(35): - env_fwd.set_pos([i * 0.5, 0., 0.]) # straight line - _, r_fwd, _, _, _ = wrapped_fwd.step(0) + env_fwd.set_pos([i * 0.5, 0., 0.]) + _, r_fwd, _, _, _ = w_fwd.step(0) - # Circular driving at CTE=0, speed=5 env_circ = MockEnv(speed=5.0, cte=0.0) - wrapped_circ = SpeedRewardWrapper(env_circ, speed_scale=0.1, window_size=30) - wrapped_circ.reset() + w_circ = SpeedRewardWrapper(env_circ, window_size=30) + w_circ.reset() for i in range(35): angle = 2 * math.pi * i / 12 env_circ.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)]) - _, r_circ, _, _, _ = wrapped_circ.step(0) + _, r_circ, _, _, _ = w_circ.step(0) - assert r_fwd > 0, f"Forward driving should earn positive reward, got {r_fwd}" - assert r_circ == 0.0, f"Circular driving should earn 0 reward, got {r_circ}" - assert r_fwd > r_circ, f"Forward ({r_fwd:.3f}) must beat circling ({r_circ:.3f})" + assert r_fwd > 0 + assert r_circ == 0.0 -def test_crash_gives_negative_reward(): - """Episode termination (done=True) must always give -1.0.""" - env = MockEnv(speed=5.0, cte=0.0, done=True) - wrapped = SpeedRewardWrapper(env, speed_scale=0.2) - wrapped.reset() - _, reward, _, _, _ = wrapped.step(0) - assert reward == -1.0, f"Crash reward must be -1.0, got {reward}" - - -def test_high_cte_reduces_reward(): - """Higher CTE should reduce reward (closer to track edge = lower base).""" - env_low = MockEnv(speed=2.0, cte=0.5) - env_high = MockEnv(speed=2.0, cte=4.0) - - wrapped_low = SpeedRewardWrapper(env_low, speed_scale=0.1, window_size=5) - wrapped_high = SpeedRewardWrapper(env_high, speed_scale=0.1, window_size=5) - wrapped_low.reset() - wrapped_high.reset() - - # Drive straight so efficiency fills up - for i in range(10): - env_low.set_pos([i * 0.3, 0., 0.]) - env_high.set_pos([i * 0.3, 0., 0.]) - _, r_low, _, _, _ = wrapped_low.step(0) - _, r_high, _, _, _ = wrapped_high.step(0) - - assert r_low > r_high, f"Low CTE ({r_low:.3f}) should reward more than high CTE ({r_high:.3f})" - - -def test_speed_bonus_increases_reward_when_on_track(): - """Faster forward driving earns more reward than slower forward driving.""" - env_slow = MockEnv(speed=0.5, cte=1.0) - env_fast = MockEnv(speed=3.0, cte=1.0) - - wrapped_slow = SpeedRewardWrapper(env_slow, speed_scale=0.1, window_size=10) - wrapped_fast = SpeedRewardWrapper(env_fast, speed_scale=0.1, window_size=10) - wrapped_slow.reset() - wrapped_fast.reset() - - for i in range(15): - env_slow.set_pos([i * 0.1, 0., 0.]) - env_fast.set_pos([i * 0.3, 0., 0.]) # Fast car covers more ground - _, r_slow, _, _, _ = wrapped_slow.step(0) - _, r_fast, _, _, _ = wrapped_fast.step(0) - - assert r_fast > r_slow, f"Fast ({r_fast:.3f}) should earn more than slow ({r_slow:.3f})" - - -def test_theoretical_max_per_step(): - """Max reward/step = 1.0 × 1.0 × (1 + scale × max_speed) = 2.0 at scale=0.1, max=10.""" - env = MockEnv() - wrapped = SpeedRewardWrapper(env, speed_scale=0.1) - assert wrapped.theoretical_max_per_step(max_speed=10.0) == pytest.approx(2.0, abs=1e-6) - - -def test_4tuple_step_compatibility(): - """Wrapper must handle 4-tuple step() return (old gym API).""" - env = MockEnv(speed=2.0, cte=0.5, use_5tuple=False) - env.set_pos([0., 0., 0.]) - wrapped = SpeedRewardWrapper(env, speed_scale=0.1) - wrapped.reset() - result = wrapped.step(0) - assert len(result) == 4, f"Expected 4-tuple, got {len(result)}" - _, reward, done, info = result - assert isinstance(reward, float) - assert reward != 999.0, "Should not use sim reward" - - -def test_reward_resets_on_episode_reset(): - """After reset, position history clears so efficiency recalculates cleanly.""" +def test_history_clears_on_reset(): env = MockEnv(speed=2.0, cte=0.5) - wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10) + wrapped = SpeedRewardWrapper(env, window_size=10) wrapped.reset() - - # Fill with circular data for i in range(15): angle = 2 * math.pi * i / 12 env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)]) wrapped.step(0) - - # After reset, start fresh straight wrapped.reset() rewards = [] for i in range(5): env.set_pos([i * 0.3, 0., 0.]) _, r, _, _, _ = wrapped.step(0) rewards.append(r) - - # Should get reasonable reward after fresh start - assert rewards[-1] > 0, "Should get positive reward after reset and straight driving" + assert rewards[-1] > 0 -# --------------------------------------------------------------------------- -# Short-lap exploit patch tests -# --------------------------------------------------------------------------- +# ── No-progress termination ─────────────────────────────────────────────────── -def test_short_lap_triggers_penalty(): - """ - A lap completed faster than min_lap_time must return a large penalty, - not a positive reward. This closes the start/finish circle exploit. - """ - env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.)) - wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) - wrapper.reset() - - # Simulate step where a new lap completes in 1 second (exploit) - info = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0), - 'lap_count': 1, 'last_lap_time': 1.0} - reward, _ = wrapper._compute_reward_and_done(done=False, info=info) - assert reward < 0, f'Short lap (1s) should penalise, got reward={reward}' - assert reward <= -10.0, f'Short lap penalty should be large (<= -10), got {reward}' - - -def test_legitimate_lap_not_penalised(): - """ - A lap completed above min_lap_time must NOT trigger the penalty. - """ - env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.)) - wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) - wrapper.reset() - - # First step — no lap yet - info_no_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0), - 'lap_count': 0, 'last_lap_time': 0.0} - wrapper._compute_reward_and_done(done=False, info=info_no_lap) - - # Legitimate lap at 12 seconds - info = {'cte': 0.2, 'speed': 3.0, 'pos': (1.0, 0.0, 0.0), - 'lap_count': 1, 'last_lap_time': 12.0} - reward, _ = wrapper._compute_reward_and_done(done=False, info=info) - assert reward >= 0, f'Legitimate lap (12s) should not be penalised, got {reward}' - - -def test_lap_count_not_double_penalised(): - """ - Penalty fires exactly once per short lap, not on every subsequent step. - """ - env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.)) - wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) - wrapper.reset() - - # Short lap fires on step where lap_count increments - info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0), - 'lap_count': 1, 'last_lap_time': 1.5} - r1, _ = wrapper._compute_reward_and_done(done=False, info=info_lap) - assert r1 < 0 - - # Next step same lap_count — should get normal reward, not another penalty - info_next = {'cte': 0.0, 'speed': 3.0, 'pos': (0.1, 0.0, 0.0), - 'lap_count': 1, 'last_lap_time': 1.5} - r2, _ = wrapper._compute_reward_and_done(done=False, info=info_next) - assert r2 >= 0, f'Penalty should not repeat on same lap_count, got r2={r2}' - - -def test_lap_count_resets_on_episode_reset(): - """lap_count tracker must reset when the episode resets.""" - env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.)) - wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) - wrapper.reset() - - # Complete a short lap - info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0), - 'lap_count': 1, 'last_lap_time': 1.0} - wrapper._compute_reward_and_done(done=False, info=info_lap) - assert wrapper._last_lap_count == 1 - - # Reset episode — counter must go back to 0 - wrapper.reset() - assert wrapper._last_lap_count == 0 - - -# --------------------------------------------------------------------------- -# v6.1 exploit terminator tests -# --------------------------------------------------------------------------- - -def test_sustained_high_cte_terminates_episode(): - """ - Grass exploit fix: if CTE exceeds max_cte_terminate for cte_patience - consecutive steps, the episode must be force-terminated with -1.0 reward. - This catches the generated_track gap where car drives indefinitely on grass. - """ - env = MockEnv(speed=3.0, cte=5.0) # CTE=5.0 > max_cte_terminate=4.0 - wrapper = SpeedRewardWrapper(env, max_cte_terminate=4.0, cte_patience=5) - wrapper.reset() - - rewards = [] - terminated = [] - for _ in range(10): - info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.), - 'active_node': 0, 'lap_count': 0, 'last_lap_time': 0.0} - r, force_term = wrapper._compute_reward_and_done(done=False, info=info) - rewards.append(r) - terminated.append(force_term) - - # High CTE should be punished immediately, then terminate at step 5 - assert rewards[0] < 0, f'High CTE should be negative immediately, got {rewards[0]}' - assert terminated[4] == True, f'Should force-terminate at step 5, got {terminated}' - assert rewards[4] == -1.0, f'Termination reward should be -1.0, got {rewards[4]}' - assert terminated[0] == False, 'Should not terminate at step 1' - - -def test_high_cte_never_gets_positive_speed_reward_before_termination(): - """ - Regression for generated_road outside-circle exploit: while CTE is outside - the allowed corridor, the wrapper must not pay positive speed reward during - the patience window. The policy should receive negative feedback - immediately, then termination. - """ - env = MockEnv(speed=5.0, cte=3.0) - wrapper = SpeedRewardWrapper(env, max_cte_terminate=2.5, cte_patience=3) - wrapper.reset() - - rewards = [] - terminated = [] - for i in range(3): - info = { - 'cte': 3.0, - 'speed': 5.0, - 'pos': (float(i), 0.0, 0.0), - 'active_node': i, - 'total_nodes': 100, - 'lap_count': 0, - 'last_lap_time': 0.0, - } - r, ft = wrapper._compute_reward_and_done(done=False, info=info) - rewards.append(r) - terminated.append(ft) - - assert rewards[:2] == [-0.25, -0.25] - assert rewards[2] == -1.0 - assert terminated == [False, False, True] - - -def test_high_cte_resets_when_back_on_track(): - """ - High CTE counter must reset when car returns to track. - Prevents false termination after a brief excursion. - """ - env = MockEnv(speed=3.0, cte=0.5) - wrapper = SpeedRewardWrapper(env, max_cte_terminate=4.0, cte_patience=5) - wrapper.reset() - - # 3 steps high CTE - for _ in range(3): - info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.), - 'active_node': 0, 'lap_count': 0, 'last_lap_time': 0.0} - r, ft = wrapper._compute_reward_and_done(done=False, info=info) - assert ft == False, 'Should not terminate after only 3 steps' - - # 1 step back on track resets counter - info = {'cte': 1.0, 'speed': 3.0, 'pos': (0., 0., 0.), - 'active_node': 1, 'lap_count': 0, 'last_lap_time': 0.0} - wrapper._compute_reward_and_done(done=False, info=info) - assert wrapper._high_cte_steps == 0, 'CTE counter should reset when back on track' - - # 5 more steps high CTE — should now terminate (counter starts fresh) - for i in range(5): - info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.), - 'active_node': 1, 'lap_count': 0, 'last_lap_time': 0.0} - r, ft = wrapper._compute_reward_and_done(done=False, info=info) - assert ft == True, 'Should terminate after 5 new consecutive high-CTE steps' - - -def test_no_track_progress_terminates_episode(): - """ - Circle/stuck exploit fix: if max active_node doesn't advance for - progress_patience steps, the episode must be force-terminated. - A circling car stays near the same waypoints — max_node never increases. - """ +def test_no_progress_terminates(): env = MockEnv(speed=3.0, cte=0.5) wrapper = SpeedRewardWrapper(env, progress_patience=10) wrapper.reset() - - # First step initialises max_node to 5, then 10 more steps stuck at 5 → terminate for i in range(12): - info = {'cte': 0.5, 'speed': 3.0, 'pos': (float(i)*0.1, 0., 0.), - 'active_node': 5, 'total_nodes': 100, - 'lap_count': 0, 'last_lap_time': 0.0} - r, ft = wrapper._compute_reward_and_done(done=False, info=info) + r, ft = wrapper._compute_reward(False, make_info(active_node=5, pos=(i*0.1, 0., 0.))) if ft: break - - assert ft == True, 'Should terminate when max active_node not advancing' + assert ft is True assert r == -1.0 -def test_low_speed_no_displacement_terminates_barrier_wedge(): - """ - Regression for invisible-barrier wedge: wheels can be commanded but the car - remains nearly motionless with acceptable CTE. This must terminate quickly - instead of returning zero/positive reward indefinitely. - """ - env = MockEnv(speed=0.05, cte=0.5) - wrapper = SpeedRewardWrapper( - env, - low_speed_grace_steps=2, - low_speed_patience=3, - low_speed_threshold=0.2, - low_speed_min_displacement=0.25, - progress_patience=100, - ) - wrapper.reset() - - terminated = False - reward = None - for _ in range(8): - info = { - 'cte': 0.5, - 'speed': 0.05, - 'pos': (1.0, 0.0, 1.0), - 'active_node': 5, - 'total_nodes': 100, - 'lap_count': 0, - 'last_lap_time': 0.0, - } - reward, terminated = wrapper._compute_reward_and_done(done=False, info=info) - if terminated: - break - - assert terminated is True - assert reward == -1.0 - - -def test_low_speed_counter_resets_after_meaningful_displacement(): - """Slow starts should not terminate if the car is still changing position.""" - env = MockEnv(speed=0.05, cte=0.5) - wrapper = SpeedRewardWrapper( - env, - low_speed_grace_steps=0, - low_speed_patience=3, - low_speed_threshold=0.2, - low_speed_min_displacement=0.25, - progress_patience=100, - ) - wrapper.reset() - - for i in range(6): - info = { - 'cte': 0.5, - 'speed': 0.05, - 'pos': (float(i) * 0.3, 0.0, 0.0), - 'active_node': i, - 'total_nodes': 100, - 'lap_count': 0, - 'last_lap_time': 0.0, - } - reward, terminated = wrapper._compute_reward_and_done(done=False, info=info) - assert terminated is False - - -def test_track_progress_resets_counter(): - """ - Advancing to a new max active_node must reset the no-progress counter. - """ - env = MockEnv(speed=3.0, cte=0.5) +def test_progress_resets_counter(): + env = MockEnv() wrapper = SpeedRewardWrapper(env, progress_patience=5) wrapper.reset() - - # Step forward: nodes 0, 1, 2, 3 — each new node resets counter for node in range(4): - info = {'cte': 0.5, 'speed': 3.0, 'pos': (float(node)*0.5, 0., 0.), - 'active_node': node, 'total_nodes': 100, - 'lap_count': 0, 'last_lap_time': 0.0} - r, ft = wrapper._compute_reward_and_done(done=False, info=info) - assert ft == False, f'Should not terminate when advancing (node {node})' - assert wrapper._no_progress_steps == 0, 'Counter should reset on new max node' + r, ft = wrapper._compute_reward(False, make_info(active_node=node, pos=(node*0.5, 0., 0.))) + assert ft is False + assert wrapper._no_progress_steps == 0 -def test_circle_exploit_terminates(): - """ - A car circling near the same spot should be terminated. - active_node oscillates but never exceeds the initial max. - """ - env = MockEnv(speed=3.0, cte=0.5) +def test_circling_active_node_terminates(): + env = MockEnv() wrapper = SpeedRewardWrapper(env, progress_patience=10) wrapper.reset() - - # Set max_node to 10 - info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.), - 'active_node': 10, 'total_nodes': 100, - 'lap_count': 0, 'last_lap_time': 0.0} - wrapper._compute_reward_and_done(done=False, info=info) - - # Now oscillate between nodes 8-10 (circling near node 10) + wrapper._compute_reward(False, make_info(active_node=10)) terminated = False for i in range(20): - node = 8 + (i % 3) # oscillates 8, 9, 10, 8, 9, 10... - info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.), - 'active_node': node, 'total_nodes': 100, - 'lap_count': 0, 'last_lap_time': 0.0} - r, ft = wrapper._compute_reward_and_done(done=False, info=info) + r, ft = wrapper._compute_reward(False, make_info(active_node=8 + (i % 3))) if ft: terminated = True break - - assert terminated, 'Circling (oscillating active_node, no new max) should terminate' + assert terminated def test_lap_completion_resets_progress_tracker(): - """ - On lap completion, active_node resets to 0. Progress tracker must also - reset so the car isn't immediately terminated for 'no progress'. - """ - env = MockEnv(speed=3.0, cte=0.5) + env = MockEnv() wrapper = SpeedRewardWrapper(env, progress_patience=5, min_lap_time=5.0) wrapper.reset() - - # Drive to near end of track - info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.), - 'active_node': 99, 'total_nodes': 100, - 'lap_count': 0, 'last_lap_time': 0.0} - wrapper._compute_reward_and_done(done=False, info=info) + wrapper._compute_reward(False, make_info(active_node=99)) assert wrapper._max_node_seen == 99 - - # Complete a valid lap - info = {'cte': 0.5, 'speed': 3.0, 'pos': (0., 0., 0.), - 'active_node': 0, 'total_nodes': 100, - 'lap_count': 1, 'last_lap_time': 12.0} # 12s lap = valid - r, ft = wrapper._compute_reward_and_done(done=False, info=info) - - # Progress tracker should be reset - assert wrapper._max_node_seen == -1, 'max_node_seen should reset on lap completion' + r, ft = wrapper._compute_reward(False, make_info(active_node=0, lap_count=1, lap_time=12.0)) + assert wrapper._max_node_seen == -1 assert wrapper._no_progress_steps == 0 - assert ft == False, 'Valid lap should not terminate' + assert ft is False + + +# ── Lap exploit guard ───────────────────────────────────────────────────────── + +def test_short_lap_penalised(): + env = MockEnv() + wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) + wrapper.reset() + r, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.0)) + assert r < 0 + assert r <= -10.0 + + +def test_legitimate_lap_not_penalised(): + env = MockEnv() + wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) + wrapper.reset() + wrapper._compute_reward(False, make_info(lap_count=0)) + r, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=12.0, pos=(1., 0., 0.))) + assert r >= 0 + + +def test_lap_penalty_fires_once(): + env = MockEnv() + wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) + wrapper.reset() + r1, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.5)) + assert r1 < 0 + r2, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.5, pos=(0.1, 0., 0.))) + assert r2 >= 0 + + +def test_lap_count_resets_on_episode_reset(): + env = MockEnv() + wrapper = SpeedRewardWrapper(env, min_lap_time=5.0) + wrapper.reset() + wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.0)) + assert wrapper._last_lap_count == 1 + wrapper.reset() + assert wrapper._last_lap_count == 0