fix(core): replace exploit bandaids with solid physics barriers + clean reward
Root cause: barriers were zero-thickness MeshCollider planes with no CCD on the car. The car tunnelled through between frames. Every Python patch was trying to catch in code what physics should enforce. Unity (source only — build in progress): - RoadBuilder.cs: CreateBarrier() now makes BoxCollider-per-segment with real 3D volume (barrierThickness=1.0m default) + half-thickness overlap at corners to seal gaps. CreateEndCap() seals open ends of non-looping tracks (generated_road). - Car.cs: rb.collisionDetectionMode = Continuous in Awake() — prevents tunneling. Python: - reward_wrapper.py v7: removed CTE-patience termination, high-CTE negative reward, solid_hit monitoring, low-speed/wedge detection. Kept: efficiency gate, no-progress (active_node) termination, lap exploit guard. Reward = speed×CTE_quality. - exp23_generated_road_clean.py: single track, no warm-start, 200k steps, clean reward, MAX_EPISODE_SECONDS=120 as safety net only. - tests: 17 tests covering clean reward properties. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
c5c4ca658e
commit
2d52bb4ffc
|
|
@ -12,238 +12,171 @@ If the user says only `continue`, interpret it using the instruction above.
|
||||||
|
|
||||||
## Current Goal
|
## Current Goal
|
||||||
|
|
||||||
Stabilize the Unity simulator geometry and collision behavior enough that:
|
Run a clean, trustworthy exp23 on `generated_road` with:
|
||||||
|
- Solid BoxCollider barriers (car physically cannot escape)
|
||||||
|
- Clean reward: speed × CTE_quality + efficiency gate
|
||||||
|
- No artificial episode caps or Python-side exploit patches
|
||||||
|
|
||||||
- `generated_road` and `generated_track` both run without bad invisible barrier placement
|
Get RL training producing genuine improvement again.
|
||||||
- barrier contacts terminate episodes appropriately
|
|
||||||
- RL can restart from a trustworthy simulator build
|
|
||||||
|
|
||||||
## Important Paths
|
## Important Paths
|
||||||
|
|
||||||
Project:
|
Project:
|
||||||
|
|
||||||
- `/home/paulh/projects/donkeycar-rl-autoresearch`
|
- `/home/paulh/projects/donkeycar-rl-autoresearch`
|
||||||
|
|
||||||
Unity source project:
|
Unity source project:
|
||||||
|
|
||||||
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim`
|
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim`
|
||||||
|
|
||||||
Unity build output:
|
Unity build output:
|
||||||
|
|
||||||
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin`
|
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin`
|
||||||
|
|
||||||
Current runtime simulator folders in use:
|
Current runtime simulator folders in use:
|
||||||
|
|
||||||
- `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin`
|
- `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin`
|
||||||
- `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy`
|
- `/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy`
|
||||||
|
|
||||||
## Current RL Experiment Files
|
Unity build log:
|
||||||
|
- `C:\Users\Paul\AppData\Local\Temp\unity_rebuild.log`
|
||||||
|
|
||||||
- `agent/experiments/exp21_generated_pair_warm_v4.py`
|
## What Was Fixed This Session
|
||||||
- `agent/experiments/exp22_generated_pair_warm_v6.py`
|
|
||||||
|
|
||||||
Latest model/output folder:
|
### Root cause identified and fixed
|
||||||
|
|
||||||
- `agent/models/exp22-generated-pair-warm-v6`
|
**The car was escaping the track because:**
|
||||||
|
1. Barriers were zero-thickness `MeshCollider` planes — no physical volume
|
||||||
|
2. Car Rigidbody had no CCD — default `Discrete` mode allows tunneling
|
||||||
|
|
||||||
Current training run:
|
Both problems created a simulator where the car could literally teleport through
|
||||||
|
barrier walls between physics frames. Every Python-side "fix" (CTE termination,
|
||||||
|
time caps, hit detection) was attempting in Python what the physics engine was
|
||||||
|
failing to enforce.
|
||||||
|
|
||||||
- launched `agent/experiments/exp22_generated_pair_warm_v6.py`
|
### Unity changes (source updated, build in progress)
|
||||||
- PID file: `agent/models/exp22-generated-pair-warm-v6/current.pid`
|
|
||||||
- current PID at launch time: `609054`
|
|
||||||
- log: `agent/models/exp22-generated-pair-warm-v6/run_2026-05-05_141929_strictcte.log`
|
|
||||||
- startup verified: connected to `localhost:9091` and `localhost:9093`, loaded `generated_road` and `generated_track`, attached warm-start model, reached `Starting training...`
|
|
||||||
|
|
||||||
Latest urgent exploit fix:
|
|
||||||
|
|
||||||
- User observed generated_road still doing the large outside circle exploit.
|
|
||||||
- Stopped the previous run immediately.
|
|
||||||
- Patched `agent/reward_wrapper.py` so high CTE receives negative reward immediately during the patience window instead of falling through to positive speed reward.
|
|
||||||
- Patched `agent/experiments/exp22_generated_pair_warm_v6.py`:
|
|
||||||
- `MAX_CTE_TERMINATE = 2.5`
|
|
||||||
- `CTE_PATIENCE = 3`
|
|
||||||
- Added regression test `test_high_cte_never_gets_positive_speed_reward_before_termination`.
|
|
||||||
- Verified `python3 -m pytest -q tests/test_reward_wrapper.py`: `21 passed`.
|
|
||||||
|
|
||||||
## What Was Learned
|
|
||||||
|
|
||||||
### Training status
|
|
||||||
|
|
||||||
The latest meaningful `exp22` run was poor and should not be resumed as-is.
|
|
||||||
|
|
||||||
From `agent/models/exp22-generated-pair-warm-v6/run_2026-04-28_2132_openfix.log`:
|
|
||||||
|
|
||||||
- best `generated_track` eval reached only about `92` steps
|
|
||||||
- run was not trustworthy due to ongoing barrier-placement concerns
|
|
||||||
|
|
||||||
### Simulator behavior
|
|
||||||
|
|
||||||
- Invisible barriers are collider-only by default, so the user cannot see them in the standalone player
|
|
||||||
- Diagnostic probe showed both tracks could advance from the start before hitting `left_barrier`, so there was no obvious full-width blocker across the road start
|
|
||||||
- User screenshot suggested the car was getting trapped near the shoulder/edge, consistent with barrier corridor too close to the drivable edge
|
|
||||||
- User also reported that barrier contact sometimes blocks the car without promptly ending the episode
|
|
||||||
|
|
||||||
### Collision semantics
|
|
||||||
|
|
||||||
The user does **not** want every barrier brush to terminate the episode.
|
|
||||||
|
|
||||||
Desired behavior:
|
|
||||||
|
|
||||||
- light brush: can continue
|
|
||||||
- sustained contact: terminate
|
|
||||||
- head-on / abrupt stop: terminate quickly
|
|
||||||
|
|
||||||
## Code Changes Already Made
|
|
||||||
|
|
||||||
### Unity / simulator side
|
|
||||||
|
|
||||||
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs`
|
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs`
|
||||||
|
- Rewrote `CreateBarrier()`: now creates one `BoxCollider` per segment with real
|
||||||
|
3D volume (`barrierThickness` wide — default 1.0m)
|
||||||
|
- Segment boxes overlap by `barrierThickness * 0.5` to close corner gaps
|
||||||
|
- Added `CreateEndCap()`: seals the two open ends of non-looping tracks
|
||||||
|
(`generated_road` is `closeLoop=0` — without end caps the car can drive off
|
||||||
|
the ends of the track)
|
||||||
|
- Added `public float barrierThickness = 1.0f` field (inspector-editable)
|
||||||
|
- `showBarrierMeshes=true` now shows proper translucent 3D boxes, not flat planes
|
||||||
|
|
||||||
Implemented structural refactor:
|
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/Car.cs`
|
||||||
|
- Added `rb.collisionDetectionMode = CollisionDetectionMode.Continuous;` in
|
||||||
|
`Awake()` — prevents tunneling even against any remaining thin geometry
|
||||||
|
|
||||||
- explicit `closeLoop` support
|
### Python changes (committed)
|
||||||
- explicit road-edge generation
|
|
||||||
- barrier edges derived from left/right road edges instead of guessed centerline offset
|
|
||||||
- open tracks do not force wraparound
|
|
||||||
- debug polyline support via gizmos
|
|
||||||
|
|
||||||
Added runtime-visible debug barrier support:
|
`agent/reward_wrapper.py` → v7 (clean)
|
||||||
|
- REMOVED: CTE-patience termination, high-CTE negative reward, solid_hit
|
||||||
|
monitoring, low-speed/wedge detection, all exploit-closing bandaids
|
||||||
|
- KEPT: efficiency gate (zero reward when circling), no-progress termination
|
||||||
|
(active_node), lap exploit guard
|
||||||
|
- Reward: `speed_norm × CTE_quality` when efficiency passes gate
|
||||||
|
|
||||||
- `showBarrierMeshes`
|
`agent/experiments/exp23_generated_road_clean.py`
|
||||||
- `barrierDebugColor`
|
- Single track: `generated_road` on port 9091
|
||||||
- barrier objects now include `MeshFilter`
|
- No warm-start (fresh PPO weights)
|
||||||
- optional `MeshRenderer` added for visible translucent barriers
|
- `MAX_EPISODE_SECONDS=120` (generous safety net, not a training constraint)
|
||||||
|
- LR=0.0003, 200k total steps, checkpoints every 10k
|
||||||
|
|
||||||
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scenes/generated_road.unity`
|
`tests/test_reward_wrapper.py` — 17 tests, all pass
|
||||||
|
|
||||||
- `closeLoop = 0`
|
## Current State
|
||||||
- `doAddBarriers = 1`
|
|
||||||
- `showBarrierMeshes = 1`
|
|
||||||
- pinned road variation arrays to one entry
|
|
||||||
- `roadOffsets.Array.data[0] = 2.2`
|
|
||||||
|
|
||||||
`/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scenes/generated_track.unity`
|
### Unity build
|
||||||
|
- Build launched with PID 37896 on 2026-05-05
|
||||||
|
- Log: `C:\Users\Paul\AppData\Local\Temp\unity_rebuild.log`
|
||||||
|
- Check: `grep -q "Exiting batchmode successfully" /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log && echo OK`
|
||||||
|
|
||||||
- `showBarrierMeshes = 1`
|
### After build completes
|
||||||
- `roadOffsetW = 2.2`
|
1. Sync to both runtime folders:
|
||||||
- barriers still enabled
|
```bash
|
||||||
|
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' \
|
||||||
|
'/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin/'
|
||||||
|
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' \
|
||||||
|
'/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy/'
|
||||||
|
```
|
||||||
|
|
||||||
### Python / RL side
|
2. Launch sims (only port 9091 needed for exp23 — single env):
|
||||||
|
```powershell
|
||||||
|
$key = 'HKCU:\Software\DonkeyCar\donkey_sim'
|
||||||
|
Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9091 -Type DWord
|
||||||
|
Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9092 -Type DWord
|
||||||
|
Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin\donkey_sim.exe' `
|
||||||
|
-ArgumentList '--port','9091' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin'
|
||||||
|
```
|
||||||
|
|
||||||
`/home/paulh/projects/donkeycar-rl-autoresearch/agent/reward_wrapper.py`
|
3. Verify port:
|
||||||
|
```bash
|
||||||
|
python3 -c "import socket; s=socket.socket(); s.settimeout(3); s.connect(('127.0.0.1',9091)); print('PORT 9091: OK'); s.close()"
|
||||||
|
```
|
||||||
|
|
||||||
Latest intent:
|
4. Visually verify barriers in the sim window:
|
||||||
|
- `showBarrierMeshes=1` is already set in both scene files
|
||||||
|
- Translucent box barriers should be visible on BOTH sides of the road
|
||||||
|
- Verify no gaps at corners
|
||||||
|
- Verify end-cap walls at start and finish of generated_road
|
||||||
|
- **Do not start exp23 until Paul confirms barriers look correct**
|
||||||
|
|
||||||
- do **not** terminate instantly on every barrier hit
|
5. Launch exp23:
|
||||||
- terminate on sustained obstacle contact
|
```bash
|
||||||
- terminate on head-on style stop
|
cd /home/paulh/projects/donkeycar-rl-autoresearch
|
||||||
|
SAVE_DIR=agent/models/exp23-generated-road-clean
|
||||||
|
mkdir -p $SAVE_DIR
|
||||||
|
nohup python3 agent/experiments/exp23_generated_road_clean.py \
|
||||||
|
> $SAVE_DIR/run_$(date +%Y-%m-%d_%H%M%S)_clean.log 2>&1 &
|
||||||
|
echo $! > $SAVE_DIR/current.pid
|
||||||
|
```
|
||||||
|
|
||||||
Current patch in file:
|
## Key Parameters (exp23)
|
||||||
|
|
||||||
- tracks `_solid_hit_steps`
|
| Setting | Value | Why |
|
||||||
- tracks `_prev_speed`
|
|---|---|---|
|
||||||
- classifies solid hits via `hit` containing `barrier`, `wall`, or `tree`
|
| Track | generated_road | Single track — diagnose before adding second |
|
||||||
- immediate terminate on abrupt speed collapse while colliding
|
| LR | 0.0003 | Standard PPO starting LR |
|
||||||
- terminate after several consecutive solid-hit frames
|
| Total steps | 200k | More room to learn with clean signal |
|
||||||
|
| max_episode_seconds | 120s | Safety net only — physics does the work |
|
||||||
|
| MAX_CTE_TERMINATE | none | Removed — barriers are physical now |
|
||||||
|
| Warm-start | none | Previous warm-starts trained on broken reward |
|
||||||
|
| showBarrierMeshes | ON | Verify visually before committing to long run |
|
||||||
|
|
||||||
This was meant to replace the too-aggressive “any barrier hit = immediate death” logic.
|
## Success Criteria
|
||||||
|
|
||||||
## Most Recent Verified Build Status
|
- Car cannot drive past the barrier walls (verify visually)
|
||||||
|
- ep_len_mean should INCREASE over checkpoints (not frozen at 118)
|
||||||
Unity batch build for the debug-visible barrier version completed successfully.
|
- eval steps should improve at 20k, 30k, 40k checkpoints
|
||||||
|
- No evidence of outside-road circling in the reward curve
|
||||||
Evidence:
|
|
||||||
|
|
||||||
- build log ended with `Exiting batchmode successfully now!`
|
|
||||||
- return code `0`
|
|
||||||
|
|
||||||
The successful build has now been synced into both `Downloads` runtime folders and both simulators have been relaunched.
|
|
||||||
|
|
||||||
Current verified runtime state:
|
|
||||||
|
|
||||||
- main folder process owns port `9091`
|
|
||||||
- main folder also owns private API port `9092`
|
|
||||||
- copy folder process owns port `9093`
|
|
||||||
- copy folder also owns private API port `9094`
|
|
||||||
- Linux socket probe reported `PORT 9091: OK`, `PORT 9092: OK`, `PORT 9093: OK`, and `PORT 9094: OK`
|
|
||||||
- latest runtime build includes double-sided barrier mesh triangles for visual/debug barrier rendering
|
|
||||||
|
|
||||||
Note: the Windows profile uses shared Unity PlayerPrefs/registry values under `HKCU:\Software\DonkeyCar\donkey_sim`. Explicit `--port` args bind the servers correctly, but the in-sim UI can still show the saved PlayerPrefs value. Before launch, set `port_h2088097884`/`portPrivateAPI_h1325370089` to `9091`/`9092`, start the main sim, then set them to `9093`/`9094` and start the copy. Also keep passing explicit `--port 9091` and `--port 9093`.
|
|
||||||
|
|
||||||
Latest user visual inspection before double-sided patch:
|
|
||||||
|
|
||||||
- `generated_road`: barriers visible on both sides except missing on left side at the very start before the first curve
|
|
||||||
- `generated_track`: barrier visible only on the right/inside side when driving clockwise; no visible left/outside barrier
|
|
||||||
|
|
||||||
Likely diagnosis: barrier mesh was generated as a single-sided vertical plane and the Standard shader culled backfaces, so some debug barrier surfaces existed but were invisible from the road/camera side.
|
|
||||||
|
|
||||||
Latest simulator-side patch:
|
|
||||||
|
|
||||||
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Assets/Scripts/RoadBuilder.cs`
|
|
||||||
- `CreateBarrier(...)` now emits reverse-facing triangles for every barrier quad, making debug barrier meshes visible from both sides
|
|
||||||
- failed attempt: `Unlit/Transparent` made both tracks' barriers black in the standalone player
|
|
||||||
- failed attempt: duplicating reverse-facing triangles made `generated_track` barriers black, likely due coplanar transparent overdraw/z-fighting on the closed/scaled track
|
|
||||||
- current debug barrier mesh is back to one triangle set per quad; material uses `Standard` transparent mode with forced pale fallback color, alpha blend, culling off, and emission enabled so barriers should stay light/translucent while remaining visible from both sides
|
|
||||||
- Unity Windows batch build succeeded after this patch
|
|
||||||
- rebuilt output synced to both runtime folders and relaunched with explicit ports
|
|
||||||
|
|
||||||
## Immediate Next Steps
|
|
||||||
|
|
||||||
1. Monitor current exp22 training log/checkpoints.
|
|
||||||
|
|
||||||
2. Determine:
|
|
||||||
- are barriers too close to the road edge globally?
|
|
||||||
- or only wrong at specific bends / first-corner geometry?
|
|
||||||
|
|
||||||
3. Fix geometry if needed before restarting RL.
|
|
||||||
|
|
||||||
4. Only after geometry is visually verified, restart `exp22` or a successor experiment.
|
|
||||||
|
|
||||||
## Useful Commands
|
## Useful Commands
|
||||||
|
|
||||||
### Sync latest build into runtime folders
|
### Check build log
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin/'
|
tail -20 /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log
|
||||||
rsync -a --delete '/mnt/c/Users/Paul/Documents/projects/sdsandbox/sdsim/Builds/DonkeySimWin/' '/mnt/c/Users/Paul/Downloads/DonkeySimWin/DonkeySimWin - Copy/'
|
grep "Exiting batchmode\|Build failed\|error\|Error" /mnt/c/Users/Paul/AppData/Local/Temp/unity_rebuild.log | tail -5
|
||||||
```
|
```
|
||||||
|
|
||||||
### Launch sims from Windows side
|
### Monitor exp23
|
||||||
|
```bash
|
||||||
```powershell
|
tail -f agent/models/exp23-generated-road-clean/run_*_clean.log
|
||||||
$key = 'HKCU:\Software\DonkeyCar\donkey_sim'
|
|
||||||
|
|
||||||
Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9091 -Type DWord
|
|
||||||
Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9092 -Type DWord
|
|
||||||
Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin\donkey_sim.exe' -ArgumentList '--port','9091' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin'
|
|
||||||
|
|
||||||
Start-Sleep -Seconds 4
|
|
||||||
|
|
||||||
Set-ItemProperty -Path $key -Name 'port_h2088097884' -Value 9093 -Type DWord
|
|
||||||
Set-ItemProperty -Path $key -Name 'portPrivateAPI_h1325370089' -Value 9094 -Type DWord
|
|
||||||
Start-Process -FilePath 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin - Copy\donkey_sim.exe' -ArgumentList '--port','9093' -WorkingDirectory 'C:\Users\Paul\Downloads\DonkeySimWin\DonkeySimWin - Copy'
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Verify ports
|
### Verify ports
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 - <<'PY'
|
python3 - <<'PY'
|
||||||
import socket
|
import socket
|
||||||
for p in (9091, 9093):
|
for p in (9091,):
|
||||||
s = socket.socket()
|
s = socket.socket(); s.settimeout(3)
|
||||||
s.settimeout(3)
|
try: s.connect(('127.0.0.1', p)); print(f'PORT {p}: OK')
|
||||||
try:
|
except Exception as e: print(f'PORT {p}: FAIL {e}')
|
||||||
s.connect(('127.0.0.1', p))
|
finally: s.close()
|
||||||
print(f'PORT {p}: OK')
|
|
||||||
except Exception as e:
|
|
||||||
print(f'PORT {p}: FAIL {e}')
|
|
||||||
finally:
|
|
||||||
s.close()
|
|
||||||
PY
|
PY
|
||||||
```
|
```
|
||||||
|
|
||||||
## Notes for Next Session
|
## Notes for Next Session
|
||||||
|
|
||||||
- If the user says `continue`, do not ask broad questions. Start with the immediate next steps above.
|
- If the user says `continue`, do not ask broad questions. Check build log → sync → launch → verify barriers → start exp23.
|
||||||
- Prefer direct verification over more RL training.
|
- **Barrier visual confirmation is required before starting exp23.** Paul must see the translucent 3D boxes on both sides of the road with no gaps before committing to a 200k training run.
|
||||||
- Do not restart long training until the user has visually confirmed the debug-visible barriers look correct.
|
- The second sim (port 9093) is not needed for exp23 — only launch one sim.
|
||||||
|
- Do not add generated_track back until generated_road training is verified working.
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,236 @@
|
||||||
|
"""
|
||||||
|
Exp 23: Clean slate — generated_road, solid barriers, simple reward.
|
||||||
|
|
||||||
|
What changed from exp22:
|
||||||
|
- Single track: generated_road on port 9091 only (diagnose one track first)
|
||||||
|
- Simulator now uses BoxCollider barriers + CCD on the car Rigidbody.
|
||||||
|
The car physically cannot escape. No Python-side exploit patches needed.
|
||||||
|
- Reward wrapper v7: speed × CTE_quality + efficiency gate + no-progress kill.
|
||||||
|
Removed: CTE-patience termination, solid_hit detection, wedge detection,
|
||||||
|
MAX_EPISODE_SECONDS hard cap.
|
||||||
|
- StuckTerminationWrapper: max_episode_seconds raised to 120s (genuine safety
|
||||||
|
net only — physics handles the actual containment).
|
||||||
|
- No warm-start: fresh PPO weights. Previous warm-starts were trained under
|
||||||
|
broken reward/barrier conditions and add more noise than signal.
|
||||||
|
- Total steps: 200k (more room to learn with clean signal).
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
sys.path.insert(0, '/home/paulh/projects/donkeycar-rl-autoresearch/agent')
|
||||||
|
|
||||||
|
import gymnasium as gym
|
||||||
|
import numpy as np
|
||||||
|
from stable_baselines3 import PPO
|
||||||
|
from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage
|
||||||
|
|
||||||
|
from donkeycar_sb3_runner import ThrottleClampWrapper
|
||||||
|
from multitrack_runner import StuckTerminationWrapper
|
||||||
|
from reward_wrapper import SpeedRewardWrapper
|
||||||
|
|
||||||
|
|
||||||
|
HOST = 'localhost'
|
||||||
|
THROTTLE_MIN = 0.2
|
||||||
|
LR = 0.0003
|
||||||
|
TOTAL_STEPS = 200_000
|
||||||
|
CHECKPOINT_EVERY = 10_000
|
||||||
|
SAVE_DIR = '/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/exp23-generated-road-clean'
|
||||||
|
os.makedirs(SAVE_DIR, exist_ok=True)
|
||||||
|
|
||||||
|
# Reward wrapper v7 params — clean and minimal
|
||||||
|
EFFICIENCY_WINDOW = 30
|
||||||
|
MIN_EFFICIENCY = 0.15
|
||||||
|
MAX_CTE = 8.0
|
||||||
|
MIN_LAP_TIME = 12.0
|
||||||
|
PROGRESS_PATIENCE = 100 # steps without new waypoint → terminate
|
||||||
|
|
||||||
|
# StuckTerminationWrapper — generous limit, physics does the real work now
|
||||||
|
MAX_STUCK_SECONDS = 5.0
|
||||||
|
MAX_EPISODE_SECONDS = 120.0 # safety net only
|
||||||
|
|
||||||
|
|
||||||
|
def log(msg):
|
||||||
|
print(f'[{datetime.now().strftime("%H:%M:%S")}] {msg}', flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
def make_env(track_id, port):
|
||||||
|
def _init():
|
||||||
|
raw = gym.make(track_id, conf={'host': HOST, 'port': port})
|
||||||
|
env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN)
|
||||||
|
env = StuckTerminationWrapper(
|
||||||
|
env,
|
||||||
|
stuck_steps=40,
|
||||||
|
min_displacement=0.5,
|
||||||
|
max_stuck_seconds=MAX_STUCK_SECONDS,
|
||||||
|
max_episode_seconds=MAX_EPISODE_SECONDS,
|
||||||
|
)
|
||||||
|
env = SpeedRewardWrapper(
|
||||||
|
env,
|
||||||
|
window_size=EFFICIENCY_WINDOW,
|
||||||
|
min_efficiency=MIN_EFFICIENCY,
|
||||||
|
max_cte=MAX_CTE,
|
||||||
|
min_lap_time=MIN_LAP_TIME,
|
||||||
|
progress_patience=PROGRESS_PATIENCE,
|
||||||
|
)
|
||||||
|
return env
|
||||||
|
return _init
|
||||||
|
|
||||||
|
|
||||||
|
def make_eval_env(track_id, port):
|
||||||
|
inner = make_env(track_id, port)()
|
||||||
|
return VecTransposeImage(DummyVecEnv([lambda e=inner: e]))
|
||||||
|
|
||||||
|
|
||||||
|
log('=' * 60)
|
||||||
|
log('Exp 23: generated_road — clean barriers, clean reward')
|
||||||
|
log(f' Sim: {HOST}:9091 -> generated_road')
|
||||||
|
log(f' throttle_min={THROTTLE_MIN}, lr={LR}, total={TOTAL_STEPS:,}')
|
||||||
|
log(f' Reward: v7 (speed×CTE, efficiency gate, no-progress kill)')
|
||||||
|
log(f' Max stuck: {MAX_STUCK_SECONDS}s, episode cap: {MAX_EPISODE_SECONDS}s (safety net)')
|
||||||
|
log(f' Progress patience: {PROGRESS_PATIENCE} steps')
|
||||||
|
log(f' Checkpoints every {CHECKPOINT_EVERY:,} steps')
|
||||||
|
log('=' * 60)
|
||||||
|
|
||||||
|
log('Creating DummyVecEnv on generated_road...')
|
||||||
|
env = DummyVecEnv([make_env('donkey-generated-roads-v0', 9091)])
|
||||||
|
env = VecTransposeImage(env)
|
||||||
|
log(f' VecEnv num_envs={env.num_envs}, obs={env.observation_space.shape}')
|
||||||
|
|
||||||
|
model = PPO(
|
||||||
|
'CnnPolicy',
|
||||||
|
env,
|
||||||
|
learning_rate=LR,
|
||||||
|
n_steps=2048,
|
||||||
|
batch_size=64,
|
||||||
|
n_epochs=10,
|
||||||
|
gamma=0.99,
|
||||||
|
gae_lambda=0.95,
|
||||||
|
clip_range=0.2,
|
||||||
|
ent_coef=0.01,
|
||||||
|
verbose=1,
|
||||||
|
device='cpu',
|
||||||
|
)
|
||||||
|
|
||||||
|
# Write PID for external monitoring
|
||||||
|
pid_path = os.path.join(SAVE_DIR, 'current.pid')
|
||||||
|
with open(pid_path, 'w') as f:
|
||||||
|
f.write(str(os.getpid()))
|
||||||
|
|
||||||
|
log(f'Fresh PPO model created. Starting training...')
|
||||||
|
|
||||||
|
best_total_steps = float('-inf')
|
||||||
|
best_total_reward = float('-inf')
|
||||||
|
steps_done = 0
|
||||||
|
run_tag = datetime.now().strftime('%Y-%m-%d_%H%M%S') + '_clean'
|
||||||
|
log_path = os.path.join(SAVE_DIR, f'run_{run_tag}.log')
|
||||||
|
best_model_path = os.path.join(SAVE_DIR, 'best_model.zip')
|
||||||
|
|
||||||
|
import logging
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format='%(message)s',
|
||||||
|
handlers=[logging.FileHandler(log_path), logging.StreamHandler(sys.stdout)],
|
||||||
|
)
|
||||||
|
file_log = logging.getLogger('exp23')
|
||||||
|
|
||||||
|
def flog(msg):
|
||||||
|
ts = datetime.now().strftime('%H:%M:%S')
|
||||||
|
file_log.info(f'[{ts}] {msg}')
|
||||||
|
|
||||||
|
flog('=' * 60)
|
||||||
|
flog(f'Exp 23 started — PID {os.getpid()}')
|
||||||
|
flog(f'Log: {log_path}')
|
||||||
|
flog('=' * 60)
|
||||||
|
|
||||||
|
while steps_done < TOTAL_STEPS:
|
||||||
|
seg_steps = min(CHECKPOINT_EVERY, TOTAL_STEPS - steps_done)
|
||||||
|
model.learn(total_timesteps=seg_steps, reset_num_timesteps=False)
|
||||||
|
steps_done += seg_steps
|
||||||
|
|
||||||
|
ckpt = os.path.join(SAVE_DIR, f'checkpoint_{steps_done:07d}')
|
||||||
|
model.save(ckpt)
|
||||||
|
model.save(os.path.join(SAVE_DIR, 'model'))
|
||||||
|
flog(f'[{steps_done:,}/{TOTAL_STEPS:,}] Checkpoint saved: {ckpt}.zip')
|
||||||
|
|
||||||
|
# Mid-training eval on generated_road
|
||||||
|
try:
|
||||||
|
obs = env.reset()
|
||||||
|
ep_rewards = np.zeros(env.num_envs)
|
||||||
|
ep_steps = np.zeros(env.num_envs)
|
||||||
|
done_mask = np.zeros(env.num_envs, dtype=bool)
|
||||||
|
|
||||||
|
for _ in range(2000):
|
||||||
|
action, _ = model.predict(obs, deterministic=True)
|
||||||
|
obs, rewards, dones, infos = env.step(action)
|
||||||
|
for i in range(env.num_envs):
|
||||||
|
if not done_mask[i]:
|
||||||
|
ep_rewards[i] += rewards[i]
|
||||||
|
ep_steps[i] += 1
|
||||||
|
if dones[i]:
|
||||||
|
done_mask[i] = True
|
||||||
|
if done_mask.all():
|
||||||
|
break
|
||||||
|
|
||||||
|
total_steps_eval = int(ep_steps.sum())
|
||||||
|
total_reward_eval = float(ep_rewards.sum())
|
||||||
|
|
||||||
|
status = '✅' if ep_steps[0] >= 2000 else f'❌@{int(ep_steps[0])}'
|
||||||
|
flog(f' Eval: gen_road={total_reward_eval:.1f}r/{int(ep_steps[0])}s {status}')
|
||||||
|
|
||||||
|
if (total_steps_eval > best_total_steps
|
||||||
|
or (total_steps_eval == best_total_steps
|
||||||
|
and total_reward_eval > best_total_reward)):
|
||||||
|
best_total_steps = total_steps_eval
|
||||||
|
best_total_reward = total_reward_eval
|
||||||
|
model.save(best_model_path)
|
||||||
|
flog(f' NEW BEST: steps={best_total_steps} reward={best_total_reward:.1f}')
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
flog(f' Eval error: {e}')
|
||||||
|
|
||||||
|
env.close()
|
||||||
|
|
||||||
|
# ── Final evaluation ──────────────────────────────────────────────────────────
|
||||||
|
flog('=' * 60)
|
||||||
|
flog('FINAL EVALUATION: best_model on generated_road')
|
||||||
|
flog('=' * 60)
|
||||||
|
|
||||||
|
EVAL_SETS = 3
|
||||||
|
EVAL_MAX_STEPS = 2000
|
||||||
|
|
||||||
|
steps_list = []
|
||||||
|
reward_list = []
|
||||||
|
|
||||||
|
for s in range(1, EVAL_SETS + 1):
|
||||||
|
try:
|
||||||
|
eval_env = make_eval_env('donkey-generated-roads-v0', 9091)
|
||||||
|
eval_model = PPO.load(best_model_path, env=eval_env, device='cpu')
|
||||||
|
obs = eval_env.reset()
|
||||||
|
done = False
|
||||||
|
total_s = 0
|
||||||
|
total_r = 0.0
|
||||||
|
|
||||||
|
while not done and total_s < EVAL_MAX_STEPS:
|
||||||
|
action, _ = eval_model.predict(obs, deterministic=True)
|
||||||
|
result = eval_env.step(action)
|
||||||
|
obs, r, done = result[0], result[1], result[2]
|
||||||
|
if hasattr(done, '__len__'):
|
||||||
|
done = bool(done[0])
|
||||||
|
total_r += float(r) if not hasattr(r, '__len__') else float(r[0])
|
||||||
|
total_s += 1
|
||||||
|
|
||||||
|
status = '✅' if total_s >= EVAL_MAX_STEPS else f'❌@{total_s}'
|
||||||
|
flog(f' Set {s}: {total_r:.1f}r / {total_s}s {status}')
|
||||||
|
steps_list.append(total_s)
|
||||||
|
reward_list.append(total_r)
|
||||||
|
eval_env.close()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
flog(f' Set {s} error: {e}')
|
||||||
|
|
||||||
|
if steps_list:
|
||||||
|
flog(f' Mean: {np.mean(steps_list):.0f} steps / {np.mean(reward_list):.1f} reward')
|
||||||
|
|
||||||
|
flog('Exp 23 complete.')
|
||||||
|
|
@ -1,58 +1,36 @@
|
||||||
"""
|
"""
|
||||||
Speed + Progress Reward Wrapper for DonkeyCar RL — v6 (Speed×CTE + Efficiency Gate)
|
Speed × CTE Reward Wrapper for DonkeyCar RL — v7 (Clean)
|
||||||
=====================================================================================
|
=========================================================
|
||||||
|
|
||||||
REWARD HACKING HISTORY:
|
The simulator now uses solid BoxCollider barriers with Continuous Collision
|
||||||
v1 additive: speed × (1-cte/max_cte) → boundary oscillation
|
Detection on the car Rigidbody. The car physically cannot escape the track.
|
||||||
v2 multiplicative: original × (1+speed×scale) → circular driving (on-track)
|
This removes the need for every Python-side exploit patch that lived here:
|
||||||
v3 path efficiency: original × (1+speed×eff×scale) → still circling!
|
|
||||||
WHY v3 failed: efficiency killed the SPEED BONUS but not the BASE reward.
|
|
||||||
A spinning car at CTE≈0 still earns 1.0/step × thousands of steps.
|
|
||||||
v4: base × eff × (1 + speed_scale × speed) → zero gradient on hills!
|
|
||||||
WHY v4 failed on hills: speed≈0 AND eff≈0 AND cte_quality varies → all
|
|
||||||
three terms near zero simultaneously → no gradient to push ANY term up.
|
|
||||||
v5: speed × CTE_quality (no efficiency) → circular driving returns!
|
|
||||||
WHY v5 failed: dropped efficiency entirely. Circular driving at CTE≈0
|
|
||||||
with speed>0 earns positive reward indefinitely. Observed in Exp 11.
|
|
||||||
v6 (THIS VERSION): v5 reward + efficiency GATE.
|
|
||||||
Keeps v5's gradient properties (non-zero gradient on hills) but adds
|
|
||||||
a binary efficiency check that zeros reward when car is circling.
|
|
||||||
|
|
||||||
ROOT CAUSE OF CIRCLING:
|
REMOVED (simulator now enforces these physically):
|
||||||
The sim's own calc_reward() uses `forward_vel` = dot(car_heading, velocity).
|
- CTE-patience termination (car can't get far off track anyway)
|
||||||
A spinning car is ALWAYS moving "forward" relative to its own heading,
|
- High-CTE negative reward patch
|
||||||
so forward_vel > 0 always, giving positive reward while circling indefinitely.
|
- solid_hit / barrier-contact monitoring
|
||||||
We bypass this entirely.
|
- low-speed / wedge detection
|
||||||
|
|
||||||
FORMULA (v6):
|
KEPT (still needed — physics can't detect these):
|
||||||
cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1] centred=1
|
- Efficiency gate: zero reward when circling
|
||||||
speed_norm = min(speed / 10.0, 1.0) # [0,1] normalised
|
(car on-track but spinning in circles, not advancing)
|
||||||
efficiency = net_displacement / total_path # [0,1] straight=1, circle=0
|
- No-progress termination: active_node not advancing
|
||||||
|
(car stuck at waypoint, not completing the course)
|
||||||
|
- Lap exploit check: super-fast laps are physically impossible but kept
|
||||||
|
as a sanity guard
|
||||||
|
|
||||||
|
FORMULA:
|
||||||
|
cte_quality = 1.0 - min(|cte| / max_cte, 1.0) # [0,1]: centred=1
|
||||||
|
speed_norm = min(speed / 10.0, 1.0) # [0,1]: normalised
|
||||||
|
efficiency = net_displacement / total_path # [0,1]: straight=1, circle=0
|
||||||
|
|
||||||
if efficiency < min_efficiency:
|
if efficiency < min_efficiency:
|
||||||
reward = 0.0 # GATE: circling → zero reward (but not negative)
|
reward = 0.0 # circling — no incentive
|
||||||
else:
|
else:
|
||||||
reward = cte_quality × speed_norm # v5 formula (gradient on hills)
|
reward = cte_quality × speed_norm
|
||||||
|
|
||||||
On done/crash: reward = -1.0
|
On done/crash: reward = -1.0
|
||||||
|
|
||||||
WHY GATE NOT MULTIPLIER:
|
|
||||||
v4 used efficiency as a multiplier: reward = base × eff × speed_bonus.
|
|
||||||
On a hill: speed≈0, eff≈0, base≈0.5 → reward≈0 and ∂reward/∂speed≈0.
|
|
||||||
No gradient to push speed up — car stays stuck.
|
|
||||||
|
|
||||||
v6 gate: efficiency is either PASS or FAIL. When efficiency > threshold
|
|
||||||
(car moving forward at all), reward = speed × CTE_quality. On a hill:
|
|
||||||
car is stuck but still has eff > 0 (not literally circling), so the gate
|
|
||||||
passes and the reward = speed × CTE_quality. ∂reward/∂speed > 0 → gradient
|
|
||||||
pushes toward more throttle. Circle has eff ≈ 0 → gate fails → reward = 0.
|
|
||||||
|
|
||||||
PROPERTIES:
|
|
||||||
- Circling (eff<threshold): reward = 0 (no incentive to circle)
|
|
||||||
- On track, stuck (eff>0): reward = speed × CTE (gradient toward unstuck)
|
|
||||||
- On track, fast: reward = high (speed + centred)
|
|
||||||
- Off track: reward ≈ 0 (CTE_quality → 0)
|
|
||||||
- Crash: reward = -1.0
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import gymnasium as gym
|
import gymnasium as gym
|
||||||
|
|
@ -62,92 +40,49 @@ from collections import deque
|
||||||
|
|
||||||
class SpeedRewardWrapper(gym.Wrapper):
|
class SpeedRewardWrapper(gym.Wrapper):
|
||||||
"""
|
"""
|
||||||
Full reward bypass: speed × CTE_quality, gated by efficiency.
|
Reward = speed × CTE_quality, gated by path efficiency.
|
||||||
|
|
||||||
Completely ignores the sim's own reward (which uses forward_vel and is
|
|
||||||
exploitable by circular/spinning motion).
|
|
||||||
|
|
||||||
Exploit termination:
|
|
||||||
- Sustained high CTE (> max_cte_terminate for cte_patience steps): grass exploit
|
|
||||||
- No track progress (active_node max not advancing for progress_patience steps):
|
|
||||||
catches circular driving, stuck-on-cone, stuck-on-barrier.
|
|
||||||
A circling car stays near the same waypoints — active_node never advances.
|
|
||||||
A stuck car never advances either. Forward driving always advances.
|
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
env: gymnasium environment
|
env: gymnasium environment
|
||||||
speed_scale: speed bonus multiplier (default 0.1)
|
window_size: steps for efficiency gate history (default 30)
|
||||||
window_size: steps for efficiency gate (default 30)
|
min_efficiency: efficiency threshold — below this, reward = 0 (default 0.15)
|
||||||
min_efficiency: efficiency gate threshold (default 0.15)
|
max_cte: CTE at which reward reaches 0 (default 8.0)
|
||||||
max_cte: track half-width for reward normalization (default 8.0)
|
min_lap_time: laps faster than this are penalised (exploit guard)
|
||||||
min_lap_time: laps faster than this are penalised as exploits
|
progress_patience: steps without new max active_node before termination
|
||||||
max_cte_terminate: terminate if CTE > this for cte_patience steps
|
|
||||||
cte_patience: steps of sustained high CTE before termination
|
|
||||||
progress_patience: steps without new max active_node before termination
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
env,
|
env,
|
||||||
speed_scale: float = 0.1,
|
|
||||||
window_size: int = 30,
|
window_size: int = 30,
|
||||||
min_efficiency: float = 0.15,
|
min_efficiency: float = 0.15,
|
||||||
max_cte: float = 8.0,
|
max_cte: float = 8.0,
|
||||||
min_lap_time: float = 5.0,
|
min_lap_time: float = 5.0,
|
||||||
max_cte_terminate: float = 4.0,
|
|
||||||
cte_patience: int = 20,
|
|
||||||
progress_patience: int = 60,
|
progress_patience: int = 60,
|
||||||
efficiency_patience: int = 20, # steps of low efficiency before termination
|
|
||||||
low_speed_patience: int = 20,
|
|
||||||
low_speed_threshold: float = 0.2,
|
|
||||||
low_speed_min_displacement: float = 0.25,
|
|
||||||
low_speed_grace_steps: int = 20,
|
|
||||||
):
|
):
|
||||||
super().__init__(env)
|
super().__init__(env)
|
||||||
self.speed_scale = speed_scale
|
|
||||||
self.window_size = window_size
|
self.window_size = window_size
|
||||||
self.min_efficiency = min_efficiency
|
self.min_efficiency = min_efficiency
|
||||||
self.max_cte = max_cte
|
self.max_cte = max_cte
|
||||||
self.min_lap_time = min_lap_time
|
self.min_lap_time = min_lap_time
|
||||||
self.max_cte_terminate = max_cte_terminate
|
|
||||||
self.cte_patience = cte_patience
|
|
||||||
self.progress_patience = progress_patience
|
self.progress_patience = progress_patience
|
||||||
self.efficiency_patience = efficiency_patience
|
|
||||||
self.low_speed_patience = low_speed_patience
|
|
||||||
self.low_speed_threshold = low_speed_threshold
|
|
||||||
self.low_speed_min_displacement = low_speed_min_displacement
|
|
||||||
self.low_speed_grace_steps = low_speed_grace_steps
|
|
||||||
self._pos_history = deque(maxlen=window_size + 1)
|
self._pos_history = deque(maxlen=window_size + 1)
|
||||||
self._last_lap_count = 0
|
self._last_lap_count = 0
|
||||||
self._high_cte_steps = 0
|
|
||||||
self._max_node_seen = -1
|
self._max_node_seen = -1
|
||||||
self._no_progress_steps = 0
|
self._no_progress_steps = 0
|
||||||
self._low_eff_steps = 0
|
|
||||||
self._solid_hit_steps = 0
|
|
||||||
self._prev_speed = 0.0
|
|
||||||
self._episode_steps = 0
|
|
||||||
self._low_speed_steps = 0
|
|
||||||
self._low_speed_anchor = None
|
|
||||||
|
|
||||||
def reset(self, **kwargs):
|
def reset(self, **kwargs):
|
||||||
result = self.env.reset(**kwargs)
|
result = self.env.reset(**kwargs)
|
||||||
self._pos_history.clear()
|
self._pos_history.clear()
|
||||||
self._last_lap_count = 0
|
self._last_lap_count = 0
|
||||||
self._high_cte_steps = 0
|
|
||||||
self._max_node_seen = -1
|
self._max_node_seen = -1
|
||||||
self._no_progress_steps = 0
|
self._no_progress_steps = 0
|
||||||
self._low_eff_steps = 0
|
|
||||||
self._solid_hit_steps = 0
|
|
||||||
self._prev_speed = 0.0
|
|
||||||
self._episode_steps = 0
|
|
||||||
self._low_speed_steps = 0
|
|
||||||
self._low_speed_anchor = None
|
|
||||||
return result
|
return result
|
||||||
|
|
||||||
def step(self, action):
|
def step(self, action):
|
||||||
result = self.env.step(action)
|
result = self.env.step(action)
|
||||||
|
|
||||||
# Handle both 4-tuple (old gym) and 5-tuple (gymnasium) APIs
|
|
||||||
if len(result) == 5:
|
if len(result) == 5:
|
||||||
obs, _sim_reward, terminated, truncated, info = result
|
obs, _sim_reward, terminated, truncated, info = result
|
||||||
done = terminated or truncated
|
done = terminated or truncated
|
||||||
|
|
@ -158,159 +93,54 @@ class SpeedRewardWrapper(gym.Wrapper):
|
||||||
else:
|
else:
|
||||||
raise ValueError(f'Unexpected step() result length: {len(result)}')
|
raise ValueError(f'Unexpected step() result length: {len(result)}')
|
||||||
|
|
||||||
# Completely ignore _sim_reward — compute our own
|
shaped, force_terminate = self._compute_reward(done, info)
|
||||||
shaped, force_terminate = self._compute_reward_and_done(done, info)
|
|
||||||
if force_terminate:
|
if force_terminate:
|
||||||
terminated = True
|
terminated = True
|
||||||
done = True
|
done = True
|
||||||
|
|
||||||
if len(result) == 5:
|
if len(result) == 5:
|
||||||
return obs, shaped, terminated, truncated, info
|
return obs, shaped, terminated, truncated, info
|
||||||
else:
|
return obs, shaped, done, info
|
||||||
return obs, shaped, done, info
|
|
||||||
|
|
||||||
def _compute_reward_and_done(self, done: bool, info: dict):
|
def _compute_reward(self, done: bool, info: dict):
|
||||||
"""
|
# Record position for efficiency calculation
|
||||||
v6.1: speed × CTE-quality + efficiency gate + grass/rollback terminators.
|
|
||||||
|
|
||||||
New termination conditions:
|
|
||||||
- Sustained high CTE: CTE > max_cte_terminate for cte_patience steps
|
|
||||||
→ terminate. Stops the grass exploit (car exits track gap and
|
|
||||||
drives indefinitely on grass with CTE just under max_cte=8.0).
|
|
||||||
- No track progress: active_node doesn't advance for progress_patience
|
|
||||||
steps → terminate. Stops mountain rollback (car goes up, rolls
|
|
||||||
back, IS moving so StuckWrapper doesn't fire, but never advances).
|
|
||||||
|
|
||||||
reward = speed_norm × cte_quality (when efficiency >= threshold)
|
|
||||||
reward = 0.0 (when circling)
|
|
||||||
reward = -1.0 (on crash/termination)
|
|
||||||
"""
|
|
||||||
# Track position for efficiency calculation
|
|
||||||
current_pos = None
|
|
||||||
try:
|
try:
|
||||||
pos = info.get('pos', (0.0, 0.0, 0.0))
|
pos = info.get('pos', (0.0, 0.0, 0.0))
|
||||||
pos_x = float(pos[0])
|
self._pos_history.append(np.array([float(pos[0]), float(pos[2])]))
|
||||||
pos_z = float(pos[2])
|
|
||||||
current_pos = np.array([pos_x, pos_z])
|
|
||||||
self._pos_history.append(current_pos)
|
|
||||||
except (TypeError, ValueError, IndexError):
|
except (TypeError, ValueError, IndexError):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
self._episode_steps += 1
|
|
||||||
|
|
||||||
# Crash / episode over
|
|
||||||
if done:
|
if done:
|
||||||
return -1.0, False
|
return -1.0, False
|
||||||
|
|
||||||
# --- CTE value for all checks ---
|
|
||||||
try:
|
try:
|
||||||
cte = float(info.get('cte', 0.0) or 0.0)
|
cte = float(info.get('cte', 0.0) or 0.0)
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
cte = 0.0
|
cte = 0.0
|
||||||
|
|
||||||
# --- Speed / collision classification ---
|
|
||||||
try:
|
try:
|
||||||
speed = max(0.0, float(info.get('speed', 0.0) or 0.0))
|
speed = max(0.0, float(info.get('speed', 0.0) or 0.0))
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
speed = 0.0
|
speed = 0.0
|
||||||
|
|
||||||
try:
|
# --- No-progress termination ---
|
||||||
hit = str(info.get('hit', 'none') or 'none').lower()
|
# Terminates episodes where the car isn't advancing along the track
|
||||||
except Exception:
|
# (circling near the start, stuck against a barrier, etc.).
|
||||||
hit = 'none'
|
|
||||||
|
|
||||||
solid_hit = (
|
|
||||||
hit != 'none' and (
|
|
||||||
'barrier' in hit or
|
|
||||||
'wall' in hit or
|
|
||||||
'tree' in hit
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
# Allow brief brushes, but terminate on:
|
|
||||||
# 1. a head-on style stop: car was moving, then collision arrives with
|
|
||||||
# a large speed drop; or
|
|
||||||
# 2. sustained obstacle contact over several telemetry frames.
|
|
||||||
if solid_hit:
|
|
||||||
head_on_impact = self._prev_speed >= 1.5 and speed <= 0.35
|
|
||||||
if head_on_impact:
|
|
||||||
self._prev_speed = speed
|
|
||||||
return -1.0, True
|
|
||||||
|
|
||||||
self._solid_hit_steps += 1
|
|
||||||
if self._solid_hit_steps >= 4:
|
|
||||||
self._prev_speed = speed
|
|
||||||
return -1.0, True
|
|
||||||
else:
|
|
||||||
self._solid_hit_steps = 0
|
|
||||||
|
|
||||||
# --- Wheels-spinning / barrier wedge termination ---
|
|
||||||
# CTE can remain deceptively acceptable when the car is pressed against
|
|
||||||
# a generated-road barrier or invisible collider. If speed stays near
|
|
||||||
# zero and position does not meaningfully change after the launch grace
|
|
||||||
# period, kill the episode quickly with a negative reward.
|
|
||||||
if (
|
|
||||||
current_pos is not None
|
|
||||||
and self._episode_steps > self.low_speed_grace_steps
|
|
||||||
and speed <= self.low_speed_threshold
|
|
||||||
):
|
|
||||||
if self._low_speed_anchor is None:
|
|
||||||
self._low_speed_anchor = current_pos
|
|
||||||
self._low_speed_steps = 1
|
|
||||||
else:
|
|
||||||
moved = float(np.linalg.norm(current_pos - self._low_speed_anchor))
|
|
||||||
if moved >= self.low_speed_min_displacement:
|
|
||||||
self._low_speed_anchor = current_pos
|
|
||||||
self._low_speed_steps = 0
|
|
||||||
else:
|
|
||||||
self._low_speed_steps += 1
|
|
||||||
|
|
||||||
if self._low_speed_steps >= self.low_speed_patience:
|
|
||||||
self._prev_speed = speed
|
|
||||||
return -1.0, True
|
|
||||||
else:
|
|
||||||
self._low_speed_steps = 0
|
|
||||||
self._low_speed_anchor = current_pos
|
|
||||||
|
|
||||||
# --- Grass / outside-road exploit: high CTE is bad immediately ---
|
|
||||||
# Do not let the policy collect positive speed reward while it is
|
|
||||||
# outside the useful road corridor. Earlier versions only terminated
|
|
||||||
# after patience frames, but still paid positive reward during those
|
|
||||||
# frames; PPO learned large fast circles outside generated_road.
|
|
||||||
if abs(cte) > self.max_cte_terminate:
|
|
||||||
self._high_cte_steps += 1
|
|
||||||
if self._high_cte_steps >= self.cte_patience:
|
|
||||||
self._prev_speed = speed
|
|
||||||
return -1.0, True # too long off-track — terminate
|
|
||||||
self._prev_speed = speed
|
|
||||||
return -0.25, False
|
|
||||||
else:
|
|
||||||
self._high_cte_steps = 0
|
|
||||||
|
|
||||||
# --- Circle / stuck exploit: no track progress termination ---
|
|
||||||
# Track the highest active_node (track waypoint) reached this episode.
|
|
||||||
# A circling car stays near the same waypoints — max_node never advances.
|
|
||||||
# A stuck car never advances either. Only genuine forward driving advances.
|
|
||||||
# On lap completion, active_node resets to 0 — we reset our tracker too.
|
|
||||||
try:
|
try:
|
||||||
active_node = int(info.get('active_node', -1) or 0)
|
active_node = int(info.get('active_node', -1) or 0)
|
||||||
total_nodes = int(info.get('total_nodes', 1) or 1)
|
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
active_node = -1
|
active_node = -1
|
||||||
total_nodes = 1
|
|
||||||
|
|
||||||
if active_node >= 0:
|
if active_node >= 0:
|
||||||
if active_node > self._max_node_seen:
|
if active_node > self._max_node_seen:
|
||||||
# New furthest point reached — genuine forward progress
|
|
||||||
self._max_node_seen = active_node
|
self._max_node_seen = active_node
|
||||||
self._no_progress_steps = 0
|
self._no_progress_steps = 0
|
||||||
else:
|
else:
|
||||||
self._no_progress_steps += 1
|
self._no_progress_steps += 1
|
||||||
if self._no_progress_steps >= self.progress_patience:
|
if self._no_progress_steps >= self.progress_patience:
|
||||||
self._prev_speed = speed
|
return -1.0, True
|
||||||
return -1.0, True # no forward progress — terminate
|
|
||||||
|
|
||||||
|
|
||||||
|
# --- Lap detection: reset progress tracker + exploit guard ---
|
||||||
try:
|
try:
|
||||||
current_lap_count = int(info.get('lap_count', 0) or 0)
|
current_lap_count = int(info.get('lap_count', 0) or 0)
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
|
|
@ -318,7 +148,6 @@ class SpeedRewardWrapper(gym.Wrapper):
|
||||||
|
|
||||||
if current_lap_count > self._last_lap_count:
|
if current_lap_count > self._last_lap_count:
|
||||||
self._last_lap_count = current_lap_count
|
self._last_lap_count = current_lap_count
|
||||||
# Reset progress tracker — active_node wraps to 0 on new lap
|
|
||||||
self._max_node_seen = -1
|
self._max_node_seen = -1
|
||||||
self._no_progress_steps = 0
|
self._no_progress_steps = 0
|
||||||
try:
|
try:
|
||||||
|
|
@ -326,47 +155,22 @@ class SpeedRewardWrapper(gym.Wrapper):
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
lap_time = 999.0
|
lap_time = 999.0
|
||||||
if lap_time < self.min_lap_time:
|
if lap_time < self.min_lap_time:
|
||||||
penalty = -10.0 * (self.min_lap_time / max(lap_time, 0.1))
|
return -10.0 * (self.min_lap_time / max(lap_time, 0.1)), True
|
||||||
self._prev_speed = speed
|
|
||||||
return penalty, True
|
|
||||||
|
|
||||||
# --- Efficiency gate: detect circular driving ---
|
# --- Efficiency gate: zero reward when circling ---
|
||||||
# Count consecutive steps of low efficiency. After patience steps, terminate.
|
if self._compute_efficiency() < self.min_efficiency:
|
||||||
# Previously this just returned 0 reward (no termination) which let circles
|
return 0.0, False
|
||||||
# run for 20+ seconds. Now we terminate after ~20 steps (~0.7s).
|
|
||||||
efficiency = self._compute_efficiency()
|
|
||||||
if efficiency < self.min_efficiency:
|
|
||||||
self._low_eff_steps += 1
|
|
||||||
if self._low_eff_steps >= self.efficiency_patience:
|
|
||||||
self._prev_speed = speed
|
|
||||||
return -1.0, True # circle too long — terminate
|
|
||||||
self._prev_speed = speed
|
|
||||||
return 0.0, False # still accumulating — zero reward
|
|
||||||
else:
|
|
||||||
self._low_eff_steps = 0
|
|
||||||
|
|
||||||
# --- CTE quality ---
|
# --- Core reward: speed × CTE quality ---
|
||||||
cte_quality = 1.0 - min(abs(cte) / self.max_cte, 1.0)
|
cte_quality = 1.0 - min(abs(cte) / self.max_cte, 1.0)
|
||||||
|
speed_norm = min(speed / 10.0, 1.0)
|
||||||
# --- Speed ---
|
|
||||||
# --- v6 reward: speed × CTE quality ---
|
|
||||||
speed_norm = min(speed / 10.0, 1.0)
|
|
||||||
self._prev_speed = speed
|
|
||||||
return cte_quality * speed_norm, False
|
return cte_quality * speed_norm, False
|
||||||
|
|
||||||
def _compute_efficiency(self) -> float:
|
def _compute_efficiency(self) -> float:
|
||||||
"""Path efficiency = net_displacement / total_path_length."""
|
|
||||||
if len(self._pos_history) < 3:
|
if len(self._pos_history) < 3:
|
||||||
return 1.0 # Insufficient history — give benefit of doubt
|
return 1.0
|
||||||
|
|
||||||
positions = list(self._pos_history)
|
positions = list(self._pos_history)
|
||||||
net = np.linalg.norm(positions[-1] - positions[0])
|
net = float(np.linalg.norm(positions[-1] - positions[0]))
|
||||||
total = sum(
|
total = float(sum(np.linalg.norm(positions[i+1] - positions[i])
|
||||||
np.linalg.norm(positions[i + 1] - positions[i])
|
for i in range(len(positions) - 1)))
|
||||||
for i in range(len(positions) - 1)
|
return net / total if total > 1e-6 else 1.0
|
||||||
)
|
|
||||||
return float(net / total) if total > 1e-6 else 1.0
|
|
||||||
|
|
||||||
def theoretical_max_per_step(self, max_speed: float = 10.0) -> float:
|
|
||||||
"""Upper bound on reward/step (efficiency=1, CTE=0, max speed)."""
|
|
||||||
return 1.0 * 1.0 * (1.0 + self.speed_scale * max_speed)
|
|
||||||
|
|
|
||||||
|
|
@ -1,30 +1,25 @@
|
||||||
"""
|
"""Tests for reward_wrapper.py v7 (clean: speed×CTE + efficiency gate)."""
|
||||||
Tests for reward_wrapper.py v4 (full sim bypass — base × efficiency × speed).
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys, os, math, pytest
|
import sys, os, math, pytest
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import gymnasium as gym
|
|
||||||
from collections import deque
|
|
||||||
|
|
||||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent'))
|
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent'))
|
||||||
from reward_wrapper import SpeedRewardWrapper
|
from reward_wrapper import SpeedRewardWrapper
|
||||||
|
|
||||||
|
import gymnasium as gym
|
||||||
|
|
||||||
# ---- Mock Environments ----
|
|
||||||
|
|
||||||
class MockEnv(gym.Env):
|
class MockEnv(gym.Env):
|
||||||
"""Configurable mock gymnasium.Env."""
|
|
||||||
metadata = {'render_modes': []}
|
metadata = {'render_modes': []}
|
||||||
|
|
||||||
def __init__(self, speed=2.0, cte=0.0, pos=(0., 0., 0.), done=False, use_5tuple=True):
|
def __init__(self, speed=2.0, cte=0.0, pos=(0., 0., 0.), done=False, use_5tuple=True):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.action_space = gym.spaces.Discrete(5)
|
self.action_space = gym.spaces.Discrete(5)
|
||||||
self.observation_space = gym.spaces.Box(0, 255, (120, 160, 3), dtype=np.uint8)
|
self.observation_space = gym.spaces.Box(0, 255, (120, 160, 3), dtype=np.uint8)
|
||||||
self._speed = speed
|
self._speed = speed
|
||||||
self._cte = cte
|
self._cte = cte
|
||||||
self._pos = list(pos)
|
self._pos = list(pos)
|
||||||
self._done = done
|
self._done = done
|
||||||
self._use_5tuple = use_5tuple
|
self._use_5tuple = use_5tuple
|
||||||
|
|
||||||
def set_pos(self, p): self._pos = list(p)
|
def set_pos(self, p): self._pos = list(p)
|
||||||
|
|
@ -34,523 +29,231 @@ class MockEnv(gym.Env):
|
||||||
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
||||||
|
|
||||||
def step(self, action):
|
def step(self, action):
|
||||||
obs = np.zeros((120, 160, 3), dtype=np.uint8)
|
obs = np.zeros((120, 160, 3), dtype=np.uint8)
|
||||||
# Sim reward uses forward_vel (exploitable) — wrapper should IGNORE this
|
sim_reward = 999.0 # deliberately bogus — wrapper must ignore this
|
||||||
sim_reward = 999.0 # Deliberately bogus — wrapper must not use this
|
info = {'speed': self._speed, 'cte': self._cte, 'pos': self._pos}
|
||||||
info = {'speed': self._speed, 'cte': self._cte, 'pos': self._pos}
|
|
||||||
if self._use_5tuple:
|
if self._use_5tuple:
|
||||||
return obs, sim_reward, self._done, False, info
|
return obs, sim_reward, self._done, False, info
|
||||||
return obs, sim_reward, self._done, info
|
return obs, sim_reward, self._done, info
|
||||||
|
|
||||||
def close(self): pass
|
|
||||||
|
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def make_info(cte=0.5, speed=2.0, pos=None, active_node=1, lap_count=0, lap_time=0.0):
|
||||||
|
return {
|
||||||
|
'cte': cte, 'speed': speed,
|
||||||
|
'pos': pos or (0., 0., 0.),
|
||||||
|
'active_node': active_node, 'total_nodes': 100,
|
||||||
|
'lap_count': lap_count, 'last_lap_time': lap_time,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def step_wrapped(wrapped_env, env, pos, cte=0.5, speed=2.0):
|
# ── Core reward properties ────────────────────────────────────────────────────
|
||||||
env.set_pos(pos)
|
|
||||||
env.set_cte(cte)
|
|
||||||
env._speed = speed
|
|
||||||
return wrapped_env.step(0)
|
|
||||||
|
|
||||||
|
|
||||||
# ---- Core v4 Properties ----
|
|
||||||
|
|
||||||
def test_sim_reward_is_completely_ignored():
|
def test_sim_reward_is_completely_ignored():
|
||||||
"""
|
|
||||||
The wrapper must NOT use the sim's reward (999.0).
|
|
||||||
v4 computes reward from scratch using CTE/pos/speed only.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=2.0, cte=0.5, pos=(0., 0., 0.))
|
env = MockEnv(speed=2.0, cte=0.5, pos=(0., 0., 0.))
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
|
wrapped = SpeedRewardWrapper(env)
|
||||||
wrapped.reset()
|
wrapped.reset()
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
_, reward, _, _, _ = wrapped.step(0)
|
||||||
assert reward != 999.0, "Wrapper must not pass through sim's bogus reward"
|
assert reward != 999.0
|
||||||
assert reward < 10.0, f"Reward should be small, got {reward}"
|
assert reward < 10.0
|
||||||
|
|
||||||
|
|
||||||
def test_circling_at_zero_cte_gives_near_zero_reward():
|
def test_crash_gives_negative_one():
|
||||||
"""
|
env = MockEnv(speed=5.0, cte=0.0, done=True)
|
||||||
v6: circling (low efficiency) should yield zero reward via the efficiency gate.
|
wrapped = SpeedRewardWrapper(env)
|
||||||
After enough steps of circular motion, the efficiency drops below threshold
|
|
||||||
and the gate zeros the reward.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.0)
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=30, min_efficiency=0.15)
|
|
||||||
wrapped.reset()
|
wrapped.reset()
|
||||||
|
_, reward, _, _, _ = wrapped.step(0)
|
||||||
# Drive in a circle for enough steps to fill the position window
|
assert reward == -1.0
|
||||||
rewards = []
|
|
||||||
for i in range(40):
|
|
||||||
angle = 2 * math.pi * i / 12 # completes circle every 12 steps
|
|
||||||
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
|
|
||||||
_, r, _, _, _ = wrapped.step(0)
|
|
||||||
rewards.append(r)
|
|
||||||
|
|
||||||
# After 20+ steps of circular motion, efficiency gate should kick in
|
|
||||||
# Last few rewards should be 0.0
|
|
||||||
assert rewards[-1] == 0.0, (
|
|
||||||
f"v6: circular driving should yield 0.0 reward via efficiency gate, got {rewards[-1]:.4f}")
|
|
||||||
assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3, (
|
|
||||||
f"v6: most of last 5 rewards during circle should be 0.0, got {rewards[-5:]}")
|
|
||||||
|
|
||||||
|
|
||||||
def test_forward_driving_earns_positive_reward():
|
def test_forward_driving_earns_positive_reward():
|
||||||
"""Straight-line driving at low CTE and reasonable speed earns positive reward."""
|
|
||||||
env = MockEnv(speed=5.0, cte=0.5)
|
env = MockEnv(speed=5.0, cte=0.5)
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10)
|
wrapped = SpeedRewardWrapper(env, window_size=10)
|
||||||
wrapped.reset()
|
wrapped.reset()
|
||||||
_, r, _, _, _ = wrapped.step(0)
|
_, r, _, _, _ = wrapped.step(0)
|
||||||
# reward = (5/10) * (1 - 0.5/8) = 0.5 * 0.9375 = 0.469
|
# reward = (5/10) * (1 - 0.5/8.0) = 0.5 * 0.9375 = 0.469
|
||||||
assert r > 0.3, f"Forward driving should earn >0.3 reward, got {r:.4f}"
|
assert r > 0.3, f"Forward driving should earn >0.3, got {r:.4f}"
|
||||||
|
|
||||||
|
|
||||||
def test_forward_beats_circling_by_large_margin():
|
def test_higher_cte_reduces_reward():
|
||||||
"""
|
env_low = MockEnv(speed=2.0, cte=0.5)
|
||||||
v6: forward driving earns positive reward; circular driving earns zero.
|
env_high = MockEnv(speed=2.0, cte=4.0)
|
||||||
The efficiency gate ensures this gap.
|
w_low = SpeedRewardWrapper(env_low, window_size=5)
|
||||||
"""
|
w_high = SpeedRewardWrapper(env_high, window_size=5)
|
||||||
# Forward driving at CTE=1m, speed=5
|
w_low.reset(); w_high.reset()
|
||||||
|
for i in range(10):
|
||||||
|
env_low.set_pos( [i * 0.3, 0., 0.])
|
||||||
|
env_high.set_pos([i * 0.3, 0., 0.])
|
||||||
|
_, r_low, _, _, _ = w_low.step(0)
|
||||||
|
_, r_high, _, _, _ = w_high.step(0)
|
||||||
|
assert r_low > r_high
|
||||||
|
|
||||||
|
|
||||||
|
def test_higher_speed_increases_reward():
|
||||||
|
env_slow = MockEnv(speed=0.5, cte=1.0)
|
||||||
|
env_fast = MockEnv(speed=3.0, cte=1.0)
|
||||||
|
w_slow = SpeedRewardWrapper(env_slow, window_size=10)
|
||||||
|
w_fast = SpeedRewardWrapper(env_fast, window_size=10)
|
||||||
|
w_slow.reset(); w_fast.reset()
|
||||||
|
for i in range(15):
|
||||||
|
env_slow.set_pos([i * 0.1, 0., 0.])
|
||||||
|
env_fast.set_pos([i * 0.3, 0., 0.])
|
||||||
|
_, r_slow, _, _, _ = w_slow.step(0)
|
||||||
|
_, r_fast, _, _, _ = w_fast.step(0)
|
||||||
|
assert r_fast > r_slow
|
||||||
|
|
||||||
|
|
||||||
|
def test_4tuple_compatibility():
|
||||||
|
env = MockEnv(speed=2.0, cte=0.5, use_5tuple=False)
|
||||||
|
env.set_pos([0., 0., 0.])
|
||||||
|
wrapped = SpeedRewardWrapper(env)
|
||||||
|
wrapped.reset()
|
||||||
|
result = wrapped.step(0)
|
||||||
|
assert len(result) == 4
|
||||||
|
_, reward, done, info = result
|
||||||
|
assert isinstance(reward, float)
|
||||||
|
assert reward != 999.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Efficiency gate ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def test_circling_earns_zero_reward():
|
||||||
|
env = MockEnv(speed=3.0, cte=0.0)
|
||||||
|
wrapped = SpeedRewardWrapper(env, window_size=30, min_efficiency=0.15)
|
||||||
|
wrapped.reset()
|
||||||
|
rewards = []
|
||||||
|
for i in range(40):
|
||||||
|
angle = 2 * math.pi * i / 12
|
||||||
|
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
|
||||||
|
_, r, _, _, _ = wrapped.step(0)
|
||||||
|
rewards.append(r)
|
||||||
|
assert rewards[-1] == 0.0
|
||||||
|
assert sum(1 for r in rewards[-5:] if r == 0.0) >= 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_forward_beats_circling():
|
||||||
env_fwd = MockEnv(speed=5.0, cte=1.0)
|
env_fwd = MockEnv(speed=5.0, cte=1.0)
|
||||||
wrapped_fwd = SpeedRewardWrapper(env_fwd, speed_scale=0.1, window_size=30)
|
w_fwd = SpeedRewardWrapper(env_fwd, window_size=30)
|
||||||
wrapped_fwd.reset()
|
w_fwd.reset()
|
||||||
for i in range(35):
|
for i in range(35):
|
||||||
env_fwd.set_pos([i * 0.5, 0., 0.]) # straight line
|
env_fwd.set_pos([i * 0.5, 0., 0.])
|
||||||
_, r_fwd, _, _, _ = wrapped_fwd.step(0)
|
_, r_fwd, _, _, _ = w_fwd.step(0)
|
||||||
|
|
||||||
# Circular driving at CTE=0, speed=5
|
|
||||||
env_circ = MockEnv(speed=5.0, cte=0.0)
|
env_circ = MockEnv(speed=5.0, cte=0.0)
|
||||||
wrapped_circ = SpeedRewardWrapper(env_circ, speed_scale=0.1, window_size=30)
|
w_circ = SpeedRewardWrapper(env_circ, window_size=30)
|
||||||
wrapped_circ.reset()
|
w_circ.reset()
|
||||||
for i in range(35):
|
for i in range(35):
|
||||||
angle = 2 * math.pi * i / 12
|
angle = 2 * math.pi * i / 12
|
||||||
env_circ.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
|
env_circ.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
|
||||||
_, r_circ, _, _, _ = wrapped_circ.step(0)
|
_, r_circ, _, _, _ = w_circ.step(0)
|
||||||
|
|
||||||
assert r_fwd > 0, f"Forward driving should earn positive reward, got {r_fwd}"
|
assert r_fwd > 0
|
||||||
assert r_circ == 0.0, f"Circular driving should earn 0 reward, got {r_circ}"
|
assert r_circ == 0.0
|
||||||
assert r_fwd > r_circ, f"Forward ({r_fwd:.3f}) must beat circling ({r_circ:.3f})"
|
|
||||||
|
|
||||||
|
|
||||||
def test_crash_gives_negative_reward():
|
def test_history_clears_on_reset():
|
||||||
"""Episode termination (done=True) must always give -1.0."""
|
|
||||||
env = MockEnv(speed=5.0, cte=0.0, done=True)
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.2)
|
|
||||||
wrapped.reset()
|
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
|
||||||
assert reward == -1.0, f"Crash reward must be -1.0, got {reward}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_high_cte_reduces_reward():
|
|
||||||
"""Higher CTE should reduce reward (closer to track edge = lower base)."""
|
|
||||||
env_low = MockEnv(speed=2.0, cte=0.5)
|
|
||||||
env_high = MockEnv(speed=2.0, cte=4.0)
|
|
||||||
|
|
||||||
wrapped_low = SpeedRewardWrapper(env_low, speed_scale=0.1, window_size=5)
|
|
||||||
wrapped_high = SpeedRewardWrapper(env_high, speed_scale=0.1, window_size=5)
|
|
||||||
wrapped_low.reset()
|
|
||||||
wrapped_high.reset()
|
|
||||||
|
|
||||||
# Drive straight so efficiency fills up
|
|
||||||
for i in range(10):
|
|
||||||
env_low.set_pos([i * 0.3, 0., 0.])
|
|
||||||
env_high.set_pos([i * 0.3, 0., 0.])
|
|
||||||
_, r_low, _, _, _ = wrapped_low.step(0)
|
|
||||||
_, r_high, _, _, _ = wrapped_high.step(0)
|
|
||||||
|
|
||||||
assert r_low > r_high, f"Low CTE ({r_low:.3f}) should reward more than high CTE ({r_high:.3f})"
|
|
||||||
|
|
||||||
|
|
||||||
def test_speed_bonus_increases_reward_when_on_track():
|
|
||||||
"""Faster forward driving earns more reward than slower forward driving."""
|
|
||||||
env_slow = MockEnv(speed=0.5, cte=1.0)
|
|
||||||
env_fast = MockEnv(speed=3.0, cte=1.0)
|
|
||||||
|
|
||||||
wrapped_slow = SpeedRewardWrapper(env_slow, speed_scale=0.1, window_size=10)
|
|
||||||
wrapped_fast = SpeedRewardWrapper(env_fast, speed_scale=0.1, window_size=10)
|
|
||||||
wrapped_slow.reset()
|
|
||||||
wrapped_fast.reset()
|
|
||||||
|
|
||||||
for i in range(15):
|
|
||||||
env_slow.set_pos([i * 0.1, 0., 0.])
|
|
||||||
env_fast.set_pos([i * 0.3, 0., 0.]) # Fast car covers more ground
|
|
||||||
_, r_slow, _, _, _ = wrapped_slow.step(0)
|
|
||||||
_, r_fast, _, _, _ = wrapped_fast.step(0)
|
|
||||||
|
|
||||||
assert r_fast > r_slow, f"Fast ({r_fast:.3f}) should earn more than slow ({r_slow:.3f})"
|
|
||||||
|
|
||||||
|
|
||||||
def test_theoretical_max_per_step():
|
|
||||||
"""Max reward/step = 1.0 × 1.0 × (1 + scale × max_speed) = 2.0 at scale=0.1, max=10."""
|
|
||||||
env = MockEnv()
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
|
|
||||||
assert wrapped.theoretical_max_per_step(max_speed=10.0) == pytest.approx(2.0, abs=1e-6)
|
|
||||||
|
|
||||||
|
|
||||||
def test_4tuple_step_compatibility():
|
|
||||||
"""Wrapper must handle 4-tuple step() return (old gym API)."""
|
|
||||||
env = MockEnv(speed=2.0, cte=0.5, use_5tuple=False)
|
|
||||||
env.set_pos([0., 0., 0.])
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
|
|
||||||
wrapped.reset()
|
|
||||||
result = wrapped.step(0)
|
|
||||||
assert len(result) == 4, f"Expected 4-tuple, got {len(result)}"
|
|
||||||
_, reward, done, info = result
|
|
||||||
assert isinstance(reward, float)
|
|
||||||
assert reward != 999.0, "Should not use sim reward"
|
|
||||||
|
|
||||||
|
|
||||||
def test_reward_resets_on_episode_reset():
|
|
||||||
"""After reset, position history clears so efficiency recalculates cleanly."""
|
|
||||||
env = MockEnv(speed=2.0, cte=0.5)
|
env = MockEnv(speed=2.0, cte=0.5)
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10)
|
wrapped = SpeedRewardWrapper(env, window_size=10)
|
||||||
wrapped.reset()
|
wrapped.reset()
|
||||||
|
|
||||||
# Fill with circular data
|
|
||||||
for i in range(15):
|
for i in range(15):
|
||||||
angle = 2 * math.pi * i / 12
|
angle = 2 * math.pi * i / 12
|
||||||
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
|
env.set_pos([0.5 * math.cos(angle), 0., 0.5 * math.sin(angle)])
|
||||||
wrapped.step(0)
|
wrapped.step(0)
|
||||||
|
|
||||||
# After reset, start fresh straight
|
|
||||||
wrapped.reset()
|
wrapped.reset()
|
||||||
rewards = []
|
rewards = []
|
||||||
for i in range(5):
|
for i in range(5):
|
||||||
env.set_pos([i * 0.3, 0., 0.])
|
env.set_pos([i * 0.3, 0., 0.])
|
||||||
_, r, _, _, _ = wrapped.step(0)
|
_, r, _, _, _ = wrapped.step(0)
|
||||||
rewards.append(r)
|
rewards.append(r)
|
||||||
|
assert rewards[-1] > 0
|
||||||
# Should get reasonable reward after fresh start
|
|
||||||
assert rewards[-1] > 0, "Should get positive reward after reset and straight driving"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ── No-progress termination ───────────────────────────────────────────────────
|
||||||
# Short-lap exploit patch tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_short_lap_triggers_penalty():
|
def test_no_progress_terminates():
|
||||||
"""
|
|
||||||
A lap completed faster than min_lap_time must return a large penalty,
|
|
||||||
not a positive reward. This closes the start/finish circle exploit.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
|
|
||||||
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
# Simulate step where a new lap completes in 1 second (exploit)
|
|
||||||
info = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
|
|
||||||
'lap_count': 1, 'last_lap_time': 1.0}
|
|
||||||
reward, _ = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert reward < 0, f'Short lap (1s) should penalise, got reward={reward}'
|
|
||||||
assert reward <= -10.0, f'Short lap penalty should be large (<= -10), got {reward}'
|
|
||||||
|
|
||||||
|
|
||||||
def test_legitimate_lap_not_penalised():
|
|
||||||
"""
|
|
||||||
A lap completed above min_lap_time must NOT trigger the penalty.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
|
|
||||||
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
# First step — no lap yet
|
|
||||||
info_no_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
|
|
||||||
'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
wrapper._compute_reward_and_done(done=False, info=info_no_lap)
|
|
||||||
|
|
||||||
# Legitimate lap at 12 seconds
|
|
||||||
info = {'cte': 0.2, 'speed': 3.0, 'pos': (1.0, 0.0, 0.0),
|
|
||||||
'lap_count': 1, 'last_lap_time': 12.0}
|
|
||||||
reward, _ = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert reward >= 0, f'Legitimate lap (12s) should not be penalised, got {reward}'
|
|
||||||
|
|
||||||
|
|
||||||
def test_lap_count_not_double_penalised():
|
|
||||||
"""
|
|
||||||
Penalty fires exactly once per short lap, not on every subsequent step.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
|
|
||||||
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
# Short lap fires on step where lap_count increments
|
|
||||||
info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
|
|
||||||
'lap_count': 1, 'last_lap_time': 1.5}
|
|
||||||
r1, _ = wrapper._compute_reward_and_done(done=False, info=info_lap)
|
|
||||||
assert r1 < 0
|
|
||||||
|
|
||||||
# Next step same lap_count — should get normal reward, not another penalty
|
|
||||||
info_next = {'cte': 0.0, 'speed': 3.0, 'pos': (0.1, 0.0, 0.0),
|
|
||||||
'lap_count': 1, 'last_lap_time': 1.5}
|
|
||||||
r2, _ = wrapper._compute_reward_and_done(done=False, info=info_next)
|
|
||||||
assert r2 >= 0, f'Penalty should not repeat on same lap_count, got r2={r2}'
|
|
||||||
|
|
||||||
|
|
||||||
def test_lap_count_resets_on_episode_reset():
|
|
||||||
"""lap_count tracker must reset when the episode resets."""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.0, pos=(0.,0.,0.))
|
|
||||||
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
# Complete a short lap
|
|
||||||
info_lap = {'cte': 0.0, 'speed': 3.0, 'pos': (0.0, 0.0, 0.0),
|
|
||||||
'lap_count': 1, 'last_lap_time': 1.0}
|
|
||||||
wrapper._compute_reward_and_done(done=False, info=info_lap)
|
|
||||||
assert wrapper._last_lap_count == 1
|
|
||||||
|
|
||||||
# Reset episode — counter must go back to 0
|
|
||||||
wrapper.reset()
|
|
||||||
assert wrapper._last_lap_count == 0
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# v6.1 exploit terminator tests
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_sustained_high_cte_terminates_episode():
|
|
||||||
"""
|
|
||||||
Grass exploit fix: if CTE exceeds max_cte_terminate for cte_patience
|
|
||||||
consecutive steps, the episode must be force-terminated with -1.0 reward.
|
|
||||||
This catches the generated_track gap where car drives indefinitely on grass.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=5.0) # CTE=5.0 > max_cte_terminate=4.0
|
|
||||||
wrapper = SpeedRewardWrapper(env, max_cte_terminate=4.0, cte_patience=5)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
rewards = []
|
|
||||||
terminated = []
|
|
||||||
for _ in range(10):
|
|
||||||
info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.),
|
|
||||||
'active_node': 0, 'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
r, force_term = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
rewards.append(r)
|
|
||||||
terminated.append(force_term)
|
|
||||||
|
|
||||||
# High CTE should be punished immediately, then terminate at step 5
|
|
||||||
assert rewards[0] < 0, f'High CTE should be negative immediately, got {rewards[0]}'
|
|
||||||
assert terminated[4] == True, f'Should force-terminate at step 5, got {terminated}'
|
|
||||||
assert rewards[4] == -1.0, f'Termination reward should be -1.0, got {rewards[4]}'
|
|
||||||
assert terminated[0] == False, 'Should not terminate at step 1'
|
|
||||||
|
|
||||||
|
|
||||||
def test_high_cte_never_gets_positive_speed_reward_before_termination():
|
|
||||||
"""
|
|
||||||
Regression for generated_road outside-circle exploit: while CTE is outside
|
|
||||||
the allowed corridor, the wrapper must not pay positive speed reward during
|
|
||||||
the patience window. The policy should receive negative feedback
|
|
||||||
immediately, then termination.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=5.0, cte=3.0)
|
|
||||||
wrapper = SpeedRewardWrapper(env, max_cte_terminate=2.5, cte_patience=3)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
rewards = []
|
|
||||||
terminated = []
|
|
||||||
for i in range(3):
|
|
||||||
info = {
|
|
||||||
'cte': 3.0,
|
|
||||||
'speed': 5.0,
|
|
||||||
'pos': (float(i), 0.0, 0.0),
|
|
||||||
'active_node': i,
|
|
||||||
'total_nodes': 100,
|
|
||||||
'lap_count': 0,
|
|
||||||
'last_lap_time': 0.0,
|
|
||||||
}
|
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
rewards.append(r)
|
|
||||||
terminated.append(ft)
|
|
||||||
|
|
||||||
assert rewards[:2] == [-0.25, -0.25]
|
|
||||||
assert rewards[2] == -1.0
|
|
||||||
assert terminated == [False, False, True]
|
|
||||||
|
|
||||||
|
|
||||||
def test_high_cte_resets_when_back_on_track():
|
|
||||||
"""
|
|
||||||
High CTE counter must reset when car returns to track.
|
|
||||||
Prevents false termination after a brief excursion.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.5)
|
|
||||||
wrapper = SpeedRewardWrapper(env, max_cte_terminate=4.0, cte_patience=5)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
# 3 steps high CTE
|
|
||||||
for _ in range(3):
|
|
||||||
info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.),
|
|
||||||
'active_node': 0, 'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert ft == False, 'Should not terminate after only 3 steps'
|
|
||||||
|
|
||||||
# 1 step back on track resets counter
|
|
||||||
info = {'cte': 1.0, 'speed': 3.0, 'pos': (0., 0., 0.),
|
|
||||||
'active_node': 1, 'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert wrapper._high_cte_steps == 0, 'CTE counter should reset when back on track'
|
|
||||||
|
|
||||||
# 5 more steps high CTE — should now terminate (counter starts fresh)
|
|
||||||
for i in range(5):
|
|
||||||
info = {'cte': 5.0, 'speed': 3.0, 'pos': (0., 0., 0.),
|
|
||||||
'active_node': 1, 'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert ft == True, 'Should terminate after 5 new consecutive high-CTE steps'
|
|
||||||
|
|
||||||
|
|
||||||
def test_no_track_progress_terminates_episode():
|
|
||||||
"""
|
|
||||||
Circle/stuck exploit fix: if max active_node doesn't advance for
|
|
||||||
progress_patience steps, the episode must be force-terminated.
|
|
||||||
A circling car stays near the same waypoints — max_node never increases.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.5)
|
env = MockEnv(speed=3.0, cte=0.5)
|
||||||
wrapper = SpeedRewardWrapper(env, progress_patience=10)
|
wrapper = SpeedRewardWrapper(env, progress_patience=10)
|
||||||
wrapper.reset()
|
wrapper.reset()
|
||||||
|
|
||||||
# First step initialises max_node to 5, then 10 more steps stuck at 5 → terminate
|
|
||||||
for i in range(12):
|
for i in range(12):
|
||||||
info = {'cte': 0.5, 'speed': 3.0, 'pos': (float(i)*0.1, 0., 0.),
|
r, ft = wrapper._compute_reward(False, make_info(active_node=5, pos=(i*0.1, 0., 0.)))
|
||||||
'active_node': 5, 'total_nodes': 100,
|
|
||||||
'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
if ft:
|
if ft:
|
||||||
break
|
break
|
||||||
|
assert ft is True
|
||||||
assert ft == True, 'Should terminate when max active_node not advancing'
|
|
||||||
assert r == -1.0
|
assert r == -1.0
|
||||||
|
|
||||||
|
|
||||||
def test_low_speed_no_displacement_terminates_barrier_wedge():
|
def test_progress_resets_counter():
|
||||||
"""
|
env = MockEnv()
|
||||||
Regression for invisible-barrier wedge: wheels can be commanded but the car
|
|
||||||
remains nearly motionless with acceptable CTE. This must terminate quickly
|
|
||||||
instead of returning zero/positive reward indefinitely.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=0.05, cte=0.5)
|
|
||||||
wrapper = SpeedRewardWrapper(
|
|
||||||
env,
|
|
||||||
low_speed_grace_steps=2,
|
|
||||||
low_speed_patience=3,
|
|
||||||
low_speed_threshold=0.2,
|
|
||||||
low_speed_min_displacement=0.25,
|
|
||||||
progress_patience=100,
|
|
||||||
)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
terminated = False
|
|
||||||
reward = None
|
|
||||||
for _ in range(8):
|
|
||||||
info = {
|
|
||||||
'cte': 0.5,
|
|
||||||
'speed': 0.05,
|
|
||||||
'pos': (1.0, 0.0, 1.0),
|
|
||||||
'active_node': 5,
|
|
||||||
'total_nodes': 100,
|
|
||||||
'lap_count': 0,
|
|
||||||
'last_lap_time': 0.0,
|
|
||||||
}
|
|
||||||
reward, terminated = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
if terminated:
|
|
||||||
break
|
|
||||||
|
|
||||||
assert terminated is True
|
|
||||||
assert reward == -1.0
|
|
||||||
|
|
||||||
|
|
||||||
def test_low_speed_counter_resets_after_meaningful_displacement():
|
|
||||||
"""Slow starts should not terminate if the car is still changing position."""
|
|
||||||
env = MockEnv(speed=0.05, cte=0.5)
|
|
||||||
wrapper = SpeedRewardWrapper(
|
|
||||||
env,
|
|
||||||
low_speed_grace_steps=0,
|
|
||||||
low_speed_patience=3,
|
|
||||||
low_speed_threshold=0.2,
|
|
||||||
low_speed_min_displacement=0.25,
|
|
||||||
progress_patience=100,
|
|
||||||
)
|
|
||||||
wrapper.reset()
|
|
||||||
|
|
||||||
for i in range(6):
|
|
||||||
info = {
|
|
||||||
'cte': 0.5,
|
|
||||||
'speed': 0.05,
|
|
||||||
'pos': (float(i) * 0.3, 0.0, 0.0),
|
|
||||||
'active_node': i,
|
|
||||||
'total_nodes': 100,
|
|
||||||
'lap_count': 0,
|
|
||||||
'last_lap_time': 0.0,
|
|
||||||
}
|
|
||||||
reward, terminated = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert terminated is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_track_progress_resets_counter():
|
|
||||||
"""
|
|
||||||
Advancing to a new max active_node must reset the no-progress counter.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.5)
|
|
||||||
wrapper = SpeedRewardWrapper(env, progress_patience=5)
|
wrapper = SpeedRewardWrapper(env, progress_patience=5)
|
||||||
wrapper.reset()
|
wrapper.reset()
|
||||||
|
|
||||||
# Step forward: nodes 0, 1, 2, 3 — each new node resets counter
|
|
||||||
for node in range(4):
|
for node in range(4):
|
||||||
info = {'cte': 0.5, 'speed': 3.0, 'pos': (float(node)*0.5, 0., 0.),
|
r, ft = wrapper._compute_reward(False, make_info(active_node=node, pos=(node*0.5, 0., 0.)))
|
||||||
'active_node': node, 'total_nodes': 100,
|
assert ft is False
|
||||||
'lap_count': 0, 'last_lap_time': 0.0}
|
assert wrapper._no_progress_steps == 0
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert ft == False, f'Should not terminate when advancing (node {node})'
|
|
||||||
assert wrapper._no_progress_steps == 0, 'Counter should reset on new max node'
|
|
||||||
|
|
||||||
|
|
||||||
def test_circle_exploit_terminates():
|
def test_circling_active_node_terminates():
|
||||||
"""
|
env = MockEnv()
|
||||||
A car circling near the same spot should be terminated.
|
|
||||||
active_node oscillates but never exceeds the initial max.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.5)
|
|
||||||
wrapper = SpeedRewardWrapper(env, progress_patience=10)
|
wrapper = SpeedRewardWrapper(env, progress_patience=10)
|
||||||
wrapper.reset()
|
wrapper.reset()
|
||||||
|
wrapper._compute_reward(False, make_info(active_node=10))
|
||||||
# Set max_node to 10
|
|
||||||
info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.),
|
|
||||||
'active_node': 10, 'total_nodes': 100,
|
|
||||||
'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
|
|
||||||
# Now oscillate between nodes 8-10 (circling near node 10)
|
|
||||||
terminated = False
|
terminated = False
|
||||||
for i in range(20):
|
for i in range(20):
|
||||||
node = 8 + (i % 3) # oscillates 8, 9, 10, 8, 9, 10...
|
r, ft = wrapper._compute_reward(False, make_info(active_node=8 + (i % 3)))
|
||||||
info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.),
|
|
||||||
'active_node': node, 'total_nodes': 100,
|
|
||||||
'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
if ft:
|
if ft:
|
||||||
terminated = True
|
terminated = True
|
||||||
break
|
break
|
||||||
|
assert terminated
|
||||||
assert terminated, 'Circling (oscillating active_node, no new max) should terminate'
|
|
||||||
|
|
||||||
|
|
||||||
def test_lap_completion_resets_progress_tracker():
|
def test_lap_completion_resets_progress_tracker():
|
||||||
"""
|
env = MockEnv()
|
||||||
On lap completion, active_node resets to 0. Progress tracker must also
|
|
||||||
reset so the car isn't immediately terminated for 'no progress'.
|
|
||||||
"""
|
|
||||||
env = MockEnv(speed=3.0, cte=0.5)
|
|
||||||
wrapper = SpeedRewardWrapper(env, progress_patience=5, min_lap_time=5.0)
|
wrapper = SpeedRewardWrapper(env, progress_patience=5, min_lap_time=5.0)
|
||||||
wrapper.reset()
|
wrapper.reset()
|
||||||
|
wrapper._compute_reward(False, make_info(active_node=99))
|
||||||
# Drive to near end of track
|
|
||||||
info = {'cte': 0.5, 'speed': 3.0, 'pos': (1., 0., 0.),
|
|
||||||
'active_node': 99, 'total_nodes': 100,
|
|
||||||
'lap_count': 0, 'last_lap_time': 0.0}
|
|
||||||
wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
assert wrapper._max_node_seen == 99
|
assert wrapper._max_node_seen == 99
|
||||||
|
r, ft = wrapper._compute_reward(False, make_info(active_node=0, lap_count=1, lap_time=12.0))
|
||||||
# Complete a valid lap
|
assert wrapper._max_node_seen == -1
|
||||||
info = {'cte': 0.5, 'speed': 3.0, 'pos': (0., 0., 0.),
|
|
||||||
'active_node': 0, 'total_nodes': 100,
|
|
||||||
'lap_count': 1, 'last_lap_time': 12.0} # 12s lap = valid
|
|
||||||
r, ft = wrapper._compute_reward_and_done(done=False, info=info)
|
|
||||||
|
|
||||||
# Progress tracker should be reset
|
|
||||||
assert wrapper._max_node_seen == -1, 'max_node_seen should reset on lap completion'
|
|
||||||
assert wrapper._no_progress_steps == 0
|
assert wrapper._no_progress_steps == 0
|
||||||
assert ft == False, 'Valid lap should not terminate'
|
assert ft is False
|
||||||
|
|
||||||
|
|
||||||
|
# ── Lap exploit guard ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def test_short_lap_penalised():
|
||||||
|
env = MockEnv()
|
||||||
|
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
||||||
|
wrapper.reset()
|
||||||
|
r, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.0))
|
||||||
|
assert r < 0
|
||||||
|
assert r <= -10.0
|
||||||
|
|
||||||
|
|
||||||
|
def test_legitimate_lap_not_penalised():
|
||||||
|
env = MockEnv()
|
||||||
|
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
||||||
|
wrapper.reset()
|
||||||
|
wrapper._compute_reward(False, make_info(lap_count=0))
|
||||||
|
r, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=12.0, pos=(1., 0., 0.)))
|
||||||
|
assert r >= 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_lap_penalty_fires_once():
|
||||||
|
env = MockEnv()
|
||||||
|
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
||||||
|
wrapper.reset()
|
||||||
|
r1, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.5))
|
||||||
|
assert r1 < 0
|
||||||
|
r2, _ = wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.5, pos=(0.1, 0., 0.)))
|
||||||
|
assert r2 >= 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_lap_count_resets_on_episode_reset():
|
||||||
|
env = MockEnv()
|
||||||
|
wrapper = SpeedRewardWrapper(env, min_lap_time=5.0)
|
||||||
|
wrapper.reset()
|
||||||
|
wrapper._compute_reward(False, make_info(lap_count=1, lap_time=1.0))
|
||||||
|
assert wrapper._last_lap_count == 1
|
||||||
|
wrapper.reset()
|
||||||
|
assert wrapper._last_lap_count == 0
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue