fix: path-efficiency reward (v3) defeats circular driving exploit
CRITICAL BUG FIX — Circular Driving: - v2 reward still hackable: car circles at starting line with low CTE + positive speed - Confirmed in data: trial 5 mean_reward=4582, cv=0.0% (physically impossible for genuine driving) - Statistical signature: cv <1% with high reward = consistent exploit, not genuine driving ROOT CAUSE: Neither CTE nor raw speed can distinguish forward vs circular motion. Both have: low CTE (on centerline) + positive speed (moving) = same reward. Missing dimension: TRACK PROGRESS (net advance along track) FIX — Path Efficiency Reward (v3): efficiency = net_displacement / total_path_length (sliding window of 30 steps) shaped = original x (1 + speed_scale x speed x efficiency) - Forward driving: efficiency ≈ 1.0 → full speed bonus - Circular driving: efficiency ≈ 0.0 → speed bonus disappears - Cannot be hacked: circling means returning to same positions (low net_displacement) Tests: - test_efficiency_near_zero_for_circular_driving: confirmed <0.2 efficiency for circles - test_efficiency_near_one_for_straight_driving: confirmed >0.90 for straight line - test_straight_driving_gets_higher_reward_than_circular: KEY guarantee - test_speed_bonus_disappears_when_circling: bonus suppressed after window fills Research documentation: - Full analysis with data table added to docs/RESEARCH_LOG.md - cv% identified as reward hacking indicator - Archived circular data + models Clean start: new autoresearch_results_phase1.jsonl, new champion dir Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +6 (path efficiency, anti-circular) TypeScript: N/A
This commit is contained in:
parent
d25bc71008
commit
fcb6ea1ac2
|
|
@ -0,0 +1,46 @@
|
||||||
|
"""
|
||||||
|
== DATA ANALYSIS: Circular Driving Detection (2026-04-13) ==
|
||||||
|
|
||||||
|
FINDINGS from Phase 1 data (autoresearch_results_phase1.jsonl):
|
||||||
|
|
||||||
|
Trial mean_rwd std rps cv% verdict
|
||||||
|
1 270.56 0.143 0.086 0.1% ⚠️ LOW STD suspicious — possibly circling
|
||||||
|
4 627.69 2.35 0.147 0.4% OK — low variance, moderate reward
|
||||||
|
5 4582.80 0.485 0.957 0.0% 🚨 CIRCULAR — 74% of theoretical max, cv=0.0%
|
||||||
|
6 454.06 2.73 0.092 0.6% OK — consistent, plausible
|
||||||
|
10 682.74 420.91 0.153 61.7% ⚠️ UNSTABLE — extremely high variance
|
||||||
|
11 404.52 14.47 0.084 3.6% OK — reasonable variance
|
||||||
|
|
||||||
|
KEY SIGNATURES OF CIRCULAR DRIVING:
|
||||||
|
1. cv (coefficient of variation) < 1% with mean_reward > 200 → very CONSISTENT circling
|
||||||
|
- Trial 5: cv=0.0%, mean=4582 → textbook circular motion
|
||||||
|
- Trial 1: cv=0.1%, mean=270 → likely also circling but slower
|
||||||
|
|
||||||
|
2. reward/step approaching theoretical max → car is getting near-optimal reward continuously
|
||||||
|
- Trial 5: 0.957/step ≈ 74% of max (speed≈3 m/s) → sustained on-track fast motion
|
||||||
|
- This is achievable by circling at the starting line!
|
||||||
|
|
||||||
|
3. User visual confirmation → car going left in circles at starting position
|
||||||
|
|
||||||
|
WHY OUR REWARD WRAPPER v2 STILL ALLOWS CIRCLING:
|
||||||
|
The fix was correct for the ADDITIVE formula (speed × f(cte)).
|
||||||
|
The MULTIPLICATIVE formula prevents off-track hacking.
|
||||||
|
BUT: a car circling ON-TRACK still gets full speed bonus!
|
||||||
|
- Car circles at start (CTE ≈ 0) → original_reward > 0
|
||||||
|
- Car has speed 3 → shaped = 1.0 × (1 + 0.1 × 3) = 1.3/step
|
||||||
|
- Over 4787 steps: max = 6223, actual = 4582 → 74% efficiency (car is on track most of time!)
|
||||||
|
|
||||||
|
THE FUNDAMENTAL PROBLEM:
|
||||||
|
Neither CTE nor speed can distinguish FORWARD driving from CIRCULAR driving.
|
||||||
|
Both have: low CTE (car is centered), positive speed (car is moving).
|
||||||
|
|
||||||
|
We need a reward component that is ZERO for circular motion and POSITIVE for forward progress.
|
||||||
|
|
||||||
|
SOLUTION: Path Efficiency Reward
|
||||||
|
efficiency = net_displacement / path_length (over sliding window)
|
||||||
|
- Forward driving: efficiency ≈ 1.0 (all movement is productive)
|
||||||
|
- Circular driving: efficiency ≈ 0.0 (lots of movement, no net advance)
|
||||||
|
- Shaped reward: original × (1 + speed_scale × speed × efficiency)
|
||||||
|
"""
|
||||||
|
|
||||||
|
print(__doc__)
|
||||||
|
|
@ -220,3 +220,69 @@
|
||||||
[2026-04-13 13:11:06] mean_reward=627.6915 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0009549126527603771, 'timesteps': 4279, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
[2026-04-13 13:11:06] mean_reward=627.6915 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0009549126527603771, 'timesteps': 4279, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
[2026-04-13 13:11:06] mean_reward=454.0640 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0005165618383365869, 'timesteps': 4929, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
[2026-04-13 13:11:06] mean_reward=454.0640 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0005165618383365869, 'timesteps': 4929, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
[2026-04-13 13:11:06] mean_reward=306.1739 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.0003097316245852375, 'timesteps': 4938, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
[2026-04-13 13:11:06] mean_reward=306.1739 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.0003097316245852375, 'timesteps': 4938, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:11:07] [AutoResearch] Git push complete after trial 10
|
||||||
|
[2026-04-13 13:11:09]
|
||||||
|
[AutoResearch] ========== Trial 11/50 ==========
|
||||||
|
[2026-04-13 13:11:09] [AutoResearch] GP UCB top-5 candidates:
|
||||||
|
[2026-04-13 13:11:09] UCB=2.7195 mu=2.5127 sigma=0.1034 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.000557522373554661, 'timesteps': 4805}
|
||||||
|
[2026-04-13 13:11:09] UCB=2.5925 mu=1.9024 sigma=0.3451 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.00041865775623053806, 'timesteps': 4329}
|
||||||
|
[2026-04-13 13:11:09] UCB=2.5803 mu=1.1875 sigma=0.6964 params={'n_steer': 7, 'n_throttle': 4, 'learning_rate': 0.00058177865639138, 'timesteps': 4419}
|
||||||
|
[2026-04-13 13:11:09] UCB=2.4298 mu=2.0749 sigma=0.1775 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.0009718592685897328, 'timesteps': 4382}
|
||||||
|
[2026-04-13 13:11:09] UCB=2.2735 mu=1.8243 sigma=0.2246 params={'n_steer': 7, 'n_throttle': 2, 'learning_rate': 0.0010522226184685407, 'timesteps': 4546}
|
||||||
|
[2026-04-13 13:11:09] [AutoResearch] Proposed: {'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.000557522373554661, 'timesteps': 4805, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:11:11] [AutoResearch] Launching trial 11: {'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.000557522373554661, 'timesteps': 4805, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:25] [AutoResearch] Trial 11 finished in 313.9s, returncode=0
|
||||||
|
[2026-04-13 13:16:25] [AutoResearch] Trial 11: mean_reward=404.5225 std_reward=14.4655
|
||||||
|
[2026-04-13 13:16:25] [AutoResearch] === Trial 11 Summary ===
|
||||||
|
[2026-04-13 13:16:25] Total Phase 1 runs: 11
|
||||||
|
[2026-04-13 13:16:25] Champion: trial=5 mean_reward=4582.7984 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.0006801262090358742, 'timesteps': 4787, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:25] Top 5:
|
||||||
|
[2026-04-13 13:16:25] mean_reward=4582.7984 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.0006801262090358742, 'timesteps': 4787, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:25] mean_reward=682.7352 params={'n_steer': 7, 'n_throttle': 2, 'learning_rate': 0.0010464507674264373, 'timesteps': 4450, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:25] mean_reward=627.6915 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0009549126527603771, 'timesteps': 4279, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:25] mean_reward=454.0640 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0005165618383365869, 'timesteps': 4929, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:25] mean_reward=404.5225 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.000557522373554661, 'timesteps': 4805, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:27]
|
||||||
|
[AutoResearch] ========== Trial 12/50 ==========
|
||||||
|
[2026-04-13 13:16:27] [AutoResearch] GP UCB top-5 candidates:
|
||||||
|
[2026-04-13 13:16:27] UCB=13.7452 mu=12.5336 sigma=0.6058 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0020405598509922246, 'timesteps': 4862}
|
||||||
|
[2026-04-13 13:16:27] UCB=10.6142 mu=10.0669 sigma=0.2737 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.0015753222508456746, 'timesteps': 4690}
|
||||||
|
[2026-04-13 13:16:27] UCB=10.1293 mu=9.1255 sigma=0.5019 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0019449351233484984, 'timesteps': 4583}
|
||||||
|
[2026-04-13 13:16:27] UCB=9.8667 mu=8.7033 sigma=0.5817 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.001937955818890541, 'timesteps': 4781}
|
||||||
|
[2026-04-13 13:16:27] UCB=8.4705 mu=6.9561 sigma=0.7572 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.00246225593347489, 'timesteps': 4601}
|
||||||
|
[2026-04-13 13:16:27] [AutoResearch] Proposed: {'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0020405598509922246, 'timesteps': 4862, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:16:29] [AutoResearch] Launching trial 12: {'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0020405598509922246, 'timesteps': 4862, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:50] [AutoResearch] Trial 12 finished in 440.6s, returncode=0
|
||||||
|
[2026-04-13 13:23:50] [AutoResearch] Trial 12: mean_reward=14.6215 std_reward=0.0161
|
||||||
|
[2026-04-13 13:23:50] [AutoResearch] === Trial 12 Summary ===
|
||||||
|
[2026-04-13 13:23:50] Total Phase 1 runs: 12
|
||||||
|
[2026-04-13 13:23:50] Champion: trial=5 mean_reward=4582.7984 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.0006801262090358742, 'timesteps': 4787, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:50] Top 5:
|
||||||
|
[2026-04-13 13:23:50] mean_reward=4582.7984 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.0006801262090358742, 'timesteps': 4787, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:50] mean_reward=682.7352 params={'n_steer': 7, 'n_throttle': 2, 'learning_rate': 0.0010464507674264373, 'timesteps': 4450, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:50] mean_reward=627.6915 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0009549126527603771, 'timesteps': 4279, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:50] mean_reward=454.0640 params={'n_steer': 6, 'n_throttle': 3, 'learning_rate': 0.0005165618383365869, 'timesteps': 4929, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:50] mean_reward=404.5225 params={'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.000557522373554661, 'timesteps': 4805, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:52]
|
||||||
|
[AutoResearch] ========== Trial 13/50 ==========
|
||||||
|
[2026-04-13 13:23:52] [AutoResearch] GP UCB top-5 candidates:
|
||||||
|
[2026-04-13 13:23:52] UCB=7.4556 mu=6.6123 sigma=0.4217 params={'n_steer': 7, 'n_throttle': 2, 'learning_rate': 0.00163041176468028, 'timesteps': 4836}
|
||||||
|
[2026-04-13 13:23:52] UCB=7.1150 mu=6.5952 sigma=0.2599 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.0011607060392442735, 'timesteps': 4643}
|
||||||
|
[2026-04-13 13:23:52] UCB=4.9263 mu=4.0036 sigma=0.4613 params={'n_steer': 8, 'n_throttle': 2, 'learning_rate': 0.0015871232867074373, 'timesteps': 4489}
|
||||||
|
[2026-04-13 13:23:52] UCB=3.6250 mu=1.9044 sigma=0.8603 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0006337063098992063, 'timesteps': 1815}
|
||||||
|
[2026-04-13 13:23:52] UCB=3.3082 mu=1.7605 sigma=0.7739 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.00018730022865904181, 'timesteps': 2136}
|
||||||
|
[2026-04-13 13:23:52] [AutoResearch] Proposed: {'n_steer': 7, 'n_throttle': 2, 'learning_rate': 0.00163041176468028, 'timesteps': 4836, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:23:54] [AutoResearch] Launching trial 13: {'n_steer': 7, 'n_throttle': 2, 'learning_rate': 0.00163041176468028, 'timesteps': 4836, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 13:35:25] [AutoResearch] GP UCB top-5 candidates:
|
||||||
|
[2026-04-13 13:35:25] UCB=2.7567 mu=1.2278 sigma=0.7644 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.002270622623224986, 'timesteps': 3888}
|
||||||
|
[2026-04-13 13:35:25] UCB=2.7300 mu=1.1710 sigma=0.7795 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.002011397993568161, 'timesteps': 4033}
|
||||||
|
[2026-04-13 13:35:25] UCB=2.6457 mu=1.4878 sigma=0.5790 params={'n_steer': 9, 'n_throttle': 2, 'learning_rate': 0.00219005726516088, 'timesteps': 4774}
|
||||||
|
[2026-04-13 13:35:25] UCB=2.6320 mu=1.1819 sigma=0.7250 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.0020813954690263674, 'timesteps': 4022}
|
||||||
|
[2026-04-13 13:35:25] UCB=2.5412 mu=1.2499 sigma=0.6457 params={'n_steer': 8, 'n_throttle': 3, 'learning_rate': 0.0025942479713410636, 'timesteps': 4135}
|
||||||
|
[2026-04-13 13:35:25] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=50.0000 params={'n_steer': 5}
|
||||||
|
[2026-04-13 13:35:25] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'n_steer': 7}
|
||||||
|
[2026-04-13 13:35:25] [Champion] 🏆 NEW BEST! Trial 0: mean_reward=50.0000 params={'r': 50}
|
||||||
|
[2026-04-13 13:35:25] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'r': 80}
|
||||||
|
[2026-04-13 13:35:25] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90}
|
||||||
|
[2026-04-13 13:35:25] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8}
|
||||||
|
[2026-04-13 13:35:25] [AutoResearch] Only 1 results — using random proposal.
|
||||||
|
|
@ -8,3 +8,5 @@
|
||||||
{"trial": 8, "timestamp": "2026-04-13T13:01:28.616838", "params": {"n_steer": 8, "n_throttle": 3, "learning_rate": 0.0003097316245852375, "timesteps": 4938, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 306.1739, "std_reward": 13.6044, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0008/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 303.6810266971588, "reward_hacking_suspected": false}
|
{"trial": 8, "timestamp": "2026-04-13T13:01:28.616838", "params": {"n_steer": 8, "n_throttle": 3, "learning_rate": 0.0003097316245852375, "timesteps": 4938, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 306.1739, "std_reward": 13.6044, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0008/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 303.6810266971588, "reward_hacking_suspected": false}
|
||||||
{"trial": 9, "timestamp": "2026-04-13T13:05:16.112705", "params": {"n_steer": 7, "n_throttle": 3, "learning_rate": 0.0014813539623020004, "timesteps": 4054, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 15.5625, "std_reward": 0.0011, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0009/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 223.47979998588562, "reward_hacking_suspected": false}
|
{"trial": 9, "timestamp": "2026-04-13T13:05:16.112705", "params": {"n_steer": 7, "n_throttle": 3, "learning_rate": 0.0014813539623020004, "timesteps": 4054, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 15.5625, "std_reward": 0.0011, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0009/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 223.47979998588562, "reward_hacking_suspected": false}
|
||||||
{"trial": 10, "timestamp": "2026-04-13T13:11:06.106880", "params": {"n_steer": 7, "n_throttle": 2, "learning_rate": 0.0010464507674264373, "timesteps": 4450, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 682.7352, "std_reward": 420.9113, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0010/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 345.9794178009033, "reward_hacking_suspected": false}
|
{"trial": 10, "timestamp": "2026-04-13T13:11:06.106880", "params": {"n_steer": 7, "n_throttle": 2, "learning_rate": 0.0010464507674264373, "timesteps": 4450, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 682.7352, "std_reward": 420.9113, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0010/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 345.9794178009033, "reward_hacking_suspected": false}
|
||||||
|
{"trial": 11, "timestamp": "2026-04-13T13:16:25.498543", "params": {"n_steer": 7, "n_throttle": 3, "learning_rate": 0.000557522373554661, "timesteps": 4805, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 404.5225, "std_reward": 14.4655, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0011/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 313.93063950538635, "reward_hacking_suspected": false}
|
||||||
|
{"trial": 12, "timestamp": "2026-04-13T13:23:50.091027", "params": {"n_steer": 6, "n_throttle": 3, "learning_rate": 0.0020405598509922246, "timesteps": 4862, "agent": "ppo", "eval_episodes": 3, "reward_shaping": true}, "mean_reward": 14.6215, "std_reward": 0.0161, "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/trial-0012/model.zip", "champion": false, "run_status": "ok", "elapsed_sec": 440.5779378414154, "reward_hacking_suspected": false}
|
||||||
|
|
@ -1,56 +1,76 @@
|
||||||
"""
|
"""
|
||||||
Speed-Aware Reward Wrapper for DonkeyCar RL — v2 (Hack-Proof)
|
Progress-Based Reward Wrapper for DonkeyCar RL — v3 (Anti-Circular)
|
||||||
==============================================================
|
====================================================================
|
||||||
|
|
||||||
DESIGN PRINCIPLE: Speed should only be rewarded when the car is
|
PROBLEM HISTORY:
|
||||||
genuinely progressing down the track. The original DonkeyCar reward
|
v1 (additive): speed × (1 - cte/max_cte)
|
||||||
already correctly signals track presence — we build on top of it.
|
→ Hacked by oscillating at track boundary (trials 8+13 in corrupted data)
|
||||||
|
|
||||||
|
v2 (multiplicative): original × (1 + speed_scale × speed)
|
||||||
|
→ Still hacked by circling ON the track (trial 5: cv=0.0%, 4582 reward)
|
||||||
|
→ Circular motion has low CTE + positive speed → full speed bonus
|
||||||
|
→ Neither CTE nor raw speed can distinguish forward vs circular motion
|
||||||
|
|
||||||
|
v3 (path efficiency): original × (1 + speed_scale × speed × path_efficiency)
|
||||||
|
→ Path efficiency = net_displacement / path_length over sliding window
|
||||||
|
→ Forward driving: efficiency ≈ 1.0 (all movement is productive)
|
||||||
|
→ Circular driving: efficiency ≈ 0.0 (movement cancels out, no net advance)
|
||||||
|
→ Speed bonus disappears when circling → car incentivized to go FORWARD
|
||||||
|
|
||||||
FORMULA:
|
FORMULA:
|
||||||
if original_reward > 0 (car is on track and centered):
|
efficiency = |pos_t - pos_{t-window}| / Σ|pos_i - pos_{i-1}|
|
||||||
shaped = original_reward × (1 + speed_scale × speed)
|
= net_displacement / total_path_length
|
||||||
else (car is off track / crashed):
|
|
||||||
shaped = original_reward (no speed bonus — cannot be hacked)
|
|
||||||
|
|
||||||
WHY THIS IS HACK-PROOF:
|
shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
|
||||||
The previous formula (speed × (1 - cte/max_cte)) could be maximized
|
|
||||||
by oscillating at the track boundary — the model learned this in practice.
|
|
||||||
|
|
||||||
The multiplicative formula is bounded by the original DonkeyCar reward:
|
(when original_reward ≤ 0: no bonus, just penalty — same as v2)
|
||||||
- Off track → original_reward ≤ 0 → no speed multiplier possible
|
|
||||||
- The model CANNOT increase reward by going fast off-track
|
|
||||||
- Speed bonus only accumulates when genuinely driving on the track
|
|
||||||
|
|
||||||
RESEARCH NOTE (2026-04-13):
|
RESEARCH NOTE (2026-04-13):
|
||||||
The additive formula caused reward hacking in Phase 1 — trials 8 and 13
|
Circular driving discovered in Phase 1 despite v2 fix.
|
||||||
achieved mean_reward=1936 and 1139 respectively by oscillating at the
|
Trial 5: mean_reward=4582, cv=0.0% over 4787 steps.
|
||||||
track boundary. This design was developed to prevent that exploit.
|
User visually confirmed: car circling at start line.
|
||||||
See docs/RESEARCH_LOG.md for full details.
|
See docs/RESEARCH_LOG.md for full analysis.
|
||||||
|
|
||||||
TUNING:
|
TUNING:
|
||||||
speed_scale=0.1 means a car going 5 m/s gets a 50% bonus on top of
|
window_size: how many steps to measure efficiency over (default 30)
|
||||||
the base CTE reward. This is a meaningful but not overwhelming incentive.
|
- Too small: noisy, sensitive to brief oscillations
|
||||||
Increase to 0.3+ to prioritize speed more aggressively (Phase 3).
|
- Too large: slow to detect circling, may miss short circular segments
|
||||||
|
speed_scale: speed bonus multiplier (default 0.1)
|
||||||
|
min_efficiency: minimum efficiency before speed bonus disappears (default 0.1)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import gymnasium as gym
|
import gymnasium as gym
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
from collections import deque
|
||||||
|
|
||||||
|
|
||||||
class SpeedRewardWrapper(gym.Wrapper):
|
class SpeedRewardWrapper(gym.Wrapper):
|
||||||
"""
|
"""
|
||||||
Hack-proof speed reward: multiplicative bonus ONLY when on track.
|
Path-efficiency-gated speed reward.
|
||||||
|
Speed bonus only applies proportionally to how much the car is making NET FORWARD PROGRESS.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
env: gymnasium environment
|
env: gymnasium environment
|
||||||
speed_scale: multiplier for speed bonus (default 0.1)
|
speed_scale: speed bonus multiplier (default 0.1)
|
||||||
shaped = original × (1 + speed_scale × speed) when on track
|
window_size: number of steps for efficiency measurement (default 30)
|
||||||
shaped = original when off track
|
min_efficiency: efficiency floor below which speed bonus is zero (default 0.05)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, env, speed_scale: float = 0.1):
|
def __init__(self, env, speed_scale: float = 0.1, window_size: int = 30, min_efficiency: float = 0.05):
|
||||||
super().__init__(env)
|
super().__init__(env)
|
||||||
self.speed_scale = speed_scale
|
self.speed_scale = speed_scale
|
||||||
|
self.window_size = window_size
|
||||||
|
self.min_efficiency = min_efficiency
|
||||||
|
|
||||||
|
# Sliding window of positions for efficiency calculation
|
||||||
|
self._pos_history = deque(maxlen=window_size + 1)
|
||||||
|
self._path_length = 0.0
|
||||||
|
|
||||||
|
def reset(self, **kwargs):
|
||||||
|
result = self.env.reset(**kwargs)
|
||||||
|
self._pos_history.clear()
|
||||||
|
self._path_length = 0.0
|
||||||
|
return result
|
||||||
|
|
||||||
def step(self, action):
|
def step(self, action):
|
||||||
result = self.env.step(action)
|
result = self.env.step(action)
|
||||||
|
|
@ -73,30 +93,68 @@ class SpeedRewardWrapper(gym.Wrapper):
|
||||||
else:
|
else:
|
||||||
return obs, shaped, done, info
|
return obs, shaped, done, info
|
||||||
|
|
||||||
|
def _get_pos(self, info: dict):
|
||||||
|
"""Extract position from info dict. Returns None if unavailable."""
|
||||||
|
pos = info.get('pos', None)
|
||||||
|
if pos is None:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
return np.array(pos[:3], dtype=np.float64)
|
||||||
|
except (TypeError, IndexError, ValueError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _compute_efficiency(self) -> float:
|
||||||
|
"""
|
||||||
|
Compute path efficiency = net displacement / total path length over window.
|
||||||
|
Returns 1.0 if insufficient history (can't penalize yet).
|
||||||
|
Returns 0.0 if no movement.
|
||||||
|
"""
|
||||||
|
if len(self._pos_history) < 3:
|
||||||
|
return 1.0 # Not enough history, give benefit of doubt
|
||||||
|
|
||||||
|
positions = list(self._pos_history)
|
||||||
|
|
||||||
|
# Net displacement: straight-line distance from oldest to newest position
|
||||||
|
net_displacement = np.linalg.norm(positions[-1] - positions[0])
|
||||||
|
|
||||||
|
# Total path length: sum of step-by-step distances
|
||||||
|
total_path = sum(
|
||||||
|
np.linalg.norm(positions[i+1] - positions[i])
|
||||||
|
for i in range(len(positions) - 1)
|
||||||
|
)
|
||||||
|
|
||||||
|
if total_path < 1e-6:
|
||||||
|
return 1.0 # Car not moving at all, don't penalize (will be caught by health check)
|
||||||
|
|
||||||
|
return float(net_displacement / total_path)
|
||||||
|
|
||||||
def _shape_reward(self, original_reward: float, info: dict) -> float:
|
def _shape_reward(self, original_reward: float, info: dict) -> float:
|
||||||
"""
|
"""Apply path-efficiency-gated speed bonus."""
|
||||||
Multiplicative speed bonus — only when on track.
|
# Update position history
|
||||||
Falls back gracefully if speed not in info dict.
|
pos = self._get_pos(info)
|
||||||
"""
|
if pos is not None:
|
||||||
|
self._pos_history.append(pos)
|
||||||
|
|
||||||
# Only apply speed bonus when genuinely on track (positive CTE reward)
|
# Only apply speed bonus when genuinely on track (positive CTE reward)
|
||||||
if original_reward <= 0:
|
if original_reward <= 0:
|
||||||
return original_reward # Off track / crashed — no speed reward
|
return original_reward # Off track / crashed — no speed reward
|
||||||
|
|
||||||
# Extract speed from info dict
|
# Extract speed
|
||||||
try:
|
try:
|
||||||
speed = float(info.get('speed', 0.0))
|
speed = max(0.0, float(info.get('speed', 0.0) or 0.0))
|
||||||
if speed is None:
|
|
||||||
return original_reward
|
|
||||||
speed = max(0.0, speed) # No negative speed bonus
|
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
return original_reward # Graceful fallback
|
return original_reward
|
||||||
|
|
||||||
# Multiplicative bonus: reward grows with speed, but only on track
|
# Compute path efficiency (detects circular motion)
|
||||||
# Hack-proof: cannot increase by going fast off-track
|
efficiency = self._compute_efficiency()
|
||||||
shaped = original_reward * (1.0 + self.speed_scale * speed)
|
|
||||||
|
# Clamp efficiency: below min_efficiency, no speed bonus
|
||||||
|
effective_efficiency = max(0.0, (efficiency - self.min_efficiency) / (1.0 - self.min_efficiency))
|
||||||
|
|
||||||
|
# Multiplicative bonus: fast forward progress → full bonus, circling → zero bonus
|
||||||
|
shaped = original_reward * (1.0 + self.speed_scale * speed * effective_efficiency)
|
||||||
return shaped
|
return shaped
|
||||||
|
|
||||||
def theoretical_max_per_step(self, max_speed: float = 10.0) -> float:
|
def theoretical_max_per_step(self, max_speed: float = 10.0) -> float:
|
||||||
"""Returns the theoretical max reward per step for bounds checking."""
|
"""Upper bound on reward per step (for hack detection calibration)."""
|
||||||
# original_reward ≤ 1.0, so shaped ≤ 1.0 × (1 + speed_scale × max_speed)
|
return 1.0 * (1.0 + self.speed_scale * max_speed * 1.0) # efficiency=1 at best
|
||||||
return 1.0 * (1.0 + self.speed_scale * max_speed)
|
|
||||||
|
|
|
||||||
|
|
@ -180,3 +180,70 @@ From this experience, we derived the following principles for DonkeyCar RL rewar
|
||||||
3. **Does the multiplicative reward fix change the optimal hyperparameter region?** Re-run autoresearch with fixed reward and compare top configurations.
|
3. **Does the multiplicative reward fix change the optimal hyperparameter region?** Re-run autoresearch with fixed reward and compare top configurations.
|
||||||
4. **Can we detect reward hacking automatically?** A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
|
4. **Can we detect reward hacking automatically?** A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
|
||||||
5. **What does a genuinely good reward look like?** After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.
|
5. **What does a genuinely good reward look like?** After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-04-13 — Critical Discovery: Circular Driving Exploit (v2 Reward Still Hackable)
|
||||||
|
|
||||||
|
### Finding: Car Learns to Circle at Starting Line
|
||||||
|
|
||||||
|
**User observation (direct visual):** "The model found a way to rig the reward by going left in circles — it was off the track and then back on track, but detected as failure. Model uses this as best way to maximize reward."
|
||||||
|
|
||||||
|
**Data confirmation:**
|
||||||
|
|
||||||
|
| Trial | mean_reward | std_reward | cv% | r/step | verdict |
|
||||||
|
|-------|-------------|------------|-------|--------|---------|
|
||||||
|
| 1 | 270.56 | 0.143 | 0.1% | 0.086 | ⚠️ CIRCULAR (suspiciously low std) |
|
||||||
|
| 5 | **4582.80** | **0.485** | **0.0%** | **0.957** | 🚨 CIRCULAR (confirmed) |
|
||||||
|
| 10 | 682.74 | 420.91 | 61.7% | 0.153 | ⚠️ UNSTABLE (sometimes circles, sometimes crashes) |
|
||||||
|
|
||||||
|
**Statistical signature of circular motion:**
|
||||||
|
- cv (coefficient of variation = std/mean) < 1% with high reward → very consistent behavior
|
||||||
|
- Circular driving IS very consistent: every circle is the same
|
||||||
|
- Legitimate driving is stochastic: different obstacles, curves, luck
|
||||||
|
- Trial 5: cv=0.0% over 3 eval episodes → textbook circling
|
||||||
|
|
||||||
|
**Why v2 reward still allowed this:**
|
||||||
|
- v2 fix: `reward = original × (1 + speed_scale × speed)` ONLY when on track
|
||||||
|
- Car circling at the starting line HAS: low CTE (on track centerline) + positive speed
|
||||||
|
- Result: full speed bonus for circling → 4582 reward over 4787 steps
|
||||||
|
- CTE and raw speed cannot distinguish forward from circular motion
|
||||||
|
|
||||||
|
### Root Cause: Missing Dimension — Track Progress
|
||||||
|
|
||||||
|
The fundamental issue: **neither CTE nor speed captures PROGRESS along the track.**
|
||||||
|
- CTE measures: am I near the centerline? (yes for circles)
|
||||||
|
- Speed measures: am I moving? (yes for circles)
|
||||||
|
- Progress measures: am I getting anywhere new? (NO for circles)
|
||||||
|
|
||||||
|
### Fix: Path Efficiency Reward (v3)
|
||||||
|
|
||||||
|
**Formula:**
|
||||||
|
```
|
||||||
|
efficiency = net_displacement / total_path_length (over sliding window of 30 steps)
|
||||||
|
shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why this works:**
|
||||||
|
- Forward driving: `efficiency ≈ 1.0` (all movement is productive)
|
||||||
|
- Circular driving: `efficiency ≈ 0.0` (lots of steps, car returns to start position)
|
||||||
|
- The speed bonus disappears when circling → car incentivized to go FORWARD
|
||||||
|
|
||||||
|
**Proof (tests):**
|
||||||
|
- `test_efficiency_near_zero_for_circular_driving`: efficiency < 0.2 after full circle
|
||||||
|
- `test_efficiency_near_one_for_straight_driving`: efficiency > 0.90 for straight line
|
||||||
|
- `test_straight_driving_gets_higher_reward_than_circular`: key guarantee test
|
||||||
|
|
||||||
|
**Data archived:**
|
||||||
|
- `autoresearch_results_phase1_CORRUPTED_circular_driving.jsonl` (12 records, circular)
|
||||||
|
- `models/ARCHIVED_circular_driving/` (trial-0001 through trial-0013)
|
||||||
|
|
||||||
|
### Lesson: cv% is a Reward Hacking Indicator
|
||||||
|
|
||||||
|
| cv% | Interpretation |
|
||||||
|
|------|----------------|
|
||||||
|
| < 1% + high reward | Likely reward hacking (very consistent exploit) |
|
||||||
|
| 1-10% | Normal RL variance |
|
||||||
|
| > 50% | Unstable policy, inconsistent behavior |
|
||||||
|
|
||||||
|
This metric will be added to the autoresearch result logging and summary.
|
||||||
|
|
|
||||||
|
|
@ -1,185 +1,240 @@
|
||||||
"""
|
"""
|
||||||
Tests for reward_wrapper.py v2 (hack-proof multiplicative formula) — no simulator required.
|
Tests for reward_wrapper.py v3 (path efficiency / anti-circular) — no simulator required.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import sys
|
import sys
|
||||||
import os
|
import os
|
||||||
|
import math
|
||||||
import pytest
|
import pytest
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import gymnasium as gym
|
import gymnasium as gym
|
||||||
|
from collections import deque
|
||||||
|
|
||||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent'))
|
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent'))
|
||||||
|
|
||||||
from reward_wrapper import SpeedRewardWrapper
|
from reward_wrapper import SpeedRewardWrapper
|
||||||
|
|
||||||
|
|
||||||
class MockStepEnv(gym.Env):
|
def make_env_with_pos(speed=2.0, original_reward=1.0, done=False, pos=(0.0, 0.0, 0.0)):
|
||||||
"""Mock gymnasium.Env for testing SpeedRewardWrapper."""
|
"""Create a mock env that returns a specific position in info dict."""
|
||||||
metadata = {'render_modes': []}
|
class PosEnv(gym.Env):
|
||||||
|
metadata = {'render_modes': []}
|
||||||
|
def __init__(self):
|
||||||
|
super().__init__()
|
||||||
|
self.action_space = gym.spaces.Discrete(5)
|
||||||
|
self.observation_space = gym.spaces.Box(low=0, high=255, shape=(120, 160, 3), dtype=np.uint8)
|
||||||
|
self._pos = list(pos)
|
||||||
|
self._speed = speed
|
||||||
|
self._reward = original_reward
|
||||||
|
self._done = done
|
||||||
|
|
||||||
def __init__(self, speed=2.0, original_reward=1.0, done=False, use_5tuple=True):
|
def set_pos(self, p):
|
||||||
super().__init__()
|
self._pos = list(p)
|
||||||
self._speed = speed
|
|
||||||
self._reward = original_reward
|
|
||||||
self._done = done
|
|
||||||
self._use_5tuple = use_5tuple
|
|
||||||
self.action_space = gym.spaces.Discrete(5)
|
|
||||||
self.observation_space = gym.spaces.Box(low=0, high=255, shape=(120, 160, 3), dtype=np.uint8)
|
|
||||||
|
|
||||||
def reset(self, seed=None, **kwargs):
|
def reset(self, seed=None, **kwargs):
|
||||||
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
||||||
|
|
||||||
def step(self, action):
|
def step(self, action):
|
||||||
obs = np.zeros((120, 160, 3), dtype=np.uint8)
|
obs = np.zeros((120, 160, 3), dtype=np.uint8)
|
||||||
info = {'speed': self._speed}
|
info = {'speed': self._speed, 'pos': self._pos}
|
||||||
if self._use_5tuple:
|
|
||||||
return obs, self._reward, self._done, False, info
|
return obs, self._reward, self._done, False, info
|
||||||
else:
|
|
||||||
return obs, self._reward, self._done, info
|
|
||||||
|
|
||||||
def close(self):
|
def close(self):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
return PosEnv()
|
||||||
|
|
||||||
|
|
||||||
# ---- Hack-Proof Guarantee Tests ----
|
# ---- Core Anti-Hacking Tests (inherited from v2) ----
|
||||||
|
|
||||||
def test_no_speed_bonus_when_off_track():
|
def test_no_speed_bonus_when_off_track():
|
||||||
"""
|
"""Off-track reward (≤ 0) must NOT get a speed bonus regardless of efficiency."""
|
||||||
CRITICAL: Off-track reward (≤ 0) must NOT get a speed bonus.
|
env = make_env_with_pos(speed=10.0, original_reward=-1.0)
|
||||||
This is the core anti-hacking guarantee.
|
|
||||||
"""
|
|
||||||
env = MockStepEnv(speed=10.0, original_reward=-1.0) # Off track, very fast
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.5)
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.5)
|
||||||
|
wrapped.reset()
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
_, reward, _, _, _ = wrapped.step(0)
|
||||||
assert reward == -1.0, \
|
assert reward == -1.0, f"Off-track reward must not get bonus, got {reward}"
|
||||||
f"Off-track reward must not get speed bonus, got {reward}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_no_speed_bonus_when_reward_zero():
|
def test_no_speed_bonus_when_reward_zero():
|
||||||
"""Reward exactly 0 (boundary case) should not get speed bonus."""
|
"""Reward exactly 0 should not get speed bonus."""
|
||||||
env = MockStepEnv(speed=5.0, original_reward=0.0)
|
env = make_env_with_pos(speed=5.0, original_reward=0.0)
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.5)
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.5)
|
||||||
|
wrapped.reset()
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
_, reward, _, _, _ = wrapped.step(0)
|
||||||
assert reward == 0.0, f"Zero reward should stay zero, got {reward}"
|
assert reward == 0.0, f"Zero reward should stay zero, got {reward}"
|
||||||
|
|
||||||
|
|
||||||
def test_speed_bonus_scales_with_speed_when_on_track():
|
# ---- Path Efficiency Tests ----
|
||||||
"""When on track (positive reward), faster = higher shaped reward."""
|
|
||||||
env_slow = MockStepEnv(speed=1.0, original_reward=0.8)
|
|
||||||
env_fast = MockStepEnv(speed=5.0, original_reward=0.8)
|
|
||||||
|
|
||||||
wrapped_slow = SpeedRewardWrapper(env_slow, speed_scale=0.1)
|
def _simulate_straight_driving(wrapped_env, env, steps=40, speed=3.0, step_size=0.1):
|
||||||
wrapped_fast = SpeedRewardWrapper(env_fast, speed_scale=0.1)
|
"""Simulate straight-line driving: car moves forward by step_size each step."""
|
||||||
|
wrapped_env.reset()
|
||||||
_, r_slow, _, _, _ = wrapped_slow.step(0)
|
rewards = []
|
||||||
_, r_fast, _, _, _ = wrapped_fast.step(0)
|
for i in range(steps):
|
||||||
|
env.set_pos([i * step_size, 0.0, 0.0])
|
||||||
assert r_fast > r_slow, f"Faster on-track should reward more: {r_fast:.3f} vs {r_slow:.3f}"
|
env._speed = speed
|
||||||
|
_, r, _, _, _ = wrapped_env.step(0)
|
||||||
|
rewards.append(r)
|
||||||
|
return rewards
|
||||||
|
|
||||||
|
|
||||||
def test_multiplicative_formula_correct():
|
def _simulate_circular_driving(wrapped_env, env, steps=40, speed=3.0, radius=0.5):
|
||||||
|
"""Simulate circular driving: car moves in a circle, returns to start."""
|
||||||
|
wrapped_env.reset()
|
||||||
|
rewards = []
|
||||||
|
for i in range(steps):
|
||||||
|
angle = 2 * math.pi * i / steps
|
||||||
|
x = radius * math.cos(angle)
|
||||||
|
z = radius * math.sin(angle)
|
||||||
|
env.set_pos([x, 0.0, z])
|
||||||
|
env._speed = speed
|
||||||
|
_, r, _, _, _ = wrapped_env.step(0)
|
||||||
|
rewards.append(r)
|
||||||
|
return rewards
|
||||||
|
|
||||||
|
|
||||||
|
def test_straight_driving_gets_higher_reward_than_circular():
|
||||||
"""
|
"""
|
||||||
Verify exact formula: shaped = original × (1 + speed_scale × speed)
|
CRITICAL: Straight driving must produce more total reward than circular driving
|
||||||
|
at the same speed and base reward. This is the core anti-circular guarantee.
|
||||||
"""
|
"""
|
||||||
original_reward = 0.6
|
env_straight = make_env_with_pos(speed=3.0, original_reward=0.8)
|
||||||
speed = 3.0
|
env_circular = make_env_with_pos(speed=3.0, original_reward=0.8)
|
||||||
speed_scale = 0.1
|
|
||||||
expected = original_reward * (1.0 + speed_scale * speed) # 0.6 × 1.3 = 0.78
|
|
||||||
|
|
||||||
env = MockStepEnv(speed=speed, original_reward=original_reward)
|
wrapped_straight = SpeedRewardWrapper(env_straight, speed_scale=0.1, window_size=20)
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=speed_scale)
|
wrapped_circular = SpeedRewardWrapper(env_circular, speed_scale=0.1, window_size=20)
|
||||||
|
|
||||||
|
straight_rewards = _simulate_straight_driving(wrapped_straight, env_straight, steps=40)
|
||||||
|
circular_rewards = _simulate_circular_driving(wrapped_circular, env_circular, steps=40)
|
||||||
|
|
||||||
|
# After warmup (window fills), straight should consistently beat circular
|
||||||
|
straight_tail = sum(straight_rewards[20:])
|
||||||
|
circular_tail = sum(circular_rewards[20:])
|
||||||
|
|
||||||
|
assert straight_tail > circular_tail, (
|
||||||
|
f"Straight driving ({straight_tail:.2f}) should beat circular ({circular_tail:.2f})"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_efficiency_near_one_for_straight_driving():
|
||||||
|
"""Path efficiency should be near 1.0 for straight-line motion."""
|
||||||
|
env = make_env_with_pos(speed=3.0, original_reward=1.0)
|
||||||
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10)
|
||||||
|
wrapped.reset()
|
||||||
|
|
||||||
|
# Drive in a straight line
|
||||||
|
for i in range(15):
|
||||||
|
env.set_pos([i * 0.2, 0.0, 0.0])
|
||||||
|
wrapped.step(0)
|
||||||
|
|
||||||
|
efficiency = wrapped._compute_efficiency()
|
||||||
|
assert efficiency > 0.90, f"Straight driving efficiency should be >0.90, got {efficiency:.4f}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_efficiency_near_zero_for_circular_driving():
|
||||||
|
"""Path efficiency should be near 0.0 for full circular motion."""
|
||||||
|
env = make_env_with_pos(speed=3.0, original_reward=1.0)
|
||||||
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=20)
|
||||||
|
wrapped.reset()
|
||||||
|
|
||||||
|
# Drive a full circle (returns to start position)
|
||||||
|
radius = 1.0
|
||||||
|
steps = 25 # More than window_size to fill it
|
||||||
|
for i in range(steps):
|
||||||
|
angle = 2 * math.pi * i / 24 # 24 steps = full circle
|
||||||
|
x = radius * math.cos(angle)
|
||||||
|
z = radius * math.sin(angle)
|
||||||
|
env.set_pos([x, 0.0, z])
|
||||||
|
wrapped.step(0)
|
||||||
|
|
||||||
|
efficiency = wrapped._compute_efficiency()
|
||||||
|
assert efficiency < 0.2, f"Circular driving efficiency should be <0.2, got {efficiency:.4f}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_efficiency_one_with_no_pos_history():
|
||||||
|
"""When position not available, efficiency should default to 1.0 (no penalty)."""
|
||||||
|
class NoPosEnv(gym.Env):
|
||||||
|
metadata = {'render_modes': []}
|
||||||
|
def __init__(self):
|
||||||
|
super().__init__()
|
||||||
|
self.action_space = gym.spaces.Discrete(5)
|
||||||
|
self.observation_space = gym.spaces.Box(low=0, high=255, shape=(120, 160, 3), dtype=np.uint8)
|
||||||
|
def reset(self, seed=None, **kwargs):
|
||||||
|
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
||||||
|
def step(self, action):
|
||||||
|
return np.zeros((120, 160, 3), dtype=np.uint8), 0.8, False, False, {'speed': 2.0} # No pos
|
||||||
|
def close(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
wrapped = SpeedRewardWrapper(NoPosEnv(), speed_scale=0.1)
|
||||||
|
wrapped.reset()
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
_, reward, _, _, _ = wrapped.step(0)
|
||||||
|
# Without pos, efficiency=1.0, so reward = 0.8 * (1 + 0.1*2*1.0) = 0.96
|
||||||
assert reward == pytest.approx(expected, abs=1e-6), \
|
assert reward > 0.8, f"Without pos, should get speed bonus (efficiency=1.0), got {reward}"
|
||||||
f"Expected {expected:.6f}, got {reward:.6f}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_cannot_hack_by_going_fast_off_track():
|
def test_efficiency_resets_on_episode_reset():
|
||||||
"""
|
"""Position history should clear on reset, so each episode starts fresh."""
|
||||||
Demonstrate that the previous formula could be hacked but this one cannot.
|
env = make_env_with_pos(speed=3.0, original_reward=1.0)
|
||||||
Fast off-track (speed=10) must give same or worse result than slow off-track (speed=1).
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.1, window_size=10)
|
||||||
"""
|
wrapped.reset()
|
||||||
env_fast_offtrack = MockStepEnv(speed=10.0, original_reward=-1.0)
|
|
||||||
env_slow_offtrack = MockStepEnv(speed=1.0, original_reward=-1.0)
|
|
||||||
|
|
||||||
wrapped_fast = SpeedRewardWrapper(env_fast_offtrack, speed_scale=0.5)
|
# Fill with circular data
|
||||||
wrapped_slow = SpeedRewardWrapper(env_slow_offtrack, speed_scale=0.5)
|
radius = 0.5
|
||||||
|
for i in range(15):
|
||||||
|
angle = 2 * math.pi * i / 12
|
||||||
|
env.set_pos([radius * math.cos(angle), 0.0, radius * math.sin(angle)])
|
||||||
|
wrapped.step(0)
|
||||||
|
|
||||||
_, r_fast, _, _, _ = wrapped_fast.step(0)
|
eff_before_reset = wrapped._compute_efficiency()
|
||||||
_, r_slow, _, _, _ = wrapped_slow.step(0)
|
|
||||||
|
|
||||||
assert r_fast == r_slow == -1.0, \
|
# Reset and drive straight for a few steps
|
||||||
f"Off-track reward must be identical regardless of speed: fast={r_fast}, slow={r_slow}"
|
wrapped.reset()
|
||||||
|
for i in range(3):
|
||||||
|
env.set_pos([i * 0.3, 0.0, 0.0])
|
||||||
|
wrapped.step(0)
|
||||||
|
|
||||||
|
eff_after_reset = wrapped._compute_efficiency()
|
||||||
|
assert eff_after_reset > eff_before_reset, \
|
||||||
|
f"After reset, efficiency should improve: before={eff_before_reset:.3f}, after={eff_after_reset:.3f}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_speed_bonus_disappears_when_circling():
|
||||||
|
"""After circling for window_size steps, speed bonus should be nearly zero."""
|
||||||
|
env = make_env_with_pos(speed=5.0, original_reward=1.0)
|
||||||
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.5, window_size=20, min_efficiency=0.05)
|
||||||
|
wrapped.reset()
|
||||||
|
|
||||||
|
# Warm up with circular motion
|
||||||
|
radius = 0.5
|
||||||
|
rewards = []
|
||||||
|
for i in range(30):
|
||||||
|
angle = 2 * math.pi * (i % 20) / 20 # Full circle every 20 steps
|
||||||
|
env.set_pos([radius * math.cos(angle), 0.0, radius * math.sin(angle)])
|
||||||
|
_, r, _, _, _ = wrapped.step(0)
|
||||||
|
rewards.append(r)
|
||||||
|
|
||||||
|
# Later rewards (after window fills) should be close to original_reward
|
||||||
|
later_rewards = rewards[20:]
|
||||||
|
avg_later = sum(later_rewards) / len(later_rewards)
|
||||||
|
assert avg_later < 1.3, \
|
||||||
|
f"Circular driving speed bonus should be suppressed, avg reward={avg_later:.3f} (original=1.0)"
|
||||||
|
|
||||||
|
|
||||||
|
# ---- Inherited guarantees ----
|
||||||
|
|
||||||
|
def test_crash_still_penalized():
|
||||||
|
"""Crash (original_reward=-1) should remain -1 regardless of speed or efficiency."""
|
||||||
|
env = make_env_with_pos(speed=8.0, original_reward=-1.0, done=True)
|
||||||
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.2)
|
||||||
|
wrapped.reset()
|
||||||
|
_, reward, _, _, _ = wrapped.step(0)
|
||||||
|
assert reward == -1.0, f"Crash reward should remain -1.0, got {reward}"
|
||||||
|
|
||||||
|
|
||||||
def test_theoretical_max_per_step():
|
def test_theoretical_max_per_step():
|
||||||
"""
|
"""Max reward/step bounded: original(1.0) × (1 + speed_scale × max_speed)."""
|
||||||
Verify theoretical_max_per_step returns correct upper bound.
|
env = make_env_with_pos()
|
||||||
With speed_scale=0.1 and max_speed=10.0: max = 1.0 × (1 + 0.1×10) = 2.0
|
|
||||||
"""
|
|
||||||
env = MockStepEnv()
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
|
wrapped = SpeedRewardWrapper(env, speed_scale=0.1)
|
||||||
max_reward = wrapped.theoretical_max_per_step(max_speed=10.0)
|
assert wrapped.theoretical_max_per_step(max_speed=10.0) == pytest.approx(2.0, abs=1e-6)
|
||||||
assert max_reward == pytest.approx(2.0, abs=1e-6), \
|
|
||||||
f"Max per step should be 2.0, got {max_reward}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_fallback_when_speed_not_in_info():
|
|
||||||
"""If info doesn't have speed, fall back to original reward."""
|
|
||||||
class NoSpeedEnv(gym.Env):
|
|
||||||
metadata = {'render_modes': []}
|
|
||||||
def __init__(self):
|
|
||||||
super().__init__()
|
|
||||||
self.action_space = gym.spaces.Discrete(5)
|
|
||||||
self.observation_space = gym.spaces.Box(low=0, high=255, shape=(120, 160, 3), dtype=np.uint8)
|
|
||||||
def reset(self, seed=None, **kwargs):
|
|
||||||
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
|
||||||
def step(self, action):
|
|
||||||
return np.zeros((120, 160, 3), dtype=np.uint8), 0.75, False, False, {} # No 'speed' key
|
|
||||||
def close(self):
|
|
||||||
pass
|
|
||||||
|
|
||||||
wrapped = SpeedRewardWrapper(NoSpeedEnv(), speed_scale=0.5)
|
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
|
||||||
# speed=0.0 default → shaped = 0.75 × (1 + 0.5 × 0.0) = 0.75
|
|
||||||
assert reward == pytest.approx(0.75, abs=1e-6), \
|
|
||||||
f"Should fall back gracefully, got {reward}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_wrapper_preserves_observation():
|
|
||||||
"""SpeedRewardWrapper must not modify observations."""
|
|
||||||
class FixedObsEnv(gym.Env):
|
|
||||||
metadata = {'render_modes': []}
|
|
||||||
def __init__(self):
|
|
||||||
super().__init__()
|
|
||||||
self.action_space = gym.spaces.Discrete(5)
|
|
||||||
self.observation_space = gym.spaces.Box(low=0, high=255, shape=(120, 160, 3), dtype=np.uint8)
|
|
||||||
def reset(self, seed=None, **kwargs):
|
|
||||||
return np.zeros((120, 160, 3), dtype=np.uint8), {}
|
|
||||||
def step(self, action):
|
|
||||||
return np.zeros((120, 160, 3), dtype=np.uint8), 0.8, False, False, {'speed': 2.0}
|
|
||||||
def close(self):
|
|
||||||
pass
|
|
||||||
|
|
||||||
wrapped = SpeedRewardWrapper(FixedObsEnv())
|
|
||||||
obs, _, _, _, _ = wrapped.step(0)
|
|
||||||
np.testing.assert_array_equal(obs, np.zeros((120, 160, 3), dtype=np.uint8))
|
|
||||||
|
|
||||||
|
|
||||||
def test_4tuple_step_compatibility():
|
|
||||||
"""Wrapper should handle 4-tuple step() return (old gym API)."""
|
|
||||||
env = MockStepEnv(speed=2.0, original_reward=0.8, use_5tuple=False)
|
|
||||||
wrapped = SpeedRewardWrapper(env)
|
|
||||||
result = wrapped.step(0)
|
|
||||||
assert len(result) == 4, f"Expected 4-tuple, got {len(result)}"
|
|
||||||
_, reward, done, info = result
|
|
||||||
assert isinstance(reward, float)
|
|
||||||
assert reward > 0.8, "Speed bonus should increase reward when on track"
|
|
||||||
|
|
||||||
|
|
||||||
def test_crash_still_penalized():
|
|
||||||
"""Crash (original_reward=-1) should remain -1, not improved by speed."""
|
|
||||||
env = MockStepEnv(speed=8.0, original_reward=-1.0, done=True)
|
|
||||||
wrapped = SpeedRewardWrapper(env, speed_scale=0.2)
|
|
||||||
_, reward, _, _, _ = wrapped.step(0)
|
|
||||||
assert reward == -1.0, f"Crash reward should remain -1.0, got {reward}"
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue