donkeycar-rl-autoresearch/agent/outerloop-results
Paul Huliganga fcb6ea1ac2 fix: path-efficiency reward (v3) defeats circular driving exploit
CRITICAL BUG FIX — Circular Driving:
- v2 reward still hackable: car circles at starting line with low CTE + positive speed
- Confirmed in data: trial 5 mean_reward=4582, cv=0.0% (physically impossible for genuine driving)
- Statistical signature: cv <1% with high reward = consistent exploit, not genuine driving

ROOT CAUSE: Neither CTE nor raw speed can distinguish forward vs circular motion.
Both have: low CTE (on centerline) + positive speed (moving) = same reward.
Missing dimension: TRACK PROGRESS (net advance along track)

FIX — Path Efficiency Reward (v3):
  efficiency = net_displacement / total_path_length  (sliding window of 30 steps)
  shaped = original x (1 + speed_scale x speed x efficiency)
  - Forward driving: efficiency ≈ 1.0 → full speed bonus
  - Circular driving: efficiency ≈ 0.0 → speed bonus disappears
  - Cannot be hacked: circling means returning to same positions (low net_displacement)

Tests:
  - test_efficiency_near_zero_for_circular_driving: confirmed <0.2 efficiency for circles
  - test_efficiency_near_one_for_straight_driving: confirmed >0.90 for straight line
  - test_straight_driving_gets_higher_reward_than_circular: KEY guarantee
  - test_speed_bonus_disappears_when_circling: bonus suppressed after window fills

Research documentation:
  - Full analysis with data table added to docs/RESEARCH_LOG.md
  - cv% identified as reward hacking indicator
  - Archived circular data + models

Clean start: new autoresearch_results_phase1.jsonl, new champion dir

Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +6 (path efficiency, anti-circular)
TypeScript: N/A
2026-04-13 13:36:17 -04:00
..
model-000 Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00
model-001 Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00
model-002 Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00
model-003 Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00
autoresearch_log.txt AUTORESEARCH: 300 total trials complete - best mean_reward=141.85 at n_steer=8, n_throttle=5, lr=0.00202 2026-04-13 01:56:06 -04:00
autoresearch_phase1_log_CORRUPTED_circular_driving.txt fix: path-efficiency reward (v3) defeats circular driving exploit 2026-04-13 13:36:17 -04:00
autoresearch_phase1_log_CORRUPTED_reward_hacking.txt fix: hack-proof reward shaping + reward hacking detection + research log 2026-04-13 12:27:48 -04:00
autoresearch_results.jsonl AUTORESEARCH: 300 total trials complete - best mean_reward=141.85 at n_steer=8, n_throttle=5, lr=0.00202 2026-04-13 01:56:06 -04:00
autoresearch_results_phase1_CORRUPTED_circular_driving.jsonl fix: path-efficiency reward (v3) defeats circular driving exploit 2026-04-13 13:36:17 -04:00
autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl fix: hack-proof reward shaping + reward hacking detection + research log 2026-04-13 12:27:48 -04:00
clean_sweep_results.jsonl AUTORESEARCH: Full Karpathy-style GP+UCB meta-controller, clean base data, fixed all paths, ready to run 2026-04-13 00:52:00 -04:00
nohup_outerloop.log Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00
outer_monitor.log Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00
sweep_results.jsonl Initial commit: stable RL sweep runner, legacy and new scripts, full docs included 2026-04-12 22:57:50 -04:00