donkeycar-rl-autoresearch

Commit Graph

Author	SHA1	Message	Date
Paul Huliganga	47d8e5b346	fix: short-lap exploit now TERMINATES the episode, not just penalises The circle exploit persisted because the penalty alone (-100 per short lap) was insufficient. The model stayed alive between laps accumulating small positive rewards, making circling a viable strategy despite the penalty. Fix: _compute_reward_and_done() returns (reward, force_terminate). When a short lap is detected, force_terminate=True is returned and step() sets terminated=True immediately. The episode ends on the spot — no more rewards possible. This makes the circle exploit strictly worse than any forward driving behaviour. Tests updated: _compute_reward → _compute_reward_and_done, short-lap test now asserts force_terminate=True. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-18 10:42:23 -04:00
Paul Huliganga	b8a13dea81	feat: v5 reward — speed × CTE-quality, drop efficiency term Problem with v4 on mountain_track: CTE × efficiency × speed all collapse to zero simultaneously when the car slows on the hill, giving no gradient signal for 'apply more throttle'. v5: reward = (speed / 10) × (1 - \|CTE\| / max_cte) - Directly rewards going fast while staying centred - Hill: car slows → reward drops → clear gradient toward more throttle - Circling protection now entirely handled by lap-time penalty + StuckTerminationWrapper (not by the reward formula) Tests updated to reflect v5 semantics (102 passing). Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-17 13:25:38 -04:00
Paul Huliganga	5d1227833d	fix: close short-lap circle exploit and cap segment eval episode length Two reward hacking behaviours observed during Wave 4 training: 1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack): Model circles at start/finish line completing laps in 1-2 sim-seconds, accumulating lap_count indefinitely with no genuine track progress. Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time < min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time). A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected. Window size also increased from 30 → 60 to catch slower circles. 2. Non-terminating segment eval episodes: evaluate_policy on wide tracks (no barriers) could run indefinitely, inflating segment_reward to 200k+. Replaced with manual eval loop capped at MAX_EVAL_STEPS=3000 steps. Phase 4 results cleared (trials 4-6 ran with exploitable reward). Tests: 4 new reward wrapper tests, 100 total passing. Agent: pi Tests: 100 passed Tests-Added: 4 TypeScript: N/A	2026-04-15 09:06:25 -04:00
Paul Huliganga	c8a495dd22	fix: reward v4 — full sim bypass kills circular driving at root ROOT CAUSE: donkey_sim.py calc_reward() uses forward_vel = dot(heading, velocity). A spinning car ALWAYS has forward_vel > 0 (always moving 'forward' relative to its own heading), so it earned positive reward indefinitely while circling. v3 WAS INSUFFICIENT: v3 applied efficiency only to the speed BONUS: original × (1 + speed×eff×scale) But 'original' from sim was still exploitable: CTE≈0 while spinning → original=1.0/step Efficiency killed the speed bonus but not the base reward. 47k-step run: spinning = 1.0/step × 47k = 47k reward (never crashes in circle) v4 FIX — base × efficiency × speed: reward = (1 - abs(cte)/max_cte) × efficiency × (1 + speed_scale × speed) Completely ignores sim's bogus forward_vel reward. Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed. ALL three terms must be high to earn reward — cannot be gamed. Key new test: test_circling_at_zero_cte_gives_near_zero_reward Worst-case exploit (CTE=0 spinning) → avg reward < 0.15 (was 1.0 in v3) forward_beats_circling_by_3x confirmed. Also: update Phase 2 autoresearch timesteps test, research log updated. Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +1 (core v4 circling guarantee) TypeScript: N/A	2026-04-13 20:56:32 -04:00
Paul Huliganga	fcb6ea1ac2	fix: path-efficiency reward (v3) defeats circular driving exploit CRITICAL BUG FIX — Circular Driving: - v2 reward still hackable: car circles at starting line with low CTE + positive speed - Confirmed in data: trial 5 mean_reward=4582, cv=0.0% (physically impossible for genuine driving) - Statistical signature: cv <1% with high reward = consistent exploit, not genuine driving ROOT CAUSE: Neither CTE nor raw speed can distinguish forward vs circular motion. Both have: low CTE (on centerline) + positive speed (moving) = same reward. Missing dimension: TRACK PROGRESS (net advance along track) FIX — Path Efficiency Reward (v3): efficiency = net_displacement / total_path_length (sliding window of 30 steps) shaped = original x (1 + speed_scale x speed x efficiency) - Forward driving: efficiency ≈ 1.0 → full speed bonus - Circular driving: efficiency ≈ 0.0 → speed bonus disappears - Cannot be hacked: circling means returning to same positions (low net_displacement) Tests: - test_efficiency_near_zero_for_circular_driving: confirmed <0.2 efficiency for circles - test_efficiency_near_one_for_straight_driving: confirmed >0.90 for straight line - test_straight_driving_gets_higher_reward_than_circular: KEY guarantee - test_speed_bonus_disappears_when_circling: bonus suppressed after window fills Research documentation: - Full analysis with data table added to docs/RESEARCH_LOG.md - cv% identified as reward hacking indicator - Archived circular data + models Clean start: new autoresearch_results_phase1.jsonl, new champion dir Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +6 (path efficiency, anti-circular) TypeScript: N/A	2026-04-13 13:36:17 -04:00
Paul Huliganga	5e93dae316	fix: hack-proof reward shaping + reward hacking detection + research log CRITICAL BUG FIX — Reward Hacking: - Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139) - New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track - Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked - Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track Reward Hacking Auto-Detection: - check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking - Flagged results are excluded from GP fitting (won't optimize toward hacking params) - reward_hacking_suspected field added to JSONL result records Research Documentation: - docs/RESEARCH_LOG.md created: full chronological research history - Random policy bug discovery and impact - Throttle clamp fix - Reward hacking discovery with evidence table - Hack-proof design rationale - Lessons learned + future research questions - Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl - Archived hacked models: models/ARCHIVED_reward_hacking/ Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +9 (reward wrapper hack-proof tests) TypeScript: N/A	2026-04-13 12:27:48 -04:00
Paul Huliganga	c804189dd0	feat: Wave 1 complete — real PPO training, model save, GP+UCB autoresearch, 37 tests passing - Rebuilt donkeycar_sb3_runner.py: real PPO/DQN model.learn() + evaluate_policy() + model.save() - Added SpeedRewardWrapper: reward = speed * (1 - \|cte\|/max_cte) - Added ChampionTracker: tracks best model across all trials, writes manifest.json - Rebuilt autoresearch_controller.py: Phase 1 results separated from random-policy data - Added timesteps to GP search space - Added --push-every N for automatic git push - Added 37 passing tests: discretize_action, reward_wrapper, autoresearch_controller, runner_integration - Scaffolded project with agent harness (large mode): PROJECT-SPEC, DECISIONS, IMPLEMENTATION_PLAN, EXECUTION_MASTER - Fixed: model.save() never called before model is defined (was root cause of all prior NameError crashes) - Fixed: random policy replaced with real trained policy evaluation Agent: pi/claude-sonnet Tests: 37/37 passing Tests-Added: +37 TypeScript: N/A	2026-04-13 10:03:15 -04:00

7 Commits