donkeycar-rl-autoresearch

Commit Graph

Author	SHA1	Message	Date
Paul Huliganga	4f77b8a468	fix: always save and return the BEST model, not the last one This was the root cause of losing good models during training. The model could learn to lap at step 30k then drift to a worse policy by step 90k, and we only ever saved the final weights. Changes to train_multitrack(): - Tracks best_segment_reward across all segments - Saves best_model.zip whenever a new high score is achieved - At end of training, RELOADS best_model.zip before returning so the caller always gets the best policy found, not the drift Both files saved per trial: model.zip <- latest checkpoint (crash recovery) best_model.zip <- best policy seen during training (used for eval) Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-17 14:45:37 -04:00
Paul Huliganga	b8a13dea81	feat: v5 reward — speed × CTE-quality, drop efficiency term Problem with v4 on mountain_track: CTE × efficiency × speed all collapse to zero simultaneously when the car slows on the hill, giving no gradient signal for 'apply more throttle'. v5: reward = (speed / 10) × (1 - \|CTE\| / max_cte) - Directly rewards going fast while staying centred - Hill: car slows → reward drops → clear gradient toward more throttle - Circling protection now entirely handled by lap-time penalty + StuckTerminationWrapper (not by the reward formula) Tests updated to reflect v5 semantics (102 passing). Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-17 13:25:38 -04:00
Paul Huliganga	a5577fb3e7	feat: shuttle-exploit detection in mini_monaco eval Samples car position every 100 steps during eval. Computes macro efficiency = net_displacement / total_sampled_path. If < 0.3 with >= 500 steps, logs WARNING: SHUTTLE EXPLOIT? with the efficiency value. Also logs reward/step per episode so anomalously high-scoring long episodes can be diagnosed immediately. This will tell us definitively whether Trials 9 and 14 (1435/1573 scores, 2000 steps each) were genuine driving or back-and-forth shuttling on a mini_monaco straight. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-16 17:29:30 -04:00
Paul Huliganga	b00f63dfbc	fix: save_dir not in scope inside train_multitrack — crashed every trial Checkpoint code added save_dir inside train_multitrack() but save_dir is defined in main(). Every trial since the checkpoint fix was added crashed with 'name save_dir is not defined' after the first segment, producing rc=101 and no GP data. Fix: add save_dir=None parameter to train_multitrack() and pass it from the main() call site. This explains why Trials 6-10 in the current run all produced None results despite appearing to train normally for the first segment. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-15 22:47:29 -04:00
Paul Huliganga	a9eed2faa3	fix: restart with verified config + seed GP with overnight 1943 result All previous issues: - Controller was never restarted after cap/checkpoint fixes -> they never ran - Timeout trials (score=0) were polluting GP data -> removed - Overnight Trial 3 result (1943 mini_monaco) was unknown to GP -> added GP now has 5 valid data points including the 1943 score at lr=0.000685, switch=17499. GP should converge toward longer switching intervals which produced the only great result. Verified before relaunch: - PARAM_SPACE max total_timesteps = 90000 ✓ - Checkpoint saves after every segment ✓ - Rescue eval on timeout ✓ - 102 tests passing ✓ Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-15 22:26:53 -04:00
Paul Huliganga	e61ebc5b38	fix: prevent trial timeouts losing all data Two changes: 1. Lower total_timesteps cap: 120k → 90k Actual throughput is 16 steps/sec (not 20 as estimated). 120k steps = 126 min training + 9 min overhead = 135 min > 2hr limit. 90k steps = 94 min + 8 min overhead = 102 min, safely within limit. 2. Per-segment checkpoint saves in multitrack_runner model.save() called after every segment so the latest weights are always on disk. If the runner is killed (timeout/crash/Ctrl+C), training data is never completely lost. 3. Timeout rescue eval in wave4_controller If JOB_TIMEOUT fires and a checkpoint exists, immediately runs a quick mini_monaco eval on the checkpoint so the trial still produces a GP data point despite the timeout. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A	2026-04-15 21:54:50 -04:00
Paul Huliganga	f9f6a09744	fix: StuckTerminationWrapper + deque import + 102 tests StuckTerminationWrapper added to wrap_env stack (between ThrottleClamp and SpeedReward): - Terminates episode after stuck_steps=80 steps with <0.5m displacement - Handles slow barrier contact that Unity hit detection misses - Handles off-lap-line circles (efficiency→0 gave zero reward but no termination; now gives -1.0 after 80 steps = ~4s of non-progress) - Wrapper stack: ThrottleClamp → StuckTermination → SpeedReward Also: missing deque import in multitrack_runner.py caused NameError. Phase 4 results cleared again (Trial 1 ran without StuckTermination). Tests: 2 new stuck-termination tests, 102 total. Agent: pi Tests: 102 passed Tests-Added: 2 TypeScript: N/A	2026-04-15 09:17:27 -04:00
Paul Huliganga	5d1227833d	fix: close short-lap circle exploit and cap segment eval episode length Two reward hacking behaviours observed during Wave 4 training: 1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack): Model circles at start/finish line completing laps in 1-2 sim-seconds, accumulating lap_count indefinitely with no genuine track progress. Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time < min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time). A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected. Window size also increased from 30 → 60 to catch slower circles. 2. Non-terminating segment eval episodes: evaluate_policy on wide tracks (no barriers) could run indefinitely, inflating segment_reward to 200k+. Replaced with manual eval loop capped at MAX_EVAL_STEPS=3000 steps. Phase 4 results cleared (trials 4-6 ran with exploitable reward). Tests: 4 new reward wrapper tests, 100 total passing. Agent: pi Tests: 100 passed Tests-Added: 4 TypeScript: N/A	2026-04-15 09:06:25 -04:00
Paul Huliganga	860e3d6610	fix: fresh PPO verbose=0 suppressed all training output — set verbose=1 Without this, Wave 4 scratch-trained models produce no rollout stats in the log, making it impossible to monitor training progress or spot degenerate policies early. Warm-start models in Wave 3 showed stats because verbose=1 was baked into the Phase-2 saved model state; fresh models default to verbose=0. Agent: pi Tests: 96 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 22:44:22 -04:00
Paul Huliganga	7534527722	Wave 4: scratch training on generated_track + mountain_track, zero-shot mini_monaco Strategy change driven by Trial 1 data analysis: - generated_road removed: too similar to generated_track, and Phase-2 warm-start caused catastrophic forgetting (reward 2388→37 in one rotation) - mountain_track mean reward was only 17 — model never converged there - mini_monaco score 24.9 (37 steps) — model was outputting degenerate actions Wave 4 approach: - NO warm-start: fresh random weights every trial - Train: generated_track + mountain_track (visually distinct backgrounds, both have road markings — forces model to learn general mark-following) - Test (zero-shot): mini_monaco only (never seen during training) - Wider LR search: [1e-4, 2e-3] (scratch model needs different range) - Larger step budgets: 60k-250k total (fresh model needs more time) - Seed params: lr=0.0003 and lr=0.001 (diverse from the start) Files: - multitrack_runner.py: 2 training tracks, no warm-start auto-detection - wave4_controller.py: new Wave 4 GP+UCB controller - tests updated: TRAINING_TRACKS assertion, seed param tests → wave4 - 96 tests passing ADR-013 to follow. Agent: pi Tests: 96 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 22:40:38 -04:00
Paul Huliganga	650f893d2d	fix: complete LR override — must patch lr_schedule, not just param_groups PPO.load() bakes lr_schedule=FloatSchedule(saved_lr) into the model. train() calls _update_learning_rate() which reads lr_schedule, not model.learning_rate. So even with param_groups patched, the first gradient step reverts the optimizer to the saved LR. Complete 3-part fix in create_or_load_model(): model.learning_rate = lr # attribute model.lr_schedule = get_schedule_fn(lr) # prevents train() reverting for pg in optimizer.param_groups: pg['lr'] = lr # immediate effect Also: - SEED_PARAMS: second seed now uses LR=0.001 (was 0.000225) so GP starts with real LR diversity instead of two identical seeds - tests/test_end_to_end.py: 13 new tests covering the full LR override path including a live learn() call; would have caught both bugs - Phase 3 results re-cleared (seed trial 1 ran with half-fix) - 96 tests total, all passing Agent: pi Tests: 96 passed Tests-Added: 13 TypeScript: N/A	2026-04-14 21:27:43 -04:00
Paul Huliganga	298cd1790a	fix: LR override was not reaching the optimizer — all trials ran at 0.000225 PPO.load() restores the saved optimizer state (lr=0.000225 from Phase 2 champion). Setting model.learning_rate alone is insufficient because _update_learning_rate() may not fire before the first gradient step, and the optimizer's param_groups still hold the old value. Fix: after PPO.load(), explicitly set lr on every optimizer param_group: model.learning_rate = lr for pg in model.policy.optimizer.param_groups: pg['lr'] = lr Impact: all 8 previous Wave 3 trials actually trained at LR=0.000225 regardless of GP proposal. Results archived as: autoresearch_results_phase3_CONTAMINATED_wrong_lr.jsonl Phase 3 results cleared; autoresearch restarting from scratch. Agent: pi Tests: 83 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 20:37:48 -04:00
Paul Huliganga	7ed2456896	fix: remove Warren from test set — indoor carpet, broken done condition Warren track surface is green carpet (not outdoor road), and the episode-done condition (\|CTE\| > max_cte) does not fire when the car crosses the INSIDE boundary. Car can drive off-track and bump into chairs indefinitely, making scores meaningless as a test metric. Changes: - multitrack_runner.py: TEST_TRACKS now mini_monaco only - wave3_controller.py: drop warren_reward from parse/save/champion paths - tests/test_wave3.py: update assertions to match single test track - All 83 tests pass Track classification (final): TRAIN : generated_road, generated_track, mountain_track TEST : mini_monaco (outdoor, proper road, correct done condition) SKIP : warren, warehouse, robo_racing_league, waveshare, circuit_launch SKIP : avc_sparkfun (orange markings) ADR-010 to be updated. Agent: pi Tests: 83 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 13:47:28 -04:00
Paul Huliganga	86657a26b8	wave3: fix track-switch bug (viewer not raw socket) + shorten trial budgets Bug: send_exit_scene_raw() opened a NEW TCP connection, creating a second phantom vehicle. The sim sent exit_scene to the phantom, leaving the real training connection stuck on generated_road for the entire run. Fix: _send_exit_scene() now calls env.unwrapped.viewer.exit_scene() on the EXISTING TCP connection that the training env already holds. This is the only reliable way to switch scenes mid-session (matches track_switcher.py). Also: - Removed send_exit_scene_raw() import from multitrack_runner.py - Simplified initial connection (no spurious exit_scene at startup) - Reduced search space: total_timesteps 80k-400k -> 30k-150k - Reduced seed params: 150k/300k -> 45k/90k (~35-45 min per trial) - Added test: test_close_and_switch_uses_viewer_not_raw_socket 83 tests passing Agent: pi Tests: 83 passed Tests-Added: 1 TypeScript: N/A	2026-04-14 13:29:49 -04:00
Paul Huliganga	4ca5304a71	wave3: add multi-track autoresearch system (83 tests passing) New files: - agent/multitrack_runner.py: trains PPO round-robin across generated_road, generated_track, mountain_track; zero-shot evaluates on mini_monaco + warren - agent/wave3_controller.py: GP+UCB outer loop optimising combined test score - tests/test_wave3.py: 30 new tests (83 total) Track classification (from visual analysis of all 10 screenshots): Training : generated_road, generated_track, mountain_track Test (ZSL): mini_monaco, warren (pseudo-outdoor — proper road markings) Skip : warehouse, robo_racing_league, waveshare, circuit_launch (indoor floor) avc_sparkfun (orange markings — different visual domain) Key design decisions: ADR-010: Warren = pseudo-outdoor track (proper road lines, not floor marks) ADR-011: Test tracks NEVER used in training; GP optimises test score only ADR-012: All trials warm-start from Phase 2 champion model Switching: env.close() + send_exit_scene_raw() + 4s wait + gym.make() Pre-Wave-3 baseline: 1/10 tracks drivable (0/2 held-out test tracks) Wave 3 goal: 2/2 test tracks drivable (mini_monaco + warren) Agent: pi Tests: 83 passed Tests-Added: 30 TypeScript: N/A	2026-04-14 12:47:12 -04:00
Paul Huliganga	e68d618d29	feat: Phase 3 — behavioral control, enhanced evaluator, 53 tests PHASE 2 MILESTONE DOCUMENTED: All 3 top models complete the full track with distinct driving styles: - Trial 20 (n_steer=3): Right lane, stable steering — CHAMPION ✅ - Trial 8 (n_steer=4): Left/center lane, oscillating (still completes!) - Trial 18 (n_steer=3): Right shoulder, very accurate line following Key finding: fewer steering bins (n_steer=3) = better driving (counterintuitive) CTE symmetry explains left/right preference: random NN init determines which side BEHAVIORAL REWARD WRAPPERS (agent/behavioral_wrappers.py): - LanePositionWrapper: target a specific CTE offset (control left/right preference) - AntiOscillationWrapper: penalise rapid steering changes (fix Model 2 oscillation) - AsymmetricCTEWrapper: enforce right-lane rule (penalise left-of-centre more) - CombinedBehavioralWrapper: all three combined in one wrapper ENHANCED EVALUATOR (agent/evaluate_champion.py): - Full metrics: reward, lap time, oscillation score, CTE distribution, lane position - --compare flag: runs all top Phase 2 models side by side with comparison table - Saves eval summary to outerloop-results/eval_summary.jsonl - Detects lap completion events from sim info dict IMPLEMENTATION PLAN updated: Wave 3 streams defined RESEARCH LOG updated: Phase 2 milestone, behavioral analysis, next steps Champion updated to Trial 20 (Phase 2) Agent: pi/claude-sonnet Tests: 53/53 passing (+13 behavioral wrapper tests) Tests-Added: +13 TypeScript: N/A	2026-04-14 09:28:43 -04:00
Paul Huliganga	cfd1f843a4	autoresearch: phase1 trial 20 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-14 04:35:49 -04:00
Paul Huliganga	5114a95a74	autoresearch: phase1 trial 20 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-14 04:35:45 -04:00
Paul Huliganga	52b8a4a10e	autoresearch: phase1 trial 15 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-14 02:56:38 -04:00
Paul Huliganga	6c8c5b25a9	autoresearch: phase1 trial 10 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-14 00:56:14 -04:00
Paul Huliganga	2d6fe2c962	autoresearch: phase1 trial 5 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-13 22:46:54 -04:00
Paul Huliganga	c8a495dd22	fix: reward v4 — full sim bypass kills circular driving at root ROOT CAUSE: donkey_sim.py calc_reward() uses forward_vel = dot(heading, velocity). A spinning car ALWAYS has forward_vel > 0 (always moving 'forward' relative to its own heading), so it earned positive reward indefinitely while circling. v3 WAS INSUFFICIENT: v3 applied efficiency only to the speed BONUS: original × (1 + speed×eff×scale) But 'original' from sim was still exploitable: CTE≈0 while spinning → original=1.0/step Efficiency killed the speed bonus but not the base reward. 47k-step run: spinning = 1.0/step × 47k = 47k reward (never crashes in circle) v4 FIX — base × efficiency × speed: reward = (1 - abs(cte)/max_cte) × efficiency × (1 + speed_scale × speed) Completely ignores sim's bogus forward_vel reward. Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed. ALL three terms must be high to earn reward — cannot be gamed. Key new test: test_circling_at_zero_cte_gives_near_zero_reward Worst-case exploit (CTE=0 spinning) → avg reward < 0.15 (was 1.0 in v3) forward_beats_circling_by_3x confirmed. Also: update Phase 2 autoresearch timesteps test, research log updated. Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +1 (core v4 circling guarantee) TypeScript: N/A	2026-04-13 20:56:32 -04:00

22 Commits