donkeycar-rl-autoresearch

Commit Graph

Author	SHA1	Message	Date
Paul Huliganga	5d1227833d	fix: close short-lap circle exploit and cap segment eval episode length Two reward hacking behaviours observed during Wave 4 training: 1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack): Model circles at start/finish line completing laps in 1-2 sim-seconds, accumulating lap_count indefinitely with no genuine track progress. Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time < min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time). A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected. Window size also increased from 30 → 60 to catch slower circles. 2. Non-terminating segment eval episodes: evaluate_policy on wide tracks (no barriers) could run indefinitely, inflating segment_reward to 200k+. Replaced with manual eval loop capped at MAX_EVAL_STEPS=3000 steps. Phase 4 results cleared (trials 4-6 ran with exploitable reward). Tests: 4 new reward wrapper tests, 100 total passing. Agent: pi Tests: 100 passed Tests-Added: 4 TypeScript: N/A	2026-04-15 09:06:25 -04:00
Paul Huliganga	860e3d6610	fix: fresh PPO verbose=0 suppressed all training output — set verbose=1 Without this, Wave 4 scratch-trained models produce no rollout stats in the log, making it impossible to monitor training progress or spot degenerate policies early. Warm-start models in Wave 3 showed stats because verbose=1 was baked into the Phase-2 saved model state; fresh models default to verbose=0. Agent: pi Tests: 96 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 22:44:22 -04:00
Paul Huliganga	7534527722	Wave 4: scratch training on generated_track + mountain_track, zero-shot mini_monaco Strategy change driven by Trial 1 data analysis: - generated_road removed: too similar to generated_track, and Phase-2 warm-start caused catastrophic forgetting (reward 2388→37 in one rotation) - mountain_track mean reward was only 17 — model never converged there - mini_monaco score 24.9 (37 steps) — model was outputting degenerate actions Wave 4 approach: - NO warm-start: fresh random weights every trial - Train: generated_track + mountain_track (visually distinct backgrounds, both have road markings — forces model to learn general mark-following) - Test (zero-shot): mini_monaco only (never seen during training) - Wider LR search: [1e-4, 2e-3] (scratch model needs different range) - Larger step budgets: 60k-250k total (fresh model needs more time) - Seed params: lr=0.0003 and lr=0.001 (diverse from the start) Files: - multitrack_runner.py: 2 training tracks, no warm-start auto-detection - wave4_controller.py: new Wave 4 GP+UCB controller - tests updated: TRAINING_TRACKS assertion, seed param tests → wave4 - 96 tests passing ADR-013 to follow. Agent: pi Tests: 96 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 22:40:38 -04:00
Paul Huliganga	650f893d2d	fix: complete LR override — must patch lr_schedule, not just param_groups PPO.load() bakes lr_schedule=FloatSchedule(saved_lr) into the model. train() calls _update_learning_rate() which reads lr_schedule, not model.learning_rate. So even with param_groups patched, the first gradient step reverts the optimizer to the saved LR. Complete 3-part fix in create_or_load_model(): model.learning_rate = lr # attribute model.lr_schedule = get_schedule_fn(lr) # prevents train() reverting for pg in optimizer.param_groups: pg['lr'] = lr # immediate effect Also: - SEED_PARAMS: second seed now uses LR=0.001 (was 0.000225) so GP starts with real LR diversity instead of two identical seeds - tests/test_end_to_end.py: 13 new tests covering the full LR override path including a live learn() call; would have caught both bugs - Phase 3 results re-cleared (seed trial 1 ran with half-fix) - 96 tests total, all passing Agent: pi Tests: 96 passed Tests-Added: 13 TypeScript: N/A	2026-04-14 21:27:43 -04:00
Paul Huliganga	298cd1790a	fix: LR override was not reaching the optimizer — all trials ran at 0.000225 PPO.load() restores the saved optimizer state (lr=0.000225 from Phase 2 champion). Setting model.learning_rate alone is insufficient because _update_learning_rate() may not fire before the first gradient step, and the optimizer's param_groups still hold the old value. Fix: after PPO.load(), explicitly set lr on every optimizer param_group: model.learning_rate = lr for pg in model.policy.optimizer.param_groups: pg['lr'] = lr Impact: all 8 previous Wave 3 trials actually trained at LR=0.000225 regardless of GP proposal. Results archived as: autoresearch_results_phase3_CONTAMINATED_wrong_lr.jsonl Phase 3 results cleared; autoresearch restarting from scratch. Agent: pi Tests: 83 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 20:37:48 -04:00
Paul Huliganga	7ed2456896	fix: remove Warren from test set — indoor carpet, broken done condition Warren track surface is green carpet (not outdoor road), and the episode-done condition (\|CTE\| > max_cte) does not fire when the car crosses the INSIDE boundary. Car can drive off-track and bump into chairs indefinitely, making scores meaningless as a test metric. Changes: - multitrack_runner.py: TEST_TRACKS now mini_monaco only - wave3_controller.py: drop warren_reward from parse/save/champion paths - tests/test_wave3.py: update assertions to match single test track - All 83 tests pass Track classification (final): TRAIN : generated_road, generated_track, mountain_track TEST : mini_monaco (outdoor, proper road, correct done condition) SKIP : warren, warehouse, robo_racing_league, waveshare, circuit_launch SKIP : avc_sparkfun (orange markings) ADR-010 to be updated. Agent: pi Tests: 83 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 13:47:28 -04:00
Paul Huliganga	86657a26b8	wave3: fix track-switch bug (viewer not raw socket) + shorten trial budgets Bug: send_exit_scene_raw() opened a NEW TCP connection, creating a second phantom vehicle. The sim sent exit_scene to the phantom, leaving the real training connection stuck on generated_road for the entire run. Fix: _send_exit_scene() now calls env.unwrapped.viewer.exit_scene() on the EXISTING TCP connection that the training env already holds. This is the only reliable way to switch scenes mid-session (matches track_switcher.py). Also: - Removed send_exit_scene_raw() import from multitrack_runner.py - Simplified initial connection (no spurious exit_scene at startup) - Reduced search space: total_timesteps 80k-400k -> 30k-150k - Reduced seed params: 150k/300k -> 45k/90k (~35-45 min per trial) - Added test: test_close_and_switch_uses_viewer_not_raw_socket 83 tests passing Agent: pi Tests: 83 passed Tests-Added: 1 TypeScript: N/A	2026-04-14 13:29:49 -04:00
Paul Huliganga	4ca5304a71	wave3: add multi-track autoresearch system (83 tests passing) New files: - agent/multitrack_runner.py: trains PPO round-robin across generated_road, generated_track, mountain_track; zero-shot evaluates on mini_monaco + warren - agent/wave3_controller.py: GP+UCB outer loop optimising combined test score - tests/test_wave3.py: 30 new tests (83 total) Track classification (from visual analysis of all 10 screenshots): Training : generated_road, generated_track, mountain_track Test (ZSL): mini_monaco, warren (pseudo-outdoor — proper road markings) Skip : warehouse, robo_racing_league, waveshare, circuit_launch (indoor floor) avc_sparkfun (orange markings — different visual domain) Key design decisions: ADR-010: Warren = pseudo-outdoor track (proper road lines, not floor marks) ADR-011: Test tracks NEVER used in training; GP optimises test score only ADR-012: All trials warm-start from Phase 2 champion model Switching: env.close() + send_exit_scene_raw() + 4s wait + gym.make() Pre-Wave-3 baseline: 1/10 tracks drivable (0/2 held-out test tracks) Wave 3 goal: 2/2 test tracks drivable (mini_monaco + warren) Agent: pi Tests: 83 passed Tests-Added: 30 TypeScript: N/A	2026-04-14 12:47:12 -04:00

8 Commits