donkeycar-rl-autoresearch

Commit Graph

Author	SHA1	Message	Date
Paul Huliganga	1d53bf613f	feat(exp29): fine-tune wave4-trial-0009 on generated track (continuous actions) Warm-starts from wave4-trial-0009/model.zip (best mini-monaco model, completed laps). Fine-tunes on generated track with continuous Box action space preserved (no DiscretizedActionWrapper) at LR=0.00005. 50k steps, checkpoint every 5k, zero-shot mini-monaco eval at end. Tests whether additional generated-track exposure improves corner handling on mini-monaco without catastrophic forgetting of driving skill. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 15:32:43 -04:00
Paul Huliganga	ee91b8f9a3	feat(exp28): fine-tune exp26 best_model on generated-track with variable throttle Warm-starts from exp26/best_model (best road model) and fine-tunes on donkey-generated-track-v0 (shadows, trees) at LR=0.00005. Adds N_THROTTLE=3 variable throttle to force learning corner braking. 50k steps, eval on mini-monaco (zero-shot) at completion. Goal: visual diversity + throttle variation → better mini-monaco generalization. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 15:32:37 -04:00
Paul Huliganga	36be93e357	feat(exp27): random roads with variable throttle + road regen + self-intersection fix Fixes three root-cause bugs discovered before/during this experiment: 1. regen_road was silently doing nothing — TcpCarHandler.RegenRoad() bailed on null TrainingManager; added direct RoadBuilder+PathManager fallback. 2. MapOverlay minimap not refreshing — fixed to check node[10] position change. 3. BrakeOnUpdateCallback: sends zero control before PPO gradient updates to prevent car drifting during 3-8s CPU pause. 4. PathManager self-intersection fix: retry loop with XZ segment-segment math (up to 20 retries) — verifiably different roads per seed. Exp27 trains fresh weights with N_THROTTLE=3 (bins 0.2/0.5/1.0), ent_coef=0.05, 500k steps, regen_road TCP message per checkpoint. Peak: 462.7r/1580 steps @110k. Also adds verify_minimap_refresh.py and verify_road_regen.py diagnostic scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 15:32:32 -04:00
Paul Huliganga	0615b22cb9	feat(eval): cross-model evaluation scripts for exp24/25/26 + gentrack→minimonaco eval_best_models.py: evaluates exp24/25/26 best models across 10 fixed random roads (regen_road with fixed seeds) for fair head-to-head comparison. eval_gentrack_on_minimonaco.py: zero-shot evaluation of gentrack specialists (exp13, wave5-gentrack-only, wave4-trial-0009) on mini-monaco. Results: exp26 > exp25 > exp24 on random roads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 15:32:21 -04:00
Paul Huliganga	8de4838c6b	feat(exp26): warm-start training from exp25 best_model (300k steps) Loads exp25 best_model (381r @ 80k) to skip early exploration. Runs 300k steps on generated_road with road regen every 10k steps. Python-side hit check is now active (added late in exp25, not loaded then). Final cross-model eval: exp26 best (9/10 full eps, 381.2r mean) — top performer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 15:32:16 -04:00
Paul Huliganga	75f7857250	chore(exp23): launched — clean barriers verified, training started Exp 23 running PID 647921 on generated_road:9091. - Barriers visually confirmed by Paul (3D box barriers, both sides, end caps visible) - Unity build synced to runtime folders - Fresh PPO, 200k steps, v7 clean reward Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 16:04:21 -04:00
Paul Huliganga	c5c4ca658e	chore(exp22): update wedgefix run log — training stopped for strategy rethink Run stopped at ~34k steps. ep_len_mean frozen at 118 due to MAX_EPISODE_SECONDS=18 cap. Barriers identified as zero-thickness MeshColliders (physics tunneling root cause). Clean-slate rebuild planned: BoxCollider barriers + CCD on car + simplified reward. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 15:36:18 -04:00
Paul Huliganga	138c65270f	feat(exp22): add solid-hit/wedge/high-CTE exploit fixes and generated-pair warm experiments - reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact or 4 sustained solid-hit frames; prevents car wedging against invisible barriers - reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned motionless (below threshold, no displacement) after grace period - reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE > max_cte_terminate (not after patience), so PPO cannot collect positive speed rewards while driving the large outside-road circle - tests: 23 passing unit tests covering all new termination paths - exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track with warm-start from champion model; exp22 is current active run - SESSION_HANDOFF.md: live handoff doc for next session continuity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 14:46:13 -04:00
Paul Huliganga	0da04327ef	docs: capture robust mountain finetune winner at 36k and preserve eval comparison	2026-04-20 00:43:27 -04:00
Paul Huliganga	1be95b7c82	wave3: autoresearch trial 5 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-15 07:15:57 -04:00
Paul Huliganga	2a747bb97c	wave3: autoresearch trial 5 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-14 18:22:44 -04:00
Paul Huliganga	349396f967	fix: stream runner output in real-time instead of buffering Replace subprocess.run(capture_output=True) with Popen + line-by-line iteration so every line from multitrack_runner.py appears in the nohup log immediately rather than only after the trial completes (~35-90 min). - stdout/stderr merged via stderr=STDOUT - line-buffered (bufsize=1) - deadline-based timeout replaces subprocess timeout kwarg - output accumulated in list for parse_runner_output() as before Agent: pi Tests: 30 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 15:13:10 -04:00
Paul Huliganga	e68d618d29	feat: Phase 3 — behavioral control, enhanced evaluator, 53 tests PHASE 2 MILESTONE DOCUMENTED: All 3 top models complete the full track with distinct driving styles: - Trial 20 (n_steer=3): Right lane, stable steering — CHAMPION ✅ - Trial 8 (n_steer=4): Left/center lane, oscillating (still completes!) - Trial 18 (n_steer=3): Right shoulder, very accurate line following Key finding: fewer steering bins (n_steer=3) = better driving (counterintuitive) CTE symmetry explains left/right preference: random NN init determines which side BEHAVIORAL REWARD WRAPPERS (agent/behavioral_wrappers.py): - LanePositionWrapper: target a specific CTE offset (control left/right preference) - AntiOscillationWrapper: penalise rapid steering changes (fix Model 2 oscillation) - AsymmetricCTEWrapper: enforce right-lane rule (penalise left-of-centre more) - CombinedBehavioralWrapper: all three combined in one wrapper ENHANCED EVALUATOR (agent/evaluate_champion.py): - Full metrics: reward, lap time, oscillation score, CTE distribution, lane position - --compare flag: runs all top Phase 2 models side by side with comparison table - Saves eval summary to outerloop-results/eval_summary.jsonl - Detects lap completion events from sim info dict IMPLEMENTATION PLAN updated: Wave 3 streams defined RESEARCH LOG updated: Phase 2 milestone, behavioral analysis, next steps Champion updated to Trial 20 (Phase 2) Agent: pi/claude-sonnet Tests: 53/53 passing (+13 behavioral wrapper tests) Tests-Added: +13 TypeScript: N/A	2026-04-14 09:28:43 -04:00
Paul Huliganga	d25bc71008	autoresearch: phase1 trial 10 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-13 13:11:06 -04:00
Paul Huliganga	5e93dae316	fix: hack-proof reward shaping + reward hacking detection + research log CRITICAL BUG FIX — Reward Hacking: - Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139) - New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track - Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked - Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track Reward Hacking Auto-Detection: - check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking - Flagged results are excluded from GP fitting (won't optimize toward hacking params) - reward_hacking_suspected field added to JSONL result records Research Documentation: - docs/RESEARCH_LOG.md created: full chronological research history - Random policy bug discovery and impact - Throttle clamp fix - Reward hacking discovery with evidence table - Hack-proof design rationale - Lessons learned + future research questions - Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl - Archived hacked models: models/ARCHIVED_reward_hacking/ Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +9 (reward wrapper hack-proof tests) TypeScript: N/A	2026-04-13 12:27:48 -04:00
Paul Huliganga	0c6263352b	autoresearch: phase1 trial 10 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-13 12:01:17 -04:00

16 Commits