donkeycar-rl-autoresearch

Commit Graph

Author	SHA1	Message	Date
Paul Huliganga	c5c4ca658e	chore(exp22): update wedgefix run log — training stopped for strategy rethink Run stopped at ~34k steps. ep_len_mean frozen at 118 due to MAX_EPISODE_SECONDS=18 cap. Barriers identified as zero-thickness MeshColliders (physics tunneling root cause). Clean-slate rebuild planned: BoxCollider barriers + CCD on car + simplified reward. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 15:36:18 -04:00
Paul Huliganga	138c65270f	feat(exp22): add solid-hit/wedge/high-CTE exploit fixes and generated-pair warm experiments - reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact or 4 sustained solid-hit frames; prevents car wedging against invisible barriers - reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned motionless (below threshold, no displacement) after grace period - reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE > max_cte_terminate (not after patience), so PPO cannot collect positive speed rewards while driving the large outside-road circle - tests: 23 passing unit tests covering all new termination paths - exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track with warm-start from champion model; exp22 is current active run - SESSION_HANDOFF.md: live handoff doc for next session continuity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 14:46:13 -04:00
Paul Huliganga	0da04327ef	docs: capture robust mountain finetune winner at 36k and preserve eval comparison	2026-04-20 00:43:27 -04:00
Paul Huliganga	1be95b7c82	wave3: autoresearch trial 5 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-15 07:15:57 -04:00
Paul Huliganga	2a747bb97c	wave3: autoresearch trial 5 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-14 18:22:44 -04:00
Paul Huliganga	349396f967	fix: stream runner output in real-time instead of buffering Replace subprocess.run(capture_output=True) with Popen + line-by-line iteration so every line from multitrack_runner.py appears in the nohup log immediately rather than only after the trial completes (~35-90 min). - stdout/stderr merged via stderr=STDOUT - line-buffered (bufsize=1) - deadline-based timeout replaces subprocess timeout kwarg - output accumulated in list for parse_runner_output() as before Agent: pi Tests: 30 passed Tests-Added: 0 TypeScript: N/A	2026-04-14 15:13:10 -04:00
Paul Huliganga	e68d618d29	feat: Phase 3 — behavioral control, enhanced evaluator, 53 tests PHASE 2 MILESTONE DOCUMENTED: All 3 top models complete the full track with distinct driving styles: - Trial 20 (n_steer=3): Right lane, stable steering — CHAMPION ✅ - Trial 8 (n_steer=4): Left/center lane, oscillating (still completes!) - Trial 18 (n_steer=3): Right shoulder, very accurate line following Key finding: fewer steering bins (n_steer=3) = better driving (counterintuitive) CTE symmetry explains left/right preference: random NN init determines which side BEHAVIORAL REWARD WRAPPERS (agent/behavioral_wrappers.py): - LanePositionWrapper: target a specific CTE offset (control left/right preference) - AntiOscillationWrapper: penalise rapid steering changes (fix Model 2 oscillation) - AsymmetricCTEWrapper: enforce right-lane rule (penalise left-of-centre more) - CombinedBehavioralWrapper: all three combined in one wrapper ENHANCED EVALUATOR (agent/evaluate_champion.py): - Full metrics: reward, lap time, oscillation score, CTE distribution, lane position - --compare flag: runs all top Phase 2 models side by side with comparison table - Saves eval summary to outerloop-results/eval_summary.jsonl - Detects lap completion events from sim info dict IMPLEMENTATION PLAN updated: Wave 3 streams defined RESEARCH LOG updated: Phase 2 milestone, behavioral analysis, next steps Champion updated to Trial 20 (Phase 2) Agent: pi/claude-sonnet Tests: 53/53 passing (+13 behavioral wrapper tests) Tests-Added: +13 TypeScript: N/A	2026-04-14 09:28:43 -04:00
Paul Huliganga	d25bc71008	autoresearch: phase1 trial 10 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-13 13:11:06 -04:00
Paul Huliganga	5e93dae316	fix: hack-proof reward shaping + reward hacking detection + research log CRITICAL BUG FIX — Reward Hacking: - Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139) - New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track - Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked - Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track Reward Hacking Auto-Detection: - check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking - Flagged results are excluded from GP fitting (won't optimize toward hacking params) - reward_hacking_suspected field added to JSONL result records Research Documentation: - docs/RESEARCH_LOG.md created: full chronological research history - Random policy bug discovery and impact - Throttle clamp fix - Reward hacking discovery with evidence table - Hack-proof design rationale - Lessons learned + future research questions - Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl - Archived hacked models: models/ARCHIVED_reward_hacking/ Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +9 (reward wrapper hack-proof tests) TypeScript: N/A	2026-04-13 12:27:48 -04:00
Paul Huliganga	0c6263352b	autoresearch: phase1 trial 10 results Agent: pi Tests: N/A Tests-Added: 0 TypeScript: N/A	2026-04-13 12:01:17 -04:00

10 Commits