Warm-starts from wave4-trial-0009/model.zip (best mini-monaco model, completed
laps). Fine-tunes on generated track with continuous Box action space preserved
(no DiscretizedActionWrapper) at LR=0.00005. 50k steps, checkpoint every 5k,
zero-shot mini-monaco eval at end.
Tests whether additional generated-track exposure improves corner handling on
mini-monaco without catastrophic forgetting of driving skill.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes three root-cause bugs discovered before/during this experiment:
1. regen_road was silently doing nothing — TcpCarHandler.RegenRoad() bailed on
null TrainingManager; added direct RoadBuilder+PathManager fallback.
2. MapOverlay minimap not refreshing — fixed to check node[10] position change.
3. BrakeOnUpdateCallback: sends zero control before PPO gradient updates to
prevent car drifting during 3-8s CPU pause.
4. PathManager self-intersection fix: retry loop with XZ segment-segment math
(up to 20 retries) — verifiably different roads per seed.
Exp27 trains fresh weights with N_THROTTLE=3 (bins 0.2/0.5/1.0), ent_coef=0.05,
500k steps, regen_road TCP message per checkpoint. Peak: 462.7r/1580 steps @110k.
Also adds verify_minimap_refresh.py and verify_road_regen.py diagnostic scripts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_best_models.py: evaluates exp24/25/26 best models across 10 fixed random
roads (regen_road with fixed seeds) for fair head-to-head comparison.
eval_gentrack_on_minimonaco.py: zero-shot evaluation of gentrack specialists
(exp13, wave5-gentrack-only, wave4-trial-0009) on mini-monaco.
Results: exp26 > exp25 > exp24 on random roads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Loads exp25 best_model (381r @ 80k) to skip early exploration. Runs 300k
steps on generated_road with road regen every 10k steps. Python-side hit
check is now active (added late in exp25, not loaded then). Final cross-model
eval: exp26 best (9/10 full eps, 381.2r mean) — top performer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Run stopped at ~34k steps. ep_len_mean frozen at 118 due to MAX_EPISODE_SECONDS=18
cap. Barriers identified as zero-thickness MeshColliders (physics tunneling root cause).
Clean-slate rebuild planned: BoxCollider barriers + CCD on car + simplified reward.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact
or 4 sustained solid-hit frames; prevents car wedging against invisible barriers
- reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned
motionless (below threshold, no displacement) after grace period
- reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE >
max_cte_terminate (not after patience), so PPO cannot collect positive speed
rewards while driving the large outside-road circle
- tests: 23 passing unit tests covering all new termination paths
- exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track
with warm-start from champion model; exp22 is current active run
- SESSION_HANDOFF.md: live handoff doc for next session continuity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace subprocess.run(capture_output=True) with Popen + line-by-line
iteration so every line from multitrack_runner.py appears in the nohup
log immediately rather than only after the trial completes (~35-90 min).
- stdout/stderr merged via stderr=STDOUT
- line-buffered (bufsize=1)
- deadline-based timeout replaces subprocess timeout kwarg
- output accumulated in list for parse_runner_output() as before
Agent: pi
Tests: 30 passed
Tests-Added: 0
TypeScript: N/A
PHASE 2 MILESTONE DOCUMENTED:
All 3 top models complete the full track with distinct driving styles:
- Trial 20 (n_steer=3): Right lane, stable steering — CHAMPION ✅
- Trial 8 (n_steer=4): Left/center lane, oscillating (still completes!)
- Trial 18 (n_steer=3): Right shoulder, very accurate line following
Key finding: fewer steering bins (n_steer=3) = better driving (counterintuitive)
CTE symmetry explains left/right preference: random NN init determines which side
BEHAVIORAL REWARD WRAPPERS (agent/behavioral_wrappers.py):
- LanePositionWrapper: target a specific CTE offset (control left/right preference)
- AntiOscillationWrapper: penalise rapid steering changes (fix Model 2 oscillation)
- AsymmetricCTEWrapper: enforce right-lane rule (penalise left-of-centre more)
- CombinedBehavioralWrapper: all three combined in one wrapper
ENHANCED EVALUATOR (agent/evaluate_champion.py):
- Full metrics: reward, lap time, oscillation score, CTE distribution, lane position
- --compare flag: runs all top Phase 2 models side by side with comparison table
- Saves eval summary to outerloop-results/eval_summary.jsonl
- Detects lap completion events from sim info dict
IMPLEMENTATION PLAN updated: Wave 3 streams defined
RESEARCH LOG updated: Phase 2 milestone, behavioral analysis, next steps
Champion updated to Trial 20 (Phase 2)
Agent: pi/claude-sonnet
Tests: 53/53 passing (+13 behavioral wrapper tests)
Tests-Added: +13
TypeScript: N/A