feat: add exp17 parallel DummyVecEnv 450k training + strategy docs

- exp17_parallel_450k.py: parallel two-track training (generated_track:9091, mountain_track:9093), 450k steps, v6 reward, HOST=localhost - DECISIONS.md: ADR-025 (parallel strategy) and ADR-026 (mountain friction fix) - docs/STATE.md: updated to April 2026 state with current champions and strategy - docs/TEST_HISTORY.md: mountain friction fix notes + Exp 17 full design - outerloop-results: exp14 finetune logs and robust mountain eval results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 02:42:20 -04:00 · 2026-04-28 02:42:20 -04:00 · b504b89b2a
parent 6e2427571a
commit b504b89b2a
8 changed files with 538 additions and 83 deletions
--- a/DECISIONS.md
+++ b/DECISIONS.md
@ -576,3 +576,79 @@ experts, not as obviously reusable initializations for the other track.
 - If transfer is revisited, it likely needs a more careful method than naive direct
  warm-starting on the other track
 - Mountain physics issues should be addressed before revisiting transfer conclusions
 ---
 ## ADR-025: Parallel DummyVecEnv with 400k+ Steps is the Primary Multi-Track Strategy
 **Date:** 2026-04-27
 **Status:** Active
 **Context:** After Wave 4 (25 trials, 80% failure rate), Exp 10 (catastrophic forgetting),
 Exp 11b (infrastructure works but 90k steps insufficient), and Exp 15/16 (cross-track
 warm starts failed both directions), the only multi-track approach that did not have a
 fundamental flaw was parallel DummyVecEnv — Exp 11b failed only because the training
 budget was halved relative to what single-track training needs.
 **Decision:** The primary next strategy is:
 1. Two sim instances (one per training track, separate ports)
 2. SB3 `DummyVecEnv([env_generated, env_mountain])` — PPO sees both tracks in every batch
 3. 400,000–500,000 total timesteps (~200k effective per track)
 4. v6 reward (efficiency gate + CTE patience terminator) on both envs
 5. No warm start — train from random weights
 6. Checkpoint every 20k steps, track mini_monaco zero-shot score throughout
 **Why parallel DummyVecEnv:**
 - PPO is an on-policy algorithm that depends on a stable rollout buffer.
  Swapping environments mid-training disrupts value estimates and causes catastrophic forgetting.
  DummyVecEnv feeds both tracks into every PPO rollout batch — no forgetting, no disruption.
 - This is how SB3 was designed to be used with multiple environments.
 **Why 400k+ steps:**
 - Single-track training converges in ~60–90k steps.
 - Two parallel tracks need at least 2× the budget because each track gets half the gradient.
  Interference between the two tasks adds further overhead.
 - Exp 11b at 90k steps (effectively 45k per track) produced only 194-step drives on both tracks.
  400k should provide adequate budget for both.
 **Rejected alternatives:**
 - Round-robin close-and-switch: disrupts PPO, 80% failure rate across 25 trials
 - Cross-track warm starts: failed both directions (ADR-024)
 - More autoresearch trials on round-robin: the method is fundamentally unreliable
 **Fallback if 400k parallel fails:** Curriculum — train generated_track alone for 150k steps,
 then add mountain to the DummyVecEnv pool for 250k more steps.
 ---
 ## ADR-026: Mountain Track Friction Fix — Use Road Material on Hill Colliders
 **Date:** 2026-04-27
 **Status:** Accepted — fix applied
 **Context:** `WheelPhys.cs` multiplies wheel grip stiffness by the static friction of the
 surface the wheel is hitting. The mountain_track scene assigned Slippery physics material
 (staticFriction=0.1) to 4 track surface colliders from the long_road prefab, giving the
 car 1/5 the normal traction on the hill. This caused visible wheelspin at full throttle and
 made hill climbing genuinely difficult for learned policies.
 **Decision:** Replace the 4 Slippery material assignments in `mountain_track.unity` with the
 Road material (staticFriction=0.5). This is a targeted scene-level override; the Slippery
 material asset itself is unchanged and remains available for intentionally slippery surfaces.
 **Fix location:** `sdsim/Assets/Scenes/mountain_track.unity` — all 4 PrefabModification
 entries that set `propertyPath: m_Material` on long_road colliders now reference Road
 (GUID 7884193b0ead347a38a13a67f294dfb5) instead of Slippery (GUID c0e12c099c364af4e9e311a43d0f12c4).
 **To activate:** Rebuild the Unity simulator binary after pulling the updated scene file.
 No Python code changes needed.
 **What this does NOT change:**
 - `Slippery.physicMaterial` asset — unchanged (still used by thunderhill, circuit_launch)
 - `Donkey_new_phys.prefab` strut colliders — also reference Slippery, but these are car body
  parts that the wheels don't touch. WheelPhys.cs only reads friction from ground hits.
 - mini_monaco.unity — also has one Slippery reference; left intentional for now
 **Expected effect:** Hill wheelspin should stop. The policy should find it easier to climb
 the hill at throttle_min=0.2, and Exp 17 multi-track results should be more interpretable
 since we are no longer fighting a physics artifact.
--- a/agent/experiments/README.md
+++ b/agent/experiments/README.md
@ -5,6 +5,7 @@ Each corresponds to an entry in docs/TEST_HISTORY.md.
 | Script | Experiment | Key change |
 |---|---|---|
 | exp17_parallel_450k.py | Exp 17 | Parallel DummyVecEnv, 450k steps, v6 reward, HOST=localhost |
 | mountain_v5.py | Exp 5 | v5 reward + throttle_min=0.5, direct model.learn() |
 | mountain_continue.py | Exp 4 | Continued Exp3 training |
 | mountain_high_throttle.py | Exp 3 | throttle_min=0.5, old v4 reward |
--- a/agent/experiments/exp17_parallel_450k.py
+++ b/agent/experiments/exp17_parallel_450k.py
@ -0,0 +1,199 @@
 """
 Exp 17: Parallel DummyVecEnv — generated_track + mountain_track, 450k steps.
 Strategy: Exp 11b proved the parallel DummyVecEnv infrastructure is stable.
 The only failure mode was insufficient training budget (~45k effective steps
 per track). This experiment triples the budget to ~225k per track.
 Changes from Exp 11b:
  - HOST: 10.0.0.55 → localhost  (WSL/Windows share ports)
  - TOTAL_STEPS: 90k → 450k
  - CHECKPOINT_EVERY: 6k → 20k
  - SAVE_DIR: exp17-parallel-450k
 Everything else identical to Exp 11b (same reward, wrappers, lr, throttle_min).
 Setup — TWO sim instances required:
  Sim 1: launch donkey_sim.exe, select generated_track, port 9091 (default)
  Sim 2: launch a second donkey_sim.exe with --port 9093, select mountain_track
         Command: donkey_sim.exe --port 9093
  Both sims must be running and on the correct tracks before starting this script.
 Evaluation:
  - Mid-training: both training tracks evaluated at each 20k checkpoint
  - End-of-training: all 4 tracks evaluated sequentially (port 9091)
 """
 import sys, os, time
 sys.path.insert(0, '/home/paulh/projects/donkeycar-rl-autoresearch/agent')
 from multitrack_runner import log, StuckTerminationWrapper
 from donkeycar_sb3_runner import ThrottleClampWrapper
 from reward_wrapper import SpeedRewardWrapper
 from stable_baselines3 import PPO
 from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage
 import gymnasium as gym
 import numpy as np
 HOST             = 'localhost'
 THROTTLE_MIN     = 0.2
 LR               = 0.000725
 TOTAL_STEPS      = 450_000
 CHECKPOINT_EVERY = 20_000
 SAVE_DIR         = '/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/exp17-parallel-450k'
 os.makedirs(SAVE_DIR, exist_ok=True)
 def make_env(track_id, port):
    def _init():
        raw = gym.make(track_id, conf={'host': HOST, 'port': port})
        env = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN)
        env = StuckTerminationWrapper(env, stuck_steps=40, min_displacement=0.5)
        env = SpeedRewardWrapper(env)
        return env
    return _init
 log('=' * 60)
 log('Exp 17: Parallel DummyVecEnv — 450k steps')
 log(f'  Sim 1: {HOST}:9091 → generated_track')
 log(f'  Sim 2: {HOST}:9093 → mountain_track')
 log(f'  throttle_min={THROTTLE_MIN}, lr={LR}, total={TOTAL_STEPS:,}')
 log(f'  Reward: v6 (speed × CTE_quality, efficiency gate >= 0.15)')
 log(f'  Stuck termination: 40 steps (~2.5s)')
 log(f'  Checkpoints: every {CHECKPOINT_EVERY:,} steps')
 log('=' * 60)
 log('Creating DummyVecEnv with two tracks...')
 env = DummyVecEnv([
    make_env('donkey-generated-track-v0', 9091),
    make_env('donkey-mountain-track-v0', 9093),
 ])
 env = VecTransposeImage(env)
 log(f'  VecEnv num_envs={env.num_envs}, obs={env.observation_space.shape}')
 model = PPO('CnnPolicy', env, learning_rate=LR, verbose=1, device='cpu')
 log('PPO created. Starting training...')
 best_reward = float('-inf')
 steps_done = 0
 while steps_done < TOTAL_STEPS:
    seg_steps = min(CHECKPOINT_EVERY, TOTAL_STEPS - steps_done)
    model.learn(total_timesteps=seg_steps, reset_num_timesteps=False)
    steps_done += seg_steps
    ckpt = os.path.join(SAVE_DIR, f'checkpoint_{steps_done:07d}')
    model.save(ckpt)
    model.save(os.path.join(SAVE_DIR, 'model'))
    log(f'[{steps_done:,}/{TOTAL_STEPS:,}] Checkpoint saved: {ckpt}.zip')
    # Eval on both training tracks using the existing DummyVecEnv connections
    try:
        obs = env.reset()
        ep_rewards = np.zeros(env.num_envs)
        ep_steps   = np.zeros(env.num_envs)
        done_mask  = np.zeros(env.num_envs, dtype=bool)
        for _ in range(2000):
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, dones, infos = env.step(action)
            for i in range(env.num_envs):
                if not done_mask[i]:
                    ep_rewards[i] += rewards[i]
                    ep_steps[i]   += 1
                    if dones[i]:
                        done_mask[i] = True
            if done_mask.all():
                break
        status0 = '✅' if ep_steps[0] >= 2000 else f'❌@{int(ep_steps[0])}'
        status1 = '✅' if ep_steps[1] >= 2000 else f'❌@{int(ep_steps[1])}'
        log(f'  Eval: gen_track={ep_rewards[0]:.1f}r/{int(ep_steps[0])}s {status0}  '
            f'mountain={ep_rewards[1]:.1f}r/{int(ep_steps[1])}s {status1}')
        total_reward = ep_rewards.sum()
        if total_reward > best_reward:
            best_reward = total_reward
            model.save(os.path.join(SAVE_DIR, 'best_model'))
            log(f'  ⭐ NEW BEST: {best_reward:.1f} combined reward')
    except Exception as e:
        log(f'  Eval error: {e}')
        import traceback; traceback.print_exc()
 model.save(os.path.join(SAVE_DIR, 'model'))
 log(f'\nTraining complete. Best combined reward: {best_reward:.1f}')
 env.close()
 time.sleep(5)
 # --- Final eval on all 4 tracks (sequential, port 9091) ---
 log('\n' + '=' * 60)
 log('FINAL EVALUATION: best_model on 4 tracks (3 sets each)')
 log('=' * 60)
 EVAL_TRACKS = [
    ('donkey-generated-track-v0',  'generated_track'),
    ('donkey-mountain-track-v0',   'mountain_track'),
    ('donkey-minimonaco-track-v0', 'mini_monaco'),
    ('donkey-generated-roads-v0',  'generated_road'),
 ]
 EVAL_PORT     = 9091
 EVAL_SETS     = 3
 EVAL_MAX_STEPS = 2000
 best_model_path  = os.path.join(SAVE_DIR, 'best_model.zip')
 results_by_track = {}
 for track_id, track_name in EVAL_TRACKS:
    log(f'\n--- {track_name} ---')
    steps_list = []
    for s in range(1, EVAL_SETS + 1):
        try:
            raw       = gym.make(track_id, conf={'host': HOST, 'port': EVAL_PORT})
            inner     = ThrottleClampWrapper(raw, throttle_min=THROTTLE_MIN)
            inner     = StuckTerminationWrapper(inner, stuck_steps=40, min_displacement=0.5)
            inner     = SpeedRewardWrapper(inner)
            eval_env  = VecTransposeImage(DummyVecEnv([lambda e=inner: e]))
            eval_model = PPO.load(best_model_path, env=eval_env, device='cpu')
            obs = eval_env.reset()
            total_r, total_s, done = 0.0, 0, False
            while not done and total_s < EVAL_MAX_STEPS:
                action, _ = eval_model.predict(obs, deterministic=True)
                result = eval_env.step(action)
                if len(result) == 4:
                    obs, r, d, info = result
                    done = bool(d[0])
                else:
                    obs, r, t, tr, info = result
                    done = bool(t[0] or tr[0])
                total_r += float(r[0])
                total_s += 1
            status = '✅' if total_s >= EVAL_MAX_STEPS else f'❌@{total_s}'
            log(f'  Set {s}: {total_r:.1f}r / {total_s}s {status}')
            steps_list.append(total_s)
            eval_env.close()
            time.sleep(3)
        except Exception as e:
            log(f'  Set {s}: ERROR — {e}')
            steps_list.append(0)
            time.sleep(3)
    mean_steps = np.mean(steps_list) if steps_list else 0
    results_by_track[track_name] = steps_list
    log(f'  Mean: {mean_steps:.0f} steps')
 log('\n' + '=' * 60)
 log('SUMMARY')
 log('=' * 60)
 for track_name, steps_list in results_by_track.items():
    steps_str = '/'.join(str(s) for s in steps_list)
    mean      = np.mean(steps_list)
    verdict   = '✅' if mean >= 1500 else '⚠️' if mean >= 500 else '❌'
    log(f'  {verdict} {track_name:20s}: {steps_str}  mean={mean:.0f}')
 log(f'\n=== Exp 17 COMPLETE ===')
--- a/agent/outerloop-results/exp14_finetune_log.txt
+++ b/agent/outerloop-results/exp14_finetune_log.txt
@ -0,0 +1,61 @@
 2026-04-20T00:08:21.090963 Loading warm-start model from models/exp14-mountain-v5/best_model.zip
 2026-04-20T00:09:16.674927 Loading warm-start model from models/exp14-mountain-v5/best_model.zip using throttle_min=0.2 env
 2026-04-20T00:09:19.055092 Switching model to env with throttle_min=0.4
 2026-04-20T00:10:27.385278 Loading warm-start model from models/exp14-mountain-v5/best_model.zip using base throttle_min=0.2 env
 2026-04-20T00:11:08.699368 ERROR during fine-tune: 'NoneType' object is not callable
 2026-04-20T00:11:08.901669 Fine-tune complete. steps_done=0
 2026-04-20T00:14:43.472139 Loading warm-start model from models/exp14-mountain-v5/best_model.zip using base throttle_min=0.2 env
 2026-04-20T00:17:44.473941 Loading warm-start model from models/exp14-mountain-v5/best_model.zip using base throttle_min=0.2 env
 2026-04-20T00:21:10.924456 Loading warm-start model from models/exp14-mountain-v5/best_model.zip using base throttle_min=0.2 env
 2026-04-20T00:25:31.932947 Loading warm-start model from models/exp14-mountain-v5/best_model.zip without creating a temp env
 2026-04-20T00:28:59.848890 [6000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0006000.zip
 2026-04-20T00:28:59.848966 ERROR during fine-tune: name 'make_env' is not defined
 2026-04-20T00:29:00.509181 Fine-tune complete. steps_done=6000
 2026-04-20T00:31:09.594830 Loading warm-start model from models/exp14-mountain-v5/best_model.zip without creating a temp env
 2026-04-20T00:34:50.056288 [6000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0006000.zip
 2026-04-20T00:35:04.415348 ERROR during fine-tune: name 'json' is not defined
 2026-04-20T00:35:04.546033 Fine-tune complete. steps_done=6000
 2026-04-20T00:37:47.831240 Loading warm-start model from models/exp14-mountain-v5/best_model.zip without creating a temp env
 2026-04-20T00:41:21.675776 [6000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0006000.zip
 2026-04-20T00:41:43.554021   Eval @ 6000: mean_steps=384.7 mean_lap=21.59375
 2026-04-20T00:41:43.694831   ⭐ NEW BEST (mean lap 21.59s) saved
 2026-04-20T00:45:26.980198 [12000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0012000.zip
 2026-04-20T00:45:42.741989   Eval @ 12000: mean_steps=187.7 mean_lap=None
 2026-04-20T00:49:24.586893 [18000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0018000.zip
 2026-04-20T00:49:42.795830   Eval @ 18000: mean_steps=287.3 mean_lap=None
 2026-04-20T00:53:15.614884 [24000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0024000.zip
 2026-04-20T00:53:37.070339   Eval @ 24000: mean_steps=374.7 mean_lap=21.765625
 2026-04-20T00:57:09.352148 [30000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0030000.zip
 2026-04-20T00:57:36.938090   Eval @ 30000: mean_steps=537.7 mean_lap=22.046875
 2026-04-20T00:57:36.938120 Switching env to throttle_min=0.2
 2026-04-20T01:00:55.914640 [36000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0036000.zip
 2026-04-20T01:01:56.665949   Eval @ 36000: mean_steps=1451.7 mean_lap=28.434895833333332
 2026-04-20T01:05:10.807288 [42000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0042000.zip
 2026-04-20T01:05:57.449632   Eval @ 42000: mean_steps=1067.7 mean_lap=27.44140625
 2026-04-20T01:08:54.843851 [48000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0048000.zip
 2026-04-20T01:10:00.878424   Eval @ 48000: mean_steps=1626.7 mean_lap=29.776785714285715
 2026-04-20T01:13:16.089861 [54000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0054000.zip
 2026-04-20T01:14:18.435622   Eval @ 54000: mean_steps=1528.3 mean_lap=30.234375
 2026-04-20T01:17:25.682859 [60000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0060000.zip
 2026-04-20T01:18:28.243356   Eval @ 60000: mean_steps=1533.0 mean_lap=34.33125
 2026-04-20T01:21:38.247436 [66000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0066000.zip
 2026-04-20T01:21:54.995379   Eval @ 66000: mean_steps=163.7 mean_lap=None
 2026-04-20T01:25:14.752223 [72000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0072000.zip
 2026-04-20T01:26:11.926001   Eval @ 72000: mean_steps=1389.7 mean_lap=43.21484375
 2026-04-20T01:29:24.138321 [78000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0078000.zip
 2026-04-20T01:29:59.928582   Eval @ 78000: mean_steps=757.0 mean_lap=43.453125
 2026-04-20T01:33:15.187091 [84000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0084000.zip
 2026-04-20T01:33:49.188449   Eval @ 84000: mean_steps=704.7 mean_lap=41.046875
 2026-04-20T01:36:57.554346 [90000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0090000.zip
 2026-04-20T01:38:12.054640   Eval @ 90000: mean_steps=1819.0 mean_lap=None
 2026-04-20T01:41:29.620560 [96000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0096000.zip
 2026-04-20T01:42:07.583154   Eval @ 96000: mean_steps=813.0 mean_lap=None
 2026-04-20T01:45:23.503967 [102000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0102000.zip
 2026-04-20T01:45:59.052782   Eval @ 102000: mean_steps=747.3 mean_lap=None
 2026-04-20T01:49:02.510514 [108000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0108000.zip
 2026-04-20T01:49:27.462705   Eval @ 108000: mean_steps=466.0 mean_lap=None
 2026-04-20T01:52:40.338223 [114000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0114000.zip
 2026-04-20T01:53:31.593848   Eval @ 114000: mean_steps=1169.0 mean_lap=None
 2026-04-20T01:56:39.035861 [120000/120000] Checkpoint saved: models/exp14-mountain-v5-finetune/checkpoint_0120000.zip
 2026-04-20T01:57:28.658996   Eval @ 120000: mean_steps=1125.0 mean_lap=None
 2026-04-20T01:57:28.795051 Fine-tune complete. steps_done=120000
--- a/agent/outerloop-results/exp14_finetune_results.jsonl
+++ b/agent/outerloop-results/exp14_finetune_results.jsonl
@ -0,0 +1,20 @@
 {"steps_done": 6000, "throttle_min": 0.4, "mean_steps": 384.6666666666667, "mean_lap_time": 21.59375, "per_set": [{"steps": 205, "laps": 0, "lap_times": []}, {"steps": 177, "laps": 0, "lap_times": []}, {"steps": 772, "laps": 1, "lap_times": [21.59375]}]}
 {"steps_done": 12000, "throttle_min": 0.4, "mean_steps": 187.66666666666666, "mean_lap_time": null, "per_set": [{"steps": 145, "laps": 0, "lap_times": []}, {"steps": 345, "laps": 0, "lap_times": []}, {"steps": 73, "laps": 0, "lap_times": []}]}
 {"steps_done": 18000, "throttle_min": 0.4, "mean_steps": 287.3333333333333, "mean_lap_time": null, "per_set": [{"steps": 233, "laps": 0, "lap_times": []}, {"steps": 244, "laps": 0, "lap_times": []}, {"steps": 385, "laps": 0, "lap_times": []}]}
 {"steps_done": 24000, "throttle_min": 0.4, "mean_steps": 374.6666666666667, "mean_lap_time": 21.765625, "per_set": [{"steps": 178, "laps": 0, "lap_times": []}, {"steps": 359, "laps": 0, "lap_times": []}, {"steps": 587, "laps": 1, "lap_times": [21.765625]}]}
 {"steps_done": 30000, "throttle_min": 0.4, "mean_steps": 537.6666666666666, "mean_lap_time": 22.046875, "per_set": [{"steps": 854, "laps": 1, "lap_times": [22.046875]}, {"steps": 365, "laps": 0, "lap_times": []}, {"steps": 394, "laps": 0, "lap_times": []}]}
 {"steps_done": 36000, "throttle_min": 0.2, "mean_steps": 1451.6666666666667, "mean_lap_time": 28.434895833333332, "per_set": [{"steps": 1540, "laps": 2, "lap_times": [29.34375, 26.84375]}, {"steps": 2000, "laps": 3, "lap_times": [29.4375, 28.4375, 27.015625]}, {"steps": 815, "laps": 1, "lap_times": [29.53125]}]}
 {"steps_done": 42000, "throttle_min": 0.2, "mean_steps": 1067.6666666666667, "mean_lap_time": 27.44140625, "per_set": [{"steps": 467, "laps": 0, "lap_times": []}, {"steps": 2000, "laps": 3, "lap_times": [27.046875, 27.703125, 27.125]}, {"steps": 736, "laps": 1, "lap_times": [27.890625]}]}
 {"steps_done": 48000, "throttle_min": 0.2, "mean_steps": 1626.6666666666667, "mean_lap_time": 29.776785714285715, "per_set": [{"steps": 2000, "laps": 3, "lap_times": [30.796875, 29.828125, 28.734375]}, {"steps": 880, "laps": 1, "lap_times": [30.65625]}, {"steps": 2000, "laps": 3, "lap_times": [29.703125, 29.203125, 29.515625]}]}
 {"steps_done": 54000, "throttle_min": 0.2, "mean_steps": 1528.3333333333333, "mean_lap_time": 30.234375, "per_set": [{"steps": 585, "laps": 0, "lap_times": []}, {"steps": 2000, "laps": 3, "lap_times": [32.734375, 29.8125, 30.8125]}, {"steps": 2000, "laps": 3, "lap_times": [31.171875, 29.71875, 27.15625]}]}
 {"steps_done": 60000, "throttle_min": 0.2, "mean_steps": 1533.0, "mean_lap_time": 34.33125, "per_set": [{"steps": 2000, "laps": 2, "lap_times": [39.140625, 33.140625]}, {"steps": 599, "laps": 0, "lap_times": []}, {"steps": 2000, "laps": 3, "lap_times": [34.21875, 31.953125, 33.203125]}]}
 {"steps_done": 66000, "throttle_min": 0.2, "mean_steps": 163.66666666666666, "mean_lap_time": null, "per_set": [{"steps": 154, "laps": 0, "lap_times": []}, {"steps": 146, "laps": 0, "lap_times": []}, {"steps": 191, "laps": 0, "lap_times": []}]}
 {"steps_done": 72000, "throttle_min": 0.2, "mean_steps": 1389.6666666666667, "mean_lap_time": 43.21484375, "per_set": [{"steps": 2000, "laps": 2, "lap_times": [50.140625, 35.6875]}, {"steps": 169, "laps": 0, "lap_times": []}, {"steps": 2000, "laps": 2, "lap_times": [39.890625, 47.140625]}]}
 {"steps_done": 78000, "throttle_min": 0.2, "mean_steps": 757.0, "mean_lap_time": 43.453125, "per_set": [{"steps": 174, "laps": 0, "lap_times": []}, {"steps": 1074, "laps": 1, "lap_times": [46.03125]}, {"steps": 1023, "laps": 1, "lap_times": [40.875]}]}
 {"steps_done": 84000, "throttle_min": 0.2, "mean_steps": 704.6666666666666, "mean_lap_time": 41.046875, "per_set": [{"steps": 953, "laps": 1, "lap_times": [40.21875]}, {"steps": 181, "laps": 0, "lap_times": []}, {"steps": 980, "laps": 1, "lap_times": [41.875]}]}
 {"steps_done": 90000, "throttle_min": 0.2, "mean_steps": 1819.0, "mean_lap_time": null, "per_set": [{"steps": 2000, "laps": 0, "lap_times": []}, {"steps": 1963, "laps": 0, "lap_times": []}, {"steps": 1494, "laps": 0, "lap_times": []}]}
 {"steps_done": 96000, "throttle_min": 0.2, "mean_steps": 813.0, "mean_lap_time": null, "per_set": [{"steps": 1671, "laps": 0, "lap_times": []}, {"steps": 385, "laps": 0, "lap_times": []}, {"steps": 383, "laps": 0, "lap_times": []}]}
 {"steps_done": 102000, "throttle_min": 0.2, "mean_steps": 747.3333333333334, "mean_lap_time": null, "per_set": [{"steps": 715, "laps": 0, "lap_times": []}, {"steps": 932, "laps": 0, "lap_times": []}, {"steps": 595, "laps": 0, "lap_times": []}]}
 {"steps_done": 108000, "throttle_min": 0.2, "mean_steps": 466.0, "mean_lap_time": null, "per_set": [{"steps": 468, "laps": 0, "lap_times": []}, {"steps": 476, "laps": 0, "lap_times": []}, {"steps": 454, "laps": 0, "lap_times": []}]}
 {"steps_done": 114000, "throttle_min": 0.2, "mean_steps": 1169.0, "mean_lap_time": null, "per_set": [{"steps": 1318, "laps": 0, "lap_times": []}, {"steps": 1278, "laps": 0, "lap_times": []}, {"steps": 911, "laps": 0, "lap_times": []}]}
 {"steps_done": 120000, "throttle_min": 0.2, "mean_steps": 1125.0, "mean_lap_time": null, "per_set": [{"steps": 941, "laps": 0, "lap_times": []}, {"steps": 1492, "laps": 0, "lap_times": []}, {"steps": 942, "laps": 0, "lap_times": []}]}
--- a/agent/outerloop-results/robust_eval_mountain.jsonl
+++ b/agent/outerloop-results/robust_eval_mountain.jsonl
@ -0,0 +1,13 @@
 {"set": 1, "episode": 1, "steps": 195, "reward": 313.8098858782323, "laps": 0, "lap_times": [], "timestamp": "2026-04-19T23:55:01.852800"}
 {"set": 1, "episode": 2, "steps": 907, "reward": 821.3252189619088, "laps": 0, "lap_times": [], "timestamp": "2026-04-19T23:55:15.580688"}
 {"set": 1, "episode": 3, "steps": 187, "reward": 312.3699834933941, "laps": 0, "lap_times": [], "timestamp": "2026-04-19T23:55:20.305057"}
 {"set_summary": {"set": 1, "mean_steps": 429.6666666666667, "mean_reward": 482.50169611117843}}
 {"set": 2, "episode": 1, "steps": 1684, "reward": 2886.7210297683996, "laps": 2, "lap_times": [30.796875, 27.3125], "timestamp": "2026-04-19T23:55:43.831212"}
 {"set": 2, "episode": 2, "steps": 1791, "reward": 2724.1041878786637, "laps": 2, "lap_times": [29.234375, 31.578125], "timestamp": "2026-04-19T23:56:08.736059"}
 {"set": 2, "episode": 3, "steps": 2000, "reward": 3338.140802157104, "laps": 3, "lap_times": [29.828125, 27.828125, 29.171875], "timestamp": "2026-04-19T23:56:34.963968"}
 {"set_summary": {"set": 2, "mean_steps": 1825.0, "mean_reward": 2982.9886732680557}}
 {"set": 3, "episode": 1, "steps": 189, "reward": 304.40264326371107, "laps": 0, "lap_times": [], "timestamp": "2026-04-19T23:56:39.723007"}
 {"set": 3, "episode": 2, "steps": 2000, "reward": 3396.2255747133167, "laps": 3, "lap_times": [29.875, 28.75, 27.765625], "timestamp": "2026-04-19T23:57:05.989723"}
 {"set": 3, "episode": 3, "steps": 773, "reward": 1300.720640436186, "laps": 1, "lap_times": [31.265625], "timestamp": "2026-04-19T23:57:18.198014"}
 {"set_summary": {"set": 3, "mean_steps": 987.3333333333334, "mean_reward": 1667.116286137738}}
 {"overall": {"mean_steps_across_sets": 1080.6666666666667, "mean_reward_across_sets": 1710.8688851723239}}
--- a/docs/STATE.md
+++ b/docs/STATE.md
@ -1,100 +1,83 @@
-# Project State — April 16, 2026 (post-testing)
+# Project State — April 27, 2026
 ## The Goal
 Train a DonkeyCar model that generalises to any road-surface track
 (outdoor, asphalt, lane markings) — demonstrated by driving a
 never-seen track without crashing.
 ---
-## Confirmed Working Models (tested today, observed by user)
+## Current Champion Models
-### ✅ Phase 2 Champion — generated_road
+### ✅ exp13-gentrack-v4 — generated_track specialist
- **Path:** `models/champion/model.zip`
+- **Path:** `models/exp13-gentrack-v4/best_model.zip`
- **Trained on:** generated_road only, ~13k steps, lr=0.000225
+- **Trained on:** generated_track only, ~30k steps (stopped early), lr=0.000725, throttle_min=0.2
- **Test result:** Drove full 2000 steps, 2013 reward. User: "driving very well, stayed in right-hand lane, very very good"
+- **Reward:** v4 (base × efficiency × speed_bonus)
- **Other tracks:** Confirmed fails on generated_track (old multitrack_eval)
+- **Performance:** Drives generated_track reliably, clean laps
 - **Zero-shot:** Fails on mountain_track (expected — single-track specialist)
-### ✅ Wave 4 Trial 9 — generated_track AND mini_monaco
+### ✅ exp14-mountain-v5-finetune ft_036k — mountain specialist
 - **Path:** `models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip`
 - **Trained on:** mountain_track, fine-tuned from exp14 base, checkpoint at 36k steps
 - **Reward:** v5 (speed × CTE-quality), throttle_floor=0.2 (switched from 0.4 at 30k)
 - **Performance:** 9/9 successful episodes, 25 total laps, mean lap 27.93s, best lap 26.16s
 - **Zero-shot:** Fails on generated_track (expected — single-track specialist)
 ### ⭐ Wave 4 Trial 9 — best generalising model (but not reproducible)
 - **Path:** `models/wave4-trial-0009/model.zip`
- **Trained on:** generated_track + mountain_track from scratch, ~90k steps, lr=0.000725, switch=6,851
+- **Trained on:** generated_track + mountain_track, ~90k steps, lr=0.000725, switch=6,851
- **Test on generated_track:** 3/3 episodes drove full 2000 steps, 13–16 second genuine laps
+- **Performance:** generated_track 2000/2000, mini_monaco 2000/2000 (zero-shot)
- **Test on mini_monaco:** Full 2000 steps, 40-second genuine laps (zero-shot — never seen during training)
+- **Problem:** Same hyperparameters repeated multiple times → all failed. This was a lucky random seed.
 - **This is our best model**
 ### ✅ Wave 4 Trial 19 — generated_track (mostly)
 - **Path:** `models/wave4-trial-0019/model.zip`
 - **Trained on:** generated_track + mountain_track from scratch, ~74k steps, lr=0.000629, switch=8,211
 - **Test on generated_track:** 2/3 episodes drove full 2000 steps, 14–17 second genuine laps. 1 crash.
 - **mini_monaco score during training:** 231 (best "honest" result from Wave 4)
 ---
-## Key Finding: Generated Track Lighting Variation
+## What We Know (cumulative)
-The generated_track changes lighting conditions (sun angle, shadows) on every
+
-env.reset() due to procedural generation. This means during training, every
+### Reward functions
-episode showed a different visual appearance of the same track. The model was
+- **v4** (base × efficiency × speed_bonus): works for generated_track; gives zero gradient on mountain hills
-forced to learn track-geometry features (road edges, markings) rather than
+- **v5** (speed × CTE-quality): works for mountain; circular driving exploit possible on flat track
-lighting-specific patterns. This visual robustness is almost certainly why
+- **v6** (v5 + efficiency gate ≥ 0.15): prevents circular exploit; may suppress early exploration
-Trial 9 can zero-shot generalise to mini_monaco.
+
 ### Training approaches tried and their outcomes
 | Approach | Result |
 |---|---|
 | Single-track PPO (Exp 9, 13) | ✅ Reliable. Best per-track performance. |
 | Round-robin close-and-switch (Wave 4, Exp 10) | ❌ 80% failure rate. Disrupts PPO rollout buffer. |
 | Parallel DummyVecEnv 90k steps (Exp 11b) | ⚠️ Infrastructure works; 90k too few steps (194 steps on all tracks). |
 | Cross-track warm start both directions (Exp 15, 16) | ❌ Both failed. Single-track policies too specialised for naive transfer. |
 ### Mountain track physics (fixed 2026-04-27)
 The mountain_track.unity scene assigned Slippery physics material (staticFriction=0.1)
 to 4 track surface colliders. WheelPhys.cs scales wheel grip by surface staticFriction,
 so the car had 1/5 normal grip on the hill. This caused visible wheelspin.
 Fixed by assigning Road material (staticFriction=0.5) to those 4 colliders in
 `sdsim/Assets/Scenes/mountain_track.unity`. The project uses a pre-built Windows
 executable (DonkeySimWin/donkey_sim.exe), so this fix is deferred until the sim
 is rebuilt from source in Unity Editor. Proceed with Exp 17 using the existing binary.
 ### Key parameter knowledge
 - **lr:** 0.000725 (from Trial 9 and Exp 9 — consistent with good results)
 - **throttle_min:** 0.2 (v5/v6 reward gives non-zero gradient on hills even at 0.2)
 - **n_steer/n_throttle:** Relevant for discrete action space only (PPO uses continuous)
 - **Per-env throttle_min in DummyVecEnv:** Feasible — each env wrapped independently
 ---
-## Full Test Results — April 16
+## Open Strategy (as of April 27)
-| Test | Model | Track | Laps | Steps | Verdict |
+The goal is reliable multi-track generalisation. The validated path forward:
 |---|---|---|---|---|---|
 | 1 | Phase 2 champion | generated_road | n/a (not a loop) | 2000/2000 | ✅ DRIVES |
 | 2 | Wave 4 Trial 3 | generated_track | — | — | ❌ MODEL CORRUPTED |
 | 3 | Wave 4 Trial 9 | generated_track | 6 laps × 3 eps | 2000/2000 | ✅ DRIVES |
 | 4 | Wave 4 Trial 9 | mini_monaco | 2 laps per ep | 2000/2000 | ✅ DRIVES (zero-shot) |
 | 5 | Wave 4 Trial 14 | mini_monaco | 1 lap ep2 only | 257/901/253 | ⚠️ INCONSISTENT |
 | 6 | Wave 4 Trial 25 | mini_monaco | 0 | ~147/eps | ❌ CRASHES |
 | + | Wave 4 Trial 19 | generated_track | 5-6 laps × 2 eps | crash/2000/2000 | ✅ MOSTLY |
 | + | Wave 4 Trial 22 | generated_track | 0 | ~110/eps | ❌ SAME SPOT |
 | + | Wave 4 Trial 2 | generated_track | 0 | ~76/eps | ❌ CRASHES |
 | + | Trial 3 (recovered) | generated_track | 0 | ~104/eps | ❌ CRASHES |
---
+1. **Exp 17:** Parallel DummyVecEnv with 400k–500k steps
   - Two sim instances: generated_track:9091, mountain_track:9093
   - v6 reward on both (efficiency gate + CTE patience terminator)
   - throttle_min=0.2 both envs (or optionally 0.5 on mountain, 0.2 on generated)
   - lr=0.000725, checkpoint every 20k, best_model tracked throughout
   - Eval mini_monaco zero-shot at every checkpoint
 3. **If Exp 17 plateaus:** Try curriculum (generated_track only for 150k, then add mountain)
 4. **If still stuck:** Tune v6 efficiency gate threshold (check % steps gated in early training)
-## What We Know Now
+See `docs/TEST_HISTORY.md` for full Exp 17 design.
 1. **Trial 9 is a genuine multi-track model.** It drives generated_track
   consistently (3/3) with clean laps, AND generalises zero-shot to
   mini_monaco (never seen in training). This is real progress.
 2. **The "amazing" overnight model (Trial 3) is lost.** The model.zip has
   a corrupted optimizer file. Policy weights were recovered but the model
   crashes at ~104 steps — the "amazing" driving was at an intermediate
   training checkpoint, not the final saved model.
 3. **Most Wave 4 high scores were not exploits — they were real.**
   Trials 5, 6, and 14 showed inconsistent results (crash some episodes,
   complete lap on others). The model was genuinely learning but unreliably.
   Only Trial 14 and 25's original very high scores (1573, 1543) appear
   to have been exploits in the original training eval.
 4. **Lighting variation on generated_track is a feature, not a bug.**
   Procedural generation changes sun angle / shadows each episode, forcing
   the model to learn geometry rather than appearance. This may be the key
   to Trial 9's generalisation ability.
 5. **Mountain_track training — unknown contribution.** We don't know if
   mountain_track training helped or hurt. Trial 9 drives generated_track
   and mini_monaco; whether it can drive mountain_track is untested.
 ---
 ## Open Questions for Strategy Discussion
 1. Can Trial 9 also drive mountain_track? (untested)
 2. Can Trial 9 drive generated_road? (untested — zero-shot to Phase 2 training track)
 3. Why does Trial 9 drive mini_monaco but other models with similar
   mini_monaco scores (Trial 14: 193, Trial 22: 193) don't reliably?
 4. Would more training steps from Trial 9's hyperparameters produce
   an even better model?
 5. Is mountain_track necessary, or could we get Trial 9's results
   training on generated_track alone?
 ---
@ -102,9 +85,9 @@ Trial 9 can zero-shot generalise to mini_monaco.
 | Model | Path | Status |
 |---|---|---|
-| Phase 2 champion | models/champion/model.zip | ✅ Good |
+| exp13-gentrack-v4 | models/exp13-gentrack-v4/best_model.zip | ✅ Generated_track specialist |
-| Wave 4 Trial 9 | models/wave4-trial-0009/model.zip | ✅ Best model |
+| exp14-mountain-v5-finetune ft_036k | models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip | ✅ Mountain specialist (best overall mountain model) |
-| Wave 4 Trial 19 | models/wave4-trial-0019/model.zip | ✅ Good |
+| exp14-mountain-v5 | models/exp14-mountain-v5/best_model.zip | ✅ Mountain base (good, slightly worse than ft_036k) |
-| Wave 4 Trial 3 | models/wave4-trial-0003/model.zip | ❌ Corrupted |
+| Wave 4 Trial 9 | models/wave4-trial-0009/model.zip | ✅ Best generalising model; unreproducible |
-| Wave 4 Trials 1,2,5-8,10-25 | models/wave4-trial-XXXX/ | Available, mostly crash on generated_track |
+| Phase 2 champion | models/champion/model.zip | ✅ generated_road specialist only |
-
+| Wave 4 other trials | models/wave4-trial-XXXX/ | Mostly crash on all tracks |
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@ -508,3 +508,105 @@ For now:
 - keep the single-track champions as separate specialists
 - do **not** assume direct cross-track warm starts are beneficial
 ---
 ## Mountain Track Friction Fix (2026-04-27)
 ### Root cause
 `WheelPhys.cs` scales wheel grip by the static friction of whatever surface the
 wheel is touching: `fFriction.stiffness = hit.collider.material.staticFriction * originalForwardStiffness`.
 `mountain_track.unity` assigned the Slippery physics material (staticFriction=0.1)
 to 4 track surface colliders from the `long_road` prefab. This gave the car 1/5
 the normal grip on the hill, causing visible wheelspin even at full throttle.
 The Slippery material is intentional on genuinely icy surfaces (thunderhill) but
 was incorrect on mountain_track's asphalt hill.
 ### Fix applied
 Replaced all 4 Slippery material assignments with Road material (staticFriction=0.5)
 in `sdsim/Assets/Scenes/mountain_track.unity`.
 | Material | staticFriction | GUID |
 |---|---|---|
 | Slippery (removed) | 0.1 | c0e12c099c364af4e9e311a43d0f12c4 |
 | Road (applied) | 0.5 | 7884193b0ead347a38a13a67f294dfb5 |
 ### To activate
 The training setup uses the pre-built Windows executable (`DonkeySimWin/donkey_sim.exe`),
 not a locally-compiled build. The scene file edit in sdsandbox/ has no effect on the
 running binary — it only matters if the sim is ever rebuilt from source in Unity Editor.
 **This fix is deferred.** Proceed with Exp 17 using the existing executable.
 If mountain hill training in Exp 17 specifically struggles (short episodes that plateau
 and never improve), that is the signal to pursue a Unity Editor rebuild.
 The scene file change is committed in sdsandbox/ and will apply automatically if the
 sim is rebuilt for any other reason. No Python code changes needed.
 ### Expected effect
 - Hill wheelspin should stop or greatly reduce
 - Throttle_min=0.2 + v5 reward should be even more effective on the hill
 - All future mountain experiments benefit; no code changes needed
 ---
 ## Strategy Review and Exp 17 Plan (2026-04-27)
 ### Where the project stands
 After 16 experiments and 4 autoresearch phases, the core problem is clear:
 multi-track training is needed for generalisation, but the training method has
 been unreliable. Here is the summary of what each approach found:
 | Approach | Outcome |
 |---|---|
 | Round-robin close-and-switch (Wave 4, Exp 10) | 80% failure. PPO rollout buffer disrupted on env swap. Lucky seed (Trial 9) worked once but cannot be reproduced. |
 | Parallel DummyVecEnv 90k steps (Exp 11b) | Infrastructure valid, no catastrophic forgetting, but 90k steps / 2 tracks = ~45k effective per track. Not enough. |
 | Cross-track warm starts (Exp 15, 16) | Both directions failed. Single-track specialists do not transfer cleanly. |
 | Single-track PPO (Exp 9, 13, 14) | Reliable but no generalisation. |
 The conclusion: **parallel DummyVecEnv is the right architecture; the only known
 failure mode is training budget**. Exp 11b was mechanically sound but starved of steps.
 ### Exp 17 — Parallel DummyVecEnv, 400k–500k steps
 **This is the primary next experiment.**
 | Parameter | Value | Reason |
 |---|---|---|
 | Architecture | DummyVecEnv([generated_track:9091, mountain_track:9093]) | Validated in Exp 11b; no PPO disruption |
 | Total timesteps | 400,000–500,000 | ~200k effective per track; Exp 11b proved 90k insufficient |
 | Reward | v6 on both envs (efficiency gate + CTE patience terminator) | Blocks circular exploit on generated_track; gate threshold may be tuned |
 | throttle_min | 0.2 both envs (or 0.5 mountain, 0.2 generated — see ADR-020) | v5/v6 gradient non-zero on hills at 0.2 |
 | learning_rate | 0.000725 | From Trial 9 and Exp 9 — consistent with best results |
 | Checkpoint | every 20,000 steps + best_model.zip tracked throughout | ADR-017: best model ≠ final model |
 | Eval | mini_monaco zero-shot at every checkpoint | Detect the peak before policy drifts |
 | Warm start | None — train from random weights | ADR-024: cross-track warm starts failed |
 **Setup checklist before running:**
 1. Two sim instances running: one on port 9091, one on port 9093
 2. Both on the same track as configured (generated_track and mountain_track)
 3. Rebuild simulator with mountain friction fix active
 4. Verify throughput: run 2-minute timing benchmark, set step cap accordingly (ADR-014)
 **Success criterion:** mini_monaco zero-shot score > 500 (at least 25% of a full
 2000-step episode) reliably across 3 evaluation sets, reproducible across 2+ runs.
 ### Fallback: Curriculum training (if Exp 17 plateaus below 200)
 If Exp 17 cannot get past ~200 steps on mini_monaco:
 - Phase A: generated_track only, 150k steps (establish road-following)
 - Phase B: add mountain_track to DummyVecEnv, continue 250k more steps
 - Rationale: gives the policy a foundation before the harder mountain physics
 ### Fallback: v6 efficiency gate tuning (if gate is too aggressive)
 Log what fraction of steps are gated (reward zeroed) in the first 100k steps.
 If >40%, lower the gate threshold from 0.15 to 0.10 for the first 150k steps,
 then raise it back to 0.15. Prevents the gate from suppressing early exploration.