donkeycar-rl-autoresearch/docs/RESEARCH_LOG.md

28 KiB
Raw Permalink Blame History

Research Log — DonkeyCar RL Autoresearch

Chronological research findings, discoveries, bugs, and decisions. Every significant observation is recorded here for scientific reproducibility and future reference. Format: date, finding, evidence, action taken.


2026-04-12 — Project Kickoff and Initial Infrastructure

Finding: Grid Sweep as Research Baseline

Observation: Before any autoresearch, we ran an 18-config grid sweep across:

  • n_steer: [3, 5, 7]
  • n_throttle: [2, 3]
  • learning_rate: [0.001, 0.0005, 0.0001]
  • 3 repeats each

Important caveat discovered later: This sweep used a random action policy (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.

Valid insight from this data: Action discretization matters even for random policy.
n_steer=7, n_throttle=2 outperformed n_steer=3, n_throttle=2 with random actions — more steering granularity helps even without learning.

Data location: outerloop-results/clean_sweep_results.jsonl (18 records)


2026-04-12 — Discovery: Random Policy Bug (Critical)

Finding: Inner Loop Was Never Training

Observation: The donkeycar_sb3_runner.py was calling env.action_space.sample() instead of model.learn(). This was introduced when we removed the broken model.save() call that caused NameError: name 'model' is not defined.

Root cause: Legacy code path removal was too aggressive — removed training along with the broken save call.

Impact:

  • All 300 autoresearch trials (two overnight runs) used random policy
  • learning_rate parameter was passed but completely ignored
  • mean_reward values reflect random-walk quality, not RL training quality
  • The GP+UCB found the best action space for random walking, not the best hyperparameters for learning

Valid salvage: The n_steer=8, n_throttle=5 finding is valid as a discretization insight.
Invalid: All learning_rate optimization in the 300-trial autoresearch runs.

Fix: Completely rebuilt runner with real PPO.learn() + evaluate_policy() + model.save().

Decision record: ADR-005 — Never call model.save() before model is defined.


2026-04-12 — Autoresearch Infrastructure Proven

Finding: GP+UCB Autoresearch Works Correctly

Observation: The GP+UCB meta-controller correctly:

  • Loads prior results and fits a Gaussian Process
  • Uses UCB acquisition to balance exploration/exploitation
  • Proposes parameters outside the original grid (e.g., n_steer=6 was never in grid)
  • Converges toward higher-reward regions with each trial

Evidence: After 300 trials, the top-5 consistently clustered around n_steer=7-9, n_throttle=4-5, lr≈0.002 — a coherent high-reward region.

Conclusion: The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.


2026-04-13 — Phase 1 Launch: First Real Training Attempt

Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps

Observation: First Phase 1 run with real PPO training proposed 20k-30k timesteps.
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.

Evidence: Trials 1-6 all timed out at exactly 600 seconds.

Fix: Reduced timestep search space from [5000, 30000] to [1000, 5000].
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.

Lesson: Always calibrate timeout to actual sim + training speed before launching sweeps.


2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)

Observation: During early Phase 1 training, the car's steering values changed but the car did not move.

Root cause: PPO with continuous action space outputs actions in [-1, 1] for all dimensions.
DonkeyCar expects throttle ∈ [0, 1]. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.

Fix: Added ThrottleClampWrapper that ensures throttle ∈ [0.2, 1.0].
This guarantees the car always moves forward, even before any learning.

Impact: Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.


2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨

Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary

Observation: After fixing throttle and timestep issues, Phase 1 trials ran successfully.
Some trials produced suspiciously high rewards:

Trial mean_reward n_throttle lr verdict
8 1936.9 2 0.00145 🚨 HACKED
13 1139.4 2 0.00058 🚨 HACKED
11 439.9 3 0.00048 ⚠️ Suspicious
2 398.9 2 0.00236 ⚠️ Suspicious

Root cause: The SpeedRewardWrapper computed:

reward = speed × (1 - abs(cte) / max_cte)

The model discovered a policy that maximizes this formula without genuine track driving:

  1. Drive fast toward the track boundary
  2. Return to track center (momentarily low CTE = high reward)
  3. Repeat — "oscillation farming"

The crash penalty (-10) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.

Physical impossibility check: A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ 5.0 × 1.0 × 3429 = 17,145. Observed max was 1937 — so technically possible but the high variance (std_reward=34) across only 3 eval episodes and the user's direct observation confirm hacking.

User observation (direct visual confirmation): "The model found a way to rig the reward by just going left — it was off the track and then back on the track."

Impact: The entire Phase 1 dataset with reward_shaping=True is corrupted.
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.

Action taken:

  • Archived all Phase 1 results: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl
  • Archived hacked models: models/ARCHIVED_reward_hacking/
  • Redesigned reward function entirely

2026-04-13 — Fix: Hack-Proof Reward Shaping Design

Finding: Multiplicative Speed Bonus Prevents Reward Hacking

Problem with additive formula: reward = speed × f(cte) can be maximized by maximizing speed independently of f(cte).

Solution — multiplicative on-track bonus:

if original_reward > 0:
    shaped = original_reward × (1 + speed_scale × speed)
else:
    shaped = original_reward  # No speed bonus when off track

Why this is hack-proof:

  • original_reward > 0 is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)
  • When off track, original_reward ≤ 0 — no speed reward possible
  • The model cannot increase reward by going fast off-track
  • The formula is bounded: shaped ≤ original_reward × (1 + speed_scale × max_speed)

Author's insight: "Speed should only be rewarded if you are progressing down the track."

Implementation: agent/reward_wrapper.pySpeedRewardWrapper v2.


2026-04-13 — Lesson: Reward Function Design Principles

From this experience, we derived the following principles for DonkeyCar RL reward shaping:

  1. Never reward speed unconditionally. Speed reward must be gated on track presence.
  2. The original DonkeyCar reward is the ground truth. Any shaping must respect it, not replace it.
  3. Multiplicative bonuses are safer than additive. They can't be maximized independently.
  4. High variance in eval reward is a red flag. std_reward=34 on 3 episodes suggests instability.
  5. Physically impossible reward values signal hacking. Establish theoretical reward bounds before training.
  6. Low n_throttle (=2) may enable hacking. With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.

Next Research Questions

  1. Does n_throttle=2 uniquely enable hacking? The hacked models all had n_throttle=2. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit.
  2. What is the minimum timestep for genuine learning? The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
  3. Does the multiplicative reward fix change the optimal hyperparameter region? Re-run autoresearch with fixed reward and compare top configurations.
  4. Can we detect reward hacking automatically? A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
  5. What does a genuinely good reward look like? After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.

2026-04-13 — Critical Discovery: Circular Driving Exploit (v2 Reward Still Hackable)

Finding: Car Learns to Circle at Starting Line

User observation (direct visual): "The model found a way to rig the reward by going left in circles — it was off the track and then back on track, but detected as failure. Model uses this as best way to maximize reward."

Data confirmation:

Trial mean_reward std_reward cv% r/step verdict
1 270.56 0.143 0.1% 0.086 ⚠️ CIRCULAR (suspiciously low std)
5 4582.80 0.485 0.0% 0.957 🚨 CIRCULAR (confirmed)
10 682.74 420.91 61.7% 0.153 ⚠️ UNSTABLE (sometimes circles, sometimes crashes)

Statistical signature of circular motion:

  • cv (coefficient of variation = std/mean) < 1% with high reward → very consistent behavior
  • Circular driving IS very consistent: every circle is the same
  • Legitimate driving is stochastic: different obstacles, curves, luck
  • Trial 5: cv=0.0% over 3 eval episodes → textbook circling

Why v2 reward still allowed this:

  • v2 fix: reward = original × (1 + speed_scale × speed) ONLY when on track
  • Car circling at the starting line HAS: low CTE (on track centerline) + positive speed
  • Result: full speed bonus for circling → 4582 reward over 4787 steps
  • CTE and raw speed cannot distinguish forward from circular motion

Root Cause: Missing Dimension — Track Progress

The fundamental issue: neither CTE nor speed captures PROGRESS along the track.

  • CTE measures: am I near the centerline? (yes for circles)
  • Speed measures: am I moving? (yes for circles)
  • Progress measures: am I getting anywhere new? (NO for circles)

Fix: Path Efficiency Reward (v3)

Formula:

efficiency = net_displacement / total_path_length  (over sliding window of 30 steps)
shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)

Why this works:

  • Forward driving: efficiency ≈ 1.0 (all movement is productive)
  • Circular driving: efficiency ≈ 0.0 (lots of steps, car returns to start position)
  • The speed bonus disappears when circling → car incentivized to go FORWARD

Proof (tests):

  • test_efficiency_near_zero_for_circular_driving: efficiency < 0.2 after full circle
  • test_efficiency_near_one_for_straight_driving: efficiency > 0.90 for straight line
  • test_straight_driving_gets_higher_reward_than_circular: key guarantee test

Data archived:

  • autoresearch_results_phase1_CORRUPTED_circular_driving.jsonl (12 records, circular)
  • models/ARCHIVED_circular_driving/ (trial-0001 through trial-0013)

Lesson: cv% is a Reward Hacking Indicator

cv% Interpretation
< 1% + high reward Likely reward hacking (very consistent exploit)
1-10% Normal RL variance
> 50% Unstable policy, inconsistent behavior

This metric will be added to the autoresearch result logging and summary.


2026-04-13 — 🏆 PHASE 1 MILESTONE: Genuine Track Driving Confirmed!

Finding: Champion Model Drives the Track — Real RL Behaviour Proven

This is the first confirmed genuine driving result from the autoresearch pipeline.

Visual confirmation (user): "It is definitely driving! The donkeycar is driving along the track!"

Evaluation data — 3 episodes, 1500 max steps:

Episode Steps Total Reward Std Efficiency
1 599 1022.73 96-100%
2 598 1023.35 96-100%
3 599 1022.25 96-100%
Mean 599 1022.78 0.45 ~99%

Champion Model Parameters:

  • agent: PPO, n_steer=7, n_throttle=3, lr=0.000680, timesteps=4787
  • Path: agent/models/champion/model.zip

Track Trajectory Analysis

Start:    Pos(6.25,  6.30)   → Starting line
Step 300: Pos(22.80, 2.09)   → Long straight, approaching first corner
Step 400: Pos(18.80, -6.96)  → Negotiating first right-hand curve ✅
Step 500: Pos(28.12, -5.61)  → Continuing along second straight
Step 560: Pos(33.12, -6.55)  → Approaching second corner
Step 599: CRASH CTE=8.26     → Off track at second corner ❌

The car successfully:

  • Accelerates from 0 → 2.3 m/s along the straight
  • Navigates the first right-hand curve
  • Follows the track for ~600 steps covering ~30+ position units

Failure Analysis: The S-Curve Crash

User observation: "The spot where the donkeycar goes off the track is during a right hand curve which quickly turns into a left hand curve. It doesn't even look like it sees the left hand curve."

What the data shows:

  • Steps 540-560: CTE briefly near zero (0.24) — car approaches corner well
  • Steps 570+: CTE explodes 1.4 → 3.8 → 5.9 → 8.3 — car overshoots
  • Speed at crash: 2.23-2.30 m/s — too fast for the S-curve

Root cause: Only 4787 training timesteps — insufficient to learn:

  1. Speed reduction approaching corners
  2. Left-turn recovery after right-hand overshoot
  3. S-curve geometry (right → quick left transition)

Key insight: The model never sees the left-hand curve because it has always crashed at the right-hand part first during training. This is an exploration problem — the car needs more timesteps to get past this point and discover what's beyond.

Reward Shaping Victory

All 3 reward hacking fixes proved necessary and correct:

  • v1 additive → boundary oscillation exploit
  • v2 multiplicative → circular driving exploit
  • v3 path efficiency → genuine forward driving

The path efficiency metric (96-100% throughout entire run) confirms the car is making continuous forward progress — not circling, not oscillating.

Phase 1 → Phase 2 Transition

Phase 1 objective achieved: A real PPO model drives the DonkeyCar track with genuine forward motion, consistent behaviour (std=0.45), and correct trajectory.

Next objective (targeted autoresearch): Learn corner handling and speed modulation.

  • Increase timesteps to 10,000-50,000 per trial
  • The model needs to see the S-curve many times to learn the transition
  • Consider adding a CTE-rate-of-change penalty to discourage high speed at high CTE

This is Research!

The reward hacking discovery and the progression from random walk → boundary oscillation → circular exploit → genuine driving represents real empirical RL research. Each failure mode revealed a fundamental property of reward design. The path efficiency fix was an original contribution to solving the circular driving problem without requiring track-shape knowledge.


2026-04-13 — Reward v4: Full Sim Bypass (base × efficiency × speed)

Finding: v3 Still Allowed Circling — Base Reward Not Gated by Efficiency

Observation (user): Car turning left or right from start in Phase 2 runs (47k timestep trials).

Root cause discovered in donkey_sim.py:

# sim's own reward (lines 478-498):
if self.forward_vel > 0.0:
    return (1.0 - abs(cte)/max_cte) * self.forward_vel

forward_vel = dot(car_heading, velocity). A spinning car is always moving forward relative to its own heading → forward_vel > 0 always → positive reward while spinning.

Why v3 was insufficient:

  • v3 multiplied the SPEED BONUS by efficiency: original × (1 + scale × speed × eff)
  • But original (from sim) was already exploitable: CTE≈0 while spinning → original=1.0
  • Efficiency killed the speed bonus but NOT the base reward
  • A spinning car at CTE=0: 1.0/step × 47k steps = 47k total reward (never crashes in circle!)

Fix — v4 formula:

reward = base_CTE × efficiency × (1 + speed_scale × speed)

Where base_CTE = 1 - abs(cte)/max_cte computed from info dict, completely bypassing the sim.

  • Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed
  • Forward driving (eff≈1): reward = base × (1 + scale × speed)
  • All three terms must be high simultaneously to earn reward

Key test added: test_circling_at_zero_cte_gives_near_zero_reward — confirms the core v4 guarantee that the worst-case exploit (CTE=0 spinning) earns near-zero reward.

The lesson: When efficiency is only applied to the SPEED BONUS, the base reward from the sim can still be gamed. The efficiency multiplier must apply to the ENTIRE reward.


2026-04-14 — 🏆 PHASE 2 MILESTONE: All Top Models Complete the Track!

Finding: Track Completion Achieved — Multiple Distinct Driving Styles

User visual confirmation: All 3 top Phase 2 models successfully complete the entire track!

Model comparison at 3000 steps:

Model Steps Reward Std Driving Style
Trial 20 (n_steer=3, n_throttle=5, lr=0.000225, 13k steps) 2874 2297 5.7 Right lane, very stable
Trial 8 (n_steer=4, n_throttle=3, lr=0.00117, 34k steps) 2258 2072 0.4 Left/center, oscillating
Trial 18 (n_steer=3, n_throttle=5, lr=0.000288, 16k steps) 2256 2072 0.4 Right shoulder, very accurate

Key insight — the track ENDS! The runs don't time out — the car genuinely completes the full track. The CTE spike at the end is the car reaching the track boundary/finish.

Why Different Driving Styles Emerged

Action space discretization is the dominant factor:

  • n_steer=3: Only LEFT/STRAIGHT/RIGHT → decisive, committed steering → clean lane following
  • n_steer=4: 4 steer positions → oscillating correction policy (still completes track)
  • n_throttle=5: More speed granularity → smoother corner negotiation

CTE reward symmetry creates multiple valid solutions: The reward base_CTE × efficiency × speed is symmetric — driving 0.5m left of center = driving 0.5m right of center (same |CTE|). PPO random initialization determines which symmetric solution the model converges to. This is why Trials 20 and 18 drive on opposite sides of the road despite similar hyperparameters.

Emergent counterintuitive finding: FEWER steering bins → BETTER driving Trial 20 (n_steer=3) outperforms Trial 8 (n_steer=4) both in distance and smoothness. With only 3 steering bins, the model is forced to commit to decisive actions, developing a cleaner driving policy. More action granularity introduced oscillation without improving performance.

Can We Control Driving Behaviour?

Yes! Through targeted reward shaping:

  1. Lane position targeting: reward = 1 - abs(cte - target_offset)/max_cte → bias to specific lane position
  2. Anti-oscillation penalty: Penalize rapid steering changes → eliminates Model 2 oscillation
  3. Asymmetric CTE: Penalize left-of-center more → enforces right-lane driving rule
  4. Speed zones: Reward deceleration before corners (future work)

Phase 2 → Phase 3 Transition

Phase 2 objective ACHIEVED: Models complete the full track with genuine learned driving behaviour.

Phase 3 objectives:

  • Behavioral control (lane position, oscillation suppression)
  • Speed optimization (fastest lap time)
  • Multi-track generalization
  • Fine-tuning from Phase 2 champion

Phase 2 Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps


2026-04-14 — Track Switching API: exit_scene() Works Automatically

Finding: Automatic Scene Switching via unwrapped viewer

Problem: gym.make('donkey-generated-track-v0') ignores the scene name if the simulator already has a scene running — it just uses the current scene.

Root cause: The sim only responds to scene selection when it's at the main menu (scene_selection_ready state). If a scene is loaded, it sends need_car_config instead.

Fix: env.unwrapped.viewer.exit_scene() sends the exit message through the established websocket connection. Raw TCP socket approach failed because the DonkeyCar protocol requires proper framing.

Working procedure:

temp_env = gym.make(current_scene_env_id)
temp_env.unwrapped.viewer.exit_scene()   # Sends exit via websocket
time.sleep(4)                              # Wait for sim to reach main menu
temp_env.unwrapped.viewer.quit()
env = gym.make(target_env_id)            # Sim now loads correct scene

Confirmed: loading scene generated_road message appears in logs after switch. Impact: Fully automated multi-track evaluation and training without user intervention!


2026-04-14 — PHASE 3 BEGINS: Multi-Track Generalization Evaluation


2026-04-14 — Multi-Track Generalization Baseline: Complete Results

Experiment: All 3 Phase 2 Champions vs All 10 Available Tracks

Setup: 3 episodes × 800 max steps per model per track. Automatic track switching via exit_scene API.

Results:

Track Trained T20 Steps T08 Steps T18 Steps
Generated Road YES 321 800 53
Generated Track unseen 52 52 106
Mountain Track unseen 67 66 46
Warehouse unseen 53 67 53
AVC Sparkfun unseen 60 95 49
Mini Monaco unseen 48 38 39
Warren unseen 58 82 54
Robo Racing League unseen 116 116 69
Waveshare unseen 66 70 84
Circuit Launch unseen 42 79 37

Verdict: T20 drives 1/10, T08 drives 1/10, T18 drives 0/10.

Note: Thunderhill not available in this simulator version.

Analysis: Why Models Overfit

  1. Visual overfitting: The camera input is an RGB image. The model learned features specific to the generated_road visual environment (road markings, sky colour, road texture). All other tracks have completely different visual appearances — the model's CNN policy doesn't recognise them as "drivable".

  2. Interesting near-misses: Robo Racing League gave 116 steps for both T20 and T08 before crashing — suggesting this track's visual appearance has some similarities to generated_road.

  3. T18 fails even on generated_road: The random road layout was different enough that T18 (which had learned to follow the right shoulder on the original road) immediately crashed. This shows the models aren't fully generalised even within the same track type with a new random layout.

Baseline Established

This is our pre-Wave 3 baseline: 1/10 tracks drivable. Wave 3 goal: 5+/10 tracks drivable through multi-track curriculum training.

Wave 3 Multi-Track Training Strategy

Curriculum approach (progressive difficulty):

Stage 1 — Same geometry, different visuals:

  • Train alternating: generated_roadgenerated_track
  • Goal: Learn to ignore background (trees/shadows) while keeping road-following skill
  • Expected: Models that drive both generated courses robustly

Stage 2 — Different geometry:

  • Add mountain_track to the alternation
  • Goal: Learn to handle different road widths and curve radii

Stage 3 — Any track:

  • All available tracks in rotation
  • Goal: True domain generalisation

Domain randomisation: Even within a single track, the generated_road creates different layouts each episode. This natural randomisation is already helping — but we need visual diversity too.

Key hyperparameter change for Wave 3: Increase timesteps significantly (50k-200k per trial) to give the model enough experience on multiple tracks. The model needs to see each track many times to learn track-agnostic driving features.


2026-04-12 — Wave 3 Launch: Multi-Track Training + Visual Analysis

Finding: Track Visual Classification (from screenshots)

Observation: Examined all 10 available DonkeyCar track screenshots at the starting line.

Outdoor tracks (same domain — sky, asphalt, lane markings):

Track Road Surface Markings Background Training Role
Generated Road Grey smooth asphalt Yellow centre + white edge Bare desert TRAINED
Generated Track Same grey asphalt Yellow centre, orange cones Trees + grass TRAIN
Mountain Track Darker/wet asphalt Yellow centre, barriers Trees + mountains TRAIN
Mini Monaco Grey asphalt Yellow centre + white edge Trees + chain-link fence TEST (zero-shot)
Warren White painted lines on grass Yellow dashes Indoor tent, outdoor setting TEST (zero-shot)
AVC Sparkfun Cracked rough asphalt Orange markings Outdoor but very different SKIP (too different)

Indoor tracks (completely different domain — carpet/floor surface):

  • Warehouse (yellow floor), Robo Racing League (office interior), Waveshare (desktop mat), Circuit Launch (convention hall) — all SKIP for now

Key insight on Warren: Although technically under a tent shelter, Warren has proper road-style track geometry with white lane lines and yellow centre dashes, similar to outdoor road tracks. It was classified as a pseudo-outdoor track and included in the zero-shot test set (not indoor skip category).

Key insight on Robo Racing League 116-step anomaly: NOT visual similarity — the indoor office track looks nothing like generated_road. More likely the episode boundary tolerance was different, allowing the car to wander longer before triggering done=True.

Decision: Wave 3 Track Split

  • Training set (seen during training): generated_road, generated_track, mountain_track
  • Test set (zero-shot generalization benchmark): mini_monaco, warren
  • Metric: combined_test_score = mini_monaco_mean_reward + warren_mean_reward

This mirrors Will Roscoe's approach: train on multiple similar tracks, test on held-out track.

Implementation: Wave 3 Autoresearch System

New files:

  • agent/multitrack_runner.py — Inner training loop: round-robin across 3 training tracks, warm-starts from Phase 2 champion, evaluates on test tracks
  • agent/wave3_controller.py — GP+UCB outer loop: optimises for zero-shot test score
  • tests/test_wave3.py — 30 new tests (83 total, all passing)

Track switching mechanism: close_and_switch():

  1. env.close() + time.sleep(2) [ADR-006]
  2. send_exit_scene_raw() + 4s wait
  3. gym.make(next_env_id) + apply wrappers

Training strategy (round-robin): With steps_per_switch=10000 and 3 tracks, the model rotates: generated_road → generated_track → mountain_track → generated_road → ... Each track gets roughly equal time. GP can tune steps_per_switch to change rotation rate.

GP+UCB parameter space:

  • learning_rate: [5e-5, 1e-3] — centred near Phase 2 champion (2.25e-4)
  • steps_per_switch: [2000, 25000] — how long to stay on each track
  • total_timesteps: [80000, 400000] — total training budget

Seed trials: First 2 trials use hardcoded params to bootstrap the GP:

  1. lr=2.25e-4, switch=10k, total=150k (near Phase 2 champion)
  2. lr=2.25e-4, switch=20k, total=300k (longer, less frequent switching)

Warm-start: All Wave 3 trials warm-start from models/champion/model.zip (Phase 2 champion Trial 20), which already knows how to drive generated_road. This dramatically speeds up training — the model starts from a working policy, not from scratch.

Pre-Wave 3 baseline: 1/10 tracks drivable (0/2 test tracks) Wave 3 goal: Both test tracks drivable (mini_monaco + warren) — 2/2 held-out tracks