donkeycar-rl-autoresearch/DECISIONS.md

20 KiB
Raw Blame History

Architecture Decision Records — DonkeyCar RL Autoresearch

One ADR per major non-obvious technical choice. Agents read this to avoid re-opening settled decisions.


ADR-001: PPO over DQN as Primary Agent

Date: 2026-04-13
Status: Accepted

Context: DonkeyCar driving is a continuous control problem (steer ∈ [-1,1], throttle ∈ [0,1]). DQN requires discrete action spaces; we worked around this with DiscretizedActionWrapper. PPO supports continuous action spaces natively.

Decision: Use PPO as the primary agent. Keep DQN support for discrete action experiments.

Consequences:

  • PPO trains faster on continuous driving tasks (no discretization artifacts)
  • No need for DiscretizedActionWrapper with PPO (but keep it for DQN experiments)
  • PPO with CnnPolicy handles raw image observations natively

Rejected alternatives:

  • DQN only — requires discretization; loses steering resolution
  • SAC — valid alternative but PPO is simpler and well-tested on DonkeyCar

ADR-002: Pure Numpy GP (TinyGP) over sklearn

Date: 2026-04-13
Status: Accepted

Context: We need a Gaussian Process surrogate model for the autoresearch controller. sklearn.gaussian_process exists but has had compatibility issues with our numpy version.

Decision: Use TinyGP — a pure numpy RBF kernel GP implemented in autoresearch_controller.py.

Consequences:

  • No sklearn dependency
  • Full control over kernel and noise parameters
  • Slightly less optimized than sklearn but sufficient for < 1000 data points

Rejected alternatives:

  • sklearn GaussianProcessRegressor — dependency issues
  • GPyTorch — overkill, adds PyTorch dependency
  • Botorch — same

ADR-003: JSONL Append-Only Results

Date: 2026-04-13
Status: Accepted

Context: Results from 300+ trials must be persistent, recoverable, and never lost.

Decision: All results are appended to JSONL files. Results files are never truncated or overwritten.

Consequences:

  • System can be interrupted and resumed at any point
  • Historical data is preserved even if a later trial fails
  • Easy to parse with json.loads(line) per line

Rejected alternatives:

  • SQLite — adds dependency, overkill for this volume
  • CSV — loses type information, harder to extend

Date: 2026-04-13
Status: Accepted

Context: We need an intelligent hyperparameter search strategy. Grid search was the starting point but misses non-grid-aligned optimal regions (proven: n_steer=8 was NOT in the original grid of [3,5,7]).

Decision: Gaussian Process + Upper Confidence Bound (UCB) acquisition. GP models the reward landscape; UCB balances exploration vs exploitation.

kappa=2.0 default: reasonable balance, can be increased for more exploration.

Consequences:

  • Finds optimal regions with fewer trials than grid search
  • Naturally handles continuous parameter spaces (learning_rate ∈ [0.00005, 0.005])
  • Requires at least 2 data points before GP can be fit (random sampling for first 2 trials)

Rejected alternatives:

  • Random search — better than grid but no learning
  • Tree Parzen Estimator (TPE/Optuna) — valid alternative, adds dependency
  • CMA-ES — better for high-dimensional spaces; our space is 3D, GP is sufficient
  • Population-Based Training (PBT) — requires parallel sim instances (we only have 1)

ADR-005: No Model Saving Before Model is Defined

Date: 2026-04-13
Status: Accepted (bug fix — never repeat)

Context: The original donkeycar_sb3_runner.py called model.save(save_path) after removing the model training code. This caused NameError: name 'model' is not defined on every single run for 300 trials.

Decision: Never call model.save() without first verifying model is defined. Training and saving must be atomic — if training fails, no save attempt.

Pattern:

try:
    model = PPO('CnnPolicy', env, ...)
    model.learn(total_timesteps=timesteps)
    model.save(save_path)
except Exception as e:
    log(f'Training failed: {e}')
    sys.exit(102)

Rejected alternatives:

  • Checking if 'model' in locals() before save — fragile, hides bugs

ADR-006: env.close() + 2-Second Cooldown is Non-Negotiable

Date: 2026-04-13
Status: Accepted

Context: Early in the project, not calling env.close() between runs caused simulator zombie processes that locked up the entire system. 20+ consecutive runs work reliably with this pattern.

Decision: Every runner process MUST:

  1. Call env.close() in a try/except before exit
  2. Sleep 2 seconds after close
  3. Then exit

This applies even if training or evaluation fails.

Rejected alternatives:

  • Relying on Python garbage collection for env cleanup — proven to cause hangs

ADR-007: PPO with CnnPolicy for Image Observations

Date: 2026-04-13
Status: Accepted

Context: DonkeyCar provides 120x160x3 RGB camera images as observations. The policy must process images.

Decision: Use PPO('CnnPolicy', env, ...) from SB3. CnnPolicy automatically handles image preprocessing with a CNN feature extractor.

Consequences:

  • Larger model than MlpPolicy (image processing overhead)
  • Requires VecTransposeImage wrapper (SB3 handles this internally)
  • Training is slower per step but produces better driving behavior

Rejected alternatives:

  • MlpPolicy — cannot handle raw image inputs
  • Custom CNN — unnecessary complexity given SB3's built-in CnnPolicy

ADR-008: All Phases Planned, Phase 1 Executed First

Date: 2026-04-13
Status: Accepted

Context: User asked whether to implement Phase 1 only or all phases. Three phases identified:

  1. Real Training Foundation
  2. Multi-Track Generalization
  3. Racing / Speed Optimization

Decision: Plan all phases in full documentation, execute Phase 1 first. Do not start Phase 2 until Phase 1 produces a genuine champion model (mean_reward > 100 on training track). This creates a wave gate between Phase 1 and Phase 2.

Rationale: Phase 2 and 3 depend on having a real trained model. Without Phase 1 complete, there is nothing to generalize or optimize for speed.


ADR-009: Tests Must Not Require Live Simulator

Date: 2026-04-13
Status: Accepted

Context: The DonkeyCar simulator must be running on port 9091 for live training. Tests cannot depend on this.

Decision: All pytest tests mock the gym environment. Integration tests use a MagicMock gym env that returns fake observations, rewards, and done signals. Only manual/acceptance tests require the live simulator.

Pattern:

@patch('gymnasium.make')
def test_runner_exits_cleanly(mock_make):
    mock_env = MagicMock()
    mock_env.reset.return_value = (np.zeros((120,160,3)), {})
    mock_env.step.return_value = (np.zeros((120,160,3)), 1.0, True, False, {})
    mock_env.action_space = gym.spaces.Box(...)
    mock_make.return_value = mock_env
    # ... test runner

ADR-010: Warren is an Outdoor/Road Track — Include in Generalization Benchmark

Date: 2026-04-12 Status: Accepted

Context: Warren (UCSD Warren Track v1.0) is under a tent but has proper road geometry: white lane lines, yellow centre dashes, orange traffic cones. Unlike purely indoor tracks (Robo Racing League, Waveshare, Circuit Launch, Warehouse) which use a carpet/hard floor as the road surface with painted lines, Warren has an actual grass+painted-road layout with genuine road markings.

Decision: Warren is classified as a "pseudo-outdoor" track — visually similar to outdoor road tracks despite being sheltered. It is included in the zero-shot test set (alongside mini_monaco) rather than the indoor-skip category.

Consequence: The Wave 3 generalization benchmark = 2 held-out tracks: mini_monaco (outdoor trees + fence) + warren (pseudo-outdoor tent + road markings).


ADR-011: Wave 3 Zero-Shot Generalization — Test Tracks Never Used in Training

Date: 2026-04-12 Status: Accepted

Context: Visual overfitting confirmed — Phase 2 champion drives only the track it was trained on (generated_road). CNN learned background-specific features (desert horizon, sky colour) rather than road-invariant features (lane markings, road edges).

Decision: Wave 3 uses a strict train/test split:

  • Training tracks: generated_road, generated_track, mountain_track
  • Test tracks (zero-shot only): mini_monaco, warren
  • Optimisation target: combined_test_score = mini_monaco_mean_reward + warren_mean_reward (the GP ONLY sees test-track performance — training performance is not the objective)

Rationale: This mirrors established domain generalisation practice. If we train the GP on training reward, we could find hyperparams that overfit the training tracks while still failing the test tracks. Only test performance correctly measures generalisation.

Consequence: Zero-shot evaluation happens at the end of every trial. If a trial crashes both test tracks, score=0. GP learns that those hyperparameters don't generalise.


ADR-012: Warm-Start from Phase 2 Champion for Wave 3

Date: 2026-04-12 Status: Accepted

Context: Training PPO from scratch across 3 tracks would require ~500k+ timesteps to reach a competent policy. Phase 2 champion (Trial 20) already drives generated_road well.

Decision: All Wave 3 trials warm-start from models/champion/model.zip (Phase 2 champion). PPO.load(path, env=new_env) loads weights; model.learning_rate is then overridden with the GP-proposed learning rate. Falls back to fresh PPO if load fails.

Rationale: The champion already knows how to follow a road. Warm-starting means Wave 3 only needs to teach generalisation — learning to apply the same skill to new visual inputs. This is far more efficient than teaching driving from scratch.

Risk: If the champion's policy is over-specialised (e.g., relies on very specific pixel features of desert background), warm-starting could hinder generalisation. This is why the GP tunes learning_rate — a higher LR will more aggressively overwrite specialised features.

ADR-013: Wave 4 — Train From Scratch on 2 Visually Distinct Tracks

Date: 2026-04-14
Status: Active

Decision: Remove generated_road from training set. Train from random weights (no warm-start) on generated_track + mountain_track only. Test zero-shot on mini_monaco.

Why generated_road was removed:

  • Too visually similar to generated_track — doesn't force generalisation
  • Phase 2 champion (trained only on generated_road) was used as warm-start in Wave 3
  • Warm-start caused catastrophic forgetting: generated_road reward went 2388→37 between rotations as the model forgot it while learning other tracks
  • The warm-start weights were a local minimum the model couldn't escape

Why no warm-start:

  • Phase 2 CNN features were specialised for generated_road visual patterns
  • 3090k steps of multi-track training insufficient to overcome that prior
  • Starting from random weights lets the CNN build features useful for both tracks simultaneously

Why generated_track + mountain_track:

  • Both outdoor, asphalt, yellow/white lane markings — same task category as mini_monaco
  • Visually distinct backgrounds (trees vs mountain/barriers) — model must learn to ignore background and follow road markings, not recognise specific scenes
  • If it can drive both, the learned features should generalise to mini_monaco (same visual category, never seen during training)

Proven result: Overnight Wave 4 Trial 3 (lr=0.000685, switch=17,499, total=157,743 steps) scored mini_monaco=1943 (full 2000-step eval, never crashed). Model saved at agent/models/wave4-trial-0003/model.zip.


ADR-014: Always Measure Throughput Before Launching Long Runs

Date: 2026-04-15
Status: Active (learned the hard way)

Decision: Before launching any autoresearch campaign, run a 5-minute timing benchmark to measure actual steps/sec. Set total_timesteps cap = (time_limit_minutes - overhead_minutes) × 60 × steps_per_sec × 0.85 safety margin.

Why: Assumed 20 steps/sec based on Phase 2. Actual Wave 4 throughput is 16 steps/sec (mountain_track physics is heavier). This caused Trials 3, 4, 7, 8, 9 to timeout, wasting 10+ hours of compute.


ADR-015: Per-Segment Model Checkpointing is Non-Negotiable

Date: 2026-04-15
Status: Active

Decision: Save model.zip after every training segment, not just at the end. If the runner is killed (timeout, crash, Ctrl+C), the latest checkpoint is on disk and training is never completely lost.

Why: Five trials timed out with no saved model. Hours of gradient updates existed only in RAM and were lost on SIGKILL.


ADR-016: Verify Fixes Are Running Before Walking Away

Date: 2026-04-15
Status: Active

Decision: After committing any fix, verify the running process is actually using the new code (check PID, log output, parameter values) before declaring it fixed. Commit + push ≠ running.

Why: Multiple fixes (90k step cap, checkpointing, rescue eval) were committed but the controller was never restarted. The fixes never ran. This caused several more timeouts that the fixes were meant to prevent.

ADR-017: Always Save the BEST Model During Training, Never Just the Latest

Date: 2026-04-17 Status: Active — enforced

Decision: Every training script must save the best model found during training, not just the final weights. Two mechanisms are approved:

  1. train_multitrack() in multitrack_runner.py — tracks best_segment_reward, saves best_model.zip on every new high score, reloads it at the end.
  2. SB3 EvalCallback(best_model_save_path=..., deterministic=True) for standalone scripts.

No training script may be written or run without one of these two mechanisms.

Why this matters: PPO policy weights can and do drift during long training runs. A model that could drive at step 30k may be broken at step 90k. Saving only the final weights throws away the best model found during training.

What was lost because this wasn't in place:

  • Wave 4 mountain_track Exp3/4/5: model was doing 20-second laps at step 30k. Final model at step 90k crashed in 13 steps. Irrecoverable.
  • Untold mid-training peaks across Wave 3 and Wave 4 that were never captured.

Root cause of the oversight: Phase 2 autoresearch used 13k-step trials on a simple single track. The final model happened to be the best model (no time to drift). This false assumption was carried forward into longer multi-track training where it was wrong. The word "checkpoint" was misleading — we were saving the latest, not the best.

Implementation: See train_multitrack() in multitrack_runner.py — the best_segment_reward tracking and best_model.zip save logic added 2026-04-17.

ADR-018: StuckTerminationWrapper is the correct collision fix — NOT OnCollisionStay

Date: 2026-04-18 Status: Active

Decision: Do NOT add OnCollisionStay to the Unity simulator. Use StuckTerminationWrapper (displacement < 0.5m over N steps → terminate).

Why OnCollisionStay is wrong: The car legitimately rubs against barriers while cornering — this should be allowed to continue. OnCollisionStay would fire on BOTH rubbing AND stuck scenarios, terminating valid driving attempts.

Why StuckTerminationWrapper is right:

  • Rubbing + still moving forward: displacement > 0.5m in 80 steps → continues
  • Stuck perpendicular, wheels spinning: displacement < 0.5m in 80 steps → terminates

The distinction between "rubbing" and "stuck" is made by checking positional progress, not collision contact. This is the correct signal.

Tuning note: stuck_steps=80 (~5 seconds at 16 steps/sec). Could be reduced to 40 (~2.5 seconds) if stuck periods are observably long.


ADR-019: Parallel DummyVecEnv for Multi-Track Training (Not Close-and-Switch)

Date: 2026-04-19 Status: Proposed (to be validated by Exp 11)

Context: Multi-track training via close_and_switch() — closing the env, reopening on a new track, calling model.set_env() — produced unreliable results. Wave 4 had 25 trials: only 4/25 scored >500, median 111. Exp 10 used nearly identical hyperparameters to the best Wave 4 trial and failed completely (crashes <180 steps on all tracks).

Root cause: PPO is an on-policy algorithm. Its rollout buffer, value function estimates, and advantage calculations are disrupted when the environment is swapped mid-training. The model catastrophically forgets one track while training on another.

Decision: Use SB3's DummyVecEnv with one env per track, each connected to a separate sim instance on a different port. PPO collects experience from ALL tracks in every rollout batch — no switching, no forgetting.

env = DummyVecEnv([
    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])

Consequences:

  • Requires multiple sim instances (one per training track)
  • More GPU/CPU usage — can be mitigated by running sims on separate machines
  • PPO sees both tracks in every batch — no catastrophic forgetting
  • No env close/reopen — stable training throughout
  • This is how SB3 is designed to work with multiple environments

Rejected alternatives:

  • close_and_switch (current) — disrupts PPO, 80% failure rate
  • Same-connection scene switching — untested, still sequential, fragile

Validation: Exp 11 will test this approach. If results are consistent across multiple runs (not lottery), this ADR is confirmed.


ADR-020: Mountain Track Hill — Throttle and Reward History

Date: 2026-04-19 Status: Accepted

Context: Mountain_track has a steep hill that the car must climb. Multiple experiments tested different throttle_min and reward combinations.

Confirmed findings (from Exp 19):

  • throttle_min=0.2 + v4 reward: car cannot get over hill. v4 reward gives zero gradient when speed≈0 AND efficiency≈0 simultaneously on hill.
  • throttle_min=0.5 + any reward: car gets over hill, BUT throttle_min is baked into the action space. Model cannot output throttle < 0.5. Result: crashes on tight corners (mini_monaco ~91 steps consistently).
  • throttle_min=0.2 + v5 reward (speed×CTE): model CAN learn to self-select high throttle on hill. Proved in Exp 9 (90k steps, mountain only) → 2000/2000. The v5 speed gradient is non-zero on hills, giving the model a learning signal.

When mountain fails in parallel training:

  • First check for training contamination (e.g., grass exploit on other track)
  • The grass exploit corrupts generated_track episodes → model learns exploit instead of driving → mountain gets corrupted gradient too
  • Fix the exploit first, then re-run. Do NOT immediately assume throttle_min is the cause.

If mountain still fails after exploit fixes:

  • Consider per-track throttle_min: throttle_min=0.5 for mountain env, throttle_min=0.2 for other envs (DummyVecEnv allows per-env wrappers)
  • This is feasible since each env in DummyVecEnv is wrapped independently

DO NOT:

  • Confuse mountain rollback with a stuck issue (it's a learning/reward issue)
  • Add termination conditions for rollback (interferes with slow hill learning)
  • Change throttle_min as the FIRST response when mountain fails

ADR-021: Generated Track Grass Exploit — Root Cause and Fix

Date: 2026-04-19 Status: Accepted

Context: generated_track has a physical gap in the boundary mesh at the first turn. The car finds this gap and drives off onto the grass indefinitely.

Root cause: donkey_sim.py determine_episode_over() has:

if math.fabs(self.cte) > 2 * self.max_cte:  # > 16.0m
    pass   # designed for bad startup frames, but means far-off-track = never terminates
elif math.fabs(self.cte) > self.max_cte:    # 8.0-16.0m
    self.over = True

The car exits through the gap, CTE quickly exceeds 16m, hits pass — episode never ends.

Fix: Python-side SpeedRewardWrapper CTE patience terminator:

  • If CTE > max_cte_terminate (4.0m) for cte_patience (20) consecutive steps → terminate
  • Catches the car at 4m (before blowing past 16m into the pass zone)
  • 4.0m chosen conservatively — legitimate cornering stays well below 4m CTE
  • Resets counter when car returns to within 4m (brief excursions allowed)

Note: We cannot fix the Unity sim code directly.