donkeycar-rl-autoresearch/DECISIONS.md

# Architecture Decision Records — DonkeyCar RL Autoresearch

> One ADR per major non-obvious technical choice.
> Agents read this to avoid re-opening settled decisions.

---

## ADR-001: PPO over DQN as Primary Agent

**Date:** 2026-04-13
**Status:** Accepted

**Context:** DonkeyCar driving is a continuous control problem (steer ∈ [-1,1], throttle ∈ [0,1]). DQN requires discrete action spaces; we worked around this with DiscretizedActionWrapper. PPO supports continuous action spaces natively.

**Decision:** Use PPO as the primary agent. Keep DQN support for discrete action experiments.

**Consequences:**
- PPO trains faster on continuous driving tasks (no discretization artifacts)
- No need for DiscretizedActionWrapper with PPO (but keep it for DQN experiments)
- PPO with CnnPolicy handles raw image observations natively

**Rejected alternatives:**
- DQN only — requires discretization; loses steering resolution
- SAC — valid alternative but PPO is simpler and well-tested on DonkeyCar

---

## ADR-002: Pure Numpy GP (TinyGP) over sklearn

**Date:** 2026-04-13
**Status:** Accepted

**Context:** We need a Gaussian Process surrogate model for the autoresearch controller. sklearn.gaussian_process exists but has had compatibility issues with our numpy version.

**Decision:** Use TinyGP — a pure numpy RBF kernel GP implemented in autoresearch_controller.py.

**Consequences:**
- No sklearn dependency
- Full control over kernel and noise parameters
- Slightly less optimized than sklearn but sufficient for < 1000 data points

**Rejected alternatives:**
- sklearn GaussianProcessRegressor — dependency issues
- GPyTorch — overkill, adds PyTorch dependency
- Botorch — same

---

## ADR-003: JSONL Append-Only Results

**Date:** 2026-04-13
**Status:** Accepted

**Context:** Results from 300+ trials must be persistent, recoverable, and never lost.

**Decision:** All results are appended to JSONL files. Results files are never truncated or overwritten.

**Consequences:**
- System can be interrupted and resumed at any point
- Historical data is preserved even if a later trial fails
- Easy to parse with `json.loads(line)` per line

**Rejected alternatives:**
- SQLite — adds dependency, overkill for this volume
- CSV — loses type information, harder to extend

---

## ADR-004: GP+UCB Bayesian Optimization for Hyperparameter Search

**Date:** 2026-04-13
**Status:** Accepted

**Context:** We need an intelligent hyperparameter search strategy. Grid search was the starting point but misses non-grid-aligned optimal regions (proven: n_steer=8 was NOT in the original grid of [3,5,7]).

**Decision:** Gaussian Process + Upper Confidence Bound (UCB) acquisition. GP models the reward landscape; UCB balances exploration vs exploitation.

**kappa=2.0** default: reasonable balance, can be increased for more exploration.

**Consequences:**
- Finds optimal regions with fewer trials than grid search
- Naturally handles continuous parameter spaces (learning_rate ∈ [0.00005, 0.005])
- Requires at least 2 data points before GP can be fit (random sampling for first 2 trials)

**Rejected alternatives:**
- Random search — better than grid but no learning
- Tree Parzen Estimator (TPE/Optuna) — valid alternative, adds dependency
- CMA-ES — better for high-dimensional spaces; our space is 3D, GP is sufficient
- Population-Based Training (PBT) — requires parallel sim instances (we only have 1)

---

## ADR-005: No Model Saving Before Model is Defined

**Date:** 2026-04-13
**Status:** Accepted (bug fix — never repeat)

**Context:** The original donkeycar_sb3_runner.py called `model.save(save_path)` after removing the model training code. This caused `NameError: name 'model' is not defined` on every single run for 300 trials.

**Decision:** Never call `model.save()` without first verifying `model` is defined. Training and saving must be atomic — if training fails, no save attempt.

**Pattern:**
```python
try:
    model = PPO('CnnPolicy', env, ...)
    model.learn(total_timesteps=timesteps)
    model.save(save_path)
except Exception as e:
    log(f'Training failed: {e}')
    sys.exit(102)
```

**Rejected alternatives:**
- Checking `if 'model' in locals()` before save — fragile, hides bugs

---

## ADR-006: env.close() + 2-Second Cooldown is Non-Negotiable

**Date:** 2026-04-13
**Status:** Accepted

**Context:** Early in the project, not calling env.close() between runs caused simulator zombie processes that locked up the entire system. 20+ consecutive runs work reliably with this pattern.

**Decision:** Every runner process MUST:
1. Call `env.close()` in a try/except before exit
2. Sleep 2 seconds after close
3. Then exit

This applies even if training or evaluation fails.

**Rejected alternatives:**
- Relying on Python garbage collection for env cleanup — proven to cause hangs

---

## ADR-007: PPO with CnnPolicy for Image Observations

**Date:** 2026-04-13
**Status:** Accepted

**Context:** DonkeyCar provides 120x160x3 RGB camera images as observations. The policy must process images.

**Decision:** Use `PPO('CnnPolicy', env, ...)` from SB3. CnnPolicy automatically handles image preprocessing with a CNN feature extractor.

**Consequences:**
- Larger model than MlpPolicy (image processing overhead)
- Requires VecTransposeImage wrapper (SB3 handles this internally)
- Training is slower per step but produces better driving behavior

**Rejected alternatives:**
- MlpPolicy — cannot handle raw image inputs
- Custom CNN — unnecessary complexity given SB3's built-in CnnPolicy

---

## ADR-008: All Phases Planned, Phase 1 Executed First

**Date:** 2026-04-13
**Status:** Accepted

**Context:** User asked whether to implement Phase 1 only or all phases. Three phases identified:
1. Real Training Foundation
2. Multi-Track Generalization
3. Racing / Speed Optimization

**Decision:** Plan all phases in full documentation, execute Phase 1 first. Do not start Phase 2 until Phase 1 produces a genuine champion model (mean_reward > 100 on training track). This creates a wave gate between Phase 1 and Phase 2.

**Rationale:** Phase 2 and 3 depend on having a real trained model. Without Phase 1 complete, there is nothing to generalize or optimize for speed.

---

## ADR-009: Tests Must Not Require Live Simulator

**Date:** 2026-04-13
**Status:** Accepted

**Context:** The DonkeyCar simulator must be running on port 9091 for live training. Tests cannot depend on this.

**Decision:** All pytest tests mock the gym environment. Integration tests use a MagicMock gym env that returns fake observations, rewards, and done signals. Only manual/acceptance tests require the live simulator.

**Pattern:**
```python
@patch('gymnasium.make')
def test_runner_exits_cleanly(mock_make):
    mock_env = MagicMock()
    mock_env.reset.return_value = (np.zeros((120,160,3)), {})
    mock_env.step.return_value = (np.zeros((120,160,3)), 1.0, True, False, {})
    mock_env.action_space = gym.spaces.Box(...)
    mock_make.return_value = mock_env
    # ... test runner
```

---

## ADR-010: Warren is an Outdoor/Road Track — Include in Generalization Benchmark

**Date:** 2026-04-12
**Status:** Accepted

**Context:** Warren (UCSD Warren Track v1.0) is under a tent but has proper road geometry:
white lane lines, yellow centre dashes, orange traffic cones. Unlike purely indoor tracks
(Robo Racing League, Waveshare, Circuit Launch, Warehouse) which use a carpet/hard floor
as the road surface with painted lines, Warren has an actual grass+painted-road layout
with genuine road markings.

**Decision:** Warren is classified as a "pseudo-outdoor" track — visually similar to
outdoor road tracks despite being sheltered. It is included in the zero-shot test set
(alongside mini_monaco) rather than the indoor-skip category.

**Consequence:** The Wave 3 generalization benchmark = 2 held-out tracks:
mini_monaco (outdoor trees + fence) + warren (pseudo-outdoor tent + road markings).

---

## ADR-011: Wave 3 Zero-Shot Generalization — Test Tracks Never Used in Training

**Date:** 2026-04-12
**Status:** Accepted

**Context:** Visual overfitting confirmed — Phase 2 champion drives only the track it was
trained on (generated_road). CNN learned background-specific features (desert horizon,
sky colour) rather than road-invariant features (lane markings, road edges).

**Decision:** Wave 3 uses a strict train/test split:
- **Training tracks:** generated_road, generated_track, mountain_track
- **Test tracks (zero-shot only):** mini_monaco, warren
- **Optimisation target:** `combined_test_score = mini_monaco_mean_reward + warren_mean_reward`
  (the GP ONLY sees test-track performance — training performance is not the objective)

**Rationale:** This mirrors established domain generalisation practice. If we train the GP
on training reward, we could find hyperparams that overfit the training tracks while still
failing the test tracks. Only test performance correctly measures generalisation.

**Consequence:** Zero-shot evaluation happens at the end of every trial. If a trial crashes
both test tracks, score=0. GP learns that those hyperparameters don't generalise.

---

## ADR-012: Warm-Start from Phase 2 Champion for Wave 3

**Date:** 2026-04-12
**Status:** Accepted

**Context:** Training PPO from scratch across 3 tracks would require ~500k+ timesteps to
reach a competent policy. Phase 2 champion (Trial 20) already drives generated_road well.

**Decision:** All Wave 3 trials warm-start from `models/champion/model.zip` (Phase 2
champion). `PPO.load(path, env=new_env)` loads weights; `model.learning_rate` is then
overridden with the GP-proposed learning rate. Falls back to fresh PPO if load fails.

**Rationale:** The champion already knows how to follow a road. Warm-starting means Wave 3
only needs to teach *generalisation* — learning to apply the same skill to new visual
inputs. This is far more efficient than teaching driving from scratch.

**Risk:** If the champion's policy is over-specialised (e.g., relies on very specific pixel
features of desert background), warm-starting could hinder generalisation. This is why the
GP tunes learning_rate — a higher LR will more aggressively overwrite specialised features.