donkeycar-rl-autoresearch/DECISIONS.md

6.9 KiB

Architecture Decision Records — DonkeyCar RL Autoresearch

One ADR per major non-obvious technical choice. Agents read this to avoid re-opening settled decisions.


ADR-001: PPO over DQN as Primary Agent

Date: 2026-04-13
Status: Accepted

Context: DonkeyCar driving is a continuous control problem (steer ∈ [-1,1], throttle ∈ [0,1]). DQN requires discrete action spaces; we worked around this with DiscretizedActionWrapper. PPO supports continuous action spaces natively.

Decision: Use PPO as the primary agent. Keep DQN support for discrete action experiments.

Consequences:

  • PPO trains faster on continuous driving tasks (no discretization artifacts)
  • No need for DiscretizedActionWrapper with PPO (but keep it for DQN experiments)
  • PPO with CnnPolicy handles raw image observations natively

Rejected alternatives:

  • DQN only — requires discretization; loses steering resolution
  • SAC — valid alternative but PPO is simpler and well-tested on DonkeyCar

ADR-002: Pure Numpy GP (TinyGP) over sklearn

Date: 2026-04-13
Status: Accepted

Context: We need a Gaussian Process surrogate model for the autoresearch controller. sklearn.gaussian_process exists but has had compatibility issues with our numpy version.

Decision: Use TinyGP — a pure numpy RBF kernel GP implemented in autoresearch_controller.py.

Consequences:

  • No sklearn dependency
  • Full control over kernel and noise parameters
  • Slightly less optimized than sklearn but sufficient for < 1000 data points

Rejected alternatives:

  • sklearn GaussianProcessRegressor — dependency issues
  • GPyTorch — overkill, adds PyTorch dependency
  • Botorch — same

ADR-003: JSONL Append-Only Results

Date: 2026-04-13
Status: Accepted

Context: Results from 300+ trials must be persistent, recoverable, and never lost.

Decision: All results are appended to JSONL files. Results files are never truncated or overwritten.

Consequences:

  • System can be interrupted and resumed at any point
  • Historical data is preserved even if a later trial fails
  • Easy to parse with json.loads(line) per line

Rejected alternatives:

  • SQLite — adds dependency, overkill for this volume
  • CSV — loses type information, harder to extend

Date: 2026-04-13
Status: Accepted

Context: We need an intelligent hyperparameter search strategy. Grid search was the starting point but misses non-grid-aligned optimal regions (proven: n_steer=8 was NOT in the original grid of [3,5,7]).

Decision: Gaussian Process + Upper Confidence Bound (UCB) acquisition. GP models the reward landscape; UCB balances exploration vs exploitation.

kappa=2.0 default: reasonable balance, can be increased for more exploration.

Consequences:

  • Finds optimal regions with fewer trials than grid search
  • Naturally handles continuous parameter spaces (learning_rate ∈ [0.00005, 0.005])
  • Requires at least 2 data points before GP can be fit (random sampling for first 2 trials)

Rejected alternatives:

  • Random search — better than grid but no learning
  • Tree Parzen Estimator (TPE/Optuna) — valid alternative, adds dependency
  • CMA-ES — better for high-dimensional spaces; our space is 3D, GP is sufficient
  • Population-Based Training (PBT) — requires parallel sim instances (we only have 1)

ADR-005: No Model Saving Before Model is Defined

Date: 2026-04-13
Status: Accepted (bug fix — never repeat)

Context: The original donkeycar_sb3_runner.py called model.save(save_path) after removing the model training code. This caused NameError: name 'model' is not defined on every single run for 300 trials.

Decision: Never call model.save() without first verifying model is defined. Training and saving must be atomic — if training fails, no save attempt.

Pattern:

try:
    model = PPO('CnnPolicy', env, ...)
    model.learn(total_timesteps=timesteps)
    model.save(save_path)
except Exception as e:
    log(f'Training failed: {e}')
    sys.exit(102)

Rejected alternatives:

  • Checking if 'model' in locals() before save — fragile, hides bugs

ADR-006: env.close() + 2-Second Cooldown is Non-Negotiable

Date: 2026-04-13
Status: Accepted

Context: Early in the project, not calling env.close() between runs caused simulator zombie processes that locked up the entire system. 20+ consecutive runs work reliably with this pattern.

Decision: Every runner process MUST:

  1. Call env.close() in a try/except before exit
  2. Sleep 2 seconds after close
  3. Then exit

This applies even if training or evaluation fails.

Rejected alternatives:

  • Relying on Python garbage collection for env cleanup — proven to cause hangs

ADR-007: PPO with CnnPolicy for Image Observations

Date: 2026-04-13
Status: Accepted

Context: DonkeyCar provides 120x160x3 RGB camera images as observations. The policy must process images.

Decision: Use PPO('CnnPolicy', env, ...) from SB3. CnnPolicy automatically handles image preprocessing with a CNN feature extractor.

Consequences:

  • Larger model than MlpPolicy (image processing overhead)
  • Requires VecTransposeImage wrapper (SB3 handles this internally)
  • Training is slower per step but produces better driving behavior

Rejected alternatives:

  • MlpPolicy — cannot handle raw image inputs
  • Custom CNN — unnecessary complexity given SB3's built-in CnnPolicy

ADR-008: All Phases Planned, Phase 1 Executed First

Date: 2026-04-13
Status: Accepted

Context: User asked whether to implement Phase 1 only or all phases. Three phases identified:

  1. Real Training Foundation
  2. Multi-Track Generalization
  3. Racing / Speed Optimization

Decision: Plan all phases in full documentation, execute Phase 1 first. Do not start Phase 2 until Phase 1 produces a genuine champion model (mean_reward > 100 on training track). This creates a wave gate between Phase 1 and Phase 2.

Rationale: Phase 2 and 3 depend on having a real trained model. Without Phase 1 complete, there is nothing to generalize or optimize for speed.


ADR-009: Tests Must Not Require Live Simulator

Date: 2026-04-13
Status: Accepted

Context: The DonkeyCar simulator must be running on port 9091 for live training. Tests cannot depend on this.

Decision: All pytest tests mock the gym environment. Integration tests use a MagicMock gym env that returns fake observations, rewards, and done signals. Only manual/acceptance tests require the live simulator.

Pattern:

@patch('gymnasium.make')
def test_runner_exits_cleanly(mock_make):
    mock_env = MagicMock()
    mock_env.reset.return_value = (np.zeros((120,160,3)), {})
    mock_env.step.return_value = (np.zeros((120,160,3)), 1.0, True, False, {})
    mock_env.action_space = gym.spaces.Box(...)
    mock_make.return_value = mock_env
    # ... test runner